Python · Feature Selection · ANN · DT · KNN · NB · Springer InECCE2019

Student Performance Prediction
A Comparative Analysis of Four Classification Algorithms

A data-mining study that predicts university students' CGPA from semester results and survey reviews of campus facilities. Three feature-selection methods — Chi-Square, Euclidean distance, and correlation — are combined into 22 shared features, then four classifiers (ANN, Decision Tree, KNN, Naïve Bayes) are benchmarked across 3-, 5-, and 9-class CGPA targets. The headline result: with fewer classes and feature selection, the ANN reaches 93.70% accuracy, narrowly edging the Decision Tree (92.18%).

Python ANN Decision Tree KNN Naïve Bayes Chi-Square Euclidean Correlation EDM Springer · InECCE2019
Domain
Educational Data Mining (EDM)
Dataset
Kaggle "Student Survey" — Bangladeshi university (IQAC · UGC · World Bank), 2017
Target
CGPA class (3 / 5 / 9 buckets) from SGPA + facility reviews
Methods
3 feature-selection methods · 4 classifiers · 3 train/test splits
Publication
Springer · InECCE2019 (Lecture Notes in Electrical Engineering)
01About

Student academic performance shapes graduate quality, employability, and — at scale — a country's economic and social development. Identifying why performance varies gives institutions the information they need to plan education policy and to flag students who need early support. While many studies have applied data mining to this problem, the paper notes that none had focused on Bangladeshi students — the gap this work sets out to fill.

The study uses a publicly available "Student Survey" dataset from Kaggle, collected via Google Forms at a Bangladeshi university as part of the Institutional Quality Assurance Program (initiated by the University Grants Commission and funded by the World Bank). A per-student average CGPA is computed as the prediction target.

The core idea is to predict a student's CGPA from their semester results (SGPA) and their review of university facilities — admission policy, laboratories, internet speed, gymnasium, safety, scholarships, and more. Rather than rely on a single feature ranker, the work runs three feature-selection methods and keeps only the features they agree on.

Four classifiers are then compared across three CGPA granularities. A central question the paper probes is how the number of prediction classes and the choice of features jointly affect accuracy — and whether an ANN or a Decision Tree comes out ahead depending on those choices.

02Pipeline

The workflow runs as a six-stage educational-data-mining pipeline: raw survey data is cleaned and encoded, three feature-selection methods run in parallel, their results are intersected into a single combined feature set, and four classifiers are trained and compared across multiple CGPA class definitions and train/test splits.

01
Raw data — Kaggle Student Survey
Load the survey dataset collected at a Bangladeshi university (academic situation + facility-quality questionnaire). A per-student average CGPA is computed from semester records to serve as the prediction target.
load_dataset() # Kaggle "Student Survey", 2017
02
Null data reduction
Survey data is incomplete, noisy, and inconsistent. Empty fields are imputed with the column mean rather than dropped — a simple strategy the paper notes degrades when the share of missing values grows large.
if cell is null: cell = mean(column)
03
Label encoding
Categorical word-labels are mapped to numeric form using a label encoder (pandas), since the classifiers operate on numeric features. Unique values per column are indexed and reversibly encoded.
for col in columns: encode(unique_values(col))
04
Feature selection — three methods in parallel
Chi-Square (significance vs. CGPA), Euclidean distance (features closest to the output), and Correlation (input–output correlation) each rank the attributes and keep their top ~39. Recurring strong features: admission policy, laboratory facilities, internet speed, sincerity & commitment, and the per-semester SGPAs.
chi_square() · euclidean() · correlation() → top-39 each
05
Combine features — intersection of the three rankers
Features appearing in the best-39 of every method are kept, yielding 22 combined features — the eleven per-semester SGPAs plus facility factors such as admission policy/procedure, lab facilities, internet speed, gymnasium, safety, scholarships, and department service/development policy.
combined = chi ∩ euclidean ∩ correlation # 22 features
06
Classification & comparison
Four classifiers — ANN, Decision Tree, KNN, Naïve Bayes — are trained on the combined features and evaluated across 3 / 5 / 9 CGPA classes and three train/test splits (85-15, 75-25, 65-35), then compared on accuracy.
for clf in [ANN, DT, KNN, NB]: evaluate(3,5,9 classes)
03Interactive Charts
Interactive · Plotly.js

The two figures below reproduce the paper's accuracy comparisons with Plotly.js. The first compares all four classifiers after combined feature selection — toggle between 3-, 5-, and 9-class CGPA targets to see the crossover: ANN leads at low granularity, while the Decision Tree pulls ahead as classes increase. The second shows how feature selection lifted the two leading models on the 3-class task.

Classifier Accuracy — After Combined Feature Selection
Impact of Feature Selection — 3-Class Accuracy ANN & Decision Tree · before vs. after combined FS

ⓘ Exact values reported in the paper: 3-class after-FS averages (ANN 93.70 · DT 92.18 · KNN 77.74 · NB 68.33), 9-class DT 60.35 (best), and the 3-class before→after ANN/DT figures. Intermediate 5- and 9-class bars for the other classifiers are reconstructed from the paper's Fig. 3–4 trends for illustration.

04Key Results
Best Accuracy
93.70%
ANN · 3-class · after FS
Classifiers Compared
4
ANN · DT · KNN · NB
Combined Features
22
Intersection of 3 rankers (from ~39 each)
DT @ 9 Classes
60.35%
Best at high granularity
  • 01 ANN wins at low granularity — on the 3-class target after combined feature selection, the ANN reaches 93.70%, just ahead of the Decision Tree at 92.18%, with KNN (77.74%) and Naïve Bayes (68.33%) trailing.
  • 02 Decision Tree wins at high granularity — with more classes the DT is strongest, scoring 60.35% at 9 classes (best of all four) and beating the ANN by 11.83 points at 5 classes. The two models effectively swap ranks as the class count changes.
  • 03 Fewer classes → higher accuracy — accuracy improves substantially as the CGPA target is coarsened from 9 → 5 → 3 classes across all four classifiers, making class definition a key design choice for performance prediction.
  • 04 Feature selection materially lifts accuracy — combining the Chi-Square, Euclidean, and correlation rankings into 22 shared features raised the 3-class ANN from 76.50% → 93.70%, enough to overtake the Decision Tree.
  • 05 Facilities matter, not just grades — beyond per-semester SGPA, the selected features highlight admission policy, laboratory facilities, internet speed, sincerity & commitment, safety, scholarships, and department service policy as influential factors in students' CGPA.
05Contributions
DomainEducational Data Mining · academic performance prediction
LanguagePython · pandas
Feature Sel.Chi-Square · Euclidean distance · Correlation → 22 combined features
ClassifiersArtificial Neural Network · Decision Tree · K-Nearest Neighbors · Naïve Bayes
PreprocessNull reduction (column-mean fill) · label encoding
DatasetKaggle "Student Survey" · Bangladeshi university (IQAC · UGC · World Bank) · 2017
Eval3 / 5 / 9 CGPA classes · splits 85-15 · 75-25 · 65-35 · accuracy
PublishedSpringer · InECCE2019 · Lecture Notes in Electrical Engineering

Read the full paper

Open-access copy on CORE · author profile and code on GitHub.