Python · Feature Selection · ANN · DT · KNN · NB · Springer InECCE2019

Student Performance Prediction —
A Comparative Analysis of Four Classification Algorithms

A data-mining study that predicts university students' CGPA from semester results and survey reviews of campus facilities. Three feature-selection methods — Chi-Square, Euclidean distance, and correlation — are combined into 22 shared features, then four classifiers (ANN, Decision Tree, KNN, Naïve Bayes) are benchmarked across 3-, 5-, and 9-class CGPA targets. The headline result: with fewer classes and feature selection, the ANN reaches 93.70% accuracy, narrowly edging the Decision Tree (92.18%).

Python ANN Decision Tree KNN Naïve Bayes Chi-Square Euclidean Correlation EDM Springer · InECCE2019

Domain

Educational Data Mining (EDM)

Dataset

Kaggle "Student Survey" — Bangladeshi university (IQAC · UGC · World Bank), 2017

Target

CGPA class (3 / 5 / 9 buckets) from SGPA + facility reviews

Methods

3 feature-selection methods · 4 classifiers · 3 train/test splits

Publication

Springer · InECCE2019 (Lecture Notes in Electrical Engineering)

Links

Full Paper ↗ GitHub ↗

01About

Student academic performance shapes graduate quality, employability, and — at scale — a country's economic and social development. Identifying why performance varies gives institutions the information they need to plan education policy and to flag students who need early support. While many studies have applied data mining to this problem, the paper notes that none had focused on Bangladeshi students — the gap this work sets out to fill.

The study uses a publicly available "Student Survey" dataset from Kaggle, collected via Google Forms at a Bangladeshi university as part of the Institutional Quality Assurance Program (initiated by the University Grants Commission and funded by the World Bank). A per-student average CGPA is computed as the prediction target.

The core idea is to predict a student's CGPA from their semester results (SGPA) and their review of university facilities — admission policy, laboratories, internet speed, gymnasium, safety, scholarships, and more. Rather than rely on a single feature ranker, the work runs three feature-selection methods and keeps only the features they agree on.

Four classifiers are then compared across three CGPA granularities. A central question the paper probes is how the number of prediction classes and the choice of features jointly affect accuracy — and whether an ANN or a Decision Tree comes out ahead depending on those choices.

02Pipeline

The workflow runs as a six-stage educational-data-mining pipeline: raw survey data is cleaned and encoded, three feature-selection methods run in parallel, their results are intersected into a single combined feature set, and four classifiers are trained and compared across multiple CGPA class definitions and train/test splits.

Raw data — Kaggle Student Survey

Load the survey dataset collected at a Bangladeshi university (academic situation + facility-quality questionnaire). A per-student average CGPA is computed from semester records to serve as the prediction target.

load_dataset() # Kaggle "Student Survey", 2017

Null data reduction

Survey data is incomplete, noisy, and inconsistent. Empty fields are imputed with the column mean rather than dropped — a simple strategy the paper notes degrades when the share of missing values grows large.

if cell is null: cell = mean(column)

Label encoding

Categorical word-labels are mapped to numeric form using a label encoder (pandas), since the classifiers operate on numeric features. Unique values per column are indexed and reversibly encoded.

for col in columns: encode(unique_values(col))

Feature selection — three methods in parallel

Chi-Square (significance vs. CGPA), Euclidean distance (features closest to the output), and Correlation (input–output correlation) each rank the attributes and keep their top ~39. Recurring strong features: admission policy, laboratory facilities, internet speed, sincerity & commitment, and the per-semester SGPAs.

chi_square() · euclidean() · correlation() → top-39 each

Combine features — intersection of the three rankers

Features appearing in the best-39 of every method are kept, yielding 22 combined features — the eleven per-semester SGPAs plus facility factors such as admission policy/procedure, lab facilities, internet speed, gymnasium, safety, scholarships, and department service/development policy.

combined = chi ∩ euclidean ∩ correlation # 22 features

Classification & comparison

Four classifiers — ANN, Decision Tree, KNN, Naïve Bayes — are trained on the combined features and evaluated across 3 / 5 / 9 CGPA classes and three train/test splits (85-15, 75-25, 65-35), then compared on accuracy.

for clf in [ANN, DT, KNN, NB]: evaluate(3,5,9 classes)

03Interactive Charts

Interactive · Plotly.js

The two figures below reproduce the paper's accuracy comparisons with Plotly.js. The first compares all four classifiers after combined feature selection — toggle between 3-, 5-, and 9-class CGPA targets to see the crossover: ANN leads at low granularity, while the Decision Tree pulls ahead as classes increase. The second shows how feature selection lifted the two leading models on the 3-class task.

Classifier Accuracy — After Combined Feature Selection

Impact of Feature Selection — 3-Class Accuracy ANN & Decision Tree · before vs. after combined FS

ⓘ Exact values reported in the paper: 3-class after-FS averages (ANN 93.70 · DT 92.18 · KNN 77.74 · NB 68.33), 9-class DT 60.35 (best), and the 3-class before→after ANN/DT figures. Intermediate 5- and 9-class bars for the other classifiers are reconstructed from the paper's Fig. 3–4 trends for illustration.

04Key Results

Best Accuracy

93.70%

ANN · 3-class · after FS

Classifiers Compared

ANN · DT · KNN · NB

Combined Features

Intersection of 3 rankers (from ~39 each)

DT @ 9 Classes

60.35%

Best at high granularity

01 ANN wins at low granularity — on the 3-class target after combined feature selection, the ANN reaches 93.70%, just ahead of the Decision Tree at 92.18%, with KNN (77.74%) and Naïve Bayes (68.33%) trailing.
02 Decision Tree wins at high granularity — with more classes the DT is strongest, scoring 60.35% at 9 classes (best of all four) and beating the ANN by 11.83 points at 5 classes. The two models effectively swap ranks as the class count changes.
03 Fewer classes → higher accuracy — accuracy improves substantially as the CGPA target is coarsened from 9 → 5 → 3 classes across all four classifiers, making class definition a key design choice for performance prediction.
04 Feature selection materially lifts accuracy — combining the Chi-Square, Euclidean, and correlation rankings into 22 shared features raised the 3-class ANN from 76.50% → 93.70%, enough to overtake the Decision Tree.
05 Facilities matter, not just grades — beyond per-semester SGPA, the selected features highlight admission policy, laboratory facilities, internet speed, sincerity & commitment, safety, scholarships, and department service policy as influential factors in students' CGPA.

05Contributions

01 First EDM study on Bangladeshi survey data — built a model to identify the key factors behind academic-performance variation and to flag students needing special attention, addressing a gap the paper notes was previously unexplored for Bangladeshi students.
02 Three-method feature-selection ensemble — implemented Chi-Square, Euclidean-distance, and correlation rankers, then intersected their top-39 results into a robust 22-feature combined set rather than trusting any single ranker.
03 End-to-end preprocessing pipeline — handled incomplete survey data via column-mean imputation and converted categorical responses to numeric form with label encoding, producing a clean matrix for all four classifiers.
04 Multi-granularity benchmark — evaluated ANN, Decision Tree, KNN, and Naïve Bayes across 3 / 5 / 9 CGPA class definitions and three train/test splits, surfacing the ANN-vs-DT crossover as granularity changes.
05 Actionable factor analysis — translated feature importance into concrete institutional levers (admission policy, lab and internet facilities, safety, scholarships), supporting data-driven education planning and early intervention.

Domain	Educational Data Mining · academic performance prediction
Language	Python · pandas
Feature Sel.	Chi-Square · Euclidean distance · Correlation → 22 combined features
Classifiers	Artificial Neural Network · Decision Tree · K-Nearest Neighbors · Naïve Bayes
Preprocess	Null reduction (column-mean fill) · label encoding
Dataset	Kaggle "Student Survey" · Bangladeshi university (IQAC · UGC · World Bank) · 2017
Eval	3 / 5 / 9 CGPA classes · splits 85-15 · 75-25 · 65-35 · accuracy
Published	Springer · InECCE2019 · Lecture Notes in Electrical Engineering

Read the full paper

Open-access copy on CORE · author profile and code on GitHub.

Full Paper ↗ GitHub ↗ All Projects

Student Performance Prediction — A Comparative Analysis of Four Classification Algorithms

Read the full paper

Student Performance Prediction —
A Comparative Analysis of Four Classification Algorithms