Reading the Brain - Social media vs long form content

This project applied machine learning to predict whether a person is watching short-form social media (e.g., TikToks or Reels) or long-form educational content—using only EEG brainwave data. The task was formulated as a binary classification problem, showcasing the use of supervised learning techniques on neurophysiological signals.

Data Collection and ML-Ready Preprocessing

EEG data was collected at 4096 Hz and downsampled to 256 Hz to focus on brain-relevant frequencies (0–100 Hz). Recordings were segmented into 0.5-second windows, generating a labeled dataset of shape (21648, 32, 129)—where each sample represents the voltage across 32 EEG channels over 129 time steps. This dataset was split into training, validation, and test sets (80/10/10), with balanced classes labeled as -1 (short-form) and 1 (long-form).

Baseline Machine Learning Models in the Time Domain

Initial experimentation used K-Nearest Neighbors (KNN) trained directly on raw time-series data. The model reached 100% test accuracy, which suggested potential data leakage or spurious correlations. A 5-layer fully connected neural network with ReLU activations, dropout (0.5), and sigmoid output was also trained and reached perfect accuracy by the sixth epoch.

These results prompted deeper ML diagnostic analysis: by training per-channel KNNs, I confirmed that the model performance was not due to meaningful EEG structure but likely due to artifacts. This highlighted a key ML engineering skill: interpreting high performance critically and not assuming accuracy implies generalisation.

Signal Transformation and Robust Feature Extraction

To mitigate the overfitting and uncover generalisable features, I transformed the data into the frequency domain using spectrograms, discarding 60Hz power line noise. This aligned the feature representation with known neural activity patterns.

On this representation, KNN was again applied. Although the performance dropped to ~65%, the accuracy was now consistent and likely generalizable, confirming that meaningful class information is encoded in the spectral features rather than raw voltages.

A grid search across different k values and channel subsets was conducted to fine-tune the classifier and assess stability.