The AHEPA EEG benchmark: setting the standard for machine learning in dementia diagnosis, a scoping review

Overview of included studies

The systematic search and screening process yielded a total of 46 empirical papers that utilized one or more ML methods with the AHEPA EEG dataset to differentiate diagnosis of AD. Collectively, these studies are the largest group of studies that utilize a single open-access EEG dataset as a basis of dementia research to date, highlighting the growing significance of shared resources in dementia research. The overview of studies included in this review is outlined in Table 1. In each row, we include first authors, their primary classifier, their input features, their validation method, and their ultimate validity score. The validity score (1–3) indicates the methodological rigor of each study, with 1 indicating high validity and 3 indicating major methodological inadequacies.

Table 1 Summary of the 46 included studies employing machine learning methods on the AHEPA EEG dataset for AD classification

The methodological validity of the included studies showed considerable variability. After the validity grouping analysis, 12 studies were categorized as validity 1, 12 studies as validity 2, and 22 studies categorized as validity 3 (see Fig. 5). Validity 1 studies generally followed the best practices, like epoch-based processing plus subject-level cross-validation (e.g., LOSO). In contrast, Validity 3 studies often exhibited methodological issues such as data leakage (e.g., using k-fold cross-validation at the epoch rather than the subject level, or applying train–test splits without justification for subject overlap), inadequate preprocessing, or implausibly high reported performance resulting from poor validation practices.

To provide transparency, Table 2 summarizes the reasoning behind each validity rating. Each entry briefly explains the methodological strengths or weaknesses that justified the assigned score, as well as the criteria not met, according to the validation scheme.

Table 2 Overview of the 46 included studies using the AHEPA EEG dataset for Alzheimer’s classification, showing author, classifier, input, validation method, and validity scoreDiagnostic classification tasks

In addition to the validity classification, the studies included in this review were also organized according to the specific diagnostic problems they addressed. The overwhelming majority of studies focused on the binary classification of patients with AD and CN. This task has become the prevailing benchmark in the field due to its clinical relevance and relative tractability. Overall, binary classification studies accounted for nearly half of all publications identified.

Aside from the AD vs. CN comparison, several studies aimed to explore more complex clinical questions. Multiple studies attempted to differentiate AD from FTD, which illustrates an immediate clinical need to distinguish dementia syndromes that share symptoms. Additionally, a significant number of works focused on the binary classification of FTD vs. CN. A limited but growing number of papers have extended this comparison to multi-class formulations, often using AD vs. FTD vs. CN. Multi-class classifications offer more potential insight into the generalizability of the model’s conclusions, but also represent a more challenging classification problem. An additional common classification approach involved grouping together several dementia conditions (e.g., AD and FTD vs. CN) to assess the ability of EEG-based ML models to generalize across pathological subtypes and to evaluate whether such models can distinguish “dementia” as an umbrella category from healthy brain activity.

Table 3 displays the distribution of problem formulations across the 46 studies included in the review. Overall, the binary classification AD vs. CN remains the principal and most employed classification problem (ensuring studies are comparable). At the same time, moving toward possibly more ambitiously multi-class and differential diagnostic problems truly reflect the heterogeneity and complexity of dementia in the clinic.

Table 3 Classification performance across different diagnostic comparisonsPerformance on AD vs. CN classification

The binary classification of AD patients versus CN controls is the most widely benchmark to which other studies using the AHEPA dataset are compared. This task structure is clinically relevant as it parallels the fundamental clinical challenge of identifying individuals with pathological findings from negative examples representative of healthy aging. It also tends to be more straightforward for researchers, as differential diagnoses are often more complicated and complex.

Across the 46 included studies, AD vs. CN was the task studied most often, and reported as part of the study in most the articles. Table 4 summarizes the results of studies on AD vs. CN classification, with the studies presented in an ordered format alongside their respective performance metrics. Furthermore, the differences in performance results between the papers that have been categorized in different validity groups should be noted, highlighting the fact better validity studies often have lower reported performance. This comparison is illustrated in Fig. 6 where the mean Accuracy and F1 score is presented for each different validity group.

Table 4 Reported performance of included studies on AD vs. CN classificationFig. 6Fig. 6The alternative text for this image may have been generated using AI.

Performance results comparison between the 3 different validity groups for the AD/CN problem

In total, the overall mean accuracy for all studies conducted on the classification of AD vs. CN was 90.81% (SD = 9.7). The accuracy for the AD vs. CN ranged considerably depending upon the methodological rigor of the validity strategy. For studies rated as Validity 1 (subject-level LOSO or equivalent, n = 13), AD vs. CN mean accuracy was more modest and plausible for an AD classification task, averaging 82.11% (SD = 6.93), whereas studies rated as Validity 2 showed (n = 8) AD vs. CN mean accuracy was higher than Validity 1, averaging 89% (SD = 11). Conversely, among Validity 3 studies (n = 19), when AD vs. CN classification performance was reported, many studies presented “almost perfect” results, with a mean overall accuracy of 97.07% (SD = 3.09)—an implausibly high value given the combinatorial size and complexity of the dataset. These results were very likely to stem from methodological limitations, including epoch-level cross-validation or data leakage, and consequently produced unrealistic and inflated accuracy estimates.

A similar pattern was seen in F1-scores results. The overall AD vs. CN mean F1-score across studies was 90.38% (SD = 9.41), but once again, when stratified by validity category, it revealed differences in average F1-score values. For example, studies rated Validity 1 showed an average F1-score of 81.57% (SD = 4.85), studies rated Validity 2 averaged 89.48% (SD = 7.26), and Validity 3 studies again produced the highest values, with an average F1-score of 97.06% (SD = 2.93), consistent with over-optimistic evaluation protocols.

Overall, the results demonstrate a systematic difference between studies conducted with rigorously validated designs (producing moderate, believable performance estimates) relative to those that were rated methodologically weaker (producing volumes of performance estimates statistically, accruing unrealistically high proposed accuracy’s).

Performance on FTD vs. CN

Among the included studies, the classification of FTD versus CN has been explored less frequently but remains clinically significant. Identifying FTD versus controls helps clarify if EEG-based signatures of dementia include pathological processes that are not limited to AD, and it helps establish a baseline for differential diagnosis tasks that are more complex than AD vs. CN. Table 5 provides an overview of the 25 studies employing FTD vs. CN classification and their corresponding metrics. Again, as discussed in the previous section, the differences between the classification performance for the different validity groups is presented in Fig. 7.

Table 5 Reported performance of included studies on FTD vs. CN classificationFig. 7Fig. 7The alternative text for this image may have been generated using AI.

Performance results comparison between the 3 different validity groups for the FTD/CN problem

The average accuracy across all studies investigating the FTD vs. CN classification problem is 86.53% (SD = 11), and the average F1-score is 86.58% (SD = 3.31). When differentiating the studies by methodological validity, distinct differences in performance became evident. Studies with Validity 1 (rigorous subject-level validation such as LOSO) reported a moderate, but credible mean accuracy of 75.18% (SD = 7.55) and mean F1 of 65.85 (SD = 5.52) which describes the greater difficulty of the differential diagnosis. In contrast, Validity 2 studies—which employed subject-level train–test splits without full cross-subject retesting—reported substantially higher values, with a mean accuracy of 91.53% (SD = 2.71) and a mean F1-score of 92.07% (SD = 4.6). Lastly, studies classified under Validity 3 frequently employed per-epoch k-fold cross-validation or other non-independent data splits and consequently reported near-perfect performance, with a mean accuracy of 94.45% (SD = 5.41) and a mean F1-score of 96.16% (SD = 3.07).

These results demonstrate that, consistent with findings from the AD vs. CN classification task, methodological rigor exerts a strong influence on reported performance. The large disparity between Validity 1 and Validity 3 illustrates how weaker validation strategies can overinflate accuracy and F1-scores, overstating a model’s generalizability to unseen subjects. The more conservative yet realistic outcomes of Validity 1 studies suggest that reliable discrimination of FTD and CN is possible using EEG data, although it still requires substantially more effort than the AD vs. CN classification task.

Performance on AD vs. FTD classification

The AD vs. FTD classification problem represents one of the most clinically relevant yet challenging comparisons in EEG-based dementia research. Unlike the AD vs. CN task, which contrasts diseased versus healthy subjects, this problem requires distinguishing between two distinct neurodegenerative syndromes that can share overlapping clinical and electrophysiological features. As a result, it serves as a more stringent test of model generalizability and clinical utility.

Table 6 lists the overall results from studies that directly investigated the AD vs. FTD classification task using the AHEPA dataset. The mean accuracy of the studies was 88.97% (SD = 9.92), with an average F1 score of 94.09 (SD = 4.1). Just as observed in the prior benchmarks, the strictness of the validation approach shaped the performance results considerably.

Table 6 Reported performance on AD vs. FTD classification using the AHEPA EEG dataset

For example, studies using Validity 1 methods (rigorous validation of each subject at the individual subject level (i.e., LOSO)) had a modest mean accuracy of 71.44% (SD = 2.04), and, as with other classifications, this indicates the greater difficulty of making this type of differential diagnosis. Studies using Validity 2 methods had a higher mean accuracy (93, SD = 5.09); while Validity 3 studies reported on accuracy in the near-perfect range (92.82%, SD = 5.81), which are values that likely represent higher but inaccurately inflated values of generalization due to weaker validation schemes or data leakage. However, the significant disparate accuracies reported for the AD versus FTD problem indicate that EEG-based classification can represent disease-specific patterns fairly well when using more advanced representations of the EEG data.

Performance on AD & FTD vs. CN classification

The AD & FTD vs. CN formulation groups all AD and FTD cases into a single “dementia” category which is then contrasted with CN cases. This is clinically relevant because it assesses whether EEG-based ML can generalize across dementia subtypes to detect neurodegenerative pathology overall, rather than discriminate between specific disease types. Methodologically, this represents an intermediate level of difficulty, more complex than the AD vs. CN task, but not as demanding as a full multi-class classification (AD vs. FTD vs. CN).

Table 7 summarizes the studies on this task. Generally, reported performances were moderate to high, with an average accuracy of 87.1%. The mean accuracy indicates good separation of dementia from controls. However, the results should be interpreted with caution due to the heterogeneity resulting from collapsing AD and FTD into one diagnostic category.

The small number of studies in this category limits generalizability, but some trends were consistent with those observed in the previously reported tasks: studies implementing subject-level validation had reported performance levels with more realistic accuracies (around 75–85%), while studies using simpler train–test splits or cross-validation at the epoch level reported higher performance levels more susceptible to being falsely inflated.

Table 7 Reported performance on AD & FTD vs. CN classification using the AHEPA EEG datasetPerformance on AD vs. FTD vs. CN classification

The AD/CN/FTD task represents a multi-class classification challenge, requiring EEG-based models to simultaneously distinguish among AD, FTD, and CN subjects. This setup provides a more clinically realistic scenario by testing whether models can generalize across different dementia subtypes rather than relying on simple binary contrasts. Multi-class configurations of this kind are generally more difficult, as the EEG markers of AD and FTD often overlap and exhibit substantial intra-subject variability.

Table 9 shows the reported classification performance in this three-class context using the AHEPA EEG dataset. For all studies in aggregate, the mean accuracy was 84.29% (SD = 10.72%) and mean F1-score produced was 85.29% (SD = 15.63%). The differences in performance of the validity groups are presented in Fig. 8; Table 8.

Table 8 Reported performance on the three-class (AD vs. FTD vs. CN) problem using the AHEPA EEG datasetFig. 8Fig. 8The alternative text for this image may have been generated using AI.

Performance results comparison between the 3 different validity groups for the AD/FTD/CN problem

When disaggregated by validity, Validity 1 studies (subject-level LOSO) achieved a mean accuracy of 69.99% (SD = 8.51) and F1-score of 59.53%, which reflects the complexity of the task. Validity 2 study results had significantly better performance: 88.25% (SD = 4.36) mean accuracy and 86.02% (SD = 3.56) mean F1-score. Finally, Validity 3 study findings showed highly favorable performance outcomes − 92.64% (SD = 9,84) accuracy and 97.44% (SD = 2.21) F1-score - which are indicative of methodological weaknesses including epoch-level cross-validation, or alertness of subjects.

Classifier distribution and benchmark performance

The evaluation of classifier use across the 46 studies noted a large variability in methodological decisions. The analysis showed a trend from traditional feature-based classifiers, toward deep learning classifiers with convolutional and graph-based neural networks becoming the most common family of algorithms. Classical ML algorithms such as SVM or Random Forest, continue to be popular, largely due to their ease of use and interpretability, although they are steadily substituted or augmented by architectures that allow direct learning of spatial–temporal patterns found in the EEG data (for instance neural methodologies with incorporated encoder architectures).

To enable a more systematic comparison of methodological performance, the classifiers reported in the included studies were grouped into four overarching families: traditional ML (e.g., SVM, decision trees), CNN-based neural networks, other neural network architectures (e.g., recurrent or graph-based models), and hybrid or ensemble-based approaches. The following subsections discuss the classifier families represented in the included studies, focusing on those appropriate for both EEG data and the AHEPA dataset. A description of these classifier categories is provided in Table 9.

Table 9 Description of classifier categoriesCNN-based neural networks

CNNs (convolutional neural networks) are the most commonly applied class of models in the examined corpus. Their main advantage is their ability to directly learn spatial and spectral representations from EEG-derived images—such as spectrograms, wavelet scalograms, and connectivity matrices—which makes them particularly well suited for modeling time–frequency aspects of neural slowing and topographic power redistribution associated with AD. Table 10 summarizes the studies that employed CNN-based models, along with their reported accuracy for the AD vs. CN classification task and their assigned validity.

Despite the widespread use of CNNs, accuracy among studies differed widely given the validation situation. Studies that employed some form of subject-level validation typically reported accuracies in the 80–85% range, whereas those analyzing epochs using k-fold cross-validation or single train–test splits showed artificially high accuracies (> 95%), suggesting the presence of data leakage. This distinction highlights that validation strategy, rather than model architecture, is one of the most significant factors influencing performance.

Table 10 Summary of studies employing CNN-based architectures for EEG-based Alzheimer’s disease detectionOther neural networks

A small number of studies used neural network architectures outside the traditional CNN frameworks, which we collectively refer to as «Other Neural Networks». Examples include Transformer-based models, Graph Neural Networks (GNNs), lightweight LSTMs, and fusion-based approaches. These architectures are designed to capture long-range temporal dependencies, multimodal relationships, or non-Euclidean spatial structures that may not be fully represented by convolutional architectures.

Transformers and graph-based networks can also model EEG functional connectivity and cross-channel relationships, yielding potentially richer descriptions of the dynamics of the brain networks. Studies using these architectures reported accuracy levels similar to those achieved by CNN-based models, while also offering enhanced interpretability and computational efficiency through attention mechanisms or other explainable components. Table 11 provides a summary of these studies, listing the first author, year, reported accuracy for AD vs. CN classification (if it was available), and the validation strategy or data split used.

Table 11 Summary of studies employing neural networks for EEG-based Alzheimer’s disease detectionTraditional ML classifiers

Traditional ML algorithms continue to be the method of choice for some studies for using EEG data to detect AD. These studies rely on hand engineered features from EEG data (examples include power spectral ratios, entropy indices, and recurrence quantification) that are fed into classifiers including Support Vector Machines (SVMs), Random Forests (RFs), and Decision Trees. While this approach usually lacks the end-to-end feature learning ability as used in deep neural networks, it is still very popular due to its interpretability, simplicity of computation, and robustness with small datasets.

The accuracy of studies summarized in Table 12 show a diverse range of reported accuracies ranging from 70% to 99% depending on the selected process of feature selection, preprocessing, and validation. Traditional approaches may not model complex spatiotemporal EEG dynamics well, but these studies do provide important interpretive information and seem to function as reasonable baselines as deep learning architectures develop.

Table 12 Summary of studies employing traditional ML techniques for EEG-based Alzheimer’s disease detectionEnsemble models

A limited number of investigations employed ensemble learning strategies in EEG-based diagnosis of AD. Ensemble models aggregate predictions from multiple base classifiers (e.g., neural networks, decision trees, support vector machines) to improve robustness to classification error, reduce the likelihood of overfitting, and improve generalization in heterogeneous EEG datasets. These models exploit the multi-modality of the different architectures to learn complementary EEG representations (both spatially and temporally), and as a result, have very stable predictive performance.

In the three articles identified in the review, ensemble methods achieved very high accuracy (> 95%) in classification of AD vs. CN participants. Specifically, Hachamnia et al. (2025) (Hachamnia et al. 2025) developed an ensemble learning approach in which their framework involved multiple deep and classical models used to classify patients with AD and FTD, with results yielding (95.08%). Barua et al. (2025) (Barua et al. 2024) showed an ensemble based EEG model called N-BodyPat, developed an N-BodyPat based model demonstrated accuracy rates of (99.85%), and appealingly performed along with a strong cross-subject generalization. The summary in Table 13, which has presented the ensemble classification frameworks, has shown that they are among the highest performing measures reported in EEG for dementia classification, although no study was found to be Validity 1 in order to provide an objective reference point.

Table 13 Summary of studies employing ensemble models for EEG-based Alzheimer’s disease detectionHybrid models

Hybrid models are systems that integrate diverse neural network architectures or other classifiers to leverage the complementary strengths of each approach. These frameworks typically combine convolutional, graph-based, and classical ML components, linking automatic feature extraction with higher-level decision fusion. For instance, Αyanbek (2025) (Ayanbek et al. 2025) proposed a hybrid ensemble and neural network model integrating CNN-based deep learning with boosting, achieving an accuracy of 78.87% for AD detection. Similarly, Jain et al. (2025) (Jain and Srivastava 2025) introduced a hybrid neural network combining CNN and non-CNN components, which attained 95.90% accuracy, while Latifoğlu et al. (2025) [43] developed a non-CNN neural and traditional ML hybrid that reached 98.46%, demonstrating the high discriminative power of multi-paradigm integration.Other studies, such as those by Nayana et al. (2025) (Nayana et al. 2025); Khalfallah et al. (2025) (Khalfallah et al. 2025), explored integrative CNN– ML frameworks, merging deep representations with handcrafted EEG features to improve generalization and mitigate overfitting. Collectively, the hybrid approaches reported in Table 14.

Table 14 Summary of studies employing hybrid techniques for EEG-based Alzheimer’s disease detection

Figure 9 is presented to better visualize and summarize the performance differences of the different classifier families in the AD/CN problem. The different ML categories are represented in different colors, and the mean Accuracy of them is reported, along with the mean Accuracy for each validity category. We can observe that Hybrid and CNN approaches achieve the best performance overall with 91.1% and 91% Accuracy respectively. For stricter judgment, when taking into consideration only the validity 1 group, traditional ML approaches have reported the best performance so far, with 85.3% average Accuracy.

Fig. 9Fig. 9The alternative text for this image may have been generated using AI.

Accuracy of the different ML categories for the AD/CN problem, also separated by their validity group

Quantitative analysis of the impact of validation rigor on reported performance

For the AD–CN classification task, methodological rigor demonstrated a strong and statistically robust association with reported performance. A Kruskal–Wallis test revealed significant differences in accuracy across validity groups (H = 23.62, p < 0.001), with a very large effect size (η² = 0.60), indicating that validation quality accounts for a substantial proportion of performance variance. Post-hoc Dunn comparisons with Bonferroni correction showed that Validity 3 studies reported significantly higher accuracies than both Validity 1 (p < 0.001) and Validity 2 (p = 0.037) studies, whereas the difference between Validity 1 and Validity 2 was not statistically significant (p = 0.367). Pairwise Cliff’s delta values confirmed large to extreme effect sizes, particularly between Validity 1 and Validity 3 (δ = −0.98), consistent with systematic performance inflation in studies employing weaker validation protocols. Complementary linear regression analysis further demonstrated a strong and statistically significant association between validity level and reported accuracy (β = 7.51, p < 0.001), with methodological validity explaining 52% of the variance in performance (R² = 0.52). This indicates that each step toward lower validation rigor is associated with an average increase of approximately 7.5% points in reported accuracy.

For the FTD–CN classification task, methodological rigor was again strongly and statistically significantly associated with reported performance. A Kruskal–Wallis test revealed significant differences in accuracy across validity groups (H = 15.23, p < 0.001), with a very large effect size (η² = 0.66), indicating that validation quality explains a substantial proportion of performance variance. Post-hoc Dunn comparisons with Bonferroni correction showed that Validity 3 studies reported significantly higher accuracies than Validity 1 studies (p < 0.001), whereas differences between Validity 1 and Validity 2 (p = 0.157) and between Validity 2 and Validity 3 (p = 1.000) were not statistically significant. Pairwise Cliff’s delta values indicated large to extreme effect sizes, particularly between Validity 1 and Validity 3 (δ = −0.96), supporting the presence of marked performance inflation in studies employing weaker validation protocols. Complementary linear regression analysis further confirmed a strong and statistically significant association between validity level and reported accuracy (β = 9.55, p < 0.001), with methodological validity explaining 67.6% of the variance in performance (R² = 0.68). These results indicate that each step toward lower validation rigor is associated with an average increase of approximately 9.5% points in reported accuracy.

Figure 10 shows the linear regression between validation rigor and reported classification accuracy across the included studies for the AD/CN (left) and for the FTD/CN (right) problem.

Fig. 10Fig. 10The alternative text for this image may have been generated using AI.

Linear regression analysis illustrating the relationship between validation rigor and reported classification accuracy for the AD–CN (left) and FTD–CN (right) tasks. Each point represents an individual study, and the fitted regression lines demonstrate the positive association between weaker validation protocols and higher reported performance

Comments (0)

No login
gif