This study analysed data from the LIFE Child study. This is an ongoing epidemiological longitudinal cohort study conducted at the Research Centre for Civilization Diseases in Leipzig, Germany (for details see [24]). The LIFE Child study protocols adhere to the Declaration of Helsinki. Written informed consent was received from all parents and children over 12 years of age.
Inclusion criteria for our study were children aged between 5 and 18 years, for whom both speaking-voice examination and information on SDQ HI values were available for one or more time points. Data from 1460 children (female: n = 716; male: n = 744) with a mean age of 10.96 years (range: 5.65–18.22) were included. Participants were either part of the Health Cohort (n = 1297) or the Obesity Cohort (n = 163). Examinations took place between the years 2012 and 2015. The number of visits varied between 1 (n = 736), 2 (n = 462), 3 (n = 231), or 4 (n = 31).
MethodsDemographic and health dataDemographic and health data were collected either by clinical interviews/questionnaires or clinical examination. All examinations were carried out by trained medical investigators/paediatricians in the child-friendly facilities of the outpatient clinic of the Research Centre for Civilisation Diseases in Leipzig. In addition to voice-specific data, child’s age, sex, Body Mass Index (BMI), socio-economic status (SES) and pubertal status were included in our study. Pubertal status was included due to missing information on voice change in most participants. The examination methods relevant to this study are described below and in detail in the study protocol [24].
Socio-economic statusSES was calculated using a multidimensional index based on information provided by parents on their school education and vocational training, their occupational status, and their net equivalent income. The operationalisation used is based on the KiGGS study of the Robert Koch Institute [25]. SES scores can range from 3.0 to 21.0, with values indicating low (3.0–8.4), medium (8.5–15.4), or high (15.5–21.0) SES.
BMI-standard deviation score (BMI-SDS)The original BMI value was calculated using the body mass (kilogram) divided by the square of the body height (m2) [24]. To ensure comparability between age and sex groups, BMI-SDS (standard-deviation-scores) were calculated using the Kromeyer-Hauschild reference [26].
Pubertal statusThe assessment of pubertal status as an indicator of physical maturity is based on Tanner’s criteria for pubertal development [27, 28]. For this study, a corrected form of the pubertal status was used. In some cases, Tanner’s criteria did not align with corresponding hormone levels (endocrine values). Certain endocrine values indicated a lower or higher physical maturity. To address this discrepancy, laboratory data were used to adjust the Tanner stage where necessary. Values range from 1 to 5, with higher values indicating greater maturity.
Voice measurements and featuresVoice measurements were conducted according to a standard operating procedure based on the recommendations by the Union of the European Phoniatrics [29] and as described by Berger et al. [30]. Voice examination was carried out in a soundproof room by trained investigators. Voice recording and analysis were conducted using the DiVAS software and a designated self-calibrating microphone headset (XION Medical, Berlin, Germany; Technical specifications of the headset: dynamic range 40–120 dB(A) (sine signal), frequency range 70 Hz–20 kHz, signal-to-noise ratio (SNR) > 62 dB, article number 352 009 010). The software and hardware used are specifically designed for clinical and acoustic voice assessment and are widely applied in clinical voice diagnostics. The head-mounted headset ensured a constant distance of 30 cm between the child’s mouth and the microphone, even when the child turned their head during the examination. This is particularly crucial for accurate measurement of the sound pressure level. Participants had two different tasks. For the analysis of the speaking voice, the task was to count from 21 to 30 at five different sound pressure levels: quietest voice (quiet_I), conversational voice (conversation_II), presentation voice (presentation_III), loudest voice (loud_IV), and quietest voice again (quiet_V) for a voice reset test. In different conditions, children were instructed to speak as soft as possible, but to avoid whispering (I); to speak as if in a normal conversation sitting across from each other (II); to speak as if in a classroom giving a presentation (III), or to shout with their loudest voice, but without screaming (IV). Immediately after the shouting task, the children were instructed to count as quietly as possible again (V). This voice reset test is considered unremarkable if the values are comparable to those of the initial quiet measurement [30, 31]. The second task consisted of sustaining a tone on “na” at a comfortable, medium pitch and volume. After a maximal inhalation, the tone should be sustained for as long as possible. Afterwards, the tone should be sustained once more for a few seconds, aiming for the sound to be as clear and pure as possible.
For the analysis, we included the following parameters: fundamental frequency (f0; Hz) and intensity (dB(A)), extracted from each condition of the first task and maximum phonation time (MPT) and Jitter, extracted from the second task. Additionally, the highest pitch (f0max) and the lowest sound pressure level (SPLmin) achievable with the singing voice were measured. The Dysphonia Severity Index (DSI) was calculated from f0max, SPLmin, MPT, and jitter using a formula by Wuyst et al. [32]. The DSI ranges from + 5 (normal voice) to − 5 (severely dysphonic voice). While normative DSI values are well established for adults [33], there are fewer normative data available for children. Since children’s voices change due to growth and puberty (voice change), DSI values in children are highly variable and age dependent. Compared to adults, children often show slightly lower DSI values. A total of 13 voice-derived features were used for analysis.
Behavioural difficulties/strengthsBehavioural data was extracted from the subscale hyperactivity/inattention (HI) from the parent-version of the SDQ, a widely used instrument for measuring behaviour in children and adolescents [22]. The subscale HI consists of five questions regarding hyperactivity/impulsivity (e.g. restless/overactive; fidgeting) and inattention (e.g. easily distracted/concentration wanders) on a three-point Likert scale, resulting in scores ranging from zero to ten.
Statistical analysisWe used two complementary analytical approaches to address different research questions. First, Linear Mixed Models (LMMs) were applied aiming to examine associations between voice features and SDQ HI scores. This approach allowed us to account for the repeated measurements per child and to estimate how specific voice features were related to symptom levels after controlling for relevant covariates. Second, we applied Machine learning (ML) methods to evaluate whether the voice features could be used to predict SDQ HI scores at the individual level. Importantly, the ML analyses predict SDQ HI scores (a proxy of ADHD-related symptoms) rather than a clinical ADHD diagnosis. Results therefore speak to symptom-level associations/predictions and not diagnostic classification. All analyses were conducted using Python (3.9), leveraging libraries such as scikit-learn [34]. A complete list of libraries and their versions can be found in the public GitHub repository.
Linear mixed modelsLMM analysis preprocessing included standardising all variables, so they were on the same scale (z-scaling). LMMs allowed us to model both fixed effects (voice features, age, sex, SES, BMI-SDS), representing effects that are assumed to be the similar across all children, and random effects, which account for child-specific variation.
A random intercept for each child was included to control for within-subject correlations, and a random slope for the number of visits allowed individual differences in symptom trajectories over time. The main analysis examined whether voice features predicted SDQ HI scores while adjusting for covariates (age, sex, SES, and BMI-SDS). Additional analyses were conducted (a) separately for boys and girls, and (b) including pubertal status as an additional covariate. LMMs were fitted using the python module statsmodels and the model’s fit was optimised using a standard procedure (Powell method, 1000 iterations). Models were evaluated by calculating the variance of the fixed effects and residuals, reported as coefficients, standard errors, confidence intervals, and p-values. Adjustment for multiple testing was done using the False Discovery Rate (FDR; Benjamini-Hochberg procedure) to reduce the chance of false positives. Results with FDR-corrected p-values (q-values) < 0.05 were considered statistically significant.
Machine learningWe conducted a series of ML analyses using Ridge Regression (algorithm), that was chosen because it is well-suited for high-dimensional and correlated data, which is typical for voice features. All numerical features were standardized, and where relevant, demographic covariates were regressed out from the voice features (linear confound removal). We trained three models: (1) Full model: voice features + demographics, (2) Demographics-only model, (3) Voice-only model. Comparing models 1 and 2 allowed us to test whether voice features provide added predictive value beyond demographics. Differences in prediction error (Mean absolute error, MAE) were tested statistically using the Nadeau & Bengio corrected t-test, which is specifically developed for cross-validated ML results [35].
The models were evaluated using nested 5-fold cross validation, repeated 20 times. This avoids overly optimistic performance estimates and ensures that the results are stable and not due to random data splitting. The hyperparameters (regularisation strength α, feature set size) were optimally tuned. GroupKFold was used to ensure that repeated measurements from the same child did not appear in both training and test sets (thus avoiding data leakage). To evaluate the performance, we calculated the MAE, that reflects average prediction error in SDQ HI points, the amount of explained variation (R2), and how strongly predictions correlated with actual outcomes (r). To determine whether models performed better than chance, permutation tests were applied by repeatedly shuffling the outcome labels and re-training the models. A p-value < 0.05 indicated that observed performance exceeded what could be expected if there were no real association. Finally, to understand which features influenced model predictions, we calculated Shapley Additive Explanation (SHAP) which quantify each feature’s contribution to the output.
Comments (0)