The effects of ageing on fatigue and endurance of the spinal extensor muscles: a systematic review and meta-analysis

This review is reported in line with the 2020 guidelines of Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) (Supplemental file 1) [25]. The protocol for this review was registered on the International Prospective Register of Systematic Reviews (PROSPERO; CRD42025645295) on 31 January 2025.

Eligibility criteria

The PICOS framework, which includes Population, Intervention, Comparison, Outcomes and Study design, was used to define the inclusion criteria for this systematic review. However, the original term “Intervention” was replaced with “Indicator” due to the nature of potentially eligible studies, as reported by previous similar systematic reviews [26, 27].

Population

Studies were considered eligible for inclusion if they involved physically independent older adults aged 60 years and above. Although the age cutoff for defining an “older” person may vary across countries, the 60+ threshold was selected based on the United Nations’ definition [28]. Additionally, studies were required to include a control group of young and/or middle-aged individuals as comparators. Only studies involving healthy participants were considered, defined as individuals free from musculoskeletal disorders, neuro-systemic degenerative diseases, chronic cardiovascular or respiratory system diseases. Participants also had to be free from a history of chronic spinal pain, current spinal pain, spinal surgery, or spinal deformities. Lastly, there were no restrictions in terms of gender and/or ethnicity.

Indicator

Eligible studies were those assessing muscle fatigue in the spinal extensor muscles and/or endurance, including lumbar, thoracic, and neck extensor muscles. Studies involving voluntary fatiguing contractions performed either for a fixed duration or until task failure were included. The tasks involved the use of a specific dynamometer for trunk muscles or a standardised test for spinal extensor endurance, such as the Biering–Sørensen test or the Ito test. The Biering–Sørensen test and its variations (i.e., modified Sørensen test on a Roman Chair) and the Ito test are standardised isometric tests designed to assess the endurance of the back extensor muscles [29]. In these tests, participants are positioned prone and instructed to lift the upper body to maintain a horizontal posture for as long as possible or for a predetermined duration. These tests differ primarily in the type of support (e.g., Roman chair or bench), fixation methods (e.g., lower body stabilisation) and arm positioning. All types of contractions were included (i.e., isometric or dynamic), measured both at an absolute force/torque level or relative to the individual’s maximal voluntary contraction (MVC).

Comparison

Eligible studies included comparisons of specific outcome variables (e.g., endurance time, EMG parameters) between groups, such as older versus younger participants or older versus middle-aged participants.

Outcomes

Outcome variables included any measure related to muscle fatigue development and endurance. For this purpose, both mechanical variables, such as endurance time and decline in force values between pre- and post-fatigue assessments, and neuromuscular variables, such as MF or MPF of the EMG signals and muscle fibre conduction velocity (MFCV), were considered.

Study design

Only cross-sectional studies were included in this systematic review, as this design is most commonly used to address the research question. Therefore, non-original literature, such as systematic and narrative reviews as well as other research designs such as randomised controlled trials, were excluded. To minimise the risk of bias, studies in all languages were included in the search. However, non-English studies were ultimately excluded due to time and resource constraints.

Information sources

Electronic searches were performed on Medline (Ovid Interface), EMBASE (Ovid Interface), PubMed, Web of Science (Clarivate Analytics) and CINAHL Plus (EBSCO Interface) databases from their inception to 28 June 2025. Hand searching of key journals—including Journal of Physiology, Journal of Neurophysiology, Journal of Electromyography and Kinesiology, European Journal of Applied Physiology, Journal of Applied Physiology, Muscle & Nerve—was also conducted. Additionally, the reference lists of all included papers were checked manually to identify any additional relevant studies that could have been missed during the search.

Search strategy

The search was conducted by the lead author (MP) without any restrictions in terms of date, geographical area and language. Search strategies were tailored for each database, and Medical Subject Headings (MeSH) were used when appropriate to optimise the search process. This approach combined MeSH terms and free-text keywords. Although the search was adapted for different databases (e.g., syntax), consistency was ensured. The full electronic search strategies for all databases are reported in Supplemental file 2.

Selection process

All search results were imported in EndNote 20 (Clarivate Analytics) by the lead author (MP). Duplicates were identified and automatically removed by the software. All remaining references were then imported into Covidence (Veritas Health Innovation, Melbourne, Australia), where the title/abstract and full-text screening processes were conducted. Titles and abstracts of the studies were independently screened by two reviewers (MP and MA) using a pretested screening form. Studies were categorised as eligible, ineligible or doubtful. Doubtful studies were discussed between the two reviewers, and a third reviewer (DF) was involved in resolving any disagreements or uncertainties. After the initial screening phase, eligible studies underwent full-text assessment, which was always conducted independently by the same two reviewers (MP and MA), with the third reviewer (DF) assisting in case of disagreements. If the full-text of an eligible study could not be retrieved by the reviewers, the authors were contacted with a two-week reply window.

Data collection process and data items

Data extraction was conducted by one reviewer (MP) using a pretested extraction form that was specifically designed to align with the review’s aims. A second reviewer (MA) verified the accuracy of the extracted data. If clarifications about the data were necessary (e.g., incomplete data), the authors were contacted with a two-week reply window. When data were available only in graphs or tables, WebPlotDigitizer (version 5.1) software was used to extract them [30]. Relevant data for each aspect of the PICOS framework were extracted. General study information (e.g., participant characteristics) and subjective measures of muscle fatigue (e.g., Borg scale ratings) were also reported. If studies included groups not relevant to the review’s aims, they were not considered for data extraction.

Risk of bias assessment

Two independent reviewers (MP and MA) assessed the methodological quality of the included papers using the Appraisal tool for Cross-Sectional Studies (AXIS) [31]. The original version of the tool consists of 20 questions divided into five different sections: introduction, methods, results, discussion, and other. However, the tool was slightly modified to better align with the objectives of this review, as previously reported [32]. Specifically, questions 7 and 14 were removed, as non-response bias had little relevance in our review due to the nature of the included studies. Question 13 of the original tool (now renumbered as question 12 in the modified version) was revised to assess whether information about participant dropouts during testing sessions were reported. Each question was answered with “Yes”, “No”, or “Don’t know”. A response of “Yes” was awarded 1 point, while “No” or “Don’t Know” responses were awarded 0 points. However, question 19 of the original tool (now renumbered as question 17 in the modified version), which is related to the conflicts of interest, was scored differently. Due to the nature of the question, a response of “Yes” was awarded 0 points, while “No” received 1 point. The maximum possible score for each study was 18, with higher scores indicating lower risk of bias. To ensure comparability across studies, scores were converted into percentages. Studies scoring ≥ 75% were classified as “good” quality, those scoring between 50 and 74% as “moderate” quality, and those scoring <50% as “poor” quality [33].

Synthesis methods

Both narrative and meta-analytic approaches were employed to synthesise the data from the included studies [34, 35]. The narrative synthesis was conducted for endurance time, EMG parameters, and force decline while the meta-analysis was performed for endurance time only. No grouping was necessary for the meta-analysis, as all studies included the same outcome measure (endurance time in seconds), allowing for direct comparisons. The narrative synthesis involved outlining and tabulating study characteristics to facilitate a comprehensive interpretation of findings across all studies [34]. For the meta-analysis, only studies with available or retrievable mean ± SD data on endurance times (in seconds) and sample sizes for both older and younger groups were included. The endurance results reported by Tsuboi et al. [24] were pooled into a single group, as the authors originally reported data separately for male and female participants. The pooled mean and standard deviation were calculated using the following formulas:

$$\text\_\text=\frac\times \text\_\text\right)+ \left(N \text\times \text\_\text\right) }+N \text}$$

$$\text\_\text= \sqrt-1\right)\times ^\_\text\right]+\left[\left(N \text-1\right)\times ^\_\text\right] }+N \text-2}}$$

The mean difference (MD) was selected as the outcome measure, as endurance time in seconds was a common metric across studies.

Meta-analysis procedures and forest plot generation were performed using R software (version 4.4.3) [36] with the “meta” package (v. 8.0.2) [37]. A random-effects model was chosen to account for the anticipated heterogeneity across studies, common in ageing research due to variations in study protocols, participant characteristics (including age ranges and fitness levels), sample sizes, task specificity, and measurement techniques. The Knapp-Hartung adjustment was applied to calculate the 95% confidence intervals around the pooled effect estimate [38], providing more conservative estimates particularly suited for meta-analyses with smaller numbers of studies. Additionally, prediction intervals were included alongside summary estimates to illustrate the expected range of true effects in future studies. This approach enhances clinical interpretation by showing the breadth of potential outcomes rather than just the average effect [35, 39].

While a random-effects model was implemented to account for heterogeneity, it is important to acknowledge that this approach does not eliminate heterogeneity but accommodates it statistically [35]. Additional measures were taken to explore heterogeneity beyond the commonly used I2 statistic, which has limitations in fully characterising heterogeneity patterns. To gain further insight into the variation in true effects across studies, the between-study variance (τ2) was estimated using the restricted maximum-likelihood (REML) estimator, as recommended for continuous outcomes [35, 39]. Prediction intervals were also calculated for pooled effect sizes to offer a more comprehensive perspective on the potential range of true differences in endurance times between older and younger adults across studies [35, 39]. This approach enhances clinical interpretation by illustrating not only the average effect but also the expected range of effects in future studies.

Sensitivity analyses

Given the anticipated heterogeneity in ageing studies comparing endurance measures, comprehensive sensitivity analyses were conducted [35]. This involved sequentially excluding each study to assess the consistency of the meta-analysis results and evaluate the impact of individual studies on overall heterogeneity. Outliers and influential cases were also quantitatively explored using established deletion diagnostics adapted from linear regression [40, 41]. These diagnostics included externally standardised residuals, DFFITS values, Cook’s distances, covariance ratios, DFBETAS values, estimates of τ2 and Q when each study was removed sequentially, diagonal elements of the hat matrix, and the weights assigned to observed outcomes during model fitting. These analyses were performed using the “influence” function from the “metafor” package (v. 4.6.0) in R [40]. The influence function allows for a quantitative identification of potential outliers and influential studies that may disproportionately affect the meta-analysis results, facilitating the detection of studies that could bias the findings. This comprehensive approach to identifying influential cases and conducting sensitivity analyses ensured the robustness of the findings regarding age-related differences in endurance times.

Certainty of evidence

The overall certainty of the evidence was independently assessed by two reviewers (MP and MA) using the Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) approach. This involved a five-step process as outlined by Goldet and Howick [42], resulting in a final rating of the quality of evidence as high, moderate, low, or very low. Based on GRADE guidelines, evidence from observational studies was initially rated as low quality, but could be upgraded in the presence of specific factors such as large effect sizes or clear dose–response relationships. Cohen’s d values were used to estimate effect sizes. Conversely, the certainty was downgraded based on concerns related to risk of bias, inconsistency, imprecision, indirectness, or potential publication bias. Publication bias was assessed by generating funnel plots using the funnel function from the “metafor” package (v. 4.6.0) in R. Due to the small number of studies in our meta-analysis (fewer than 10), formal statistical tests such as Egger’s regression test were not performed as they lack sufficient statistical power with small sample sizes and can lead to misleading results [43]. Instead, we relied on visual inspection of the funnel plot to qualitatively evaluate potential asymmetry that might suggest publication bias. The GRADE approach was applied to the studies included in the meta-analysis, following the removal of the outlier (described below), to enable a tailored interpretation of the evidence in accordance with established guidelines for evaluating observational studies [44].

Comments (0)

No login
gif