Unveiling pathogens and contaminants: refining metagenomics for clinical diagnostics

Abstract

Introduction:

Shotgun metagenomic sequencing (mNGS), an untargeted approach that sequences all nucleic acids in a sample, has emerged as a powerful tool for pathogen detection and genome characterization. However, its implementation in clinical diagnostics remains limited due to technical challenges such as contamination and reduces sensitivity, especially in low-biomass samples.

Methods:

We applied mNGS to 144 clinical samples representing chronic infections, acute infections, and respiratory co-infections. To address contamination, we established a framework integrating negative controls, lab-specific contaminant watchlists, and computational filtering. Viral detection performance and genome recovery were assessed across sample types and viral loads.

Results:

Viral load was shown to be the primary determinant of sensitivity, with reliable recovery achieved only at higher titers. Our framework substantially improved contamination management, reducing false-positive signals and enhancing viral genome recovery. mNGS enabled the detection of clinically relevant co-infections and refined viral classification beyond targeted diagnostics, while also revealing the substantial risk of spurious detections in the absence of contamination-aware workflows.

Discussion:

These findings define practical sensitivity thresholds for clinical mNGS and underscore the need for contamination-aware workflows, particularly for low-biomass samples, while providing an open-source contaminants watchlist that enhances reliability and utility of clinical metagenomics.

1 Introduction

Shotgun metagenomics (mNGS) is a next-generation sequencing (NGS) approach that enables comprehensive analysis of all nucleic acids (both DNA and RNA), within a given sample (Quer et al., 2022; Ibañez-Lligoña et al., 2023). By using a non-targeted strategy, mNGS allows for unbiased sequencing of microbial and host genetic material, facilitating the detection and characterization of a wide range of microorganisms, including viruses, bacteria, fungi, archaea, and parasites (Chrzastek et al., 2022; Parras-Moltó et al., 2018; Iyer and Damania, 2020; Xie et al., 2023; Zhao et al., 2024; Miao et al., 2018; Su et al., 2024). This hypothesis-free approach is particularly well suited for identifying novel or unexpected pathogens and understanding microbial diversity and dynamics in complex clinical settings (Ibañez-Lligoña et al., 2023). Metagenomics has proven to be valuable in a wide range of applications including study of microbial community composition, pathogen evolution, antimicrobial resistance, and the discovery of novel microorganisms through whole-genome sequencing (Quer et al., 2022; Chiu and Miller, 2019; von Fricken et al., 2023).

Over the last decade, mNGS has been introduced into the clinical sphere (Comas et al., 2020), where it offers several advantages over conventional diagnostics (Chiu and Miller, 2019; Gu et al., 2019). Traditional methods such as culture, PCR or serological tests rely on prior knowledge of the target organism and fail to identify unculturable, unknown, or unexpected pathogens (Ibañez-Lligoña et al., 2023; Schloss and Handelsman, 2005). Moreover, whole-genome sequencing enables functional characterization, identification of resistance mutations and virulence factors, thereby supporting both diagnosis and informed therapeutic decision-making.

Accurate disease diagnosis is essential for effective treatment, as misdiagnoses causes delays, increased mortality and higher healthcare costs. Acute infections remain particularly challenging, with 50–60% of hospitalized patients being discharged without an identified cause. In emerging infections and chronic conditions, low-abundance or resolved infections often go undetected by conventional methods. Since syndromic presentations can result from multiple pathogens, a high-throughput approach that simultaneously identifies all potential agents in one assay is needed (Edridge et al., 2019). Moreover, mNGS could be used for prompt detection of emerging, re-emerging, or novel pathogens is essential for timely outbreak containment (Quer et al., 2022).

Despite its potential, clinical implementation of mNGS remains limited due to technical, analytical and interpretative challenges (Ibañez-Lligoña et al., 2023). Contamination is one of the major technical barriers, which is exacerbated by the technique’s high sensitivity (Jurasz et al., 2021). Exogenous nucleic acids can be introduced at multiple stages of the workflow, from sample collection to sequencing (Fierer et al., 2025). Common sources include reagents and consumables (collectively known as the” kitome” [Olomu et al., 2020)], laboratory environments, and operator handling (Lou et al., 2023). These contaminants can lead to false-positive results or obscure genuine microbial signals. This is particularly critical in low-biomass samples, where the actual amount of microbial genetic material is very small compared to the host or background content, therefore enabling contaminants and host-derived reads dominate the sequencing output (Jurasz et al., 2021; Fierer et al., 2025; Minich et al., 2019).

Several strategies have been developed to mitigate the risk of contamination, including physical decontamination methods (e.g., UV treatment, DNase digestion), bioinformatic filtering, and the use of negative controls (Fierer et al., 2025; Corless et al., 2000). However, no single approach has proven to fully remove background noise. In addition, few studies have systematically catalogued the composition and behaviour of contaminants across large clinical datasets or assessed their impact on diagnostic performance. This knowledge gap complicates efforts to set detection thresholds or confidently interpret results from low-biomass samples. In this study, we aim to address key limitations in clinical mNGS workflows by characterizing recurrent contamination and evaluating the natural viral sensitivity across 144 clinical samples and 18 methodological controls.

2 Materials and methods2.1 Sample collection

A total of 144 clinical samples were collected from patients across various diagnostic groups representing diverse diagnostic categories and subjected to shotgun metagenomic sequencing (Table 1). All samples were leftover material obtained from routine clinical diagnostic testing. Samples were completely anonymized. Human data of the participants related to sex, gender, race, ethnicity, social grouping, and other social variables are not relevant to the scope of the analysis. Our research has been approved by the Vall d’Hebron Barcelona Hospital Campus Ethics Committee under the following approval numbers: PR(AG)429–2021 and PR(AG)287–2022 for neuropathy samples, PR(AG)259–2020 for the respiratory virus samples, PR(AG)118–2021 for HEV samples, and PR(AMI)437–2023 for pregnant women samples. Written informed consent was obtained from all patients for the collection and research use of their biological samples.

ConditionSample typeArticle codeNumber of samplesPaired samplesAlzheimer’s diseaseCSFALZ-CSF39NAGuillain-Barré SyndromeCSFGBS-CSF2621SerumGBS-S21Neuralgic amyotrophyCSFNA-CSF11SerumNA-S1Preterm birthPlasmaPB-P44Amniotic fluidPB-AF6Positive controlsNasopharyngeal swabPC-NFS11NANasopharyngeal exudatePC-NFE4PlasmaPC-P10Acute HEVAmniotic fluidHEVA-AF12FecesHEVA-F1PlasmaHEVA-P2Chronic HEVPlasmaHEVC-P17NA

Distribution of the 144 samples analyzed, categorized by sample type and indicating paired samples.

The codification depending on the condition and type is added for easy interpretation of the results. AF, amniotic fluid; ALZ, Alzheimer’s disease; CSF, cerebrospinal fluid; F, feces; GBS, Guillain-Barré Syndrome; HEV, Hepatitis E virus; HEVA, Acute Hepatitis E virus infection; HEVC, Chronic Hepatitis E virus infection, P, plasma; PB, preterm birth; PC, positive control; NA, neuralgic amyotrophy; NFE, Nasopharyngeal exudate; NFS, nasopharyngeal swab.

Among these, samples from patients with Guillain-Barré Syndrome (GBS) were also analyzed, which included 26 cerebrospinal fluid (CSF) and 21 serum specimens, of which 21 of them form matched CSF-serum pairs. Both serum and CSF samples were collected in the context of acute symptomatology for suspected GBS cases, with posterior diagnostic confirmation of included cases. Additionally, 1 matched CSF-serum pair was obtained from a case of neuralgic amyotrophy, another acute inflammatory peripheral nervous system disorder. CSF was obtained through a lumbar puncture performed for clinical diagnostic purposes, with a surplus of 2 to 4 mL retained for research. The CSF samples were then frozen at −80 °C for preservation. For serum collection, whole blood was drawn into 9 mL blood tubes and centrifuged at 3500 rpm at 4 °C for 15 min. The resulting serum was aliquoted into 1.5 mL tubes. Serum was then frozen at −80 °C for preservation.

Additionally, 39 CSF samples were collected from individuals diagnosed with Alzheimer’s disease (AD), following the methodology described in the Cano et al. (2024). Six amniotic fluid samples and four plasma samples, including four matched pairs, were collected from pregnant women in the clinical context of suspected intra-amniotic infection (chorioamnionitis). Gestational ages at sampling ranged from 16 to 33.6 weeks. Upon hospital admission, an amniocentesis was performed under ultrasound guidance for diagnostic purposes. Using continuous ultrasound monitoring, a sterile needle was inserted via transabdominal puncture to aspirate a sample of amniotic fluid (AF) under strict aseptic conditions.

In addition to the mentioned clinical samples, a set of positive controls was included consisting of clinical specimens with confirmed viral infections using conventional clinical microbiology methods. Eleven nasopharyngeal swabs, four nasopharyngeal exudates and ten plasma samples were obtained from cases of suspected acute infection in patients admitted to the hospital or emergency department. Moreover, one stool, one amniotic fluid and two plasma samples were collected from a patient which had tested positive for a HEV acute infection. Furthermore, 17 plasma samples were obtained from a patient with a chronic HEV infection.

Samples were classified as low-biomass or high-biomass according to their anatomical origin. Low-biomass samples were defined as specimens derived from physiologically sterile body sites, including cerebrospinal fluid, plasma, serum, and amniotic fluid, which are not expected to harbour a resident microbiome under basal conditions. High-biomass samples comprised specimen types known to contain abundant microbial communities, such as stool and nasopharyngeal swabs and exudates (Fierer et al., 2025). Blank negative controls consisted of DNAse and RNase-free sterile water (ThermoFisher Scientific, Waltham, MA, USA), which is routinely used for nucleic acid extractions and RT-PCR-Nested reactions.

2.2 Nucleic acid extraction, library preparation and Illumina sequencing

For nucleic acid extraction from nasopharyngeal swab and exudates, the STARMag 96 × 4 Universal Cartridge Kit was used on the Microlab STARled automated platform (Seegene, South Korea). In contrast, RNA and DNA from amniotic fluid, cerebrospinal fluid, plasma, serum and stool samples were extracted using the QIAamp MinElute Virus Spin extraction kit (QIAGEN, Hilden, Germany), omitting the addition of the RNA carrier. Next, sequencing libraries were prepared through the TruSeq Stranded total RNA library kit (Illumina, San Diego, CA, USA) according to the manufacturer’s instructions. Unique dual indices (IDT for Illumina – TruSeq RNA UD Indexes v2) were ligated to each sample to enable multiplexing. Following library quantification and normalization to 4 nM, libraries were pooled prior to sequencing.

Additionally, two types of negative methodological controls were included in the workflow: extraction controls (CE), added during nucleic acid extraction (sterile water), and library controls (CL), added during library preparation and processed in parallel with the clinical samples throughout the workflow. A total of eighteen “blank” controls were included across different sequencing runs to assess and control for inter and intra-run sequencing variations.

Library quality was assessed using the KAPA Library Quantification Kit (Roche Applied Science, Pleasanton, CA, USA) by RT-qPCR in the Light Cycler 480 instrument in combination with the TapeStation 4,200 system (Agilent, Santa Clara, CA, USA) using the D1000 ScreenTape Assay. Finally, libraries were sequenced on either the NextSeq2000 or Novaseq6000 platforms (Illumina, San Diego, CA, USA), with PhiX v3 (Illumina, San Diego, CA, USA) used as internal control for sequencing.

Sequencing generated a total of 9.81 × 1010 reads across all eleven runs. Two of these runs were performed in a NovaSeq6000 platform, while the other nine runs were done in the NextSeq2000 platform. Libraries sequenced on the NextSeq2000 platform produced a median of 9.53 × 107 reads per sample (3.4 × 106–1.54 × 108), whereas NovaSeq6000 runs yielded a median of 1.61 × 108 reads per sample (6.67 × 107–4.51 × 108 reads).

2.3 Cross-contamination control experiment

Additionally, three experiments were carried out to assess the putative cross-contamination that occurred during the process of library preparation and/or sequencing. For that, a NextSeq2000 (Illumina, San Diego, CA, USA) was loaded with a pool containing eight water “blank” controls, a previously produced in-lab HCV clone and one RNA coming from HEVC-P-2 sample which underwent library preparation through TruSeq Stranded total RNA library kit (Illumina, San Diego, CA, USA).

2.4 Bioinformatic analysis

Sequencing data was retrieved from the sequencing platforms in either FASTQ or BCL format for all clinical samples and corresponding negative controls. When required, BCL files were converted to FASTQ format using the bcl2fastq software (Illumina, San Diego, CA, USA). Then, samples underwent an initial quality control (QC) assessment.

QC analysis began with the generation of quality reports using FastQC (Babraham Bioinformatics, 2010) and MultiQC (Ewels et al., 2016). Next, raw reads were trimmed using Trimmomatic (Bolger et al., 2014) to remove adapter sequences, low-quality bases, and short reads. Duplicate sequences were removed using the clumpify tool from BBMap suite (Bushnell, 2014). To minimize the influence of technical contamination, BBDuk from BBMAP suite (Bushnell, 2014) was used to filter out reads present in matched extraction (CE) and library (CL) controls.

Human reads were identified by mapping against the human reference genome (GRCh38) using Bowtie2 (Langmead and Salzberg, 2012) and subsequently removed from the dataset. The remaining non-human reads were subjected to taxonomic classification using Kraken2 (Wood et al., 2019) with the PlusPF database (RefSeq database for archaea, bacteria, viruses, plasmids, human, UniVec_Core, protozoa and fungi; accessed in April 2025). Taxonomic results were then filtered using an in-house script to remove known environmental and reagent-derived contaminants, based on both internal negative controls and previously published contaminant lists. De novo assembly of high-quality, non-human reads was performed using MEGAHIT (Li et al., 2015). Assembly quality was evaluated by remapping reads back to the assembled contigs. Assembled contigs were further classified taxonomically using Kraken2 (Wood et al., 2019). To validate the presence of specific pathogens identified through taxonomic classification and to assess genome coverage, reference-based mapping was conducted on selected positive samples using Bowtie2 (Langmead and Salzberg, 2012).

All downstream analyses were conducted in R (R Core Team, 2022) (v 4.2.5). Data processing and visualization, including Principal Coordinate Analysis, bar plots, line graphs, and statistical summaries, were performed using the tidyverse (Wickham et al., 2019), ggplot2 (Wickham, 2016), and other relevant R packages. Custom scripts were used to generate read-retention plots across pipeline steps, exploring taxonomic profiles and compare microbial compositions between sample groups and control types. In parallel, decontam (Davis et al., 2018) was used to identify contaminants in our data.

To contextualize the presence of recurrent contaminant genera in clinical samples, the proportion of non-human reads assigned to genera included in the defined contaminant watchlist was calculated for each sample using raw reads counts prior to normalization. Moreover, rarefaction (saturation) analyses were performed in host-depleted, pre-filtered genus-level files to assess sequence depth adequacy across sample types with the use of the R package vegan (Oksanen et al., 2025).

3 Results3.1 Characterization of recurrent contaminants in negative controls

Negative controls showed highly consistent contaminant profiles dominated by recurrent taxa, with no significant differences between extraction vs. library controls. Notably, the next step involving host read depletion altered overall compositional clustering patterns and reduced the total read counts, indicating that high host content dominates Bray–Curtis dissimilarity when present and can obscure microbial-specific variation (Supplementary Figures S1, S2).

To assess variability in community composition across control types, we performed PERMANOVA and beta-dispersion analyses after each processing step. No significant differences were detected between extraction and library controls at any stage (PERMANOVA p-value > 0.05, dispersion p-value > 0.05). Although statistical testing did not reveal any significant difference in beta dispersion at the human-read-removed stage, there is an apparent difference in spread between extraction and library controls observed in the dispersion boxplot (Supplementary Figure S3).

After filtering, a contaminant watchlist was generated by identifying taxonomic families (Supplementary Table S1) and genera (1271) (Supplementary Table S2) that were prevalent in at least half of all negative controls and had a counts per million (CPM) > 0.5 (Table 2). From this list, 81 genera such as Propionibacterium, Gammaretrovirus or Flavobacterium, matched previously published contaminants (Jurasz et al., 2021; Lou et al., 2023; Asplund et al., 2019; Piro and Renard, 2022; Zinter et al., 2019; Salter et al., 2014; Hornung et al., 2019; Miller et al., 2019) (Supplementary Table S3). Moreover, we extracted specific genera according to methodological control type (Supplementary Tables S4, S5).

GenusNCBI taxonomy IDPrevalence (%)Mean CPMMedian CPMInterquartile range of CPMPhyllobacterium28100100.00434929.23746526.62770279.24Burkholderia32008100.00166363.157837.61360049.24Escherichia561100.0040490.294817.0734218.42Alcaligenes507100.0018907.37774.9339799.55Pseudomonas286100.0016802.393018.454813.42Micrococcus1269100.0013736.171327.766938.39Cutibacterium1912216100.0011400.183053.4122247.96Sphingomonas13687100.0010633.931176.373583.25Moraxella475100.0010615.59671.479709.87Homo9605100.009350.714009.784866.26Mesorhizobium68287100.009274.515730.8415776.96Corynebacterium1716100.007134.131714.068339.69Shewanella22100.006549.39424.8710470.74Paraburkholderia1822464100.005699.98670.653090.06

Top 15 most prevalent and abundant genera detected across all methodological controls.

The table includes genus name, NCBI Taxonomy ID, and abundance statistics: mean counts per million (CPM), calculated from raw genus-level read counts and normalized to the total number of reads per sample. Summary statistics (mean, median, and interquartile range (IQR)) are shown across all negative controls.

Application of this contaminant watchlist reduced the total number of genera to be assessed for downstream analysis in clinical samples by 44.7%, substantially simplifying clinical reporting. After host removal, the proportion of non-host reads classified as recurrent contaminants from the watchlist varied across sample types. Among low-biomass samples, amniotic fluid showed the highest mean percentage of contaminant reads (98.6%), followed by serum (97.7%), cerebrospinal fluid (96.9%), and plasma (94.4%).

High-biomass samples showed greater variability. Nasopharyngeal exudates (97.3%), nasopharyngeal swabs (92.6%) displayed contaminant proportions comparable to low-biomass samples, whereas the fecal sample showed a substantially lower proportion of contaminant reads (48.9%) consistent with a higher endogenous microbial content (Supplementary Tables S6, S7).

To explore sample-control relationships, PCoA was performed on 151 samples (including replicates) and 18 blank negative controls after each pipeline step. Samples were stratified by biomass level (low-biomass samples, high-biomass samples). Moreover, rarefaction (saturation) analyses were performed stratified by clinical sample type to evaluate sequencing depth adequacy (Supplementary Figures S4–S6). Across sample types and methodological controls, curves demonstrated early plateauing behaviour indicating that sequencing depth was sufficient to capture the majority of the detectable taxa within each matrix.

In early processing steps (after the removal of low-quality sequences), low-biomass samples clustered tightly with controls, and PERMANOVA revealed no significant compositional differences (PERMANOVA p-value > 0.05, dispersion p-value > 0.05, Figure 1a), indicating the domination of contamination, due to limited true microbial signal. For decontamination, we employed BBDuk (Bushnell, 2014)-based read subtraction using negative controls as reference. This led to partial separation of low-biomass samples from controls in the Bray-Curtis space (Figure 1b), although the PCoA still showed overlapping. PERMANOVA confirmed a statistically significant difference between low-biomass samples and methodological negative controls (R2 = 0.053, p-value < 0.001). However, beta dispersion analysis indicated a significantly higher within-group variability (F = 25.0, p-value < 0.001), suggesting that heterogeneity in dispersion may partially contribute to the observed group differences. Conversely, PCoA of high-biomass samples and controls showed a clear separation from controls as early as the trimming step (PERMANOVA p-value < 0.05) (Figure 1c). This trend continued across the subsequent pipeline stages, and by the last step, separation was pronounced, with PCoA1 explaining 36.4% of the variance (Figure 1d). In contrast, low-biomass samples clustered tightly with controls across steps.

Panel a shows a PCoA plot comparing low-biomass samples (green circles) and controls (blue triangles) along axes PCoA1 (37.7%) and PCoA2 (19.6%). Panel b contains a PCoA plot with low-biomass samples and controls, plotted on PCoA1 (18.7%) and PCoA2 (10.7%). Panel c displays a PCoA plot of high-biomass samples (green circles) and controls on axes PCoA1 (42.3%) and PCoA2 (29%). Panel d features a PCoA plot of high-biomass samples and controls, shown on PCoA1 (36.4%) and PCoA2 (15.9%). Group identities are indicated by color and shape in each panel.

Principal coordinates analysis (PCoA) at genus level based on Bray-Curtis distances, showing clinical samples and methodological controls. (a) Low-biomass samples after initial preprocessing (adapter and quality trimming). (b) Low-biomass samples after BBDuk (Bushnell, 2014) read subtraction decontamination strategy. (c) High-biomass samples after initial preprocessing (adapter and quality trimming). (d) High-biomass samples after BBDuk (Bushnell, 2014) read subtraction decontamination.

To further reduce background noise, we applied prevalence-based filtering using decontam (Davis et al., 2018), alongside our contaminant watchlist for interpretative purposes to flag ambiguous or low abundance taxa without risking the exclusion of true biological signals (Supplementary Figure S7). Despite elevated group heterogeneity (beta dispersion p-value < 0.05), the total retained microbial abundance per sample was significantly higher in clinical samples compared to negative control (Wilcoxon p-value < 0.001), supporting the presence of true biological content distinct from background contamination.

3.2 Assessment of cross-contamination and index hopping

To evaluate cross-contamination, nine negative controls (sterile water), one of which was spiked with a hepatitis C Virus (HCV) clone were sequenced. A hepatitis E virus (HEV)-positive plasma sample was included in the same batch in three separate sequencing runs. Despite stringent protocols, HCV reads were detected in other samples in all three sequencing runs at significantly lower read counts (between 2 to 161 reads) compared to the HCV sample. HEV reads appeared only in the HEV samples in two out of the three replicate sequencing runs. However, HEV reads appeared in the HCV-spiked sample at very low read counts in one of the batches (Figure 2a).

Panel a illustrates a laboratory workflow with multiple sample tubes, some labeled as containing HCV or HEV, and colored arrows indicating three experimental runs and contamination pathways; bidirectional and unidirectional contamination routes are indicated by dashed and solid arrows. Panel b presents a heatmap matrix with numerical values and a blue color gradient denoting contamination levels between samples, with sample identifiers listed along both axes and a color scale for interpretation.

Cross-contamination results. (a) Schematic figure of the three cross-contamination experiments when performing library preparation. Arrows indicate the presence of viral reads detected in each sample post-sequencing and post-upstream analysis, demonstrating cross-contamination. Green virions in the third tube represents the presence of HCV in the tube, while purple virions in the eight tube represents the presence of HEV. (b) Heatmap showing the mean percentage of reads from the three sequencing runs misassigned from donor samples (rows) to receiver samples (columns), indicative of index hopping during Illumina sequencing. The values represent the percentage of reads erroneously assigned across sample indexes.

To further investigate these low-level detections, we analyzed shared tile and index combinations across samples to identify potential index misassignment events in the three sequencing runs. This analysis revealed pairwise misassignment events across samples in the three sequencing runs, with a pattern of dual-index sharing consistent with index hopping. Most sample pairs showed minimal read transfer (<0.01%), indicating low background cross-contamination. However, a few pairs displayed higher misassignment rates (0.2–0.5%), consistent within the previously reported range for index hopping (0.1–2%) (https://sapac.illumina.com/techniques/sequencing/ngs-library-prep/multiplexing/index-hopping.html) (Figure 2b).

More specifically, the HCV-spiked sample, which exhibited a strong abundance gradient, low-level HCV reads were detected in other samples, consistent with potential transfer from high-abundance libraries. However, index-sharing analysis does not allow unequivocal determination of read origin, and directionality is inferred based on relative abundance patterns rather than directly observed.

3.3 mNGS diagnostic performance across clinical infection types3.3.1 Assessment of a chronic infection

We analyzed 17 longitudinal plasma samples from a patient with a chronic HEV infection, viral loads ranged from 35 IU/mL to x 2.2 106 IU/mL. Metagenomic sequencing along with background subtraction, successfully detected HEV reads in 15 out of 17 samples. Detection was consistent at viral loads above 103 IU/mL, with HEV reads ranging from 82.09 to 7.10 × 105 counts per million (CPM) (Supplementary Table S8). In contrast, samples with viral loads lower than 103 IU/mL showed inconsistent or no detection of HEV (Figure 3), despite testing qPCR-positive cases. This suggests that 103 IU/mL is the lowest viral load at which detection becomes reproducible and retrieves reliable detection by mNGS. In addition, we observed a strong positive correlation between the log-transformed viral load (IU/mL) and the log-transformed normalized read counts (CPM) obtained by metagenomic sequencing (Pearson’s r = 0.93, p-value < 0.001), indicating that higher viral loads lead to a proportionally enhanced detection of the pathogen. This correlation underscores the quantitative potential of metagenomic sequencing in virome studies and highlights detection limits relevant to clinical diagnostics.

Panel a displays a line graph comparing log10(CPM) and log10(viral load IU/mL) across 17 HEVC samples, with vertical dashed lines marking RBV600, RBV800, and RBV1000 groups. Panel b shows a scatter plot of log10(CPM) versus log10(viral load IU/mL), including a regression line with R squared value of 0.88 and a statistically significant p-value less than 0.001.

Viral load vs metagenomic detection (CPM). (a) Line plot showing log-transformed viral load (IU/mL) for hepatitis E virus (HEV) and its corresponding log-transformed viral counts per million. (b) Correlation plot between log-transformed viral load and log-transformed normalized read counts (CPM).

In samples with viral loads above 5 × 104 IU/mL, near-complete recovery was achieved, with > 99% genome coverage, CPM values exceeding 170,000, and full-length assemblies generated as single contigs (N50 = longest contig = total length ≈ 7,200 bp) (Supplementary Table S8; Supplementary Figure S8). In contrast, samples with low viral loads (< 103 IU/mL), which exhibited inconsistent HEV detection, showed minimal or no genome coverage, CPMs below 1,000, and failed to yield contig assemblies, underscoring the reduced sensitivity of metagenomic sequencing at low input concentrations. Intermediate viral load cases (e.g., HEVC-P-8, HEVC-P-9, HEVC-P-16) produced partial genome coverage (24–75%) and fragmented assemblies composed of multiple shorter contigs with reduced N50 values, suggesting a gradual relationship between input viral load and assembly completeness (Supplementary Table S8).

3.3.2 Acute viral infections

We analyzed 14 clinical samples from patients with serologically and RT-qPCR confirmed acute viral infections. These included ten plasma samples (low-biomass) with Cytomegalovirus (CMV), Human polyomavirus 1 (BKV) and Epstein–Barr virus (EBV), with viral loads ranging from 226 IU/mL to 2.87 × 105 IU/mL. In addition, four samples from a patient with acute HEV genotype 1 infection were analyzed: two plasma samples, one amniotic fluid sample (all low biomass), and one fecal sample (high biomass).

mNGS detected its qPCR-confirmed pathogen in eleven out of fourteen samples (Table 3). Consistent with our prior observations, detection became unreliable below viral loads of 103 IU/mL. Moreover, the log-transformed viral load (IU/mL) strongly correlated with log-transformed CPM values (Pearson’s r = 0.77, p-value < 0.05), further confirming that higher viral loads correlate with higher read counts in metagenomics, reinforcing the threshold for reliable viral detection in both chronic and acute infection contexts (Supplementary Figure S9). Viral load also correlated positively with genome coverage (Pearson’s r = 0.61, p-value < 0.05), with coverage spanning from 0% in low-titer samples to up to 100% in two HEV samples (HEVA-P-1, HEVA-F-1). While viral load was the main driver of detection and assembly, some exceptions (PC-P-3) suggest that factors such as read quality, library preparation biases, or strain divergence also influence assembly and detection success.

Sample IDTarget virusViral load (IU/mL)CPMGenome covered (%)Number of contigsN50Longest contigTotal lengthPC-P-1CMV23,20813083.1511.076

Comments (0)

No login
gif