Staphylococcus aureus (S. aureus) causes more than 100 000 deaths worldwide annually owing to antimicrobial resistance (AMR) and remains persistent because of its ever-evolving resistance mechanisms [1,2]. It acquires resistance through multiple strategies, including horizontal gene transfer, chromosomal mutations, efflux pump activity, and enzymatic modification or inactivation of antibiotics [1,3]. Moreover, new drug resistance mechanisms have emerged and spread globally, resulting in decreased efficacy of current treatment against common bacteria that cause severe and often deadly infections [4].
Machine learning approaches have been applied to large-scale genome sequencing and drug susceptibility datasets in order to uncover potential genetic determinants that shape AMR [[5], [6], [7], [8]]. These models typically rely on two major categories of genomic features: (1) predefined markers like known resistance genes, which are interpretable but limited by prior knowledge [5,[9], [10], [11]]; and (2) reference-agnostic k-mer representations, which comprehensively capture novel and complex determinants without needing a reference genome [[12], [13], [14], [15]]. However, k-mer-based approaches often result in high-dimensional models with limited biological interpretability [12,16]. Thus, an ideal framework would balance the discovery power of k-mers with the interpretability offered by gene-based features.
Furthermore, resistance is often governed by coordinated genetic networks rather than single genes. In particular, the interplay of regulatory elements – for instance, mecR1/mecI-mediated control of mecA, together with additional two-component systems that modulate cell-wall homeostasis – illustrates how coordinated gene networks rather than single determinants shape AMR [[17], [18], [19]]. These observations motivate approaches that retain genomic context, yield interpretable gene-level features, and explicitly consider multi-gene effects (genetic interactions [GI]/epistasis) that are actionable for combination strategies [20,21].
Therefore, we developed a fine-grained and reference-agnostic gene-context 22-mer (gkmer) representation that links sequence signals to specific genes and functional domains, thus enabling interpretable gene-level features while retaining the discovery power of k-mer analysis. Specifically, we introduced: (1) a two-step random forest pipeline (RF1 and RF2) that transitions from gkmer screening and mapping to interpretable gene-level models; (2) co-information-based gene-synergy networks that capture higher-order (epistasis-like) effects beyond pairwise associations; and (3) protein-structure mapping to anchor features in putative functional regions. Collectively, these innovations establish a generalizable framework that extends k-mer analysis beyond black-box prediction, thus providing a systematic and interpretable approach for modelling AMR, uncovering higher-order GI, and generating mechanistic hypotheses.
Comments (0)