A novel explainable AI for revealing determinants of cancer drug response through integrative multi-omics analysis

Abstract

Introduction:

Cancer drug response rates differ across patients and cell lines; however, many current computational prediction models still function as uninterpretable black boxes, offering limited insight into why a given treatment works well for some patients or cell lines but poorly for others.

Methods:

Here, we introduce an interpretable cancer drug response prediction framework that leverages multi-omics data from the Genomics of Drug Sensitivity in Cancer 2 (GDSC2) resource, including genomics, transcriptomics, and proteomics, where available, together with explicit chemical drug descriptors derived from SMILES and InChI representations. We use a Modified Neighbor-Joining Algorithm (MNJA) to generate topology-aware gene-sequence trees. Combined multi-omics and drug features are summarized into high-level deep descriptors via a decimal-scaled GoogLeNet (DS-GoogLeNet), together with lightweight handcrafted features. A Smoluchowski Kookaburra Optimization Algorithm (SKOA) then selects informative multimodal features, which are used to classify the sensitivity or resistance of each cell line-drug pair with an explainable Aranda Graph Attention Network (EA-GAT).

Results:

By analyzing model behavior using SHAP-based feature attributions and subsequently subjecting SHAP-ranked genes to pathway enrichment analysis, we highlight the recurrent involvement of the PI3K/AKT/mTOR pathway and related downstream signaling cascades in drug response. Under leakage-safe stratified 10-fold cross-validation on 2614 GDSC2 cell line-drug pairs, the framework attains an accuracy of 95.87% and an F1-score of 95.87%, with an area under the receiver operating characteristic curve (AUROC) of 0.957 and an area under the precision-recall curve (AUPRC) of 0.946.

Discussion:

Overall, the framework appears to predict drug response accurately while also supporting biologically meaningful interpretation, making it a useful computational tool for hypothesis generation and biomarker-focused investigation in oncology.

1 Introduction

Cancer remains one of the major causes of illness and death worldwide, despite substantial progress in screening and treatment (1). Drug therapy is, of course, central to the management of many human diseases, including allergies, asthma, infections, and cancer (2). To inhibit tumor proliferation and improve patient outcomes, modern oncology relies on pharmaceutical products that disrupt critical biological pathways involved in tumor cell division (e.g., DNA replication), mitotic progression, or growth factor signal transduction (3). Unfortunately, not all patients experiencing the same type of cancer, nor all tumors of the same histopathologic classification, react equally well to the same drug. While some patients may achieve considerable benefits from using the drug, other patients may receive little benefit but experience adverse reactions.

Accurately predicting how a given tumor will respond to a particular drug before treatment is therefore critical. Cancer drug response prediction (CDRP) seeks to learn the relationship between molecular profiles and observed drug sensitivity so that ineffective drugs can be avoided, unnecessary toxicity reduced, and treatment decisions aligned more closely with the biological characteristics of each patient’s disease (4, 5). Large-scale pharmacogenomic resources, such as cancer cell line panels with matched molecular and drug-response data, provide an important substrate for developing and benchmarking such predictive models.

In recent years, artificial intelligence and machine learning have become crucial in predicting cancer drug responses. The early prediction methods relied on classic machine learning approaches like support vector machines, decision trees and ensemble learning to predict the sensitivity of drugs from single-omics level such as gene expression data, while newer methodologies tend to use deep learning - based methods including reference drug-based deep neural networks that use fully connected models and CNNs to model nonlinear relationships between molecular features and drug responses (69). These approaches have improved predictions in many cases, but are generally constrained to using only partial amounts of available molecular information and often act as black boxes that reveal nothing about the reasons for their predictions (10).

The current research in this area has several shortcomings. First, most CDRPs use data from a single molecular modality (e.g. transcriptional data) and therefore do not fully integrate complementary information from genomics, transcriptomics, and proteomics. This partial view of tumor biology can constrain predictive power and obscure mechanisms that drive sensitivity or resistance. Second, drug molecular structure is rarely modeled by CDRPs. However, molecular structure significantly affects how drugs interact with cells (i.e., pharmacodynamics), how the drug is metabolized and excreted (i.e., pharmacokinetics); modeling structure could improve the accuracy of predictions made by CDRPs (11). Third, although some CDRP models identify genes whose expression correlates with drug response, they often provide only limited interpretability and do not support clear gene- or pathway-level explanations that can be examined biologically (12). This relationship held true throughout all levels of response as measured within this panel.

These gaps motivate the development of an integrative and explainable framework for cancer drug response prediction. In this work, we propose a model that combines multi-omics data with explicit drug molecular representations and is coupled to an Explainable AI (XAI) layer. Briefly, our model produces topology-aware sequences and trees using the MNJA (Modified Neighbour Joining Algorithm), generates a variety of attributes, uses DS-GoogLeNet to select optimized features based on a predefined criterion, and then determines cell line–drug interactions using a graph-attention classifier. We use post-hoc SHAP analysis and pathway enrichment on top of the predictive core to determine which genes and signaling pathways are responsible for each individual prediction. An in-depth analysis of a case study involving PI3K inhibitors demonstrates that our model can recover known predictors of response and that the PI3K-AKT/mTOR signaling pathway plays a significant role in determining response to these drugs. Therefore, our model bridges numeric predictions with interpretive biological evidence.

In summary, this study advances cancer drug response prediction by addressing both the integration and interpretability challenges. The major contributions of this study include:

Integrative multi-omics modeling: A single framework is used to develop a representation of each cancer cell line by integrating three types of data (genomics, transcriptomics, and proteomics) which provides a more complete basis for predicting cancer drug response.

Explicit incorporation of drug structure: Each drug’s molecular structure is encoded into the model along with its cellular properties; thus, providing a better understanding of how drugs may affect cells differently based on their unique chemical makeup.

Topology-aware gene representation: We generate sequence- and tree-based features via MNJA to obtain more informative and structured gene-level representations.

Rich, optimized feature extraction and selection: We extract a large, diverse set of features using attribute-based descriptors and DS-GoogLeNet.Later, followed by an optimization-guided feature selection step that favors compact, generalizable subsets.

Explainable prediction with pathway-level insight: We employ SHAP and pathway enrichment analysis to provide gene and pathway-level explanations of model predictions. A detailed PI3K inhibitor case study demonstrating a direct link between the model’s rationale and established biological mechanisms of cancer is also provided.

Experimental in vitro validation: The model’s predictive results were validated using a separate 72-hour dose-response experiment across six genomically characterized cell lines to test whether Pictilisib inhibited PI3K activity. We found that we achieved a binary accuracy of 83.3% and good agreement in ranked-order sensitivity values between our model and the measured IC50 values.

1.1 Related work

The literature is rich in research on predictive models of cancer drug response using machine learning or deep learning methods (e.g., across different modalities and architectures) that have focused on various aspects of translation. Early examples of applying deep learning to predict drug sensitivity based on gene expression were reported by Chawla et al., who paired a high-dimensional gene expression matrix with drug information and trained a deep neural network to distinguish between effective and ineffective drugs (11). These researchers demonstrated that predicting which genes will be responsive to a particular drug based on gene expression alone yields reasonable accuracy. However, they did so within a single omics dimension, and their ability to explain the predictive features they identified was limited to broad-level interpretations of feature importance and lacked methodical integration with additional molecular dimensions.

Hostallero et al. introduced TINDL (Tissue-Informed Normalisation Deep Learning) to strengthen preclinical-to-clinical transfer (12). The approach normalizes gene expression in a tissue-aware manner before training deep models to predict anti-cancer drug response in patients. By accounting for tissue context, TINDL narrows the gap between cell lines and clinical samples and surfaces candidate biomarkers along the way. The pipeline is nevertheless computationally demanding, relies almost exclusively on transcriptomics, and offers no explicit encoding of drug molecular structure or fine-grained mechanistic rationale for its predictions.

Paltun et al. approached the problem from a data integration angle with DIVERSE, a Bayesian framework that fuses gene expression, drug similarity, and protein–protein interaction information for precise drug response prediction (13). Joint modeling of these heterogeneous sources yielded clear gains over single-modality baselines, reinforcing the case for integrative learning. Yet feature selection was not explicitly optimized for generalization, and the model remained comparatively opaque: recovering clean gene- or pathway-level rationales for individual predictions is difficult.

In a different vein, Lee et al. developed a gene-centric CDRP method built around convolutional encoders (14). Here, gene expression and somatic mutation profiles are reshaped into tensor representations and passed through a CNN to predict sensitive versus resistant phenotypes. The design exploits spatial structure in the constructed tensors and effectively uses combined genomic features. Drugs, however, are largely handled as categorical labels, and detailed molecular descriptors are not deeply integrated—constraints that limit the method’s ability to disentangle drug-specific mechanisms.

Complementing these efforts, Qureshi et al. assessed machine learning-based drug response prediction in lung cancer patients using an Extreme Gradient Boosting model trained on genomics and clinical variables (15). The study made a convincing case for combining clinical context with molecular features. Its reliance on time-intensive longitudinal data collection is a practical drawback, and drug chemistry is not explicitly encoded. Interpretability, in turn, is delivered through tree-based feature importance rather than pathway-aware explanations.

Recent work has expanded the use of multi-omics data in CDRP. Wang et al. proposed a deep learning approach that integrates multiple omics layers—including gene expression, copy number variation, mutation, and protein array data—together with graph embeddings based on biological networks, using attention mechanisms to weight contributions from different omics types (16). Liu and Mei developed NDSP, which combines multi-omics data with similarity network fusion and deep learning to reduce dimensionality and mitigate overfitting, while preserving meaningful similarity structure among samples (17). Sharma et al. introduced DeepInsight-3D, which transforms multi-omics profiles into multi-channel images and applies convolutional neural networks to predict anti-cancer drug responses (18). Ahmad et al. recently applied machine learning to genomic profiling and drug discovery in lung cancer (19). These methods demonstrate that structured representations of multi-omics data can substantially improve predictive performance, but they often provide limited direct interpretability at the level of specific genes and pathways.

Other studies have explicitly integrated multi-omics features with detailed drug representations. Mohammadzadeh-Vardin et al. proposed DeepDRA, which uses autoencoders to integrate multi-omics data with drug descriptors and fingerprints in a drug repurposing context (20). Wu et al. introduced PASO, a framework that uses pathway-based difference features derived from multi-omics data and SMILES representations of drugs, combined with transformer encoders, multi-scale convolutions, and attention mechanisms to capture complex drug– cell interactions (21). These approaches highlight the advantages of representing both the cellular and drug dimensions in a richer, more mechanistic manner, but their interpretability is often limited to attention weights or global feature rankings, and proteomics is not always systematically incorporated.

Explainability has begun to receive more focused attention in this domain. Tang and Gottlieb proposed PathDSP, an explainable model that combines chemical structure fingerprints with pathway level enrichment scores derived from gene expression, mutation, and copy number variation data (22). By operating at the pathway level and applying SHAP analysis, PathDSP provides more transparent, pathway-centric explanations of drug sensitivity. Nevertheless, it does not fully exploit the breadth of available omics modalities, such as proteomics, and its explanations are primarily aggregated at the pathway level rather than at the level of individual genes and their interactions with specific drugs. Broader surveys and mapping studies on deep learning for CDRP further emphasize that many high-performing models still trade interpretability for accuracy and rarely deliver mechanistic insight that can be directly related to known signaling cascades and drug mechanisms (23).

In parallel, several multi-omics integration studies in oncology more generally, including recent work in breast cancer, have shown that AI-based integration of heterogeneous molecular data can improve precision stratification and drug resistance prediction (24).Recently, machine learning approaches have shown considerable promise in lung cancer treatment-response prediction. Interpretable frameworks using genomic and clinical variables have been proposed for drug-response analysis in non-small cell lung cancer (NSCLC) ()?, while integrated multi-omics transcriptomic signatures have also been validated for immunotherapy-response stratification through FOXOmediated transcriptional features ()?. Taken together, these studies reinforce the importance of jointly modeling multiple omics layers and suggest that the same principles can be extended to drug-response prediction in a more mechanistically grounded way (10).

Taken together, the literature points to two broad conclusions: deep learning combined with multi-omics integration can deliver strong performance on cancer drug response prediction, and performance generally benefits from the inclusion of drug molecular information and biological priors such as pathways and network structure. Three critical gaps nevertheless persist. Many frameworks still privilege a subset of omics modalities, or combine them only loosely, rather than constructing a genuinely unified multi-omics representation. Detailed drug molecular structure is not consistently encoded alongside cellular features in ways that support mechanistic interpretation. And interpretability is too often reduced to generic feature-importance scores or attention maps; relatively few models are built to yield coherent gene- and pathway-level explanations that can be cross-checked against established cancer biology.

The framework proposed here is designed to close these gaps. It integrates genomics, transcriptomics, and proteomics with explicit drug structural representations, and it treats explainability as a core design principle rather than an afterthought—using SHAP-based attributions and pathway enrichment analyses to expose the mechanistic determinants of drug response.

The rest of the paper is organized as follows. Section 2 describes the materials and methods, including data processing, feature construction, model architecture, and evaluation protocol. Section 3 presents the experimental results and the associated biological interpretation. Section 4 discusses the findings in the context of existing work and outlines the limitations of the study. Finally, Section 5 concludes the paper and highlights directions for future development.

2 Materials and methods2.1 Data source and cohort curation

The primary pharmacogenomic data source for our investigation was the Genomics of Drug Sensitivity in Cancer 2 (GDSC2) (25). The GDSC2 database contains virtually all of the drug sensitivity measures for nearly 700 cancer cell lines and for over 138 different anti-cancer drugs across an estimated 75,000 experiments. We did not use the complete original GDSC resource. Rather, we created a cohort of matched cell line–drug pairings to be used in our analysis for which both response measures, molecular characteristics, and structural characteristics of each drug were available.

Our starting cohort had 8,146 cell line–drugs. From this cohort, we deleted 854 pairs based on missing or invalid responses. An additional 4,179 pairs were removed due to insufficient information about their molecular profiles. Lastly, another 499 pairs were removed from the cohort since no drug structural information was available. Thus, there remained a total of 2,614 labeled cell line–drug pairs in the cohort to be analyzed. Table 1 summarizes all aspects of the data curation process.

Curation stepRetained pairsExcluded at this stepReasonInitial matched candidate cohort8,146–Candidate cell line–drug pairs after matching core response, molecular, and drug-structure fieldsAfter removing missing/invalid response measurements7,292854No usable IC50 or invalid response summaryAfter removing
incomplete molecular profiles3,1134,179Missing required molecular feature information for the selected omics feature setAfter removing pairs
without usable drug
structural information2,614499Missing or unusable structural representation fieldsFinal labeled2,6145,532Used in all experiments

Summary of the data curation process used to derive the final analytical cohort from the matched GDSC2 cell line–drug pairs.

Bold values indicate the final labeled analytical cohort used in all experiments.

All subsequent preprocessing, feature extraction, feature selection, and model training steps were carried out within the training folds of the stratified 10-fold cross-validation procedure. The same curated cohort was used throughout the comparative experiments reported in this study.

2.2 Response label definition

Cancer drug response was formulated as a binary classification task. Each retained cell line– drug pair was assigned a sensitivity label using an IC50-based threshold. Pairs with IC50 ¡ 1 µM were labeled as sensitive, whereas pairs with IC50≥ 1 µM were labeled as resistant. Applying this threshold to the final curated cohort of 2,614 cell line–drug pairs yielded 1,241 sensitive pairs (47.48%) and 1,373 resistant pairs (52.52%). The same threshold was used in the orthogonal Pictilisib validation analysis to maintain consistency between the computational and experimental evaluations. Stratified 10-fold cross-validation was used to preserve class proportions across folds during training and testing.

2.3 Proposed cancer drug response prediction methodology

The proposed work performs Cancer Drug Response Prediction (CDRP) based on the Smoluchowski.

Kookaburra Optimization Algorithm (SKOA) and an Enhanced Graph Attention Network (EAGAT). The overall framework integrates multi-omics information from cancer cell lines with explicit drug molecular structure and an explainability layer based on SHAP and pathway enrichment. Figure 1 depicts the architecture of the proposed model.

Flowchart illustrating a bioinformatics pipeline using the GDSC dataset, where genes, drug, and cell data undergo processes including sequence identification, structure conversion, feature extraction, selection, and classification to predict drug sensitivity or resistance.

Architecture of the proposed cancer drug response prediction model.

2.3.1 Input data

The curated GDSC2 cohort described in Sections 2.1 and 2.2 was used to define three main entities in this study: cancer cell lines, genes, and drugs, represented as shown in Equations (13)

where Nc, Ng, and Nd denote the number of cell lines, genes, and drugs under study, respectively.

For each cell line ci ∈ C, we extracted genomics and transcriptomics features, including binary mutation calls, copy number variation, and normalized gene expression values. Where protein or phospho-protein measurements were available within the matched molecular profile, these were incorporated as additional covariates to approximate proteomic activity. All continuous features were standardized to zero mean and unit variance within the training folds, as described in the evaluation protocol.

For each drug dj ∈ D, we collected the corresponding response summaries together with structural encodings such as SMILES and InChI strings, along with metadata including drug id, name, and annotated targets. Binary sensitivity labels for each cell line–drug pair were assigned as described in Section 2.2. These components form the basis of the integrative multi-omics and structural representation used in the proposed framework.

2.3.2 Sliding window

For efficient Sequence Identification (SI), a sliding window is generated for each gene sequence. Let sk denote the sequence associated with gene gk, let Lk denote the length of sk, let Lw denote the window length, and let Sw denote the step size. The set of windows for gk is defined as shown in Equation (4)

where each window spans Lw consecutive positions of the sequence. In this way, each gene sequence is partitioned into overlapping or non-overlapping windows, depending on the value of Sw, and the resulting windows are passed to the next stage for sequence identification.

2.3.3 Sequence identification

In this step, Sequence Identification (SI) is applied to the set of sliding-window segments Wk derived for each gene gk. The purpose of SI is not to infer biological evolution, but to convert each window into a structured segment representation that can later be organized into the tree-based representation described in Section 2.3.4. For each window, the local ordering and repetition pattern of the encoded sequence symbols are examined to determine whether the window contains one of a small set of predefined structural-event patterns.

The event categories considered in this study are insertion, deletion, inversion, mirror, and duplication. An insertion indicates the appearance of an additional short subsequence within the current window relative to the local reference ordering; a deletion indicates the absence of an expected subsequence; an inversion indicates that a local subsequence appears in reversed order; a mirror event indicates a reflected local arrangement pattern; and a duplication indicates repetition of a subsequence within the same windowed region.

Following the above pattern verification, each analyzed window is represented as a detected segment and associated with its corresponding event label. Thus, for each gene gk, the SI stage outputs a collection of descriptor-ready segment units of the form as shown in Equation (5)

where sk,r denotes the rth detected segment for gene gk, ek,r denotes its corresponding event label, and nk is the number of detected segments obtained from the windows of gk. These segment–event pairs form the direct input to the descriptor-based dissimilarity calculation and Sequence Tree construction stage described next.

2.3.4 Sequence tree construction

By utilizing the Modified Neighbor-Joining Algorithm (MNJA), the Sequence Tree (ST)—from which gene features are extracted—is constructed for the genes in G. For ST construction, the classical Neighbor Joining Algorithm (NJA) is used as the basis. In the present framework, NJA is adapted as a computational procedure for organizing the detected sequence segments and their pairwise relationships into a structured tree representation. It is not used here for phylogenetic or evolutionary inference.

For each gene, the identified segments are represented using descriptors that are used to compute a pair-wise dissimilarity matrix. This matrix serves as input to MNJA’s iterative merging of similar segments to construct, step by step, a tree structure (i.e., an ordered graph). The final constructed tree provides a topological-computational model that describes both the sequences’ local patterns as well as the inter-relationship among all event types detected.

The tree structure can be written as shown in Equation (6)

where Vg indicates the set of nodes (segments) for gene gand Eg indicates the set of branches. The branch lengths between segment representations uand vare estimated using their descriptor-based dissimilarity as shown in Equation (7)

where d(·,·) denotes the descriptor-based dissimilarity function defined over segment representations.

Thus, the tree Tg is constructed based on the dissimilarity matrix and the estimated branch lengths, as shown in Figure 2.

Hierarchical diagram illustrating a tree structure with one root node branching into three segment groups, each of which branches further into two or more segment leaf nodes labeled as detected genomic segments.

Illustrative sequence tree constructed from detected genomic segments and their hierarchical grouping. Internal nodes represent descriptor-based segment groupings, and leaf nodes correspond to detected segments used for downstream feature extraction.

When compared to commonly employed “bag-of-mutations” or other flat gene-level representations, MNJA enables the maintenance of structural relationships (i.e., topological structure) among the detected genomic segments and the events they represent, allowing downstream feature extractors to use topology-aware patterns rather than independent binary indicators. The features are extracted from Tg as explained in Section 2.3.6.

2.3.5 Molecules structure conversion

Meanwhile, the SMILES and InChI strings representing the drug molecules are collected from the drug set D and converted into a structured representation Mj for downstream feature extraction. For each drug dj ∈ D, Mj captures the molecular structure in two complementary forms: a molecular graph describing atom-level connectivity and a standardized two-dimensional depiction used for image-based representation, as illustrated in Figure 3.

Chemical structure diagram for the molecule piperlongumine, showing a benzene ring with two methoxy groups and an extended chain leading to a six-membered lactam ring with a double bond and carbonyl groups.

Example of drug molecule structure representation.

This conversion provides a consistent structural input for both handcrafted descriptor extraction and deep feature learning in the subsequent stage. In this way, the framework treats each drug as a structured entity whose geometry and connectivity can be used during feature extraction.

2.3.6 Feature extraction

We first extract descriptive features from both the sequence trees Tg and the molecular structures Mj. The sequence-tree component provides structural information derived from the detected sequence motifs and events, while the molecular component encodes structural characteristics derived from the corresponding drug compounds. Together, these two components form the input used for downstream feature learning.

For DS-GoogLeNet input construction, each matched cell line–drug pair was represented as a fixed-size three-channel tensor. The sequence-tree representation derived from Tg was first encoded as a two-dimensional structured map. This map preserved the detected segment relationships and event patterns. The drug representation derived from Mj was expressed separately as a standardized two-dimensional depiction. To maintain structural consistency, the sequence-tree maps were standardized using zero-padding, whereas the drug depictions were resized to a common spatial resolution. The resulting representations were then aligned to a shared size of 224 × 224 and fused along the channel dimension to form a composite tensor . Continuous-valued channels were normalized within the training folds, while binary event indicators were retained in encoded form. This channel-wise fusion enabled DS-GoogLeNet to jointly analyze cellular and drug representations while preserving their complementary structural information.

Features are learned from the sequence trees using DS-GoogLeNet. In addition, handcrafted descriptors are used to capture statistical and texture-related information. From the molecular structures, features such as Gray-Level Co-occurrence Matrix (GLCM), texture descriptors, Local Tetra Pattern, and Local Binary Pattern (LBP) are extracted, together with the deep features produced by DS-GoogLeNet.

Standard GoogLeNet is adopted as the base feature extraction model because the fused representation of each cell line–drug pair is expressed as a structured three-channel tensor in which local spatial neighborhoods preserve segment-event arrangements from the sequence tree together with the standardized drug depiction. Under this tensorized representation, convolutional filtering becomes appropriate for learning joint local patterns across the cellular and drug channels. GoogLeNet was selected because its multi-scale convolutional blocks can capture patterns at different receptive fields within this fused representation, which is useful when informative structures occur at different spatial granularities.

In the proposed method, DS-GoogLeNet is used as the deep feature learner for the hybrid representation obtained from Tg and Mj. Here, decimal scaling is applied as a numerical range control step on the convolutional responses before non-linear activation, so that feature magnitudes from the fused channels remain on a comparable scale during feature extraction. Let the tensor formed by combining the sequence-tree and molecular-structure representations be denoted by X.

The tensor X is first convolved to produce the pre-activation response map (Equation (8))

after which a decimal-scaling operation is applied to Z before the ReLU transformation. The resulting activated feature map is written as (Equation (9))

where denotes decimal-scaling normalization of the convolutional responses and denotes the Rectified Linear Unit activation.

The resulting feature map is then down-sampled with weight value Wd and passed through a fully connected layer as (Equations (10, 11))

where the pooling operation reduces the spatial resolution and O denotes the high-level feature vector. During pretraining, the output is activated using the softmax function as (Equation (12))

where represents the predicted class probabilities. For the integrated framework, the penultimate-layer representation is used as the extracted deep feature vector.

In addition to these deep features, handcrafted descriptors including GLCM, texture, Local Tetra Pattern, and LBP features are computed from Tg and Mj and concatenated with the DS-GoogLeNet features to form the complete feature representation used in the subsequent stages.

2.3.7 Attributes extraction

Beyond the structural features extracted from the sequence trees and drug molecular representations, we retain a set of auxiliary annotation variables for each matched cell line–drug pair. On the cell-line side, these attributes are drawn from the matched annotation metadata and include lineage or primary disease category, molecular subtype, where available, and other matched line-level metadata that help contextualize the omics profile. On the drug side, we retain identifier and annotation fields including drug id, drug name, annotated targets, and the corresponding SMILES and InChI representations.

The DS-GoogLeNet-based feature extraction procedure is summarized in Algorithm 1.

After suitable preprocessing and numerical encoding, these auxiliary variables are organized into an attribute matrix A and concatenated with the extracted feature matrix F to form the full representation as shown in Equation (13)

where [·∥·] denotes concatenation. This combined representation integrates learned structural features with contextual annotation variables and is then passed to the feature-selection stage.

2.3.8 Feature selection

The feature vector Xfull is formed by combining omics features, drug structural features, and auxiliary attributes, resulting in a heterogeneous representation. To retain the most informative subset before classification, a feature-selection stage is applied. In this work, we employ the Smoluchowski Kookaburra Optimization Algorithm (SKOA), an enhanced variant of the Kookaburra Optimization Algorithm (KOA) that incorporates a Smoluchowski-guided search mechanism. Rather than using the full feature set directly, SKOA searches for a smaller group of features that still captures the patterns most useful for cancer drug response prediction. The full representation Xfull initially contained 2,378 candidate features. During stratified 10-fold cross-validation, SKOA retained an average of 244.6 ± 5.9 features per fold, with a median of 244 features. This means that the selected subset accounted for 10.29% of the original feature space. The consistency of the selected subsets across folds was assessed using the pairwise Jaccard index, which gave a mean stability of 0.69. The Smoluchowski-guided component is incorporated to improve the search process and reduce the tendency of KOA to become trapped in local optima. The optimal subset obtained from SKOA is then provided to the EA-GAT classifier described in the next subsection.

2.3.8.1 Initialization

The population matrix E and the initial positions eg,c of the Kookaburras are defined as (Equation (14))

where each row Eg corresponds to the gth Kookaburra and each column index c = 1,…,k corresponds to one decision variable.

The initial position of each element eg,c is given by Equation (15)

where a specifies the number of Kookaburras (population size), k denotes the number of decision variables (feature dimensions), is a random number, and (lbc, ubc) are the lower and upper bound values for the cth dimension of the search space.

For notational convenience, we also denote the gth row Eg as a candidate vector xg, and write the population as with NK= a. In vector form, the initialization in (15) can be compactly expressed as Equation (16)

where ri collects the random values α for all kdimensions.

2.3.8.2 Fitness

The fitness function, which updates the position of Kookaburras, is derived based on maximum classification accuracy as Equation (17)

where denotes the accuracy obtained using the feature subset encoded by xi.

2.3.8.3 Position updation

The position update is affected by the random selection of prey. Thus, the Smoluchowski approach that selects the suitable prey p is given by (Equation (18))

where denotes the Smoluchowski-based selection function.

Exploration phase: In this phase, the prey is attacked by Kookaburras, leading to a detailed search in the space. The new position of Kookaburra with constant C1 is expressed by (Equation (19))

where t denotes the current iteration.

Exploitation phase: Then, Kookaburra carries the prey and kills it. The new position of Kookaburra is expressed by (Equation (20))

where C2 is a constant and t signifies the iteration value with maximum iteration Tmax. Hence, the best position (optimal feature subset) x* is attained.

The complete SKOA-based feature-selection procedure is summarized in Algorithm 2.

In contrast to simple filter-based or L1-regularized feature selection, SKOA optimizes feature subsets directly with respect to classifier performance and can capture higher-order interactions between heterogeneous features (omics, structure, attributes), which is critical in the high-dimensional, multimodal CDRP setting. The optimal feature set obtained from SKOA is then used for classification.

2.3.9 Classification

Classification was conducted using the optimized feature set acquired by SKOA, incorporating the Enhanced Aranda Graph Attention Network (EA-GAT). This classifier uses self-attention to capture dependencies among the selected features. In the proposed framework, the Aranda activation function is used as the non-linear transformation applied after neighborhood aggregation within the Graph Attention Network (GAT), resulting in the EA-GAT model. Its inclusion is motivated empirically in this study: as shown later in the ablation analysis, replacing it with a conventional activation leads to a modest but consistent reduction in predictive performance. The general architecture of this classifier is shown in Figure 4. A learnable linear transformation is applied to the input feature vector before it is passed to the attention mechanism.

Comments (0)

No login
gif