CAT-GxD: Centralized access to gene expression datasets

Publicly available gene expression compendia include transcriptomics (RNAseq and microarray) and proteomics datasets, as well as gene regulatory and protein structure information. For bacteria, such expression data are available for diverse isolates, mutant derivatives, and upon exposure to varied environmental conditions. While the datasets can be retrieved from the Gene Expression Omnibus (GEO) database [1] and ProteomeXchange Consortium [2], the data are not readily queryable and require biologists, who may have limited expertise in bioinformatics, to expend considerable time and effort to use the information. Integration of these individual datasets requires multiple analysis and curation steps to manage the use of non-standardized gene IDs across multiple datasets, compare results across different experimental trials (i.e., calculate replicate ratios), calculate fold change for each gene, and incorporate the constantly increasing number of dataset submissions. A resource whose main function is to compile and interrogate differential expression at the transcription and translation levels across different experiments supports both exploratory and analysis stages of a biologist's research.

Initially focusing on the Centers for Disease Control (CDC) Urgent Threat pathogen Clostridioides difficile, we built CAT-GxD (Centralized Access to Gene Expression Datasets), an integrated and queryable engine to readily access and compare expression of specific genes or gene subsets in accordance with Findability, Accessibility, Interoperability, and Reusability (FAIR) guiding principles [3]. By providing seamless access to disparate public datasets and embedded tools for comparative analysis and discovery, CAT-GxD enables biologists to rapidly and effectively query public expression data to understand C. difficile gene expression and how it changes under different conditions.

The Gram-positive and spore-forming anaerobic pathogen Clostridioides difficile (formerly Clostridium difficile) is considered an ‘Urgent Threat’ to US healthcare by the CDC [4]. C. difficile is the leading cause of antibiotic-associated diarrhea that may be self-limiting, or progress to severe and fulminant (pseudomembranous) colitis, ileus or toxic megacolon [5]. Annual US cases of C. difficile infections (CDI) are approximately 500,000, with ∼30,000 cases resulting in deaths, impose up to $6 billion overall cost to healthcare [6,7]. There are currently no vaccines against CDI, and there are several limitations to antibiotic therapy, including the emergence of resistant strains and the perpetuation of gut dysbiosis. The mechanisms by which C. difficile causes disease are poorly understood and are an active area of investigation. The ∼4000 genes of C. difficile are controlled by complex regulatory networks that are responsive to metabolic and environmental cues; the functions of most of these genes in bacterial physiology and virulence remain undefined. Understanding their roles in pathogenesis will facilitate the development of novel therapeutic agents.

While C. difficile 630 has been widely used for various mechanistic studies, the genetic and phenotypic diversity, particularly of current clinical isolates, is well recognized. Indeed, publicly available transcriptomics and proteomics datasets have been generated with at least five different C. difficile strains, and cognate mutant derivatives. Here we introduce how we built CAT-GxD to access this larger data collection, and demonstrate the utility of this resource for users who wish to easily extract gene expression information from otherwise difficult-to-compare multi-omics datasets.

Comments (0)

No login
gif