Scaling Sensor Metadata Extraction for Exposure Health Using LLMs

Abstract

Background The rapid evolution and diversity of sensor technologies, coupled with inconsistencies in how sensor metadata is reported across formats and sources, present significant challenges for generating exposomes and exposure health research.

Objective Despite the development of standardized metadata schemas, the process of extracting sensor metadata from unstructured sources remains largely manual and unscalable. To address this bottleneck, we developed and evaluated a large language model (LLM)-based pipeline for automating sensor metadata extraction and harmonization from exposure health literature publicly available.

Methods Using GPT-4 in a zero-shot setting, we constructed a pipeline that parses full-text PDFs to extract metadata and harmonizes output into structured formats. Results: Our automated pipeline achieved substantial efficiency gains in completing extractions much faster than manual review and demonstrated strong performance with average accuracy and precision of 94.74%, recall of 100%, and F1-score of 97.28%.

Conclusions This study demonstrates the feasibility and scalability of leveraging LLMs to automate sensor metadata extraction for exposure health, reducing manual burden while enhancing metadata completeness and consistency. Our findings support the integration of LLM-driven pipelines into exposure health informatics platforms.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This research was supported by the NIEHS, 1R24ES036134 [SMARTER], NCATS, UL1TR002538, UM1TR004409 [CTSI]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study used openly available published papers.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present work are contained in the manuscript.

AbbreviationsGPTGenerative Pre-trained TransformerLLMLarge Language ModelSMARTERSensors and Metadata for Analytics and Research in Exposure Health

View original article

Medrxiv - Occupational and Environmental Health

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Scaling Sensor Metadata Extraction for Exposure Health Using LLMs

Comments (0)