RESEARCH ARTICLE

HDI Corpus: A Dataset for Named Entity Recognition for In-Context Herb-Drug Interactions

The Open Bioinformatics Journal 26 Jan 2026 RESEARCH ARTICLE DOI: 10.2174/0118750362377947250903082648

Abstract

Introduction

This article proposes a new dataset for Named Entity Recognition based on PubMed articles and aiming to address the problem of Herb-Drug Interactions. It aims to offer a new dataset for recognizing herb-drug interaction entities, including contextual information.

Background

Machine learning and Deep learning provide users with powerful tools for task automation, but require large quantities of data to perform well. In the field of Natural Language Processing, training Deep Learning models requires the annotation of large corpora of text. While some corpora exist in medical literature, each specific task requires an adapted corpus.

Methods

The dataset was tested using a classical Named Entity Recognition pipeline, as well as new possibilities offered by generative AI.

Results

The dataset proposes annotated sentences of around a hundred articles and covers 15 entities, including herbs, drugs, and pathologies, as well as contextual information, such as cohort composition, patient information, or pharmacological clues.

Discussion

The study demonstrates that this dataset performs comparably to the DDI (Drug-Drug Interaction) corpus — a standard dataset in the drug Named Entity Recognition — for drug recognition, and performs well on most of the entities. Conclusion: We believe this corpus could help diversify pharmacological Named Entity Recognition.

Keywords: Natural Language Processing, Unstructured herb-drug interaction, Named Entity Recognition, Pharmacology, Natural health products.
Fulltext HTML PDF
1800
1801
1802
1803
1804