• Español
Tu estás aquí: Inicio > en > labda_ddicorpus
 
 

LaBDA
Corpora

Grupo Labda

Corpora for Drug-Drug Interaction Extraction

The DDI Corpus

DDI Corpus (dataset 2013): a corpus annotated with drug-drug interactions. Text were collected from the Drugbank database and MedLine. This version of the corpus was used in the SemEval-2013 Task 9 Drug-Drug Interaction extraction task (http://www.cs.york.ac.uk/semeval-2013/task9/) Download DDI (annotated documents XML format)

DDI Corpus (dataset 2011): this version of the corpus was used as training and test dataset in the DDIExtraction 2011 shared task (http://labda.sintonia.inf.uc3m.es/DDIExtraction2011/). Download DDI (annotated documents XML format)

The DDI corpus is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License: Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Safe Creative #1006026486995

The management of drug-drug interactions (DDIs) is a critical issue resulting from the overwhelming amount of information available on them. Natural Language Processing (NLP) techniques can providean interesting way to reduce the time spent by healthcare professionals on reviewing biomedical literature. However, the shortage of annotated corpora for DDI extraction is the main bottleneck in the development of NLP systems for this area of Pharmacovigilance.

The DDI corpus is made up of 792 texts selected from the DrugBank database and other 233 Medline abstracts on the subject of DDIs. The corpus was annotated with a total of 18,502 pharmacological substances and 5028 DDIs, including both pharmacokinetic (PK) as well as pharmacodynamic (PD) interactions. To date, the corpora annotated with DDIs have focused in PK DDIs, but not in PD DDIs.

Annotation guidelines were developed by domain experts in order to ensure a high-quality, reliable and accurate annotation of the corpus. Pharmacological substances were classified according to four entity types: drug (for generic drugs), brand (for trade drugs), group (for drug classes) and drug_n (for active substances not approved for human use). DDIs were also classified into four types: mechanism (for DDIs describing the way the interaction occurs), effect (for DDIs describing the consequence of the interaction), advice (for DDIs described by a recommendation or advice) and int (for DDIs without any additional information). Inter-Annotator Agreement (IAA) was measured to assess the consistency and quality of the corpus. The agreement was almost perfect (Kappa up to 0.96 and generally over 0.80), except for the DDIs in the MedLine database (0.55?0.72).

The DDI corpus has been developed for the Semeval 2013-DDI Extraction 2013 challenge, whose main goal was to provide a common framework for the evaluation of information extraction techniques applied to the recognition and classification of pharmacological substances (DrugNER subtask) and the detection and classification of drug-drug interactions (DDIExtraction subtask) from biomedical texts. The DDI corpus is a valuable gold-standard for those research groups interested in the recognition of pharmacological active substances, including drugs, groups of drugs, toxins, etc. or those specifically working in the field of DDI relation extraction.

The DDI corpus is divided into two datasets: training and test. The training dataset is the same for both subtasks and contains gold-standard annotations of pharmacological substances and their interactions. It consists of 714 texts (572 from DrugBank and 142 MedLIne abstracts) annotated with a total of 13029 pharmacological substances (13029 from DrugBank and 1826 from MedLine) and 4037 DDIs (3805 from DrugBank and 232 from MedLine). The test dataset for the Drug NER subtask consists of 52 DrugBank texts (annotated with 303 pharmacological substances) and 58 MedLine abstracts (with 382 pharmacological substances). The test dataset for the subtask of DDI extraction consists of 158 DrugBank Texts (annotated with 889 DDIs) and 33 MedLine abstracts (with 95 DDIs).

We hope that the release of this dataset will encourage further research on the DDI problem.

In any work that uses the DDI Corpus, please acknowledge the authors, as follows:

María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, Thierry Declerck, The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions, Journal of Biomedical Informatics, Volume 46, Issue 5, October 2013, Pages 914-920, http://dx.doi.org/10.1016/j.jbi.2013.07.011

Isabel Segura-Bedmar, Paloma Martínez, María Herrero Zazo, (2014). Lessons learnt from the DDIExtraction-2013 shared task, Journal of Biomedical Informatics, Vol.51, pp:152-164.

A description of the corpus can be downloaded here.

Contacto

Isabel Segura-Bedmar
isegura@inf.uc3m.es