Biomedical Text Corpora

Listed below are several biomedical text corpora.

Name Description


A large biomedical citation database


An annotationi corpus of biomedical text, it consists of multiple layers of annotation, encompassing both syntactic and semantic annotation.


A collection of biomedical annotations (MEDLINE abstracts): the AbGene corpus of annotated sentences of genes and protein named entities, the MedPost corpus of part of speech tagged sentences and the GENETAG corpus for named entity identification used for BioCreAtIvE I.

BioCreative corpus

Dataset produced by the BioCreative assessment, text passages relevant for GO annotations of human proteins.

BioMed Central's corpus

Open access corpus of full text articles provided by BioMed Central.

OHSUMED text collection

Document collection used for the TREC-9 contest.