Biomedical NLP Tools

Listed below are several biomedical NLP Tools.

Name Description

eHOST: The Extensible Human Oracle Suite of Tools

A prototype annotation system that provides an open-source stand-alone client for manual annotation of clinical texts.The eHOST annotation tool has been used by several institutions and projects for a variety of tasks, including both the 2010 and 2011 i2b2/VA Challenges, annotation tasks for the Consortium for Healthcare Informatics Research (CHIR) projects.

Platform: eHOST is written in Java.

Datasets that can be used: MTSamples dataset


If you are familiar with NLP and you are using information retrieval techniques to conduct research then you can use ARC by adding it to your existing toolkit so that it can help you get relevant data to your clinicians faster.

The Automated Retrieval Console (ARC) is open source software designed to improve the processes of information retrieval (e.g., natural language processing, machine learning, information extraction, etc). Behind the scenes, ARC processes text with open source NLP pipelines converting unstructured text to structured data such as SNOMED or UMLS codes. The structured data is then automatically converted to features and fed to open source supervised machine learning algorithms.For developers and IR researchers, ARC provides a suite of tools to mix and match NLP with machine learning and interfaces to quickly calculate performance. ARC imports UIMA-based pipelines to convert free text into different feature types for classification and can be downloaded 'bundled' with cTAKES or a standalone version.

Platform: Java.

Datasets that can be used: Clinical text, can be checked at Access Clinical Text Data

BADREX - Biomedical Abbreviation Expander with Dynamic Regular Expressions

BADREX uses dynamically generated regular expressions to annotate term definition–term abbreviation pairs, and corefers unpaired acronyms and abbreviations back to their initial (or most recent) definition in the text. BADREX achieves precision and recall of 98% and 97% on the Medstract corpus, and 90% and 85% on the BioText corpus. Against these corpora, BADREX yields improved performance over previous approaches, requires no training data and allows runtime customisation of its input parameters.

In addition, there is the option of annotating and classifying common medical abbreviations extracted from Wikipedia.

Platform: Java.

Datasets that can be used: Medstract, or BioText corpus

BioEnEx (Bio-entity Extractor)

BioEnEx is a tool designed to annotate multiple biomedical entity types (i.e. genes/proteins, diseases, species, chemicals, etc) with high performance. It uses a first order CRF classifier. Separate feature sets are used for diseases and genes/proteins. For other types of entities (e.g. species) a generic feature set is used. More information available in the tool's website.

Platform: Java.

Datasets that can be used: BioCreative II GM Corpus

BoB - a Best-of-Breed automated clinical text de-identification system

BoB is an automated clinical text de-identification system developed within the Consortium for Healthcare Informatics Research (CHIR) at the Department of Veterans Affairs. For its development, we studied several existing de-identification systems implementing the best methods from these systems as well as other later developed features to de-identify Veterans Health Administration clinical documents. BoB consists of a UIMA-based NLP framework with different NLP procedures, pattern matching techniques, as well as machine learning-based predictions. BoB's design is based on two major modules developed to achieve high sensitivity and precision.

Platform: Java.

Datasets that can be used: Clinical text data

brat rapid annotation tool (brat)

brat rapid annotation tool, is a free, open-source, web-based tool for text annotation visualisation and editing.

Platform: Python, Web service.

Datasets that can be used: Biomedical text data, see example data and manual .


ConText is based on a negation algorithm called NegEx. ConText's input is a sentence with indexed clinical conditions; ConText's output for each indexed condition is the value for contextual features or modifiers. The initial versions of ConText determines values for three modifiers:

  • Negation: affirmed or negated.
  • Temporality: recent, historical, or hypothetical.
  • Experiencer: patient or other.

A newer version (pyConText) is more extensible and can have user-defined modifiers, One project involving radiology reports added the following modifiers:

  • Uncertainty: certain or uncertain.
  • Quality of radiologic exam: limited or not limited.

The google code site contains java and python versions of ConText and NegEx, links to papers describing and evaluating the algorithms, a description of the algorithm (including a list of the trigger terms used for each type of modifier), and a dataset of 120 reports of six types with manually-assigned values to the three modifiers in the original version of ConText.

Platform: Python, Java.

Datasets that can be used: A set of 2,376 sentences from 120 clinical reports of six types: emergency department, discharge summary, surgical pathology, radiology, operative notes, and echocardiogram can be downloaded here .


Excellent general purpose clinical NLP pipeline. Based on UIMA..

Platform: Java, UIMA-based framework.

Datasets that can be used: Clinical text, can be checked at Access Clinical Text Data

The MITRE Identification Scrubber Toolkit (MIST)

The MITRE Identification Scrubber Toolkit (MIST) is a suite of tools for identifying and redacting personally identifiable information (PII) in free-text medical records. MIST helps you replace these PII either with obscuring fillers, such as [NAME], or with artificial, synthesized, but realistic English fillers.

MIST decomposes the deidentification task into two subtasks:

  • an annotation subtask, where the tools of trainable, corpus-based natural language processing are brought to bear to identify the PII phrases, and
  • a replacement subtask, where information in the PII phrases is used to generate suitable replacements, given a chosen replacement strategy.
  • The first subtask is addressed by the MITRE Annotation Toolkit (MAT), which is a highly customizable suite of tools for natural language processing upon which MIST is built. The customizations for MIST itself address the second subtask. The MIST documentation uses the terms annotation and tagging interchangeably for the task of identifying, either by hand or automatically, the PII phrases in your documents. The labels for your PII types (e.g., NAME, PHYSICIAN, AGE, DATE) will be the tags that you'll be applying to your documents.

    Platform: Java, Python.

    Datasets that can be used: Clinical text, can be checked at Access Clinical Text Data