🏥 Medical Natural Language Processing with spaCy 🏥
MedaCy is a text processing and learning framework built over spaCy to support the lightning fast prototyping, building, and application of highly predictive named entity recognition and relationship extraction systems in the medical domain.
- Highly predictive out-of-the-box trained models for clinical named entity recognition and relationship extraction.
- Customizable feature extraction pipelines for custom model building.
- Integrated converters for common text annotation formats (Prodigy, BRAT, etc).
- Pre-compiled medical terminology and abbreviation lexicons.
Using medaCy is simple: all one needs is to select a pipeline and provide it with training data to learn from.
Training a Named Entity Recognition model for Clinical Text using medaCy:
from medacy.pipelines import ClinicalPipeline
from medacy.tools import DataLoader
from medacy.pipeline_component import MetaMap
import joblib
from medacy.learn import Learner
#Some more powerful pipelines require an outside knowledge source such as MetaMap.
metamap = MetaMap(metamap_path="/home/share/programs/metamap/2016/public_mm/bin/metamap")
#Automatically organizes your training files.
train_loader = DataLoader("/directory/containing/your/training/data/")
#Pre-metamap our training data to speed up building models.
train_loader.metamap(metamap)
#Create pipeline and specify entities to learn.
pipeline = ClinicalPipeline(metamap, entities=['Strength'])
#create a Learner using our pipeline and data
learner = Learner(pipeline, train_loader)
#Build a model (defaults to Conditional Random Field)
model = learner.train()
joblib.dump(model,'/location/to/save/model')
Prediction utilizing medaCy:
from medacy.pipelines import ClinicalPipeline
from medacy.tools import DataLoader
from medacy.pipeline_component import MetaMap
import joblib
from medacy.predict import Predictor
model = joblib.load('/location/containing/saved/model')
#Some more powerful pipelines require an outside knowledge source such as MetaMap.
metamap = MetaMap(metamap_path="/home/share/programs/metamap/2016/public_mm/bin/metamap")
data_loader = DataLoader("/directory/containing/your/text/to/label")
#Pre-metamap our data we wish to label to speed up prediction. Not necessary.
data_loader.metamap(metamap)
pipeline = ClinicalPipeline(metamap, entities=['Strength'])
#create a Learner using our pipeline and data
predictor = Predictor(pipeline, data_loader, model=model)
predictor.predict()
#prediction appear in a /predictions sub-directory of your data.
An example combined pipeline script:
from medacy.learn import Learner
from medacy.predict import Predictor
from medacy.pipelines import ClinicalPipeline
from medacy.tools import DataLoader
from medacy.pipeline_components import MetaMap
import logging, sys, joblib
#See what medaCy is doing at any part of the learning or prediction process
logging.basicConfig(stream=sys.stdout,level=logging.INFO) #set level=logging.DEBUG for more information
train_loader = DataLoader("/training/directory")
test_loader = DataLoader("/evaluation/directory")
metamap = MetaMap(metamap_path="/home/share/programs/metamap/2016/public_mm/bin/metamap")
train_loader.metamap(metamap)
test_loader.metamap(metamap)
pipeline = ClinicalPipeline(metamap, entities=['Drug', 'Form', 'Route', 'ADE', 'Reason', 'Frequency', 'Duration', 'Dosage', 'Strength'])
learner = Learner(pipeline, train_loader)
model = learner.train()
joblib.dump(model,'medacy_model')
learner.cross_validate() #perform 10 fold cross validation on predicted model, this takes time.
predictor = Predictor(pipeline, test_loader, model=model)
predictor.predict()
#prediction appear in a /predictions sub-directory of your data.
Note, the ClinicalPipeline requires spaCy's small model - install it with pip:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
To install this repository from source do the following:
- Enter into a python3 virtual envirorment, once inside make sure to upgrade pip to the latest version.
- Run the following instruction - this should take a bit and may throw some non-fatal warnings.
pip install git+https://github.com/NanoNLP/medaCy.git
- Install spaCy's small model.
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
MedaCy leverages the text-processing power of spaCy with state-of-the-art research tools and techniques in medical named entity recognition. MedaCy consists of a set of lightning-fast pipelines that are specialized for learning specific types of medical entities. A pipeline consists of a stackable and interchangeable set of PipelineComponents - these are bite-sized code blocks that each overlay a feature onto the text being processed.
You can write your own PipelineComponents to utilize in custom pipelines by interfacing the BasePipeline and BaseComponent classes. Alternatively use the components already included with medaCy. Some more powerful components require outside software - an example is the MetaMapComponent which interfaces with MetaMap to overlay rich medical concept information onto text. Components are chained or stacked in pipelines and can themselves depend on the outputs of previous components to function.
To contribute do the following:
- Enter into a python3 virtual envirorment, once inside make sure to upgrade pip to the latest version.
- Fork and clone this repository, enter into the cloned repo and run:
pip install -e .
This will install medaCy in editable mode. Any changes you make to medaCy sources code will be reflected immediately when used.
- Insure you are developing in the development branch or your own branch of the development branch.
This package is licensed under the GNU General Public License
Andriy Mulyar, Bobby Best, Steele Farnsworth, Yadunandan Pillai, Corey Sutphin, Bridget McInnes