In this chapter, you will learn how to train, store, and use custom statistical pipeline components. First, we will discuss when exactly we should perform custom model training. Then, you will learn a fundamental step of model training – how to collect and label your own data.
In this chapter, you will also learn how to make the best use of Prodigy, the annotation tool. Next, you will learn how to update an existing statistical pipeline component with your own data. We will update the spaCy pipeline's named entity recognizer (NER) component with our own labeled data.
Finally, you will learn how to create a statistical pipeline component from scratch with your own data and labels. For this purpose, we will again train an NER model. This chapter takes you through a complete machine learning practice, including collecting data, annotating data, and training a model for information extraction.
By the end of this chapter, you'll be ready to train spaCy models on your own data. You'll have the full skillset of collecting data, preprocessing data in to the format that spaCy can recognize, and finally, training spaCy models with this data. In this chapter, we're going to cover the following main topics:
The chapter code can be found at the book's GitHub repository: https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter07.
In the previous chapters, we saw how to make the best of spaCy's pre-trained statistical models (including the POS tagger, NER, and dependency parser) in our applications. In this chapter, we will see how to customize the statistical models for our custom domain and data.
spaCy models are very successful for general NLP purposes, such as understanding a sentence's syntax, splitting a paragraph into sentences, and extracting some entities. However, sometimes, we work on very specific domains that spaCy models didn't see during training.
For example, the Twitter text contains many non-regular words, such as hashtags, emoticons, and mentions. Also, Twitter sentences are usually just phrases, not full sentences. Here, it's entirely reasonable that spaCy's POS tagger performs in a substandard manner as the POS tagger is trained on full, grammatically correct English sentences.
Another example is the medical domain. The medical domain contains many entities, such as drug, disease, and chemical compound names. These entities are not expected to be recognized by spaCy's NER model because it has no disease or drug entity labels. NER does not know anything about the medical domain at all.
Training your custom models requires time and effort. Before even starting the training process, you should decide whether the training is really necessary. To determine whether you really need custom training, you will need to ask yourself the following questions:
Let's discuss these questions in detail in the following sections.
If the model performs well enough (above 0.75 accuracy), then you can customize the model output by means of another spaCy component. For example, let's say we work on the navigation domain and we have utterances such as the following:
navigate to my home
navigate to Oxford Street
Let's see what entities spaCy's NER model outputs for these sentences:
import spacy
nlp = spacy.load("en_core_web_md")
doc1 = nlp("navigate to my home")
doc1.ents
()
doc2 = nlp("navigate to Oxford Street")
doc2.ents
(Oxford Street,)
doc2.ents[0].label_
'FAC'
spacy.explain("FAC")
'Buildings, airports, highways, bridges, etc.'
Here, home isn't recognized as an entity at all, but we want it to be recognized as a location entity. Also, spaCy's NER model labels Oxford Street as FAC, which means a building/highway/airport/bridge type entity, which is not what we want.
We want this entity to be recognized as GPE, a location. Here, we can train NER further to recognize street names as GPE, as well as also recognizing some location words, such as work, home, and my mama's house, as GPE.
Another example is the newspaper domain. In this domain, person, place, date, time, and organization entities are extracted, but you need one more entity type – vehicle (car, bus, airplane, and so on). Hence, instead of training from scratch, you can add a new entity type by using spaCy's EntityRuler (explained in Chapter 4, Rule-Based Matching). Always examine your data first and calculate the spaCy models' success rate. If the success rate is satisfying, then use other spaCy components to customize.
For instance, in the preceding newspaper example, only one entity label, vehicle, is missing from the spaCy's NER model's labels. Other entity types are recognized. In this case, you don't need custom training.
Consider the medical domain again. The entities are diseases, symptoms, drugs, dosages, chemical compound names, and so on. This is a specialized and long list of entities. Obviously, for the medical domain, you require custom model training.
If we need custom model training, we usually follow these steps:
In the data collection step, we decide how much data to collect: 1,000 sentences, 5,000 sentences, or more. The amount of data depends on the complexity of your task and domain. Usually, we start with an acceptable amount of data, make a first model training, and see how it performs; then we can add more data and retrain the model.
After collecting your dataset, you need to annotate your data in such a way that the spaCy training code recognizes it. In the next section, we will see the training data format and how to annotate data with spaCy's Prodigy tool.
The third point is to decide on training a blank model from scratch or make updates to an existing model. Here, the rule of thumb is as follows: if your entities/labels are present in the existing model but you don't see a very good performance, then update the model with your own data, such as in the preceding navigation example. If your entities are not present in the current spaCy model at all, then most probably you need custom training.
Tip
Don't rush into training your own models. First, examine if you really need to customize the models. Always keep in mind that training a model from scratch requires data preparation, training a model, and saving it, which means spending your time, money, and effort. Good engineering is about spending your resources wisely.
We'll start our journey of building a model with the first step: preparing our training data. Let's move on to the next section and see how to prepare and annotate our training data.
The first step of training a model is always preparing training data. You usually collect data from customer logs and then turn them into a dataset by dumping the data as a CSV file or a JSON file. spaCy model training code works with JSON files, so we will be working with JSON files in this chapter.
After collecting our data, we annotate our data. Annotation means labeling the intent, entities, POS tags, and so on.
This is an example of annotated data:
{
"sentence": "I visited JFK Airport."
"entities": {
"label": "LOC"
"value": "JFK Airport"
}
As you see, we point the statistical algorithm to what we want the model to learn. In this example, we want the model to learn about the entities, hence, we feed examples with entities annotated.
Writing down JSON files manually can be error-prone and time-consuming. Hence, in this section, we'll also see spaCy's annotation tool, Prodigy, along with an open source data annotation tool, Brat. Prodigy is not open source or free, but we will go over how it works to give you a better view of how annotation tools work in general. Brat is open source and immediately available for your use.
Prodigy is a modern tool for data annotation. We will be using the Prodigy web demo (https://prodi.gy/demo) to exhibit how an annotation tool works.
Let's get started:
The preceding screenshot shows an example text that we want to annotate. The buttons at the bottom of the screenshot showcase the means to accept this training example, to reject this example, or to ignore this example. If the example is irrelevant to our domain/task (but involved in the dataset somehow), we ignore this example. If the text is relevant and the annotation is good, then we accept this example, and it joins our dataset.
After we're finished with annotating the text, we click the accept button. Once the session is finished, you can dump the annotated data as a JSON file. When you're finished with your annotation job, you can click the Save button to finish the session properly. Clicking Save will dump the annotated data as a JSON file automatically. That's it. Prodigy offers a really efficient way of annotating your data.
Another annotation tool is Brat, which is a free and web-based tool for text annotation (https://brat.nlplab.org/introduction.html). It's possible to annotate relations as well as entities in Brat. You can also download Brat onto your local machine and use it for annotation tasks. Basically, you upload your dataset to Brat and annotate the text on the interface. The following screenshot shows an annotated sentence from an example of a CoNLL dataset:
You can play with example datasets on the Brat demo website (https://brat.nlplab.org/examples.html) or get started by uploading a small subset of your own data. After the annotation session is finished, Brat dumps a JSON of annotated data as well.
As we remarked earlier, spaCy training code works with JSON file format. Let's see the details of training the data format.
For the NER, you need to provide a list of pairs of sentences and their annotations. Each annotation should include the entity type, the start position of the entity in terms of characters, and the end position of the entity in terms of characters. Let's see an example of a dataset:
training_data = [
("I will visit you in Munich.", {"entities": [(20, 26, "GPE")]}),
("I'm going to Victoria's house.", {
"entities": [
(13, 23, "PERSON"),
(24, 29, "GPE")
]})
("I go there.", {"entities": []})
]
This dataset consists of three example pairs. Each example pair includes a sentence as the first element. The second element of the pair is a list of annotated entities. In the first example sentence, there is only one entity, Munich. This entity's label is GPE and starts at the 20th character position in the sentence and ends at the 25th character. Similarly, the second sentence includes two entities; one is PERSON, Victoria's, and the second entity is GPE, house. The third sentence does not include any entities, hence the list is empty.
We cannot feed the raw text and annotations directly to spaCy. Instead, we need to create an Example object for each training example. Let's see the code:
import spacy
from spacy.training import Example
nlp = spacy.load("en_core_web_md")
doc = nlp("I will visit you in Munich.")
annotations = {"entities": [(20, 26, "GPE")]}
example_sent = Example.from_dict(doc, annotations)
In this code segment, first, we created a doc object from the example sentence. Then we fed the doc object and its annotations in a dictionary form to create an Example object. We'll use Example objects in the next section's training code.
Creating example sentences for training the dependency parser is a bit different, and we'll cover this in the Training a pipeline component from scratch section.
Now, we're ready to train our own spaCy models. We'll first see how to update an NLP pipeline statistical model. For this purpose, we'll train the NER component further with the help of our own examples.
In this section, we will train spaCy's NER component further with our own examples to recognize the navigation domain. We already saw some examples of navigation domain utterances and how spaCy's NER model labeled entities of some example utterances:
navigate/0 to/0 my/0 home/0
navigate/0 to/0 Oxford/FAC Street/FAC
Obviously, we want NER to perform better and recognize location entities, such as street names, district names, and other location names, such as home, work, and office. Now, we'll feed our examples to the NER component and will do more training. We will train NER in three steps:
Also, we will learn how to do the following:
Let's get started and dive into training the NER model procedure. As we pointed out in the preceding list, we'll train the NER model in several steps. We'll start with the first step, disabling the other statistical models of the spaCy NLP pipeline.
Before starting the training procedure, we disable the other pipeline components, hence we train only the intended component. The following code segment disables all the pipeline components except NER. We call this code block before starting the training procedure:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*other_pipes)
Another way of writing this code is as follows:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
# training code goes here
In the preceding code block, we made use of the fact that nlp.disable_pipes returns a context manager. Using a with statement makes sure that our code releases the allocated sources (such as file handlers, database locks, or multiple threads). If you're not familiar with statements, you can read more at this Python tutorial: https://book.pythontips.com/en/latest/context_managers.html.
We have completed the first step of the training code. Now, we are ready to make the model training procedure.
As we mentioned in Chapter 3, Linguistic Features, in the Introducing named entity recognition section, spaCy's NER model is a neural network model. To train a neural network, we need to configure some parameters as well as provide training examples. Each prediction of the neural network is a sum of its weight values; hence, the training procedure adjusts the weights of the neural network with our examples. If you want to learn more about how neural networks function, you can read the excellent guide at http://neuralnetworksanddeeplearning.com/.
In the training procedure, we'll go over the training set several times and show each example several times (one iteration is called one epoch) because showing an example only once is not enough. At each iteration, we shuffle the training data so that the order of the training data does not matter. This shuffling of training data helps train the neural network thoroughly.
In each epoch, the training code updates the weights of the neural network with a small number. Optimizers are functions that update the neural network weights subject to a loss. At epoch, a loss value is calculated by comparing the actual label with the neural network's current output. Then, the optimizer function can update the neural network's weight with respect to this loss value.
In the following code, we used the stochastic gradient descent (SGD) algorithm as the optimizer. SGD itself is also an iterative algorithm. It aims to minimize a function (for neural networks, we want to minimize the loss function). SGD starts from a random point on the loss function and travels down its slope in steps until it reaches the lowest point of that function. If you want to learn more about SGD, you can visit Stanford's excellent neural network class at http://deeplearning.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/.
Putting it all altogether, here's the code to train spaCy's NER model for the navigation domain. Let's go step by step:
import random
import spacy
from spacy.training import Example
nlp = spacy.load("en_core_web_md")
trainset = [
("navigate home", {"entities": [(9,13, "GPE")]}),
("navigate to office", {"entities": [(12,18, "GPE")]}),
("navigate", {"entities": []}),
("navigate to Oxford Street", {"entities": [(12, 25, "GPE")]})
]
epochs = 20
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.create_optimizer()
for i in range(epochs):
random.shuffle(trainset)
example = Example.from_dict(doc, annotation)
nlp.update([example], sgd=optimizer)
ner = nlp.get_pipe("ner")
ner.to_disk("navi_ner")'
nlp.update outputs a loss value each time it is called. After invoking this code, you should see an output similar to the following screenshot (the loss values might be different):
That's it! We trained the NER component for the navigation domain! Let's try some example sentences and see whether it really worked.
Now we can test our brand-new updated NER component. We can try some examples with synonyms and paraphrases to test whether the neural network really learned the navigation domain, instead of memorizing our examples. Let's see how it goes:
navigate home
navigate to office
navigate
navigate to Oxford Street
doc= nlp("navigate to my house")
doc.ents
(house,)
doc.ents[0].label_
'GPE'
doc= nlp("drive me to home")
doc.ents
(home,)
doc.ents[0].label_
'GPE'
doc= nlp("navigate to Soho")
doc.ents
(Soho,)
doc.ents[0].label_
'GPE'
doc = nlp("I watched a documentary about Lady Diana.")
doc.ents
(Lady Diana,)
doc.ents[0].label_
'PERSON'
Great! spaCy's neural networks can recognize not only synonyms but entities of the same type. This is one of the reasons why we use spaCy for NLP. Statistical models are incredibly powerful.
In the next section, we'll learn how to save the model we trained and load a model into our Python scripts.
In the preceding code segment, we already saw how to serialize the updated NER component as follows:
ner = nlp.get_pipe("ner")
ner.to_disk("navi_ner")
We serialize models so that we can upload them in other Python scripts whenever we want. When we want to upload a custom-made spaCy component, we perform the following steps:
import spacy
nlp = spacy.load('en', disable=['ner'])
ner = nlp.create_pipe("ner")
ner.from_disk("navi_ner")
nlp.add_pipe(ner, "navi_ner")
print(nlp.meta['pipeline'])
['tagger', 'parser', 'navi_ner']
Here are the steps that we follow:
Now, we also learned how to serialize and load custom components. Hence, we can move forward to a bigger mission: training a spaCy statistical model from scratch. We'll again train the NER component, but this time we'll start from scratch.
In the previous section, we saw how to update the existing NER component according to our data. In this section, we will create a brand-new NER component for the medicine domain.
Let's start with a small dataset to understand the training procedure. Then we'll be experimenting with a real medical NLP dataset. The following sentences belong to the medicine domain and include medical entities such as drug and disease names:
Methylphenidate/DRUG is effectively used in treating children with epilepsy/DISEASE and ADHD/DISEASE.
Patients were followed up for 6 months.
Antichlamydial/DRUG antibiotics/DRUG may be useful for curing coronary-artery/DISEASE disease/DISEASE.
The following code block shows how to train an NER component from scratch. As we mentioned before, it's better to create our own NER rather than updating spaCy's default NER model as medical entities are not recognized by spaCy's NER component at all. Let's see the code and also compare it to the code from the previous section. We'll go step by step:
import random
import spacy
from spacy.training import Example
train_set = [
("Methylphenidate is effectively used in treating children with epilepsy and ADHD.", {"entities": [(0, 15, "DRUG"), (62, 70, "DISEASE"), (75, 79, "DISEASE")]}),
("Patients were followed up for 6 months.", {"entities": []}),
("Antichlamydial antibiotics may be useful for curing coronary-artery disease.", {"entities": [(0, 26, "DRUG"), (52, 75, "DIS")]})
]
entities = ["DIS", "DRUG"]
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner
<spacy.pipeline.ner.EntityRecognizer object at 0x7f54b50044c0>
for ent in entities:
ner.add_label(ent)
epochs = 25
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(25):
random.shuffle(train_set)
for text, annotation in train_set:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotation)
nlp.update([example], sgd=optimizer)
Here's what this code segment outputs (the loss values may be different):
Did it really work? Let's test the newly trained NER component:
doc = nlp("I had a coronary disease.")
doc.ents
(coronary disease,)
doc.ents[0].label_
'DIS'
Great – it worked! Let's also test some negative examples, entities that are recognized by spaCy's pre-trained NER model but not ours:
doc = nlp("I met you at Trump Tower.")
doc.ents
()
doc = nlp("I meet you at SF.")
doc.ents
()
This looks good, too. Our brand new NER recognizes only medical entities. Let's visualize our first example sentence and see how displaCy exhibits new entities:
from spacy import displacy
doc = nlp("I had a coronary disease.")
displacy.serve(doc, style="ent")
This code block generates the following visualization:
We successfully trained the NER model on small datasets. Now it's time to work with a real-world dataset. In the next section, we'll dive into processing a very interesting dataset regarding a hot topic; mining Corona medical texts.
In this section, we will train on a real-world corpus. We will train an NER model on the CORD-19 corpus provided by the Allen Institute for AI (https://allenai.org/). This is an open challenge for text miners to extract information from this dataset to help medical professionals around the world fight against Corona disease. CORD-19 is an open source dataset that is collected from over 500,000 scholarly articles about Corona disease. The training set consists of 20 annotated medical text samples:
The antiviral drugs amantadine and rimantadine inhibit a viral ion channel (M2 protein), thus inhibiting replication of the influenza A virus.[86] These drugs are sometimes effective against influenza A if given early in the infection but are ineffective against influenza B viruses, which lack the M2 drug target.[160] Measured resistance to amantadine and rimantadine in American isolates of H3N2 has increased to 91% in 2005.[161] This high level of resistance may be due to the easy availability of amantadines as part of over-the-counter cold remedies in countries such as China and Russia,[162] and their use to prevent outbreaks of influenza in farmed poultry.[163][164] The CDC recommended against using M2 inhibitors during the 2005–06 influenza season due to high levels of drug resistance.[165]
As we see from this example, real-world medical text can be quite long, and it can include many medical terms and entities. Nouns, verbs, and entities are all related to the medicine domain. Entities can be numbers (91%), number and units (100 ng/ml, 25 microg/ml), number-letter combinations (H3N2), abbreviations (CDC), and also compound words (qRT-PCR, PE-labeled).
The medical entities come in several shapes (numbers, number and letter combinations, and compounds) as well as being very domain-specific. Hence, a medical text is very different from everyday spoken/written language and definitely needs custom training.
Pathogen
MedicalCondition
Medicine
We transformed the dataset so that it's ready to use with spaCy training. The dataset is available under the book's GitHub repository: https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter07/data.
$wget
https://github.com/PacktPublishing/Mastering-spaCy/blob/main/Chapter07/data/corona.json
This will download the dataset into your machine. If you wish, you can manually download the dataset from GitHub, too.
import json
with open("data/corona.json") as f:
data = json.loads(f.read())
TRAIN_DATA = []
for (text, annotation) in data:
new_anno = []
for anno in annotation["entities"]:
st, end, label = anno
new_anno.append((st, end, label))
TRAIN_DATA.append((text, {"entities": new_anno}))
This code segment will read the dataset's JSON file and format it according to the spaCy training data conventions.
a) First, we'll do the related imports:
import random
import spacy
from spacy.training import Example
b) Secondly, we'll initialize a blank spaCy English model and add an NER component to this blank model:
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
print(ner)
print(nlp.meta)
c) Next, we define the labels we'd like the NER component to recognize and introduce these labels to it:
labels = ['Pathogen', 'MedicalCondition', 'Medicine']
for ent in labels:
ner.add_label(ent)
print(ner.labels)
d) Finally, we're ready to define the training loop:
epochs = 100
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(100):
random.shuffle(TRAIN_DATA)
for text, annotation in TRAIN_DATA:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotation)
nlp.update([example], sgd=optimizer)
This code block is identical to the training code from the previous section, except for the value of the epochs variable. This time, we iterated for 100 epochs, because the entity types, entity values, and the training sample text are semantically more complicated. We recommend you do at least 500 iterations for this dataset if you have the time. 100 iterations over the data are sufficient to get good results, but 500 iterations will take the performance further.
from spacy import displacy
doc = nlp("One of the bacterial diseases with the highest disease burden is tuberculosis, caused by Mycobacterium tuberculosis bacteria, which kills about 2 million people a year.")
displacy.serve(doc, style="ent")
The following screenshot highlights two entities – tuberculosis and the name of the bacteria that causes it as the pathogen entity:
doc2 = nlp("Pathogenic bacteria contribute to other globally important diseases, such as pneumonia, which can be caused by bacteria such as Streptococcus and Pseudomonas, and foodborne illnesses, which can be caused by bacteria such as Shigella, Campylobacter, and Salmonella. Pathogenic bacteria also cause infections such as tetanus, typhoid fever, diphtheria, syphilis, and leprosy. Pathogenic bacteria are also the cause of high infant mortality rates in developing countries.")
displacy.serve(doc2, style="ent")
Here is the visual generated by the preceding code block:
Looks good! We successfully trained spaCy's NER model for the medicine domain and now the NER can extract information from medical text. This concludes our section. We learned how to train a statistical pipeline component as well as prepare the training data and test the results. These are great steps in both mastering spaCy and machine learning algorithm design.
In this chapter, we explored how to customize spaCy statistical models according to our own domain and data. First, we learned the key points of deciding whether we really need custom model training. Then, we went through an essential part of statistical algorithm design – data collection, and labeling.
Here we also learned about two annotation tools – Prodigy and Brat. Next, we started model training by updating spaCy's NER component with our navigation domain data samples. We learned the necessary model training steps, including disabling the other pipeline components, creating example objects to hold our examples, and feeding our examples to the training code.
Finally, we learned how to train an NER model from scratch on a small toy dataset and on a real medical domain dataset.
With this chapter, we took a step into the statistical NLP playground. In the next chapter, we will take more steps in statistical modeling and learn about text classification with spaCy. Let's move forward and see what spaCy brings us!
3.144.104.29