Chapter 2: Flair Base Types

A good place to start with any NLP framework is getting comfortable with its basic objects and methods used frequently throughout the code. In Flair, the first step is getting familiar with its base types. These are the basic objects that are used for defining sentences or text fragments and forming tokens through tokenization.

One of the main challenges NLP is struggling with today is its support for underrepresented languages. Most state-of-the-art prebuilt NLP models are usually built only for some of the most spoken languages, while failing to provide support for the roughly 7,000 other languages spoken on the planet today. While Flair stands out with its excellent language coverage and its work on multilingual embeddings, it's still far from supporting all the possible languages in areas such as corpora availability, tokenization methods, and prebuilt sequence tagging models. To form tokens for a language with special tokenization rules currently not supported by Flair, you will need to implement your own tokenizer. Luckily, doing so in Flair couldn't be easier.

In this chapter, you will learn about the Sentence object, which is used for representing sentences or text fragments, and the Token base type, which is used for representing words in tokenized sentences. You will learn how to implement custom tokenizers and how to use them. You will also learn about the Corpus object in Flair and how to use its helper functions to load and interact with corpora and datasets.

This chapter covers the following topics:

  • Sentence and Token objects
  • Using custom tokenizers
  • Corpus objects

Technical requirements

For this chapter, you will need a Python environment with Flair version 0.11 installed and a stable internet connection since we will be downloading corpora.

The code examples in this chapter can be found in this book's official GitHub repository, in the following Jupyter notebook: https://github.com/PacktPublishing/Natural-Language-Processing-with-Flair/tree/main/Chapter02.

Sentence and Token objects

Sentence and Token objects can be regarded as the most common objects in Flair syntax. The former is used for representing sentences or any other text fragments such as paragraphs. They are essentially lists of Token objects and their corresponding label assignments.

If you are wondering what objects, classes, and methods mean, this simply suggests that you're not particularly familiar with object-oriented programming (OOP). Luckily, OOP is super easy to get a grasp of in Python. In OOP, pieces of code that store and/or process data are organized into blueprints called classes. An example of a class could be the Word class, which can store and process a single word. Classes in OOP can include several procedures called methods. An example of a Word class method could be get_length(), which would simply return a word's length. Classes can also contain a special type of method called a constructor, which gets called when the class is instantiated. The result of instantiating a class is an object. For example, calling word_1 = Word('potato') would call the constructor method with 'potato' as the first argument value, return a Word object, and store it as word_1. We can then use this variable to call this object's methods, for example, word_1.get_length().

This brief introduction to OOP should give you enough understanding to be able to dive straight into Flair's arguably most fundamental entity – the Sentence class.

Understanding the Sentence class

The Sentence class constructor accepts the following parameters:

  • text: Original string (sentence) or a list of string tokens (words).
  • use_tokenizer: Tokenizer to use. If True, SegTokTokenizer will be used. If False, SpaceTokenizer will be used instead. If a Tokenizer object is provided, it will be used as the custom tokenizer.
  • language_code: The language code that's used for some embeddings (optional).
  • start_position: The start char offset of the sentence in the document (optional).

For example, let's create a simple Sentence object and examine its string representation behavior:

from flair.data import Sentence

sentence = Sentence('Some nice text.')

print(sentence, len(sentence))

The preceding script should print out the following:

Sentence: "Some nice text ." 4

The preceding output indicates that the sentence has been split into four tokens, one of which is a full stop. The separate . token is a direct result of not providing the use_tokenizer parameter, which, if left blank, defaults to using SegTokTokenizer.

Understanding the Token class

The Token class represents a single entity such as a word in a tokenized sentence.

Each token can have zero, one or multiple tags. For example, Flair allows using the same Sentence object for both POS tagging and NER. This means that a single token can in theory have both a named entity label as well as a part-of-speech label.

The Token object's corresponding text can easily be obtained by referring to its Token.text property.

We now have a vague understanding of how the two base types work. But to be able to fully understand how base types work, we need to dive deeper into tokenization techniques in Flair.

Tokenization in Flair

Depending on your needs and the type of language used, you may choose from any of the following tokenizers available in Flair:

  • SpaceTokenizer(): A simple tokenizer that splits text by the space character. Note that this tokenizer splits by the space character (" ") only. It does not split by characters that are included in Python's string.whitespace, which is often used in methods for splitting or formatting text. Whitespace (along with the space character) also includes the new line and tab characters, which are not taken into account with SpaceTokenizer.
  • SpacyTokenizer(model): A tokenizer that uses a commonly used NLP library called spaCy. It accepts a parameter called model that can either be a string representing a spaCy model (for example, en_core_web_sm) or a spaCy Language object.
  • SciSpacyTokenizer(): A tokenizer that uses spaCy's en_core_sci_sm model, extended by special characters such (, ), and -. These serve as additional token separators.
  • SegtokTokenizer(): A tokenizer that uses a Flair dependency called segtok that focuses on splitting (Indo-European) text.
  • JapaneseTokenizer(tokenizer, sudachi_mode): A tokenizer class that uses the konoha library and provides a selection of tokenizers for the Japanese language. The tokenizer parameter can be either mecab, janome, or sudachi.

Now that we have a good understanding of the tokenizers available in Flair, let's try to use the Sentence object with a non-default tokenizer:

from flair.data import Sentence

from flair.tokenization import SpaceTokenizer

tokenizer = SpaceTokenizer()

s = Sentence('Some nice text.', use_tokenizer=tokenizer)

# getting the string representation using magic method __str__

print(s, len(s))

The preceding script, which uses SpaceTokenizer, should print out the following:

Sentence: "Some nice text." 3

This, unlike SegtokTokenizer, treats the text. string as a single token due to splitting only on the space character.

In the next section, we will look at how to extract information from the Sentence and Token objects and how to tag them with additional data.

Sentence and Token object helper methods

The Sentence object has several helpful methods and properties. The most straightforward one, and one that we already used in the preceding example, is the implementation of the __str__ magic method, which returns a string representation of the object, for example, when trying to print it out.

To be able to fully understand the format of the string representation of the Sentence object, we need to tag at least one token. To do that, we will use the get_token(n) method to get the Token object for the nth token, and then use the add_label(label_type, value) method to assign a label to the token:

from flair.data import Sentence

sentence = Sentence('A short sentence')

sentence.get_token(1).add_label('manual-pos', 'DT')

print(sentence)

The preceding script will print out the following:

Sentence: "A short sentence" → ["A"/DT]

The first part of the printed-out string displays the entire sentence in its original form, whereas the second part of the sentence displays the tokens with their corresponding tags. In this code example, we manually tagged the first token to be able to test out the string representation of the object, whereas in Flair, tokens will usually be labeled by sequence taggers instead.

A similar string representation of the Sentence object can also be directly obtained by calling the to_tagged_string() method.

Important Note

The get_token(n) method uses 1-based indexing, meaning that n=1 will retrieve the first token, whereas n=0 will return None. A 0-based indexing alternative to using get_token(n) would be to use the indexer operator. For example, to get the first token for a Sentence object called sentence, we can simply call sentence[0].

Tokens in a tokenized Sentence object can also be obtained by using its __iter__() magic function, which allows us to get all the tokens by simply iterating the object. We can, for example, do that by using a for loop.

Let's run the following script using the same sentence variable from the preceding script:

for token in sentence:

    print(token)

The script will print out the following:

Token[0]: "A" → DT (1.0)

Token[1]: "short"

Token[2]: "sentence"

This allows us to get a clear visual representation of how our original string was tokenized.

When working with the Sentence object, it is important to understand the behavior of using the len() method with the object. It will return the number of tokens, not the length of the original string. This means that calling len(sentence) using the variable in the preceding script would return 3. If you wish to get the length of the actual sentence (that is, the original text) you will need to run len(sentence.to_original_text()).

If none of the tokenizers described here work for you, Flair has you covered. All you need to do is implement your own custom tokenizer and pass it into Flair. Let's learn how.

Using custom tokenizers

While Flair ships with several tokenizers that support the most commonly spoken languages, it is entirely possible you will be working with a language that uses tokenization rules currently not covered by Flair. Luckily, Flair offers a simple interface that allows us to implement our tokenizers or use third-party libraries.

Using the TokenizerWrapper class

The TokenizerWrapper class provides an easy interface for building custom tokenizers. To build one, you simply need to instantiate the class by passing the tokenizer_func parameter. The parameter is a function that receives the entire sentence text as input and returns a list of token strings.

As an exercise, let's try to implement a custom tokenizer that splits the text into characters. This tokenizer will treat every character as a token:

from flair.data import Token

from flair.tokenization import TokenizerWrapper

def char_splitter(sentence):

    return list(sentence)

char_tokenizer = TokenizerWrapper(char_splitter)

In the preceding code, we implemented the space_splitter function, which simply splits the sentence into a list of characters. Finally, we created the new tokenizer object by instantiating the TokenizerWrapper class with our new function.

Now, we can test our tokenizer by using it with a Sentence object and printing out the generated tokens:

from flair.data import Sentence

text = "Good day."

sentence = Sentence(text, use_tokenizer=char_tokenizer)

for token in sentence:

    print(token)

The resulting script will print out the following:

Token[0]: "G"

Token[1]: "o"

Token[2]: "o"

Token[3]: "d"

Token[4]: " "

Token[5]: "d"

Token[6]: "a"

Token[7]: "y"

Token[8]: "."

The printed text indicates that our tokenizer performs as expected and that it is compatible with the Sentence object.

Important Note on Tokenizers

Tokenization rules for languages are often complex and include many edge cases. Therefore, it is always recommended to use third-party tokenizers from established NLP packages as opposed to implementing your own.

Now that we've covered tokenization and the Sentence and Token base types, it's time to move on and dive into understanding the Corpus object.

Understanding the Corpus object

The Corpus object is the main object that stores corpora in memory in Flair. Each Corpus is a collection of three datasets that behave like lists of Sentence objects. These datasets can be accessed via the following properties:

  • The train property, which contains the dataset that will be used for training models.
  • The test property, which contains a dataset that's independent of the train dataset. It is used for model validation.
  • The dev property, which contains the dataset that's used for hyperparameter tuning.

These three datasets ideally contain data from the same data source and follow the same probability distribution.

An example corpus object can be obtained by loading one of Flair's prepared datasets:

from flair import datasets

corpus = datasets.UD_ENGLISH()

The corpus summary can be obtained by simply printing out the object:

print(corpus)

This should print out the following:

Corpus: 12543 train + 2002 dev + 2077 test sentences

The train dataset of the corpus can be obtained by simply calling the train property:

train_dataset = corpus.train

Then, the nth sentence of the dataset can be obtained by using the indexer operator. For example, the 101st sentence can be obtained by using the following code:

sentence = train_dataset[100]

print(sentence)

The preceding script will print out the 101st sentence in the train dataset.

The object also ships with several helper methods. One of the most widely used is the downsample(percentage) method, which reduces the size of the corpus, where percentage is a value between 0 and 1 that determines the size of the new corpus.

Note

The downsample(percentage) method modifies and returns the original Corpus object. This means that calling it several times downsamples the corpus multiple times.

For example, if we wanted to downsample our corpora to 1%, we would use the following code:

corpus.downsample(0.01)

print(corpus)

This will print out the following:

Corpus: 125 train + 20 dev + 21 test sentences

This is 1% of the size of the original corpus.

Flair ships with several techniques and instances of the Corpus object that allow us to load data from CSV files and read our own sequence labeling datasets. It offers a vast selection of prepared datasets, all of which will be covered in more detail later in the book.

Summary

In this chapter, we covered Flair's base types, such as the Sentence and Token objects, explained how to initialize and use them, and tried out some of their basic helper methods. This should allow us to handle, transform, and understand data in Flair more easily as we move toward more complex topics. We also covered using custom tokenizers in Flair and implemented our own character-based tokenizer. Finally, we scraped the surface of what Flair's datasets and the Corpus objects can do. We learned how to load corpora and datasets, assess their size, extract, and read individual sentences, and downsample entire datasets.

We are now familiar enough with the syntax, basic objects and helper methods to be able to move on to Flair's most powerful NLP technique – sequence tagging. We will cover this in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.22.216