Lemmatization

Lemmatization

What is Lemmatization?

Lemmatization is the process of reducing a word to its base or root form, also known as its lemma, while still retaining its meaning. The base form or lemma of a word is its canonical or dictionary form, and is the word that appears when you look it up in a dictionary.

How Lemmatization Works

Lemmatization works by using a dictionary or vocabulary, called a lemmatizer, that contains all the base forms of words in a language. When a word is passed to the lemmatizer, it uses linguistic rules and algorithms to determine the base form of the word. This involves considering the context of the word in the sentence, as well as its part of speech (e.g., noun, verb, adjective, etc.).

For example, the word “running” can be reduced to its base form “run” through lemmatization, and the word “amazing” can be reduced to “amaze”. This process is useful in natural language processing (NLP) tasks such as text classification, information retrieval, and sentiment analysis, as it reduces the complexity of the text and allows for more accurate analysis and understanding.

Lemmatization is typically accomplished using software libraries or tools that contain dictionaries or algorithms to identify the base form of a word. Python has several NLP libraries that include lemmatization capabilities, such as spaCy, NLTK, and TextBlob. These libraries can be used to preprocess text data, reducing complexity and variability to improve the accuracy and performance of NLP models.

Lemmatization In Code

Lemmatization can be used in code using various programming languages and NLP libraries.

Example 1 (NLTK)

Here’s an example of how to use the Python Natural Language Toolkit (NLTK) library to perform lemmatization:

import nltk
from nltk.stem import WordNetLemmatizer

# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# input sentence
sentence = "The cats were playing in the garden"

# tokenize sentence into words
words = nltk.word_tokenize(sentence)

# perform lemmatization on each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# print lemmatized words
print(lemmatized_words)

In this example, we first import the necessary libraries and initialize the lemmatizer using the WordNetLemmatizer class from the NLTK library. We then tokenize the input sentence into individual words using the word_tokenize() function. We use a list comprehension to perform lemmatization on each word using the lemmatize() method of the WordNetLemmatizer class, and store the resulting lemmas in a new list. Finally, we print the lemmatized words.

The output of this code would be:

['The', 'cat', 'be', 'playing', 'in', 'the', 'garden']

As you can see, the lemmatizer correctly reduced the word “cats” to “cat” and “were” to “be” based on their context and part of speech.

Example 2 (spaCy)

Example of lemmatization code using the spaCy library in Python:

import spacy

# load the English language model
nlp = spacy.load('en_core_web_sm')

# define a sentence to be lemmatized
sentence = "I am running in the park"

# process the sentence using the language model
doc = nlp(sentence)

# extract the lemmatized version of each word in the sentence
lemmas = [token.lemma_ for token in doc]

# print the lemmatized sentence
print(' '.join(lemmas))

In this example, we first import the spaCy library and load the pre-trained English language model. We then define a sentence to be lemmatized and pass it to the nlp() function, which returns a spaCy Doc object representing the processed sentence. We can then extract the lemma of each word in the sentence using the .lemma_ attribute of each token in the Doc object. Finally, we print the lemmatized version of the sentence using the ' '.join() method to join the lemmas into a single string with spaces between each word.

The output of this code would be:

-PRON- be run in the park

As you can see, the spaCy lemmatizer reduces “running” to “run” and “am” to “be” to provide the base form of each word in the sentence.

Example 3 (TextBlob)

Example of lemmatization in TextBlob using the sentence “The cats were playing in the garden”:

from textblob import TextBlob

# define the input sentence
sentence = "The cats were playing in the garden"

# create a TextBlob object
blob = TextBlob(sentence)

# perform lemmatization on each word in the sentence
lemmatized_words = [word.lemmatize() for word in blob.words]

# print the lemmatized words
print(lemmatized_words)

The output of this code would be:

['The', 'cat', 'were', 'playing', 'in', 'the', 'garden']

As you can see, the lemmatized words are the same as the ones obtained using the NLTK library in the previous example. TextBlob uses the WordNet lemmatizer from the NLTK library under the hood, so the results are similar.

Additional Resources: