Stemming and Lemmatization

Stemming and Lemmatization are both text normalization techniques in Natural Language Processing. Text normalization involves the transformation of words in a sentence into a standard form make the text distribution more compact. In order words, text normalization attempts to make the distribution of the texts have a normal distribution curve. For example, the words ‘smile’, ‘smiling’ and ‘smiled’ are 3 different words but they do have the same meaning. Well, a human can easily know they have the same meaning. The variation was due to context. But machines do not know this. In Natural Language Processing, it is good practice to allow your model to interpret those words as one word, with one meaning.

Text normalization is a vital preprocessing step when dealing with textual data. And by extension, stemming and lemmatization are. Let’s understand what these are.

What is Stemming

Stemming is a text normalizing technique that cuts down affixes of words, to extract its base form or root words. Stemming is a crude process and sometimes, the root word, also called the stem, may not have grammatical meaning. In fact, in some other NLP libraries like spaCy, stemming is not included.

There are various stemming programs used to carry out stemming. These programs are called stemmer or stemming algorithm. In NLTK, there is the Porter Stemmer, Lancaster Stemmer, Regular Expression Stemmer, and Snowball Stemmer. The most common is the Porter stemming algorithm

Porter Stemming Algorithm

The Porter Stemming Algorithm is arguably the most popular stemming algorithm in Natual Language Processing. In NLTK, it can be instantiated using the PorterStemmer class. The algorithm takes an input of tokenized words and outputs the stems. Let’s take a simple code example using the PorterStemmer class.

#import the PorterStemmer class from nltk.stem library
from nltk.stem import PorterStemmer
#insantiate the PorterStemmer class
stemmer = PorterStemmer()
#create a list of tokens
tokens = [‘smiling’, ‘smile’, ‘smiled’, ‘smiles’]
#create an empty list to take in the stemmed words
stemmed_words = []
#loop over each token in the list
for each_word in tokens:
    #stem each word in the list
    stemmed_word = stemmer.stem(each_word)
    #add the stemmed word to the empty stemmed word list
    stemmed_words.append(stemmed_word)
#print the stemmed words list
print(stemmed_words)

Output:

['smile', 'smile', 'smile', 'smile']

As seen, all variations of the word have been stemmed from its root word, ‘smile’. As earlier mentioned, some words may not be stemmed into meaning root words. If we attempt to stem the words ‘cry’, ‘crying’, ‘cries’ and ‘cried’, it outputs the word ‘cri’, which does not have any grammatical meaning.

Let’s take another example where we pass sentences as input.

#import the PorterStemmer class from nltk.stem function
from nltk.stem import PorterStemmer
#import th word_tokenize method in the nltk library
from nltk import word_tokenize
#insantiate the PorterStemmer class
stemmer = PorterStemmer()
#define some statement
sentences = 'NLTK is a very interesting subject. I have learnt a lot from this website. I will not stop learning'
#tokenize each word in the sentences
each_sentence = word_tokenize(sentences)
#create an empty list to take in the stemmed words
stemmed_words = []
#loop over each token in the list
for word in each_sentence:
    #stem each word in the list
    stemmed_word = stemmer.stem(word)
    #add the stemmed word to the empty stemmed word list
    stemmed_words.append(stemmed_word)
#print the stemmed words list
print(stemmed_words)

Output:

['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'I', 'will', 'not', 'stop', 'learn']

Again, some of the stemmed words do not have a dictionary meaning. Mind you, there are other stemming algorithms. The tweak to the code would be importing the new stemming algorithm and instantiating the same.

Using the Lancaster stemmer, with the LancasterStemmer class outputs

['nltk', 'is', 'a', 'very', 'interest', 'subject', '.', 'i', 'hav', 'learnt', 'a', 'lot', 'from', 'thi', 'websit', '.', 'i', 'wil', 'not', 'stop', 'learn']

The regular expression stemmer takes a regular expression and cut of any suffix or prefix that matches the defined expression. Using the Regular Exppression stemmer with the RegexpStemmer class and defining the ‘ing’ parameter as the regular expression outputs

['NLTK', 'is', 'a', 'very', 'interest', 'subject', '.', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'website', '.', 'I', 'will', 'not', 'stop', 'learn']

Snowball stemmer allows for stemming in 15 other languages including, Arabic, French, German, Italian, Portuguese, Russian, Spanish and Swedish. When using snowball stemmer, the language has to be defined. Using the Snowball Stemmer with the SnowballStemmer class abd defining the language as ‘english’ outputs

['nltk', 'is', 'a', 'veri', 'interest', 'subject', '.', 'i', 'have', 'learnt', 'a', 'lot', 'from', 'this', 'websit', '.', 'i', 'will', 'not', 'stop', 'learn']

Let’s now take a look at lemmatization

What is Lemmatization?

Lemmatization is almost like stemming, in that it cuts down affixes of words until a new word is formed. Only that in lemmatization, the root word, called ‘lemma’ is a word with a dictionary meaning. The morphological analysis of words is done in lemmatization, to remove inflection endings and outputs base words with dictionary meaning. Lemmatization with the NLTK library is done using the WordNetLemmatizer class. It’s almost the same methodology as using PorterStemmer. Let’s take a simple coding example.

#import the WordNetLemmatizer class from nltk.stem library
from nltk.stem import WordNetLemmatizer
#insantiate the WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()
#create a list of tokens
tokens = ['crying', 'cry', 'cried']
#create an empty list to take in the lemmatized words
lemmatized_words = []
#loop over each token in the list
for each_word in tokens:
    #lemmatize each word in the list
    lemmatized_word = lemmatizer.lemmatize(each_word)
    #add the lemmatized word to the empty lemmatized word list
    lemmatized_words.append(lemmatized_word)
#print the lemmatized words list
print(lemmatized_words)

Output:

['cry', 'cry', 'cried']

WordNetLemmatizer outputs meaningful words – cry and cried, as opposed to what PorterStemmer returned – cri.

Use Case Scenarios

Let’s say we have a text we want a machine learning model to understand. We need to preprocess the text using stemming or lemmatization. Let’s take a code example for each of them starting with stemming.

#import the necessary libraries
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

#define the text to be preprocessed
text = '''
A part of speech is a classification system in the English Language that reveals the role a word plays in a context. 
There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.
POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence. 
The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech. 
POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words. 

'''
#instantiate the PorterSTemmer class
stemmer = PorterStemmer()
#Tokenize the text to lists of sentences
sentence = sent_tokenize(text)

#loop over each list based on its index
for index, _ in enumerate(sentence):
    #tokenized each sentence to a list of words
    tokenized_words = word_tokenize(sentence[index])
    #apply stemmer if the word is not a stopword
    words = [stemmer.stem(word) for word in tokenized_words if word not in set(stopwords.words('english'))]
    #add the stemmed words to the sentence variable
    sentence[index] = ' '.join(words)

Output:

Observe that words such as, of, in, the, etc are completely taken out. They are called stopwords. Stopwords do not add serious meaning to a sentence. It is good practice to remove them.

Now, let’s carry out lemmatization on the same text and see the result.

#import the necessary libraries
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#define the text to be preprocessed
text = '''
A part of speech is a classification system in the English Language that reveals the role a word plays in a context. 
There are eight parts of speech in the English Language: nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections.
POS tagging in simple terms means allocating every word in a sentence to a part of speech. NLTK has a method called pos_tag that performs POS tagging on a sentence. 
The methods apply supervised learning approaches that utilize features such as context, the capitulation of words, punctuations, and so on to determine the part of speech. 
POS tagging is a critical procedure to understand the meaning of a sentence and know the relationship between words. 

'''
#instantiate the WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()
#Tokenize the text to lists of sentences
sentence = sent_tokenize(text)

#loop over each list based on its index
for index, _ in enumerate(sentence):
    #tokenized each sentence to a list of words
    tokenized_words = word_tokenize(sentence[index])
    #apply lemmatizer if the word is not a stopword
    words = [lemmatizer.lemmatize(word) for word in tokenized_words if word not in set(stopwords.words('english'))]
    #add the lemmatized words to the sentence variable
    sentence[index] = ' '.join(words)

Output:

So rounding off…

Stemming or Lemmatization: Which should you go for?

No doubt, lemmatization is better than stemming. But there could be tradeoffs. Lemmatization requires a solid understanding of linguistics, hence it is computationally intensive. If speed is one thing you require, you should consider stemming. If you are trying to build a sentiment analysis or an email classifier, the base word is sufficient to build your model. In this case, as well, go for stemming.

If however, your model would actively interact with humans – say you are building a chatbot, language translation algorithm, etc, lemmatization would be a better option.