Stemming and Lemmatization

Harsh
3 min readMay 4, 2023

--

https://www.analyticsinsight.net/heres-what-you-need-to-know-if-you-are-aspiring-to-become-an-nlp-engineer/

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on enabling machines to understand and interact with human language. It involves processing, analyzing, and generating human language to perform various tasks such as sentiment analysis, language translation, text summarization, and speech recognition, among others.

One of the essential tasks in NLP is text preprocessing, which involves cleaning and transforming raw text data into a more manageable format for analysis. Text preprocessing techniques such as tokenization, stop-word removal, and normalization are commonly used in NLP.

Stemming and lemmatization are two text normalization techniques used to reduce words to their base or root form. Stemming involves removing suffixes from words to obtain their base form, while lemmatization involves converting words to their dictionary or morphological base form. The primary goal of these techniques is to reduce the number of unique words in a text document, making it easier to analyze and understand.

Stemming is a simpler and faster technique compared to lemmatization. It uses a set of rules or algorithms to remove suffixes and obtain the base form of a word. However, stemming can sometimes produce a base form that is not a valid word, and it may also lead to ambiguity.

Lemmatization, on the other hand, is a more sophisticated technique that uses a vocabulary and morphological analysis to determine the base form of a word. It produces a valid base form that can be found in a dictionary, making it more accurate than stemming. However, lemmatization is slower and more complex than stemming.

Applying stemming and lemmatization to your NLP code is very simple. Below we will see how the two differ from each other and how can they be coded in python.

Stemming

import nltk
from nltk.stem import PorterStemmer
porter = PorterStemmer()

porter.stem("walking")

The result of this would be ‘walk’. The answer would be the same for walked and walks.

Lemmatization

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize("walking")

From intuition, we would have guessed that the lemmatized value for walking would be walk because that’s the root form of the word but, here the answer would be walking. This is because lemmatizer by default assumes the word to be noun. We have to specify the part of speech for the word we want to lemmatize.

lemmatizer.lemmatize("walking", pos=wordnet.VERB)

Now, the answer would be walk.

We will now try to be compare few other examples for stemming and lemmatize:

porter.stem("ran") # this gives the answer as ran
lemmatizer.lemmatize('ran', pos=wordnet.VERB) # gives the answer as run
lemmatizer.lemmatize('ran') # this gives the answer as ran
porter.stem("following") # this gives the answer as follow
lemmatizer.lemmatize('following', pos=wordnet.VERB) # this gives the answer as follow
lemmatizer.lemmatize('following') # this gives the answer as following

Applying stemming and lemmatization on a sentence

Stemming

sentence = "lemmatization is more sophisticated than stemming"
sentence_split = sentence.split()
for token in sentence:
print(porter.stem(token), end = " ")
lemmat is more sophist than stem 

Lemmatization

The wordnet part of speech outout is not compatible with lemmatization method in Python and hence, we have to do some pre-processing to make them compatible.

def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN


nltk.download('averaged_perceptron_tagger')
words_and_tags = nltk.pos_tag(sentence)

for word, tag in words_and_tags:
lemma = lemmatizer.lemmatize(word, get_wordnet_pos(tag))
print(lemma, end = ' ')
lemmatization be more sophisticated than stem 

While stemming is simpler than lemmatization, both techniques have their respective use cases, and the choice between them depends on the specific task at hand. In some cases, stemming may produce better results than lemmatization, while in other cases, lemmatization may be more accurate. Therefore, it is essential to weigh the trade-offs between simplicity, speed, and accuracy when selecting a text normalization technique.

Link to my linkedin profile:

https://www.linkedin.com/in/harshharsh/

--

--