Basics of NLP: Text classification with Markov Assumption

5 min readMay 8, 2023

Text classification, also known as text categorization, is a fundamental task in natural language processing (NLP) that involves assigning predefined categories or labels to a given text document. The goal is to automatically analyze and categorize unstructured text data into meaningful classes based on their content, sentiment, topic, or any other predefined criteria.

The Markov assumption, named after Russian mathematician Andrey Markov, is a fundamental assumption in probability theory and statistics. In the context of text generation or language modeling, the Markov assumption is often applied to simplify the modeling process.

The Markov assumption states that the future state or event in a sequence depends only on the current state or event, and not on the past states or events. In other words, the future is assumed to be independent of the past given the present.

In the context of text generation, the Markov assumption implies that the probability of observing a particular word in a sequence depends only on the preceding word or a small fixed window of previous words. This assumption simplifies the modeling process because it allows us to estimate the probability distribution of the next word based solely on the current context without considering the entire history of the sequence.

For example, in a first-order Markov model, the probability of a word is conditioned on the immediately preceding word. In a second-order Markov model, the probability of a word is conditioned on the two preceding words, and so on.

Applying the Markov assumption allows for efficient modeling and generation of sequences, including text. It has been widely used in various NLP tasks such as language modeling, part-of-speech tagging, and machine translation. However, it is important to note that the Markov assumption oversimplifies the complexity of language and may not capture long-range dependencies or context that extends beyond a fixed window of words.

The markov equation can be simplified in probably terms as below:

The mathematics behind the equation is a bit complex and beyond the scope of this article. There are lots of articles and free courses available online.

Code

#### downloading spam dataset #####
!wget https://lazyprogrammer.me/course_files/spam.csv

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

spam_df = pd.read_csv('spam.csv', encoding = "ISO-8859-1")

spam_df = spam_df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1)

# mapping the category names to labels #
map_dict = {'ham': 0, 'spam': 1}
spam_df['labels'] = spam_df['v1'].map(map_dict)

input_text = spam_df['v2']
label = spam_df['labels']

# splitting the data into test and train #

train_text, test_text, Ytrain, Ytest = train_test_split(input_text, label)

##### starting point for word indexing #####
##### we are setting aside 0 for unknown values, i.e. the tokens that are present in 
##### test but not in train. 

idx = 1
word2idx = {'<unk>': 0}

# populate word2idx
for text in train_text:
    tokens = text.split()
    for token in tokens:
      if token not in word2idx:
        word2idx[token] = idx
        idx += 1


# convert data into integer format
train_text_int = []
test_text_int = []

for text in train_text:
  tokens = text.split()
  line_as_int = [word2idx[token] for token in tokens]
  train_text_int.append(line_as_int)

for text in test_text:
  tokens = text.split()
  line_as_int = [word2idx.get(token, 0) for token in tokens]
  test_text_int.append(line_as_int)

# initialize A and pi matrices - for both classes. The number of A and pi matrices depends on the number of classes or categories we have #
V = len(word2idx)

A0 = np.ones((V, V))
pi0 = np.ones(V)

A1 = np.ones((V, V))
pi1 = np.ones(V)

# compute counts for A and pi

def compute_counts(text_as_int, A, pi):
  for tokens in text_as_int:
    last_idx = None
    for idx in tokens:
      if last_idx is None:
        # it's the first word in a sentence
        pi[idx] += 1
      else:
        # the last word exists, so count a transition
        A[last_idx, idx] += 1

      # update last idx
      last_idx = idx


compute_counts([t for t, y in zip(train_text_int, Ytrain) if y == 0], A0, pi0)
compute_counts([t for t, y in zip(train_text_int, Ytrain) if y == 1], A1, pi1)



# normalize A and pi so they are valid probability matrices

A0 /= A0.sum(axis=1, keepdims=True)
pi0 /= pi0.sum()

A1 /= A1.sum(axis=1, keepdims=True)
pi1 /= pi1.sum()

# log A and pi since we don't need the actual probs
logA0 = np.log(A0)
logpi0 = np.log(pi0)

logA1 = np.log(A1)
logpi1 = np.log(pi1)

# compute priors for both categories #


count0 = sum(y == 0 for y in Ytrain)
count1 = sum(y == 1 for y in Ytrain)
total = len(Ytrain)
p0 = count0 / total
p1 = count1 / total
logp0 = np.log(p0)
logp1 = np.log(p1)
p0, p1

Now, we have the complete A and pi matrix, we can go ahead with building the classifier.

# build a classifier
class Classifier:
  def __init__(self, logAs, logpis, logpriors):
    self.logAs = logAs
    self.logpis = logpis
    self.logpriors = logpriors
    self.K = len(logpriors) # number of classes

  def _compute_log_likelihood(self, input_, class_):
    logA = self.logAs[class_]
    logpi = self.logpis[class_]

    last_idx = None
    logprob = 0
    for idx in input_:
      if last_idx is None:
        # it's the first token
        logprob += logpi[idx]
      else:
        logprob += logA[last_idx, idx]
      
      # update last_idx
      last_idx = idx
    
    return logprob
  
  def predict(self, inputs):
    predictions = np.zeros(len(inputs))
    for i, input_ in enumerate(inputs):
      posteriors = [self._compute_log_likelihood(input_, c) + self.logpriors[c] \
             for c in range(self.K)]
      pred = np.argmax(posteriors)
      predictions[i] = pred
    return predictions


# each array must be in order since classes are assumed to index these lists
clf = Classifier([logA0, logA1], [logpi0, logpi1], [logp0, logp1])

Training and testing for the model

## training data ##

Ptrain = clf.predict(train_text_int)
print(f"Train acc: {np.mean(Ptrain == Ytrain)}")

Train acc: 0.99

Ptest = clf.predict(test_text_int)
print(f"Test acc: {np.mean(Ptest == Ytest)}")

Test acc: 0.95

Even though the accuracy for both the train and the test is the same, we have an imbalanced dataset and hence, F1-score would be a more useful metric to take into account than accuracy.

from sklearn.metrics import f1_score

f1_score(Ytrain, Ptrain)

ans: 0.99

f1_score(Ytest, Ptest)

ans: 0.82

This gives us a better picture of how our model is performing. Even with such a simple assumption, we are able to attain a high F1-score on our test data.

Basics of NLP: Text classification with Markov Assumption

Code

Written by Harsh