Text classification, also known as text categorization, is a fundamental task in natural language processing (NLP) that involves assigning predefined categories or labels to a given text document. The goal is to automatically analyze and categorize unstructured text data into meaningful classes based on their content, sentiment, topic, or any other predefined criteria.

The Markov assumption, named after Russian mathematician Andrey Markov, is a fundamental assumption in probability theory and statistics. In the context of text generation or language modeling, the Markov assumption is often applied to simplify the modeling process.

The Markov assumption states that the future state or event in a sequence depends only on the current state or event, and not on the past states or events. In other words, the future is assumed to be independent of the past given the present.

In the context of text generation, the Markov assumption implies that the probability of observing a particular word in a sequence depends only on the preceding word or a small fixed window of previous words. This assumption simplifies the modeling process because it allows us to estimate the probability distribution of the next word based solely on the current context without considering the entire history of the sequence.

For example, in a first-order Markov model, the probability of a word is conditioned on the immediately preceding word. In a second-order Markov model, the probability of a word is conditioned on the two preceding words, and so on.

Applying the Markov assumption allows for efficient modeling and generation of sequences, including text. It has been widely used in various NLP tasks such as language modeling, part-of-speech tagging, and machine translation. However, it is important to note that the Markov assumption oversimplifies the complexity of language and may not capture long-range dependencies or context that extends beyond a fixed window of words.

The markov equation can be simplified in probably terms as below:

The mathematics behind the equation is a bit complex and beyond the scope of this article. There are lots of articles and free courses available online.

**Code**

`#### downloading spam dataset #####`

!wget https://lazyprogrammer.me/course_files/spam.csv

from sklearn.model_selection import train_test_split

import numpy as np

import pandas as pd

spam_df = pd.read_csv('spam.csv', encoding = "ISO-8859-1")

spam_df = spam_df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1)

`# mapping the category names to labels #`

map_dict = {'ham': 0, 'spam': 1}

spam_df['labels'] = spam_df['v1'].map(map_dict)

input_text = spam_df['v2']

label = spam_df['labels']

# splitting the data into test and train #

train_text, test_text, Ytrain, Ytest = train_test_split(input_text, label)

##### starting point for word indexing #####

##### we are setting aside 0 for unknown values, i.e. the tokens that are present in

##### test but not in train.

idx = 1

word2idx = {'<unk>': 0}

# populate word2idx

for text in train_text:

tokens = text.split()

for token in tokens:

if token not in word2idx:

word2idx[token] = idx

idx += 1

# convert data into integer format

train_text_int = []

test_text_int = []

for text in train_text:

tokens = text.split()

line_as_int = [word2idx[token] for token in tokens]

train_text_int.append(line_as_int)

for text in test_text:

tokens = text.split()

line_as_int = [word2idx.get(token, 0) for token in tokens]

test_text_int.append(line_as_int)

`# initialize A and pi matrices - for both classes. The number of A and pi matrices depends on the number of classes or categories we have #`

V = len(word2idx)

A0 = np.ones((V, V))

pi0 = np.ones(V)

A1 = np.ones((V, V))

pi1 = np.ones(V)

# compute counts for A and pi

def compute_counts(text_as_int, A, pi):

for tokens in text_as_int:

last_idx = None

for idx in tokens:

if last_idx is None:

# it's the first word in a sentence

pi[idx] += 1

else:

# the last word exists, so count a transition

A[last_idx, idx] += 1

# update last idx

last_idx = idx

compute_counts([t for t, y in zip(train_text_int, Ytrain) if y == 0], A0, pi0)

compute_counts([t for t, y in zip(train_text_int, Ytrain) if y == 1], A1, pi1)

# normalize A and pi so they are valid probability matrices

A0 /= A0.sum(axis=1, keepdims=True)

pi0 /= pi0.sum()

A1 /= A1.sum(axis=1, keepdims=True)

pi1 /= pi1.sum()

# log A and pi since we don't need the actual probs

logA0 = np.log(A0)

logpi0 = np.log(pi0)

logA1 = np.log(A1)

logpi1 = np.log(pi1)

# compute priors for both categories #

count0 = sum(y == 0 for y in Ytrain)

count1 = sum(y == 1 for y in Ytrain)

total = len(Ytrain)

p0 = count0 / total

p1 = count1 / total

logp0 = np.log(p0)

logp1 = np.log(p1)

p0, p1

Now, we have the complete A and pi matrix, we can go ahead with building the classifier.

`# build a classifier`

class Classifier:

def __init__(self, logAs, logpis, logpriors):

self.logAs = logAs

self.logpis = logpis

self.logpriors = logpriors

self.K = len(logpriors) # number of classes

def _compute_log_likelihood(self, input_, class_):

logA = self.logAs[class_]

logpi = self.logpis[class_]

last_idx = None

logprob = 0

for idx in input_:

if last_idx is None:

# it's the first token

logprob += logpi[idx]

else:

logprob += logA[last_idx, idx]

# update last_idx

last_idx = idx

return logprob

def predict(self, inputs):

predictions = np.zeros(len(inputs))

for i, input_ in enumerate(inputs):

posteriors = [self._compute_log_likelihood(input_, c) + self.logpriors[c] \

for c in range(self.K)]

pred = np.argmax(posteriors)

predictions[i] = pred

return predictions

# each array must be in order since classes are assumed to index these lists

clf = Classifier([logA0, logA1], [logpi0, logpi1], [logp0, logp1])

Training and testing for the model

`## training data ##`

Ptrain = clf.predict(train_text_int)

print(f"Train acc: {np.mean(Ptrain == Ytrain)}")

`Train acc: 0.99`

`Ptest = clf.predict(test_text_int)`

print(f"Test acc: {np.mean(Ptest == Ytest)}")

`Test acc: 0.95`

Even though the accuracy for both the train and the test is the same, we have an imbalanced dataset and hence, F1-score would be a more useful metric to take into account than accuracy.

`from sklearn.metrics import f1_score`

f1_score(Ytrain, Ptrain)

ans: 0.99

`f1_score(Ytest, Ptest)`

ans: 0.82

This gives us a better picture of how our model is performing. Even with such a simple assumption, we are able to attain a high F1-score on our test data.