Naive Bayes Classifier (NBC)

Hindavi Churi
Dec 1, 2021
5 min read

The goal of the project is to apply the Naive Bayes Classifier on sentiment labelled reviews for a movie. In other words, sentiment analysis of the data. The data is trained and tested without any use of the library. The project aims to classify for a given test sample what will be the sentiment based on the trained data.

Data

The data used describe the reviews for a movie and the sentiment associated with the review. The total length of the data is 1000. This data is then divided into train, development, and test datasets. We take 80% as the training dataset and 20% as the test dataset. We further divide the training dataset as 80% which is the actual dataset and 20% as the development dataset.

Naive Bayes

The Naive Bayes method is based on the Bayes Theorem. The reason why it is called Naive is that a feature/variable is assumed to be independent of another feature/variable. Hence, it assumes independence between features and is almost a rare case in real-life applications. This method is used to find conditional probability where a probability of an event is likely obtained with knowledge of some other previously occurred event.

The conditional probability is given by,

P( X | Y ) = P( X and Y ) / P( A )

here,

probability of event x with knowledge of previously occurred event y is estimated.

The Naive Bayes classifier is mostly used for the Classification tasks. Hence, the Naive Bayes can be defined as,

now,

to calculate the probability of A given that event B has already occurred, we calculate the probability of event B given that A had already occurred with the probability of event A and the probability of event B.

Also, the events here are independent of each other.

The Naive Bayes can be of 3 types:

Multinomial Naive Bayes
Bernoulli Naive Bayes
Gaussian Naive Bayes

Each type serves its purpose depending on the type of class its dealing with. For example, if class takes continuous values, Gaussian Naive Bayes is used, while for multiple classes Multinomial Naive Bayes is used. In this project, we will be using the Bernoulli Naive Bayes type since our data is binary, i.e. 0(negative) or 1(positive), based on the review.

The Naive Bayes method is mostly used in sentiment analysis, recommendation systems, etc. But the biggest disadvantage of the method is that it assumes the variables to be independent of each other.

In this project, we will see how we used Naive Bayes Classifier for sentiment analysis.

K-fold cross-validation

Cross-validation is normally used to know the working of the training dataset. It is applied to the training dataset to understand the hyperparameters and the concept of overfitting and underfitting on the given data.

Cross-validation can be of many types, one of which is K-fold cross-validation.

K-fold cross-validation is a procedure to divide the data into k-folds where k-1 folds are treated as training data and the last dataset as a test dataset. The training and testing are done recursively to understand the hyperparameters.

This validation can also be used to understand whether the training dataset is overfitting the data and to take the steps accordingly for testing the actual data based on the trained model.

The procedure can be seen in the image below,

Smoothing

While testing a sample on the trained dataset, there can be an event that is new and has not occurred in training dataset. For such events, the probability would be seen as zero making the whole probability zero which is not a feasible solution.

Smoothing is known as a technique that takes care of such events. It adds some value to such probabilities to keep it non-zero and in fact the whole probability with respect to the class as non-zero.

There are many smoothing techniques one of which is Laplace smoothing.

Laplace smoothing technique is a well-known technique used for issues faced in the naive Bayes classifiers. It adds a Laplace constant in the numerator and adds k times laplace coefficient in the denominator, where k is number of classes.

The formula looks like,

Data-Preprocessing

For training and testing, data preprocessing is done. In order to understand the probability of each event, in this case, words from each review(sentence) should be considered. Hence, the review is split so as to be seen as a token.

The tokenization is done such as the word is purely formed so as to not contain any special cases along with it. In other words, only to be seen as alphanumeric. After tokenization, there are many words that do not make sense in context to what we are trying to estimate. Such words can be regarded as useless just for this purpose. These words are frequently called stop_words. Hence, we go over each tokenized word and see if it's a stop_word and discard it in that case.

The results from the above-preprocessing measures can be seen below,

original sentence,

after preprocessing,

The next step would be to estimate words from each sentence and form a bag of words that are unique in the combining document of reviews. Then, estimate the occurrence of each word to the whole document, i.e. estimate the probability of each word.

The below output is just a sample of how the word probability,

We then calculate, the probability of each word with respect to each class(positive or negative). Lastly, we calculate the probability of each class in the document.

Hence, these are all the steps for data preprocessing, i.e we have trained the data and can be used for testing.

Development dataset

The purpose of development is to understand the working of the training dataset that we are building. To see where the data is fit to be tested or is it overfitted and stuff.

For this purpose, we divide our training dataset into train and development datasets. We use this development dataset to train and test the training data. This can be achieved by the cross-validation method. The results of this method tell us the accuracy of the model for different folds formed during training and testing.

In other words, the development dataset is used to tune the hyperparameters. The knowledge of this can be then applied to testing data to achieve maximum accuracy.

The results from the development dataset can be seen as follows,

The average performance of the model is estimated as the mean of these accuracies which is 88.1%.

Hyperparameters

Laplace values:

We perform a development dataset for different Laplace values.

for a=0.001, we get performance with an accuracy of 87.79%

for a=0.1, we get performance with an accuracy of 87.89%

for a=1, we get performance with an accuracy of 88.1%

for a=10, we get performance with an accuracy of 88.3%

for a =100, we get performance with an accuracy of 88.69%

we can see that as the Laplace value increases the performance of the model increases. It may seem good at first sight, but can be signs of overfitting of the model. Hence we use a=1 for our further testing.

Testing

For testing, we preprocess the data the same as we did for training data. We apply to smooth on the probabilities with a=1 as optimized hyperparameter from the development testing.

The results show an accuracy of 39%.

Challenges faced

The biggest challenge was to know how to use the development dataset and how to tune the hyperparameters. The training and testing of data and tuning hyperparameters was challenging.

The issue was resolved by trying multiple things such as changes in the Laplace coefficient.

Also, trying out different preprocessing methods helped a lot to understand how the performance of the model changes.

Experiments

We tried different preprocessing methods to bring the data as clean as possible to further help training and testing smoothingly and more accurately.

We have already discussed the preprocessing methods above.

Code