top of page
Drama Globe.jpg

Movie Genre Prediction - NLP

Want to know what type of movie it is based on only the story or review for that movie?

This data science project inspires the idea of how a movie genre can be predicted based on its story or review to further apply it in practice for recommendation systems used in Netflix, Amazon prime, etc, or search engine recommendations such as Google, Yahoo, etc. 

Tools Used: Python, Jupyter Notebook

Libraries Used: NLTK, SkLearn, Pandas, Matplotlib

Models: Logistic Regression, Multinomial Naive Bayes, Stochastic Gradient Descent, Extreme Gradient Boosting, K-Nearest Neighbors, Random Forest and Support Vector Machine.

The focus of this project is on classifying whether the movie is Drama or Non-Drama based on its story using the Text-Analysis model along with machine learning algorithms. The data used for training consists of 20,000 samples and 3498 samples for evaluation of the model which is stored in an excel file. The data is accessed using the Pandas library. The data is formatted appropriately for training using various text-analysis models like 

​

  • Bag-of-words

  • TFIDF

  • LDA 

  • Word Embedding

​​

Various machine learning models are trained to understand their performance on training data for each text-analyzed data and compared to pick the best model for evaluation data. Since the class labels are imbalanced, the Roc-Auc score and F1 score are analyzed for comparison.

WDC_edited.png

Word Cloud from the cleaned data

Results from the analysis for various model scores based on the initial imbalanced class data suggested that Logistic Regression outperformed other models with Bag-of-words and TFIDF models. Further, balancing the imbalance data with class weights and Oversampling methods, the Logistic Regression with class weights and TFIDF model protested as the best-performing model with an accuracy of 70.7%.  Finally, as the whole dataset was trained for this model, the evaluation data for this model yielded an accuracy of 71.27%.

The Link to the Code

  • Github-Pro1

This project was a collaborative effort to which I dedicated 10 hours a week for a span of 1 month. My responsibilities included cleaning data before training the model and training some models.

At the end of the presentation, our team was awarded the best project team with remarks:
good business understanding,  best presentation, and highest model score for the project in the whole class.

bottom of page