![Drama Globe.jpg](https://static.wixstatic.com/media/63a1ce_4c13e92da487453db298d6bcfeedb04b~mv2.jpg/v1/fill/w_323,h_318,al_c,q_80,usm_0.66_1.00_0.01,enc_avif,quality_auto/63a1ce_4c13e92da487453db298d6bcfeedb04b~mv2.jpg)
Movie Genre Prediction - NLP
Want to know what type of movie it is based on only the story or review for that movie?
This data science project inspires the idea of how a movie genre can be predicted based on its story or review to further apply it in practice for recommendation systems used in Netflix, Amazon prime, etc, or search engine recommendations such as Google, Yahoo, etc.
Tools Used: Python, Jupyter Notebook
Libraries Used: NLTK, SkLearn, Pandas, Matplotlib
Models: Logistic Regression, Multinomial Naive Bayes, Stochastic Gradient Descent, Extreme Gradient Boosting, K-Nearest Neighbors, Random Forest and Support Vector Machine.
The focus of this project is on classifying whether the movie is Drama or Non-Drama based on its story using the Text-Analysis model along with machine learning algorithms. The data used for training consists of 20,000 samples and 3498 samples for evaluation of the model which is stored in an excel file. The data is accessed using the Pandas library. The data is formatted appropriately for training using various text-analysis models like
​
-
Bag-of-words
-
TFIDF
-
LDA
-
Word Embedding
​​
Various machine learning models are trained to understand their performance on training data for each text-analyzed data and compared to pick the best model for evaluation data. Since the class labels are imbalanced, the Roc-Auc score and F1 score are analyzed for comparison.
![WDC_edited.png](https://static.wixstatic.com/media/63a1ce_6519efa1b4be458296faf4024ad3a267~mv2.png/v1/fill/w_250,h_226,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/WDC_edited.png)
Word Cloud from the cleaned data
Results from the analysis for various model scores based on the initial imbalanced class data suggested that Logistic Regression outperformed other models with Bag-of-words and TFIDF models. Further, balancing the imbalance data with class weights and Oversampling methods, the Logistic Regression with class weights and TFIDF model protested as the best-performing model with an accuracy of 70.7%. Finally, as the whole dataset was trained for this model, the evaluation data for this model yielded an accuracy of 71.27%.