top of page
Search

People-Interests Classification system.

Writer's picture: Hindavi ChuriHindavi Churi



There are various fields where a grouping of people becomes essential. College-Housing, Office-projects, academic projects, and many more are some of the examples. The grouping of the people depends on various criteria. Like for college-housing people grouping, the criteria can be similar personality or hobbies, for projects it can be based on success, how determined the person is, expertise, and such many projects.


In this project, we implement the classification system for people with various objectives. We put together people not to specialize in any specific area but to group them on a general basis. We have used the KNN classifier for collaborative learning with measuring. We will see what kind of dataset is used, how the data cleaning is done, and the building of the classifier to further recommend the likeness for those who missed on some aspects, unlike other people.


DATA

Our project is based on grouping people based on factors such as personality, hobbies, and such. For this, we collected the dataset from Kaggle for interests dataset, the link to which is here. From the dataset, we have 1010 rows for 150 different interest and details columns.

A peek of the dataset can be seen below,



and the column names are given as,


The data for each row and column is specified by the rating that the user(row) has given to those specific criteria (column) based on its likeness. The rating is between 1-5 for most of the criteria and some columns consist of categorical data.


Data pre-processing

As we can see the data that we loaded consists of many unwanted criteria such as a number of siblings, left-right hand, and much more. These criteria may be useful for some other focused projects but not here. Hence, we drop such columns from our data.

Next is to set the values to 0 for where we find no values.


Label encoder:

There are some columns that have categorical data. We hence convert this data to labels. The results are achieved by using the LabelEncoder module from scikit-learn.


Adjusting the features:

We know that there are various columns describing music, movie, personalities, spending habits. To work on such long data becomes messy. Hence we group such columns based on their belongings. Like for music, we group columns such as pop, slow songs, etc. A similar, process is carried out for other groups too.


Adding Label:

We add the label for classification on ourselves. For labelling each user, we take into account the ratings, how it varies for different criteria and use this knowledge to label them as "Less_interactive" and "Fun and Talk".


The new data frame is shown below,

After processing, we get a total of 1006 rows and 5 columns. The rating value for the new data frame was calculated as the average of all the features for that group.

Hence, we have obtained a clean data frame to start training with. But before that let's see some information about the methods and algorithms used.


KNN - K Nearest Neighbors

KNN is a supervised learning method that can be used both as a classification method and regression method. KNN is based on the objective that things which tend to be similar are similar! For any classification method, the output is a discrete value that tells about the data.


KNN is processed as taking a test sample and using it as a point from which all the data are considered and based on the requirement it increases its radar to detect the point which lies closest to it. For K such required points it continues to go on until it detects K top points to group the test sample with.

The image below explains the KNN algorithm properly,

To estimate the closeness of a point with any other point, we have different metrics.

The different metrics are:

  • Euclidean distance

  • Manhattan distance

  • Cosine distance

and many more.

The cosine distance is used mainly to find similarities between the document. Many times there are cases where common distance measurement such as Euclidean distance cannot justify the closeness between two documents. The two documents can be logically close but not physically. Here we use cosine measures to estimate the angle between the two documents. Apart from this cosine measures can also be used for different data such as rating and where direct measurement is not a criterion.



Cosine Similarity

Cosine measure is a way to estimate the distance between two vectors. The estimation is done by measuring the angle between the two vectors. This measurement is very useful where the vectors cannot be seen as similar physically but are logically the same.

The diagram below explains the cosine similarity,

As cosine gives value 1 for angle 0 it implies that the two vectors are overlapping if the distance between them is more. The vectors moving in the 90 degrees direction can be regarded as different or dissimilar vectors as the cosine of 90 degrees is 0.

Hence, cosine similarity can give more insights than that which can be detected with bare eyes.


Collaborative filtering

Collaborative filtering is a method where new data can be assigned to a group based on its similarity. For collaborative filtering, it filters out a large group of data to find a group that is similar to the new data and where it can belong to. It is a type of recommendation system.



As seen in the above image, the similarity of two users is estimated based on the books they read and when new books come in the book is recommended to a user if the other user has already read it.


This type of method is useful when there is no content on the basis to which the new data can be assigned. For such inputs, content-based filtering can be used. The collaborative filtering works on three basic steps: similarity between users, rating the new user will give based on the recommendation from the similar user, and how accurate the recommendation was.

In our project, we use a collaborative learning technique since our data consist of users and items.


Training

The training of the data is done using the KNN classifier for cosine similarity between the users based on their ratings for different criteria. We split the data into train and test with 30% test size. We use the KNN classifier directly from scikit-learn module for training using general metrics for measuring distance.


Contribution and Experiments:

Different K values:

We try the KNN classifier for different K values and find different accuracies for each value.

The accuracies obtained are shown below:

We can see the variation from the below graph:

The accuracies for k=2 is highest can be because of the binary label that is used. However, there is not much difference between the accuracies and all are feasible.


Cosine similarity:

We try using cosine similarity as fact that it calculates logical closeness rather than physical closeness.

The results for cosine for different accuracies are shown above.

whereas, we use the normal metric for classification the results rise up greatly,

The variation is seen below,

as we can see, the variation in the graph is constant for and after k's value is 3. We also obtain accuracies are almost 100%. This can be an indication of overfitting.


Hence for our model, we conclude using Cosine similarity as our measure.


TOP-3 features:

We estimate the top 3 features for each user. The 3 features come from the original dataset. We feature these from each category like music, movie, personality, etc. Hence, for each user, we show the top 3 likenesses associated with each category.

The results can be seen below,

This result is for the music category. Similarly, for other categories, we obtained similar results.


Recommendation system:

We further use the model to build the recommender system. Since there are few values in the data frame which are 0, indicating that particular user did not rate the category.

Hence on the knowledge, we obtained, we predict the rating for that value based on the similar users for that user.

The results are shown below:

This can be described as being a KNN regression method partly.



Challenges

  • The Biggest challenge faced was collecting the data. Getting such data was a big task and was not an easy job considering the goal you want to achieve. Finally, I came to a conclusion for modifying the data for the classification model that I want to build.

  • The next challenge was data cleaning and associating each feature clearly. It may have taken more duration of the project preparation but came to a good end.


CODE






References



  • cosine:

  • https://www.delftstack.com/howto/python/cosine-similarity-between-lists-python/

  • image: https://www.machinelearningplus.com/nlp/cosine-similarity/





47 views0 comments

Recent Posts

See All

Kommentare


bottom of page