Titanic - Machine Learning from Disaster

Hindavi Churi
Sep 28, 2021
2 min read

Updated: Nov 3, 2021

This machine Learning project is about predicting the survival output of the passengers that were travelling on the titanic ship. There are various attributes that contribute the survival chances. The main challenge of this project is to play with these attributes and mine that patterns which will prove to be useful in predicting the output.

Data

We have the train data set and test data set. Each of these has same attributes except for the survival status in train data set.

Data Pre-processing

Age:

This attribute has many missing values. One of the common methods would be to drop the rows with these missing values. But since there are many missing values, almost 20% of data; we use another method which replaces or fills these values with mean of the Age attribute. This method is convenient as the filled values are not that biased.

By looking at the data, we can say that the people whose age is less than 11, more that 60 and between 11-25 has more survival rate. To justify these or say to grasp this information and represent it properly for data mining, we assign the group value for each group we make.

Gender

This attribute has Either Female or Male values in form of the string. Hence with map these values to 1 and 0 respectively.

Embarked

This attribute has many missing values. Here we fill or replace these missing values with the most common value since we cannot replace it with mean values because the values are strings.

Family_size and Is_alone

We create new columns from Parch and SibSp columns. The new attributes tell us the total no of family members a person has or the person is alone. This insight can be useful since we can observe that for people who are alone or has just 1 more person with them has more survival chances.

Drop columns

We drop the Name, Ticket, Fare and Cabin attributes since they are unique and does not contribute significantly.

Data training and Prediction

We split the train data into X_train which is all the attributes of train data set except for the survival status and Y_train which is the survival status.

We use Random Forest Classifier for our project. The reason is among all the classifiers Random Forest Clasifier has the maximum accuracy. Random Forest Classifier is a method where a number of decision tree classifiers based on sub-samples of the dataset are formed and uses averaging to improve the predictive accuracy.

We get an accuracy of 81.25%

Finally, we make prediction using the test data.

Code

Contribution

For my contribution, I performed data pre-processing on training and testing data set. All the data pre-processing measures are well explained above. The pre-processing steps are not included in the tutorial that was suggested.