Project 6 : Classifying Disasters on Social Media

October 8, 2020

This project is my take on the “Real or Not ? NLP with Disaster Tweets” Kaggle competition, in which we are provided a list of tweets that can either be about a disaster or not.

All of these tweets have been detected based on keywords such as “ablaze”, and it’s up to the competitor to build a model that determines wheter these words were used in the context of an actual event.

While the actual results of the testing set are public (which explains the abnormally high amount of 100% accuracy submissions on the leaderboard), my goal was to build the most efficient model and to get the highest accuracy by using machine learning algorithms.

While XGBoost, Random Forest and SVM models gave reliable predictions, I once again saw how important feature engineering is when training models, as it was what had the most impact when I worked on improving my score.

By focusing on creating new variables and reducing the document feature matrixes’ dimensions with singular value decomposition, I was able to submit predictions with accuracies similar to those that focused on deep-learning models rather than feature engineering.

Link to the RPub notebook

Link to the GitHub repository (includes a ReadMe notebook)