In this project we have tried to predict change in Tesla’s stock price from the tweets posted by Elon Musk.
- The tweets dataset contains all of Elon Musk's tweets from November 16, 2012 to September 29, 2017. This dataset was taken from Kaggle.
- The stock price was extracted using pandas_datareader. Refer to the code for more information. This contains Date, Open, Close, Low, High prices and the Volume for Tesla's stock from November 16, 2012 to September 29, 2017.
- Extract date from the column called "Time".
- Sort the dataset by date, in ascending order.
- Drop unwanted columns like "row ID", "Time", "Retweet from" and "User".
- To merge this dataset with the stock price data, we need to make it into a day-wise dataset and as there can be multiple tweets in a day, our best approach to solve this problem was to concatenate all the tweets which were tweeted on the same day.
- Merge with stock price dataset.
- Our price difference was calculated by using the formula price_diff = Close - Open. So, we would know whether the stock price that day ended on a positive value or negative value.
- If the value was positive, our target variable would be equal to 1 else it would be equal to 0.
- Drop unwanted columns like "High", "Low", "Volume", "Adj Close" and "price_diff".
- Save it in csv format.
- Use tokenization and lemmatization to remove unwanted words from the tweet and store all of them in a column called "new_tweet".
- Use TfidfVectorizer for pre-processing the tweets and store it in X in the form of an array.
- Convert X to a dataframe and add a new column called "len_tweets" which will store the length of each new_tweet.
- Apply train test split on the dataset, use test_size = 0.2
- For this project, we have tried three algorithms, Logistic Regression, XGBoost and Naive Bayes Classifier.
For model selection, we need to look at accuracy, precision and recall. So, let's look at them one by one.
Here, XGBoostClassifier is the best model as it has an accuracy of ~61% followed by Logistic Regression with an accuracy of ~55%.
Here, XGBoostClassifier is the best model as it has a precision of 0.61 followed by Logistic Regression with a precision of 0.56.
Here, XGBoostClassifier is the best model as it has a recall of 0.61 followed by Logistic Regression with a recall of 0.56.
In the end, XGBoostClassifier is the best algorithm for our model as it performs the best compared to the other two.
- Since we got an accuracy of ~61%, the best way to improve our model is to get more data.
- Another way to improve our model is to add more features which can distinguish tweets perfectly.
- Trying other machine learning algorithms like decision tree classifier or random forest.