Assuming we were a huge books publisher and a writer came to us with a book, how could we know if this book will be successful? Also if we were to be the authors of the book, could we ever know if the book will get the audience sympathy or even reach the cinema theatres? To answer the following questions we set a goal to our research: to see if we can build a model that will predict if a book is so successful that it will also be awarded by list of books features.
- Introduction
- Imports
- Data acquisition
3.1 Scraping challanges
3.2 Scraping clean data
3.3 Authentication process
3.4 Authentication class
3.5 Scraping Process
3.6 Book Spider Class
3.7 Scraping route creation
3.8 Genre spider - Scrapping and threading
4.1 First crawl
4.2 Concating Data
4.3 Total data scraped - Data cleaning
5.1 Corrupted data cleaning
5.2 Replace missing data - original title
5.3 None values - discussion and strategy - Pre outliers cleaning EDA
6.1 Genre distribution
6.2 Mean rating by genre
6.3 Language distribution
6.4 Edition count to rating
6.5 Rating to award
6.6 Pages count to books count - Dealing with outliers
7.1 Outliers detection
7.2 Outliers cleaning
7.3 Outliers cleaning results - EDA after outliers cleaning
8.1 Thoughts of the results
8.2 Aggregation metrics
8.3 Original title correlation with awards
8.4 Awards count per genre
8.5 Awards percentage by genre - Machine learning preperation
- Machine learning - Decision tree
10.1 Single decision tree
10.2 First prediction
10.3 New dimenstion - The ace in the sleeve
10.4 Depth optiomazation - Machine learning - Random forest
11.1 Overfitting?
11.2 Model improvment
11.3 Adjusting features
11.4 Grid search many forests
11.5 F-score accuracy addition
11.6 Random states tests - Conclusion and credits
For implementation, visit hosted notebook:
https://chapost1.github.io/books-success-prediction-experiment/