This project uses machine learning models to explore and predict insights from the top films of the last 20 years. Using the TMDb dataset, we analyse key factors that influence the success of films, focusing on revenue prediction and genre classification.
We analyzed the TMDb Movies Dataset, a comprehensive resource with over 1 million movies, updated daily. It includes information such as titles, ratings, release dates, revenue, genres, and more.
- Data: Contains raw and cleaned datasets in CSV format.
- Notebooks: Jupyter notebooks with data cleaning, preprocessing, model training and analysis code.
- Slides: Presentation slides.
Data cleaning was critical due to the large dataset size (1 million+ records):
- Dropped irrelevant columns to streamline the dataset.
- Filtered data: Retained only movies released in the last 20 years, focusing on "Released" status.
- Outlier handling: Removed entries with unusual values for runtime, budget, and revenue.
- Sorted by revenue to prioritize high-performing movies.
Feature engineering was essential to improve model inputs. We used techniques such as:
- Scaling: Applied MinMaxScaler to normalize numeric variables.
- Date Features: Split the release_date into release_month and release_day.
- Genres: Normalized and tokenized genres for multi-label classification.
- Text Features: Extracted key trends from movie titles for feature representation.
Revenue Prediction Model: RandomForestRegressor
- Input Features: budget, runtime, release_year
- Evaluation Metric: Mean Absolute Error (MAE)
- Performance: Achieved MAE of $X.
- Genre Classification:
Revenue Prediction Model: RandomForestClassifier
- Input Features: budget, runtime, release_year
- Evaluation Metric: Accuracy
- Performance: Achieved accuracy of X%.
Validation Techniques
- Train-Test Split: 80/20 split for training and testing.
- Cross-Validation: Employed k-fold cross-validation to ensure robust performance.
We optimized models using the following techniques:
- Random Search & Grid Search: Tuned hyperparameters like the number of estimators and tree depth.
- Feature Pruning: Reduced features to minimize noise and enhance interpretability.
- Revenue Drivers: High budgets and longer runtimes correlate with higher revenue but show diminishing returns beyond a threshold.
- Genre Trends: Action, adventure, and family genres dominate top revenue earners.
- Title Impact: Short, impactful titles with emotional or intriguing keywords perform better.