GitHub - claracachapa/next-best-movie

The Next Best Movie

Project Overview

This project uses machine learning models to explore and predict insights from the top films of the last 20 years. Using the TMDb dataset, we analyse key factors that influence the success of films, focusing on revenue prediction and genre classification.

Datasets

We analyzed the TMDb Movies Dataset, a comprehensive resource with over 1 million movies, updated daily. It includes information such as titles, ratings, release dates, revenue, genres, and more.

Project Structure

Data: Contains raw and cleaned datasets in CSV format.
Notebooks: Jupyter notebooks with data cleaning, preprocessing, model training and analysis code.
Slides: Presentation slides.

Analysis Approach

1. Data Cleaning

Data cleaning was critical due to the large dataset size (1 million+ records):

Dropped irrelevant columns to streamline the dataset.
Filtered data: Retained only movies released in the last 20 years, focusing on "Released" status.
Outlier handling: Removed entries with unusual values for runtime, budget, and revenue.
Sorted by revenue to prioritize high-performing movies.

2. Feature Engineering and Selection

Feature engineering was essential to improve model inputs. We used techniques such as:

Scaling: Applied MinMaxScaler to normalize numeric variables.
Date Features: Split the release_date into release_month and release_day.
Genres: Normalized and tokenized genres for multi-label classification.
Text Features: Extracted key trends from movie titles for feature representation.

3. Model Building and Evaluation

Revenue Prediction Model: RandomForestRegressor

Input Features: budget, runtime, release_year
Evaluation Metric: Mean Absolute Error (MAE)
Performance: Achieved MAE of $X.
Genre Classification:

Revenue Prediction Model: RandomForestClassifier

Input Features: budget, runtime, release_year
Evaluation Metric: Accuracy
Performance: Achieved accuracy of X%.

Validation Techniques

Train-Test Split: 80/20 split for training and testing.
Cross-Validation: Employed k-fold cross-validation to ensure robust performance.

4. Hyperparameter Tuning and Model Optimization

We optimized models using the following techniques:

Random Search & Grid Search: Tuned hyperparameters like the number of estimators and tree depth.
Feature Pruning: Reduced features to minimize noise and enhance interpretability.

Key Findings & Implications

Revenue Drivers: High budgets and longer runtimes correlate with higher revenue but show diminishing returns beyond a threshold.
Genre Trends: Action, adventure, and family genres dominate top revenue earners.
Title Impact: Short, impactful titles with emotional or intriguing keywords perform better.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data/clean		data/clean
notebooks		notebooks
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
The next best movie.pdf		The next best movie.pdf
config.yaml		config.yaml
hello.py		hello.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Next Best Movie

Project Overview

Datasets

Project Structure

Analysis Approach

1. Data Cleaning

2. Feature Engineering and Selection

3. Model Building and Evaluation

4. Hyperparameter Tuning and Model Optimization

Key Findings & Implications

About

Releases

Packages

Languages

claracachapa/next-best-movie

Folders and files

Latest commit

History

Repository files navigation

The Next Best Movie

Project Overview

Datasets

Project Structure

Analysis Approach

1. Data Cleaning

2. Feature Engineering and Selection

3. Model Building and Evaluation

4. Hyperparameter Tuning and Model Optimization

Key Findings & Implications

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages