- The objective of this project is to develop a machine learning model that can predict whether a credit card transaction is fraudulent or not. 🕵️♂️💳
- Data analysis and cleaning 🧹
- Data exploration and visualization 🔍📊
- Training multiple machine learning models 🤖
- Evaluating and comparing models 📈
- Visualizing results and metrics 📉
- Python 🐍
- Pandas for data manipulation 🐼
- NumPy for numerical operations 🔢
- Matplotlib and Seaborn for data visualization 📊🎨
- Scikit-learn for machine learning modeling and evaluation 🤖
- Data Analysis and Cleaning 🧹
- Data Loading: Import the credit card transactions dataset 📥
- Basic Analysis: Explore the structure and basic statistics of the dataset 📊
- Duplicate Check: Identify and remove duplicate rows 🗑️
- Data Cleaning: Check for and handle missing values 🚫
- Exploratory Data Analysis (EDA) 🔍
- Target Variable Distribution: Visualize the distribution of fraudulent and non-fraudulent transactions 📊
- Correlation Matrix: Analyze the correlation between variables 🔗
- Feature Distribution: Visualize the distribution of each feature 📈
- Data Preprocessing 🧪
- Feature and Target Separation: Split the dataset into features (X) and target variable (y) ✂️
- Dataset Splitting: Divide the data into training and testing sets 🧩
- Feature Scaling: Normalize the features to improve model performance 📏
- Machine Learning Modeling 🤖
- Logistic Regression: Train and evaluate a logistic regression model 📉
- Random Forest: Train and evaluate a Random Forest model 🌳
- Support Vector Machine (SVM): Train and evaluate an SVM model 🧠
- Model Evaluation and Comparison 📈
- ROC Curve and AUC: Compare models using the ROC curve and area under the curve (AUC) 📊
- Precision-Recall Curve: Evaluate the precision and recall of the models 📉
- Additional Metrics: Calculate precision, recall, F1-score, and accuracy for each model 📏
- Logistic Regression: Good performance in precision but lower recall 📉
- Random Forest: Best balance between precision and recall 🌳
- SVM: High precision but lower recall compared to Random Forest 🧠
- Target Variable Distribution: Bar chart showing the distribution of fraudulent and non-fraudulent transactions 📊
- Correlation Matrix: Heatmap showing the correlation between variables 🔗
- Feature Distribution: Histograms showing the distribution of each feature 📈
- ROC Curve: Plot comparing the ROC curves of the models 📉
- Precision-Recall Curve: Plot comparing the precision and recall of the models 📊
- Notebook
- Random Forest is the most balanced model for detecting credit card fraud 🌳
- Feature standardization and duplicate removal are crucial steps in data preprocessing 🧹
- Evaluating multiple metrics is essential for a comprehensive model comparison 📏
- For any inquiries or collaborations, you can contact me at: jotaduranbon.com 📧