This project is a Web Application Firewall (WAF) designed to protect web applications from malicious requests. By leveraging Machine Learning , specifically Logistic Regression, the WAF can distinguish between good (legitimate) and bad (malicious) requests. The solution involves a proxy server that intercepts incoming requests, evaluates them using a trained ML model, and determines whether to allow or block the request based on the prediction.
Section | Description |
---|---|
Overview | Introduction to the Web Application Firewall (WAF) project. |
Features | Key features of the WAF including proxy server, ML model, and logging. |
Architecture | Overview of the components and workflow of the WAF. |
Tech Stack | Technologies and tools used in the project. |
Installation | Step-by-step guide to install the WAF. |
Usage | Instructions on how to run and use the WAF. |
Dataset | Details on the dataset used for training the ML model. |
Machine Learning Model | Information on the ML model and training process. |
Contributing | Guidelines for contributing to the project. |
Web Application Firewalls (WAFs) are critical components for protecting web applications from attacks such as SQL injection, Cross-Site Scripting (XSS), and other OWASP Top 10 vulnerabilities. This WAF uses a Logistic Regression model to classify incoming HTTP requests as either good or bad, enhancing the security of the web application it protects.
- Proxy Server: Intercepts incoming HTTP requests and forwards them to the web server if deemed safe.
- Machine Learning Model: Logistic Regression model trained to detect malicious requests.
- Real-Time Request Analysis: Analyzes and classifies requests in real-time.
- Logging: Logs all requests and their classification for auditing and further analysis.
The architecture of the WAF is composed of the following components:
- Proxy Server: Acts as an intermediary between the client and the web server.
- Request Logger: Logs incoming requests for analysis and model training.
- Feature Extractor: Extracts relevant features from HTTP requests for ML model input.
- Logistic Regression Model: Trained model to classify requests as good or bad.
- Decision Engine: Uses the model's prediction to allow or block the request.
- Programming Language: Python
- Machine Learning Library: Scikit-learn
- Data Handling: Pandas
- HTTP Handling: Requests
- Logging: Python's logging module
- Network Security: Integration of security best practices and protocols
- Web Security: Implementing security measures to protect against these vulnerabilities.
Clone the Repository:
git clone https://github.com/Pratham-verma/Web_Application_Firewall.git
-
Run the Proxy Server:
python proxy_server.py
-
Monitor Logs: Check the logs generated by the proxy server to see the classification of requests.
The dataset used for training the Logistic Regression model consists of labeled HTTP requests. Each request is classified as either good (legitimate) or bad (malicious). The dataset includes various features extracted from the HTTP headers, body, and other metadata.
To prepare the dataset:
- Collect a large number of HTTP requests from various sources.
- Label the requests as good or bad.
- Extract features from each request.
- Split the dataset into training and testing sets.
The Logistic Regression model is trained using the prepared dataset. The model learns to identify patterns and features that distinguish good requests from bad ones.
-
Prepare the Dataset: Ensure your dataset is in a suitable format (e.g., CSV) with labeled features.
-
Train the Model:
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import pandas as pd # Load dataset data = pd.read_csv('dataset.csv') X = data.drop('label', axis=1) y = data['label'] # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train the model model = LogisticRegression() model.fit(X_train, y_train)
-
Evaluate the Model:
from sklearn.metrics import accuracy_score, classification_report # Predict and evaluate y_pred = model.predict(X_test) print(accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))
Contributions are welcome! Please fork the repository and submit a pull request with your improvements.