Malicious URL Detection System

A machine learning project for detecting malicious URLs using Random Forest and Logistic Regression classifiers. This project includes a REST API for real-time predictions and a live dashboard for visualizing model performance.

About This Project

This is my machine learning project where I built a system to classify URLs as malicious or benign. I learned about:

Feature engineering for URL analysis
Machine learning classification models
REST API development with Flask
Data visualization with Plotly
Model deployment and serving

Features

Binary Classification: Detects malicious URLs with 94% accuracy
Feature Extraction: Extracts 40+ features from URLs using regex-based tokenization
REST API: Real-time URL prediction with <120ms latency
Live Dashboard: Interactive dashboard with Plotly visualizations
Two Models: Random Forest and Logistic Regression for comparison
Docker Support: Easy deployment with Docker

Project Structure

Malicious/ ├── app.py # Main Flask application ├── train.py # Model training script ├── test_api.py # API testing script ├── requirements.txt # Python dependencies ├── Dockerfile # Docker configuration ├── docker-compose.yml # Docker Compose setup ├── src/ │ ├── preprocessing/ │ │ └── feature_extractor.py # Feature extraction module │ ├── models/ │ │ ├── model_trainer.py # Model training │ │ └── model_predictor.py # Model prediction │ └── dashboard/ │ └── dashboard.py # Dashboard visualization ├── data/ │ ├── raw/ # Raw dataset (if you have one) │ └── processed/ # Processed data └── models/ # Trained model files (generated)

Installation

Prerequisites

Python 3.8 or higher
pip

Setup

Clone the repository:

git clone <repository-url> cd Malicious

Create virtual environment (recommended):

python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Train the model:

python train.py

Run the application:

python app.py

The API will be available at http://localhost:5000

Usage

Web Interface

The easiest way to test URLs is through the web interface:

Open: http://localhost:5000/test
Enter a URL and click "Check URL"

REST API

Single URL Prediction

curl -X POST http://localhost:5000/api/predict \ -H "Content-Type: application/json" \ -d '{"url": "http://example.com", "model": "random_forest"}'

Response:

{ "url": "http://example.com", "prediction": "benign", "probability": 0.95, "benign_probability": 0.95, "malicious_probability": 0.05, "model": "random_forest" }

Batch Prediction

curl -X POST http://localhost:5000/api/predict/batch \ -H "Content-Type: application/json" \ -d '{"urls": ["http://example.com", "http://github.com"]}'

Dashboard

Access the live dashboard at: http://localhost:5000/dashboard

The dashboard shows:

Real-time statistics
Prediction distribution charts
Response time histograms
Prediction timeline

API Endpoints

GET / - API information
GET /api/health - Health check
POST /api/predict - Single URL prediction
POST /api/predict/batch - Batch URL prediction
GET /api/stats - Model statistics
GET /dashboard - Performance dashboard
GET /test - Web interface for testing URLs

Model Details

Training Data

Total Samples: 10,000
- Benign URLs: 5,000
- Malicious URLs: 5,000
Train/Test Split: 80/20
- Training: 8,000 samples
- Testing: 2,000 samples

Models

Random Forest
- 100 trees
- Max depth: 20
- Accuracy: ~94%
Logistic Regression
- Max iterations: 1000
- Solver: lbfgs
- Accuracy: ~94%

Features

The model extracts 40+ features from URLs including:

URL structure (length, domain, path, query)
Special character counts
TLD analysis
Entropy calculations
Suspicious keyword detection
Pattern matching
Tokenization features

Technologies Used

Python 3.8+
scikit-learn - Machine learning models
Flask - Web framework
Plotly - Data visualization
pandas, numpy - Data processing
Docker - Containerization

Docker Deployment

Using Docker Compose

docker-compose up --build

Using Docker

docker build -t malicious-url-detector . docker run -p 5000:5000 malicious-url-detector

Performance

Model Accuracy: 94%
Response Latency: <120ms average
Features Extracted: 40+
Training Time: ~1-2 minutes

Future Improvements

Use real malicious URL dataset
Add more features
Implement model retraining pipeline
Add authentication to API
Implement rate limiting
Add logging and monitoring

License

MIT License

Author

Daksh Patel

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
HOW_TO_TEST_URLS.md		HOW_TO_TEST_URLS.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
RUN_LOCALLY.md		RUN_LOCALLY.md
app.py		app.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run_local.bat		run_local.bat
setup.py		setup.py
start_server.bat		start_server.bat
test_api.bat		test_api.bat
test_api.py		test_api.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Malicious URL Detection System

About This Project

Features

Project Structure

Installation

Prerequisites

Setup

Usage

Web Interface

REST API

Single URL Prediction

Batch Prediction

Dashboard

API Endpoints

Model Details

Training Data

Models

Features

Technologies Used

Docker Deployment

Using Docker Compose

Using Docker

Performance

Future Improvements

License

Author

About

Uh oh!

Releases

Packages

Languages

License

Daksh-patel3/Malicious-URL-Detection-Project-End-to-End

Folders and files

Latest commit

History

Repository files navigation

Malicious URL Detection System

About This Project

Features

Project Structure

Installation

Prerequisites

Setup

Usage

Web Interface

REST API

Single URL Prediction

Batch Prediction

Dashboard

API Endpoints

Model Details

Training Data

Models

Features

Technologies Used

Docker Deployment

Using Docker Compose

Using Docker

Performance

Future Improvements

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages