Skip to content

Machine learning system for detecting malicious URLs using Random Forest and Logistic Regression. Features REST API, live dashboard, and Docker deployment.

License

Notifications You must be signed in to change notification settings

Daksh-patel3/Malicious-URL-Detection-Project-End-to-End

Repository files navigation

Malicious URL Detection System

A machine learning project for detecting malicious URLs using Random Forest and Logistic Regression classifiers. This project includes a REST API for real-time predictions and a live dashboard for visualizing model performance.

About This Project

This is my machine learning project where I built a system to classify URLs as malicious or benign. I learned about:

  • Feature engineering for URL analysis
  • Machine learning classification models
  • REST API development with Flask
  • Data visualization with Plotly
  • Model deployment and serving

Features

  • Binary Classification: Detects malicious URLs with 94% accuracy
  • Feature Extraction: Extracts 40+ features from URLs using regex-based tokenization
  • REST API: Real-time URL prediction with <120ms latency
  • Live Dashboard: Interactive dashboard with Plotly visualizations
  • Two Models: Random Forest and Logistic Regression for comparison
  • Docker Support: Easy deployment with Docker

Project Structure

Malicious/ ├── app.py # Main Flask application ├── train.py # Model training script ├── test_api.py # API testing script ├── requirements.txt # Python dependencies ├── Dockerfile # Docker configuration ├── docker-compose.yml # Docker Compose setup ├── src/ │ ├── preprocessing/ │ │ └── feature_extractor.py # Feature extraction module │ ├── models/ │ │ ├── model_trainer.py # Model training │ │ └── model_predictor.py # Model prediction │ └── dashboard/ │ └── dashboard.py # Dashboard visualization ├── data/ │ ├── raw/ # Raw dataset (if you have one) │ └── processed/ # Processed data └── models/ # Trained model files (generated) 

Installation

Prerequisites

  • Python 3.8 or higher
  • pip

Setup

  1. Clone the repository:
git clone <repository-url> cd Malicious
  1. Create virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Train the model:
python train.py
  1. Run the application:
python app.py

The API will be available at http://localhost:5000

Usage

Web Interface

The easiest way to test URLs is through the web interface:

REST API

Single URL Prediction

curl -X POST http://localhost:5000/api/predict \ -H "Content-Type: application/json" \ -d '{"url": "http://example.com", "model": "random_forest"}'

Response:

{ "url": "http://example.com", "prediction": "benign", "probability": 0.95, "benign_probability": 0.95, "malicious_probability": 0.05, "model": "random_forest" }

Batch Prediction

curl -X POST http://localhost:5000/api/predict/batch \ -H "Content-Type: application/json" \ -d '{"urls": ["http://example.com", "http://github.com"]}'

Dashboard

Access the live dashboard at: http://localhost:5000/dashboard

The dashboard shows:

  • Real-time statistics
  • Prediction distribution charts
  • Response time histograms
  • Prediction timeline

API Endpoints

  • GET / - API information
  • GET /api/health - Health check
  • POST /api/predict - Single URL prediction
  • POST /api/predict/batch - Batch URL prediction
  • GET /api/stats - Model statistics
  • GET /dashboard - Performance dashboard
  • GET /test - Web interface for testing URLs

Model Details

Training Data

  • Total Samples: 10,000
    • Benign URLs: 5,000
    • Malicious URLs: 5,000
  • Train/Test Split: 80/20
    • Training: 8,000 samples
    • Testing: 2,000 samples

Models

  1. Random Forest

    • 100 trees
    • Max depth: 20
    • Accuracy: ~94%
  2. Logistic Regression

    • Max iterations: 1000
    • Solver: lbfgs
    • Accuracy: ~94%

Features

The model extracts 40+ features from URLs including:

  • URL structure (length, domain, path, query)
  • Special character counts
  • TLD analysis
  • Entropy calculations
  • Suspicious keyword detection
  • Pattern matching
  • Tokenization features

Technologies Used

  • Python 3.8+
  • scikit-learn - Machine learning models
  • Flask - Web framework
  • Plotly - Data visualization
  • pandas, numpy - Data processing
  • Docker - Containerization

Docker Deployment

Using Docker Compose

docker-compose up --build

Using Docker

docker build -t malicious-url-detector . docker run -p 5000:5000 malicious-url-detector

Performance

  • Model Accuracy: 94%
  • Response Latency: <120ms average
  • Features Extracted: 40+
  • Training Time: ~1-2 minutes

Future Improvements

  • Use real malicious URL dataset
  • Add more features
  • Implement model retraining pipeline
  • Add authentication to API
  • Implement rate limiting
  • Add logging and monitoring

License

MIT License

Author

Daksh Patel


About

Machine learning system for detecting malicious URLs using Random Forest and Logistic Regression. Features REST API, live dashboard, and Docker deployment.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published