A machine learning project for detecting malicious URLs using Random Forest and Logistic Regression classifiers. This project includes a REST API for real-time predictions and a live dashboard for visualizing model performance.
This is my machine learning project where I built a system to classify URLs as malicious or benign. I learned about:
- Feature engineering for URL analysis
- Machine learning classification models
- REST API development with Flask
- Data visualization with Plotly
- Model deployment and serving
- Binary Classification: Detects malicious URLs with 94% accuracy
- Feature Extraction: Extracts 40+ features from URLs using regex-based tokenization
- REST API: Real-time URL prediction with <120ms latency
- Live Dashboard: Interactive dashboard with Plotly visualizations
- Two Models: Random Forest and Logistic Regression for comparison
- Docker Support: Easy deployment with Docker
Malicious/ ├── app.py # Main Flask application ├── train.py # Model training script ├── test_api.py # API testing script ├── requirements.txt # Python dependencies ├── Dockerfile # Docker configuration ├── docker-compose.yml # Docker Compose setup ├── src/ │ ├── preprocessing/ │ │ └── feature_extractor.py # Feature extraction module │ ├── models/ │ │ ├── model_trainer.py # Model training │ │ └── model_predictor.py # Model prediction │ └── dashboard/ │ └── dashboard.py # Dashboard visualization ├── data/ │ ├── raw/ # Raw dataset (if you have one) │ └── processed/ # Processed data └── models/ # Trained model files (generated) - Python 3.8 or higher
- pip
- Clone the repository:
git clone <repository-url> cd Malicious- Create virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Train the model:
python train.py- Run the application:
python app.pyThe API will be available at http://localhost:5000
The easiest way to test URLs is through the web interface:
- Open: http://localhost:5000/test
- Enter a URL and click "Check URL"
curl -X POST http://localhost:5000/api/predict \ -H "Content-Type: application/json" \ -d '{"url": "http://example.com", "model": "random_forest"}'Response:
{ "url": "http://example.com", "prediction": "benign", "probability": 0.95, "benign_probability": 0.95, "malicious_probability": 0.05, "model": "random_forest" }curl -X POST http://localhost:5000/api/predict/batch \ -H "Content-Type: application/json" \ -d '{"urls": ["http://example.com", "http://github.com"]}'Access the live dashboard at: http://localhost:5000/dashboard
The dashboard shows:
- Real-time statistics
- Prediction distribution charts
- Response time histograms
- Prediction timeline
GET /- API informationGET /api/health- Health checkPOST /api/predict- Single URL predictionPOST /api/predict/batch- Batch URL predictionGET /api/stats- Model statisticsGET /dashboard- Performance dashboardGET /test- Web interface for testing URLs
- Total Samples: 10,000
- Benign URLs: 5,000
- Malicious URLs: 5,000
- Train/Test Split: 80/20
- Training: 8,000 samples
- Testing: 2,000 samples
-
Random Forest
- 100 trees
- Max depth: 20
- Accuracy: ~94%
-
Logistic Regression
- Max iterations: 1000
- Solver: lbfgs
- Accuracy: ~94%
The model extracts 40+ features from URLs including:
- URL structure (length, domain, path, query)
- Special character counts
- TLD analysis
- Entropy calculations
- Suspicious keyword detection
- Pattern matching
- Tokenization features
- Python 3.8+
- scikit-learn - Machine learning models
- Flask - Web framework
- Plotly - Data visualization
- pandas, numpy - Data processing
- Docker - Containerization
docker-compose up --builddocker build -t malicious-url-detector . docker run -p 5000:5000 malicious-url-detector- Model Accuracy: 94%
- Response Latency: <120ms average
- Features Extracted: 40+
- Training Time: ~1-2 minutes
- Use real malicious URL dataset
- Add more features
- Implement model retraining pipeline
- Add authentication to API
- Implement rate limiting
- Add logging and monitoring
MIT License
Daksh Patel