Skip to content

codebasics/job-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Job Scraper & Analytics Dashboard

Automated job data pipeline for LinkedIn with intelligent skill extraction and real-time analytics.

Python 3.11+ Playwright Streamlit SQLite


Overview

A production-ready job scraping system that collects job listings from LinkedIn, extracts technical skills using regex-based pattern matching, and provides interactive analytics through a Streamlit dashboard.

Key Capabilities

Feature Description
Two-Phase Scraping Separate URL collection and detail extraction for resilience
3-Layer Skill Extraction 977 skills with regex patterns, minimal false positives
150 Role Categories Automatic role normalization with pattern matching
Real-Time Analytics Interactive charts, skill trends, and export capabilities
Adaptive Rate Limiting Circuit breaker with auto-tuning concurrency (2-10 workers)
Resume Capability Checkpoint-based recovery from interruptions

Project Structure

Job_Scrapper/ ├── README.md # This file ├── requirements.txt # Production dependencies ├── requirements-dev.txt # Development dependencies ├── .gitignore # Git ignore rules │ ├── code/ # All source code │ ├── streamlit_app.py # Main dashboard entry point │ ├── run_scraper.py # CLI scraper runner │ ├── save_linkedin_cookies.py # LinkedIn authentication helper │ ├── setup_playwright.sh # Playwright browser installer (WSL/Linux) │ │ │ ├── data/ │ │ ├── jobs.db # SQLite database (auto-created) │ │ └── Analysis_Report/ # Generated analysis reports │ │ ├── Data_Analyst/ │ │ ├── Data_Engineer/ │ │ └── GenAI_DataScience/ │ │ │ ├── src/ │ │ ├── config/ # Configuration files │ │ │ ├── skills_reference_2025.json # 977 skills with regex patterns │ │ │ ├── roles_reference_2025.json # 150 role categories │ │ │ ├── countries.py # Country/location mappings │ │ │ └── naukri_locations.py │ │ │ │ │ ├── db/ # Database layer │ │ │ ├── connection.py # SQLite connection manager │ │ │ ├── schema.py # Table schemas │ │ │ └── operations.py # CRUD operations │ │ │ │ │ ├── models/ │ │ │ └── models.py # Pydantic data models │ │ │ │ │ ├── scraper/ │ │ │ ├── unified/ │ │ │ │ ├── linkedin/ # LinkedIn scraper components │ │ │ │ │ ├── concurrent_detail_scraper.py # Multi-tab scraper (up to 10 tabs) │ │ │ │ │ ├── sequential_detail_scraper.py # Single-tab scraper │ │ │ │ │ ├── playwright_url_scraper.py # URL collection │ │ │ │ │ ├── selector_config.py # CSS selectors │ │ │ │ │ ├── retry_helper.py # 404/503 handling │ │ │ │ │ └── job_validator.py # Field validation │ │ │ │ │ │ │ │ │ ├── naukri/ # Naukri scraper components │ │ │ │ │ ├── url_scraper.py │ │ │ │ │ ├── detail_scraper.py │ │ │ │ │ └── selectors.py │ │ │ │ │ │ │ │ │ ├── scalable/ # Rate limiting & resilience │ │ │ │ │ ├── adaptive_rate_limiter.py │ │ │ │ │ ├── checkpoint_manager.py │ │ │ │ │ └── progress_tracker.py │ │ │ │ │ │ │ │ │ ├── linkedin_unified.py # LinkedIn orchestrator │ │ │ │ └── naukri_unified.py # Naukri orchestrator │ │ │ │ │ │ │ └── services/ # External service clients │ │ │ ├── playwright_browser.py │ │ │ └── session_manager.py │ │ │ │ │ ├── analysis/ │ │ │ └── skill_extraction/ # 3-layer skill extraction │ │ │ ├── extractor.py # Main AdvancedSkillExtractor class │ │ │ ├── layer3_direct.py # Pattern matching from JSON │ │ │ ├── batch_reextract.py # Re-process existing jobs │ │ │ └── deduplicator.py # Skill normalization │ │ │ │ │ ├── ui/ │ │ │ └── components/ # Streamlit UI components │ │ │ ├── kpi_dashboard.py │ │ │ ├── link_scraper_form.py │ │ │ ├── detail_scraper_form.py │ │ │ └── analytics/ │ │ │ ├── skills_charts.py │ │ │ └── overview_metrics.py │ │ │ │ │ ├── utils/ │ │ │ └── cleanup_expired_urls.py │ │ │ │ │ └── validation/ │ │ ├── validation_pipeline.py │ │ └── single_job_validator.py │ │ │ ├── scripts/ │ │ ├── extraction/ │ │ │ └── reextract_skills.py │ │ │ │ │ └── validation/ # Validation suite │ │ ├── layer1_syntax_check.sh │ │ ├── layer2_coverage.sh │ │ ├── layer3_fp_detection.sh │ │ ├── layer4_fn_detection.sh │ │ ├── cross_verify_skills.py │ │ └── run_all_validations.sh │ │ │ ├── tests/ │ │ ├── test_skill_validation_comprehensive.py │ │ └── test_linkedin_selectors.py │ │ │ └── docs/ # Documentation │ └── archive/ # Historical docs │ └── Analysis/ # Downloaded CSVs and notebooks (gitignored) ├── Data Analysis/ │ ├── data_visualizer.ipynb # Analysis notebook (update CSV path for charts) │ └── csv/ # Add exported CSVs here │ ├── Data Engineering/ │ ├── data_visualizer.ipynb │ └── csv/ │ └── GenAI & DataScience/ ├── data_visualizer.ipynb └── csv/ 

Installation

Prerequisites

  • Python 3.11 or higher
  • Git

Step 1: Clone & Create Virtual Environment

Windows (PowerShell)

git clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git cd Job-Scrapper # Create virtual environment python -m venv venv-win # Activate .\venv-win\Scripts\Activate.ps1 # Install dependencies pip install -r requirements.txt

Linux / WSL

git clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git cd Job-Scrapper # Create virtual environment python3 -m venv venv-linux # Activate source venv-linux/bin/activate # Install dependencies python -m pip install -r requirements.txt

Note for dual-boot users: Keep separate venvs (venv-win/ and venv-linux/) as Python virtual environments are not cross-platform compatible.

Step 2: Install Playwright Browsers

# Windows playwright install chromium # Linux/WSL (use python -m prefix) python -m playwright install chromium

Step 3: Launch Dashboard

cd code # Windows streamlit run streamlit_app.py # Linux/WSL (use python -m prefix) python -m streamlit run streamlit_app.py

The dashboard opens at http://localhost:8501


Architecture

Why Two-Phase Scraping?

Phase 1: URL Collection Phase 2: Detail Scraping ┌─────────────────────┐ ┌─────────────────────┐ │ Search Results │ │ Individual Jobs │ │ ├── Fast scroll │ ──▶ │ ├── Full desc │ │ ├── Extract URLs │ │ ├── Skills parse │ │ └── Store to DB │ │ └── Store details │ └─────────────────────┘ └─────────────────────┘ job_urls table jobs table 

Benefits:

  • Resilience: If detail scraping fails, URLs are preserved
  • Efficiency: Batch process up to 10 jobs concurrently in Phase 2
  • Resumable: Pick up exactly where you left off
  • Deduplication: Skip already-scraped URLs automatically

Why Regex-Based Skill Extraction?

Approach Speed Accuracy Maintenance
Regex (chosen) 0.3s/job 85-90% Pattern file updates
spaCy NER 3-5s/job 75-80% Model retraining
GPT-based 2-10s/job 90%+ API costs

Our 3-layer approach achieves 85-90% accuracy at 10x speed of NLP:

  1. Layer 1: Multi-word phrase extraction (priority matching)
  2. Layer 2: Context-aware extraction (technical context detection)
  3. Layer 3: Direct pattern matching (977 skill patterns from JSON)

Usage

Dashboard Workflow

  1. KPI Dashboard - View overall statistics
  2. Link Scraper - Phase 1: Collect job URLs
  3. Detail Scraper - Phase 2: Extract job details & skills
  4. Analytics - Analyze skill trends and export data

Command Line

cd code # Run validation suite bash scripts/validation/run_all_validations.sh # Re-extract skills for existing jobs python -m src.analysis.skill_extraction.batch_reextract --batch-size 100

LinkedIn Authentication (Optional)

For authenticated scraping with higher limits:

cd code python save_linkedin_cookies.py

This saves cookies to linkedin_cookies.json for subsequent sessions.


Configuration

Skills Reference (code/src/config/skills_reference_2025.json)

{ "total_skills": 977, "skills": [ { "name": "Python", "patterns": ["\\bPython\\b", "\\bpython\\b", "\\bPython3\\b"] } ] }

Environment Variables (Optional)

Create .env file in code/ directory:

# Database path (default: data/jobs.db) DB_PATH=data/jobs.db # Playwright browser path (for WSL) PLAYWRIGHT_BROWSERS_PATH=.playwright-browsers

Database Schema

-- Phase 1: URL Collection CREATE TABLE job_urls ( job_id TEXT PRIMARY KEY, platform TEXT NOT NULL, input_role TEXT NOT NULL, actual_role TEXT NOT NULL, url TEXT NOT NULL UNIQUE, scraped INTEGER DEFAULT 0 ); -- Phase 2: Full Details CREATE TABLE jobs ( job_id TEXT PRIMARY KEY, platform TEXT NOT NULL, actual_role TEXT NOT NULL, url TEXT NOT NULL UNIQUE, job_description TEXT, skills TEXT, company_name TEXT, posted_date TEXT, scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP );

Performance

Metric Value
URL Collection 200-300 URLs/min
Detail Scraping 15-20 jobs/min (10 workers)
Skill Extraction 0.3s/job
Storage per Job ~2KB

Troubleshooting

Playwright Browser Not Found (WSL/Linux)

cd code chmod +x setup_playwright.sh ./setup_playwright.sh

"python" command not found (Linux)

Use python3 or the python -m prefix:

python3 -m streamlit run streamlit_app.py python3 -m pip install package_name

Rate Limited (429 Errors)

The adaptive rate limiter handles this automatically:

  • Concurrency reduces from 10 → 2
  • Circuit breaker triggers 60s pause
  • Gradually recovers when stable

Database Locked

pkill -f streamlit python -m streamlit run streamlit_app.py

Development

Install Dev Dependencies

pip install -r requirements-dev.txt

Run Tests

cd code python -m pytest tests/ -v

Type Checking

cd code python -m basedpyright src/

License

MIT License - See LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published