Automated job data pipeline for LinkedIn with intelligent skill extraction and real-time analytics.
A production-ready job scraping system that collects job listings from LinkedIn, extracts technical skills using regex-based pattern matching, and provides interactive analytics through a Streamlit dashboard.
| Feature | Description |
|---|---|
| Two-Phase Scraping | Separate URL collection and detail extraction for resilience |
| 3-Layer Skill Extraction | 977 skills with regex patterns, minimal false positives |
| 150 Role Categories | Automatic role normalization with pattern matching |
| Real-Time Analytics | Interactive charts, skill trends, and export capabilities |
| Adaptive Rate Limiting | Circuit breaker with auto-tuning concurrency (2-10 workers) |
| Resume Capability | Checkpoint-based recovery from interruptions |
Job_Scrapper/ ├── README.md # This file ├── requirements.txt # Production dependencies ├── requirements-dev.txt # Development dependencies ├── .gitignore # Git ignore rules │ ├── code/ # All source code │ ├── streamlit_app.py # Main dashboard entry point │ ├── run_scraper.py # CLI scraper runner │ ├── save_linkedin_cookies.py # LinkedIn authentication helper │ ├── setup_playwright.sh # Playwright browser installer (WSL/Linux) │ │ │ ├── data/ │ │ ├── jobs.db # SQLite database (auto-created) │ │ └── Analysis_Report/ # Generated analysis reports │ │ ├── Data_Analyst/ │ │ ├── Data_Engineer/ │ │ └── GenAI_DataScience/ │ │ │ ├── src/ │ │ ├── config/ # Configuration files │ │ │ ├── skills_reference_2025.json # 977 skills with regex patterns │ │ │ ├── roles_reference_2025.json # 150 role categories │ │ │ ├── countries.py # Country/location mappings │ │ │ └── naukri_locations.py │ │ │ │ │ ├── db/ # Database layer │ │ │ ├── connection.py # SQLite connection manager │ │ │ ├── schema.py # Table schemas │ │ │ └── operations.py # CRUD operations │ │ │ │ │ ├── models/ │ │ │ └── models.py # Pydantic data models │ │ │ │ │ ├── scraper/ │ │ │ ├── unified/ │ │ │ │ ├── linkedin/ # LinkedIn scraper components │ │ │ │ │ ├── concurrent_detail_scraper.py # Multi-tab scraper (up to 10 tabs) │ │ │ │ │ ├── sequential_detail_scraper.py # Single-tab scraper │ │ │ │ │ ├── playwright_url_scraper.py # URL collection │ │ │ │ │ ├── selector_config.py # CSS selectors │ │ │ │ │ ├── retry_helper.py # 404/503 handling │ │ │ │ │ └── job_validator.py # Field validation │ │ │ │ │ │ │ │ │ ├── naukri/ # Naukri scraper components │ │ │ │ │ ├── url_scraper.py │ │ │ │ │ ├── detail_scraper.py │ │ │ │ │ └── selectors.py │ │ │ │ │ │ │ │ │ ├── scalable/ # Rate limiting & resilience │ │ │ │ │ ├── adaptive_rate_limiter.py │ │ │ │ │ ├── checkpoint_manager.py │ │ │ │ │ └── progress_tracker.py │ │ │ │ │ │ │ │ │ ├── linkedin_unified.py # LinkedIn orchestrator │ │ │ │ └── naukri_unified.py # Naukri orchestrator │ │ │ │ │ │ │ └── services/ # External service clients │ │ │ ├── playwright_browser.py │ │ │ └── session_manager.py │ │ │ │ │ ├── analysis/ │ │ │ └── skill_extraction/ # 3-layer skill extraction │ │ │ ├── extractor.py # Main AdvancedSkillExtractor class │ │ │ ├── layer3_direct.py # Pattern matching from JSON │ │ │ ├── batch_reextract.py # Re-process existing jobs │ │ │ └── deduplicator.py # Skill normalization │ │ │ │ │ ├── ui/ │ │ │ └── components/ # Streamlit UI components │ │ │ ├── kpi_dashboard.py │ │ │ ├── link_scraper_form.py │ │ │ ├── detail_scraper_form.py │ │ │ └── analytics/ │ │ │ ├── skills_charts.py │ │ │ └── overview_metrics.py │ │ │ │ │ ├── utils/ │ │ │ └── cleanup_expired_urls.py │ │ │ │ │ └── validation/ │ │ ├── validation_pipeline.py │ │ └── single_job_validator.py │ │ │ ├── scripts/ │ │ ├── extraction/ │ │ │ └── reextract_skills.py │ │ │ │ │ └── validation/ # Validation suite │ │ ├── layer1_syntax_check.sh │ │ ├── layer2_coverage.sh │ │ ├── layer3_fp_detection.sh │ │ ├── layer4_fn_detection.sh │ │ ├── cross_verify_skills.py │ │ └── run_all_validations.sh │ │ │ ├── tests/ │ │ ├── test_skill_validation_comprehensive.py │ │ └── test_linkedin_selectors.py │ │ │ └── docs/ # Documentation │ └── archive/ # Historical docs │ └── Analysis/ # Downloaded CSVs and notebooks (gitignored) ├── Data Analysis/ │ ├── data_visualizer.ipynb # Analysis notebook (update CSV path for charts) │ └── csv/ # Add exported CSVs here │ ├── Data Engineering/ │ ├── data_visualizer.ipynb │ └── csv/ │ └── GenAI & DataScience/ ├── data_visualizer.ipynb └── csv/ - Python 3.11 or higher
- Git
git clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git cd Job-Scrapper # Create virtual environment python -m venv venv-win # Activate .\venv-win\Scripts\Activate.ps1 # Install dependencies pip install -r requirements.txtgit clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git cd Job-Scrapper # Create virtual environment python3 -m venv venv-linux # Activate source venv-linux/bin/activate # Install dependencies python -m pip install -r requirements.txtNote for dual-boot users: Keep separate venvs (venv-win/ and venv-linux/) as Python virtual environments are not cross-platform compatible.
# Windows playwright install chromium # Linux/WSL (use python -m prefix) python -m playwright install chromiumcd code # Windows streamlit run streamlit_app.py # Linux/WSL (use python -m prefix) python -m streamlit run streamlit_app.pyThe dashboard opens at http://localhost:8501
Phase 1: URL Collection Phase 2: Detail Scraping ┌─────────────────────┐ ┌─────────────────────┐ │ Search Results │ │ Individual Jobs │ │ ├── Fast scroll │ ──▶ │ ├── Full desc │ │ ├── Extract URLs │ │ ├── Skills parse │ │ └── Store to DB │ │ └── Store details │ └─────────────────────┘ └─────────────────────┘ job_urls table jobs table Benefits:
- Resilience: If detail scraping fails, URLs are preserved
- Efficiency: Batch process up to 10 jobs concurrently in Phase 2
- Resumable: Pick up exactly where you left off
- Deduplication: Skip already-scraped URLs automatically
| Approach | Speed | Accuracy | Maintenance |
|---|---|---|---|
| Regex (chosen) | 0.3s/job | 85-90% | Pattern file updates |
| spaCy NER | 3-5s/job | 75-80% | Model retraining |
| GPT-based | 2-10s/job | 90%+ | API costs |
Our 3-layer approach achieves 85-90% accuracy at 10x speed of NLP:
- Layer 1: Multi-word phrase extraction (priority matching)
- Layer 2: Context-aware extraction (technical context detection)
- Layer 3: Direct pattern matching (977 skill patterns from JSON)
- KPI Dashboard - View overall statistics
- Link Scraper - Phase 1: Collect job URLs
- Detail Scraper - Phase 2: Extract job details & skills
- Analytics - Analyze skill trends and export data
cd code # Run validation suite bash scripts/validation/run_all_validations.sh # Re-extract skills for existing jobs python -m src.analysis.skill_extraction.batch_reextract --batch-size 100For authenticated scraping with higher limits:
cd code python save_linkedin_cookies.pyThis saves cookies to linkedin_cookies.json for subsequent sessions.
{ "total_skills": 977, "skills": [ { "name": "Python", "patterns": ["\\bPython\\b", "\\bpython\\b", "\\bPython3\\b"] } ] }Create .env file in code/ directory:
# Database path (default: data/jobs.db) DB_PATH=data/jobs.db # Playwright browser path (for WSL) PLAYWRIGHT_BROWSERS_PATH=.playwright-browsers-- Phase 1: URL Collection CREATE TABLE job_urls ( job_id TEXT PRIMARY KEY, platform TEXT NOT NULL, input_role TEXT NOT NULL, actual_role TEXT NOT NULL, url TEXT NOT NULL UNIQUE, scraped INTEGER DEFAULT 0 ); -- Phase 2: Full Details CREATE TABLE jobs ( job_id TEXT PRIMARY KEY, platform TEXT NOT NULL, actual_role TEXT NOT NULL, url TEXT NOT NULL UNIQUE, job_description TEXT, skills TEXT, company_name TEXT, posted_date TEXT, scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP );| Metric | Value |
|---|---|
| URL Collection | 200-300 URLs/min |
| Detail Scraping | 15-20 jobs/min (10 workers) |
| Skill Extraction | 0.3s/job |
| Storage per Job | ~2KB |
cd code chmod +x setup_playwright.sh ./setup_playwright.shUse python3 or the python -m prefix:
python3 -m streamlit run streamlit_app.py python3 -m pip install package_nameThe adaptive rate limiter handles this automatically:
- Concurrency reduces from 10 → 2
- Circuit breaker triggers 60s pause
- Gradually recovers when stable
pkill -f streamlit python -m streamlit run streamlit_app.pypip install -r requirements-dev.txtcd code python -m pytest tests/ -vcd code python -m basedpyright src/MIT License - See LICENSE file for details.