A Python-based web scraper that collects GitHub developer information, their followers, and repository details using Selenium and stores the data in a MySQL database.
- Scrapes trending developers across multiple programming languages
- Collects follower information (up to 1000 per developer)
- Gathers repository details including name, URL, description, language, stars, and forks
- Supports authentication via cookies or username/password
- Stores data in a MySQL database with automatic schema creation
- Includes error handling and logging
- Follows clean architecture principles
github-toolkit/ ├── config/ │ └── settings.py # Configuration and environment variables ├── core/ │ ├── entities.py # Domain entities │ └── exceptions.py # Custom exceptions ├── infrastructure/ │ ├── database/ # Database-related code │ │ ├── connection.py │ │ └── models.py │ └── auth/ # Authentication service │ └── auth_service.py ├── services/ │ └── scraping/ # Scraping services │ ├── github_developer_scraper.py │ └── github_repo_scraper.py ├── utils/ │ └── helpers.py # Utility functions ├── controllers/ │ └── github_scraper_controller.py # Main controller ├── main.py # Entry point └── README.md - Python 3.8+
- MySQL database
- Chrome browser
- Chrome WebDriver
- Clone the repository:
git clone https://github.com/yourusername/github-scraper.git cd github-scraper- Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Create a
.envfile in the root directory with the following variables:
GITHUB_USERNAME=your_username GITHUB_PASSWORD=your_password DB_USERNAME=your_db_username DB_PASSWORD=your_db_password DB_HOST=your_db_host DB_NAME=your_db_name - Create a
configdirectory:
mkdir configCreate a requirements.txt file with:
selenium sqlalchemy python-dotenv Run the scraper:
python main.pyThe scraper will:
- Authenticate with GitHub
- Scrape trending developers for specified languages
- Collect their followers (up to 1000 per developer)
- Scrape their repositories
- Store all data in the MySQL database
- Modify
config/settings.pyto change:LANGUAGES: List of programming languages to scrapeUSE_COOKIE: Toggle between cookie-based and credential-based authentication
- Adjust sleep times in services if needed for rate limiting
- id (PK)
- username (unique)
- profile_url
- created_at
- updated_at
- published_at
- id (PK)
- username
- repo_name
- repo_intro
- repo_url (unique)
- repo_lang
- repo_stars
- repo_forks
- created_at
- updated_at
- published_at
- Custom exceptions for authentication, scraping, and database operations
- Logging configured at INFO level
- Graceful shutdown of browser instance
- Fork the repository.
- Create a feature branch (
git checkout -b feature/your-feature). - Commit changes (
git commit -m "Add your feature"). - Push to the branch (
git push origin feature/your-feature). - Open a pull request.
This project is licensed under the MIT License - see the LICENSE file for details (create one if needed).
- Built with Selenium, SQLAlchemy, and Python.
- Inspired by the need to automate GitHub data collection.