Name	Name	Last commit message	Last commit date
Latest commit History 10 Commits
config	config
src	src
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
requirements.txt	requirements.txt

GitHub Scraper

A Python-based web scraper that collects GitHub developer information, their followers, and repository details using Selenium and stores the data in a MySQL database.

Features

Scrapes trending developers across multiple programming languages
Collects follower information (up to 1000 per developer)
Gathers repository details including name, URL, description, language, stars, and forks
Supports authentication via cookies or username/password
Stores data in a MySQL database with automatic schema creation
Includes error handling and logging
Follows clean architecture principles

Project Structure

github-toolkit/ ├── config/ │ └── settings.py # Configuration and environment variables ├── core/ │ ├── entities.py # Domain entities │ └── exceptions.py # Custom exceptions ├── infrastructure/ │ ├── database/ # Database-related code │ │ ├── connection.py │ │ └── models.py │ └── auth/ # Authentication service │ └── auth_service.py ├── services/ │ └── scraping/ # Scraping services │ ├── github_developer_scraper.py │ └── github_repo_scraper.py ├── utils/ │ └── helpers.py # Utility functions ├── controllers/ │ └── github_scraper_controller.py # Main controller ├── main.py # Entry point └── README.md

Prerequisites

Python 3.8+
MySQL database
Chrome browser
Chrome WebDriver

Installation

Clone the repository:

git clone https://github.com/yourusername/github-scraper.git cd github-scraper

Create a virtual environment and activate it:

python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Create a .env file in the root directory with the following variables:

GITHUB_USERNAME=your_username GITHUB_PASSWORD=your_password DB_USERNAME=your_db_username DB_PASSWORD=your_db_password DB_HOST=your_db_host DB_NAME=your_db_name

Create a config directory:

mkdir config

Requirements

Create a requirements.txt file with:

selenium sqlalchemy python-dotenv

Usage

Run the scraper:

python main.py

The scraper will:

Authenticate with GitHub
Scrape trending developers for specified languages
Collect their followers (up to 1000 per developer)
Scrape their repositories
Store all data in the MySQL database

Configuration

Modify config/settings.py to change:
- LANGUAGES: List of programming languages to scrape
- USE_COOKIE: Toggle between cookie-based and credential-based authentication
Adjust sleep times in services if needed for rate limiting

Database Schema

github_users

id (PK)
username (unique)
profile_url
created_at
updated_at
published_at

github_repos

id (PK)
username
repo_name
repo_intro
repo_url (unique)
repo_lang
repo_stars
repo_forks
created_at
updated_at
published_at

Error Handling

Custom exceptions for authentication, scraping, and database operations
Logging configured at INFO level
Graceful shutdown of browser instance

Contributing

Fork the repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/your-feature).
Open a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details (create one if needed).

Acknowledgments

Built with Selenium, SQLAlchemy, and Python.
Inspired by the need to automate GitHub data collection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GitHub Scraper

Features

Project Structure

Prerequisites

Installation

Requirements

Usage

Configuration

Database Schema

github_users

github_repos

Error Handling

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

trinhminhtriet/github-toolkit

Folders and files

Latest commit

History

Repository files navigation

GitHub Scraper

Features

Project Structure

Prerequisites

Installation

Requirements

Usage

Configuration

Database Schema

github_users

github_repos

Error Handling

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages