Skip to content

trinhminhtriet/github-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 GitHub toolkit

A Python-based web scraper that collects GitHub developer information, their followers, and repository details using Selenium and stores the data in a MySQL database.

✨ Features

  • 🔥 Scrapes trending developers across multiple programming languages
  • 👥 Collects follower information (up to 1000 per developer)
  • 📦 Gathers repository details including name, URL, description, language, stars, and forks
  • 🔐 Supports authentication via cookies or username/password
  • 🗄️ Stores data in a MySQL database with automatic schema creation
  • ⚠️ Includes error handling and logging
  • 🧩 Follows clean architecture principles

🗂️ Project Structure

github-toolkit/ ├── config/ │ └── settings.py # Configuration and environment variables ├── core/ │ ├── entities.py # Domain entities │ └── exceptions.py # Custom exceptions ├── infrastructure/ │ ├── database/ # Database-related code │ │ ├── connection.py │ │ └── models.py │ └── auth/ # Authentication service │ └── auth_service.py ├── services/ │ └── scraping/ # Scraping services │ ├── github_developer_scraper.py │ └── github_repo_scraper.py ├── utils/ │ └── helpers.py # Utility functions ├── controllers/ │ └── github_scraper_controller.py # Main controller ├── main.py # Entry point └── README.md 

🛠️ Prerequisites

  • 🐍 Python 3.8+
  • 🗄️ MySQL database
  • 🌐 Chrome browser
  • 🧰 Chrome WebDriver

⚙️ Installation

  1. Clone the repository:
git clone git@github.com:trinhminhtriet/github-toolkit.git cd github-toolkit
  1. Create a virtual environment and activate it:
python3 -m venv .venv source ~/.venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Create a .env file in the root directory with the following variables:
GITHUB_USERNAME=your_username GITHUB_PASSWORD=your_password DB_USERNAME=your_db_username DB_PASSWORD=your_db_password DB_HOST=your_db_host DB_NAME=your_db_name 
  1. Create a config directory:
mkdir config

📋 Requirements

Create a requirements.txt file with:

selenium sqlalchemy python-dotenv 

▶️ Usage

Run the scraper:

cd src python main.py

The scraper will:

  1. 🔑 Authenticate with GitHub
  2. 🌟 Scrape trending developers for specified languages
  3. 👥 Collect their followers (up to 1000 per developer)
  4. 📦 Scrape their repositories
  5. 💾 Store all data in the MySQL database

⚙️ Configuration

  • Modify config/settings.py to change:
    • LANGUAGES: List of programming languages to scrape
    • USE_COOKIE: Toggle between cookie-based and credential-based authentication
  • ⏱️ Adjust sleep times in services if needed for rate limiting

🗃️ Database Schema

github_users

  • 🆔 id (PK)
  • 👤 username (unique)
  • 🔗 profile_url
  • 🕒 created_at
  • 🕒 updated_at
  • 📅 published_at

github_repos

  • 🆔 id (PK)
  • 👤 username
  • 📦 repo_name
  • 📝 repo_intro
  • 🔗 repo_url (unique)
  • 🏷️ repo_lang
  • ⭐ repo_stars
  • 🍴 repo_forks
  • 🕒 created_at
  • 🕒 updated_at
  • 📅 published_at

🛡️ Error Handling

  • ❗ Custom exceptions for authentication, scraping, and database operations
  • 📝 Logging configured at INFO level
  • 🛑 Graceful shutdown of browser instance

🤝 Contributing

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/your-feature).
  3. Commit changes (git commit -m "Add your feature").
  4. Push to the branch (git push origin feature/your-feature).
  5. Open a pull request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details (create one if needed).

🙏 Acknowledgments

About

github-toolkit: Scrapes GitHub developers, followers, repositories into MySQL database.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published