RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/trinhminhtriet/github-toolkit below:

trinhminhtriet/github-toolkit: github-toolkit: Scrapes GitHub developers, followers, repositories into MySQL database.

A Python-based web scraper that collects GitHub developer information, their followers, and repository details using Selenium and stores the data in a MySQL database.

🔥 Scrapes trending developers across multiple programming languages
👥 Collects follower information (up to 1000 per developer)
📦 Gathers repository details including name, URL, description, language, stars, and forks
🔐 Supports authentication via cookies or username/password
🗄️ Stores data in a MySQL database with automatic schema creation
⚠️ Includes error handling and logging
🧩 Follows clean architecture principles

github-toolkit/
├── config/
│   └── settings.py           # Configuration and environment variables
├── core/
│   ├── entities.py          # Domain entities
│   └── exceptions.py        # Custom exceptions
├── infrastructure/
│   ├── database/           # Database-related code
│   │   ├── connection.py
│   │   └── models.py
│   └── auth/              # Authentication service
│       └── auth_service.py
├── services/
│   └── scraping/          # Scraping services
│       ├── github_developer_scraper.py
│       └── github_repo_scraper.py
├── utils/
│   └── helpers.py         # Utility functions
├── controllers/
│   └── github_scraper_controller.py  # Main controller
├── main.py                # Entry point
└── README.md

🐍 Python 3.8+
🗄️ MySQL database
🌐 Chrome browser
🧰 Chrome WebDriver

Clone the repository:

git clone git@github.com:trinhminhtriet/github-toolkit.git
cd github-toolkit

Create a virtual environment and activate it:

python3 -m venv .venv
source ~/.venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Create a .env file in the root directory with the following variables:

GITHUB_USERNAME=your_username
GITHUB_PASSWORD=your_password
DB_USERNAME=your_db_username
DB_PASSWORD=your_db_password
DB_HOST=your_db_host
DB_NAME=your_db_name

Create a config directory:

Create a requirements.txt file with:

selenium
sqlalchemy
python-dotenv

Run the scraper:

The scraper will:

🔑 Authenticate with GitHub
🌟 Scrape trending developers for specified languages
👥 Collect their followers (up to 1000 per developer)
📦 Scrape their repositories
💾 Store all data in the MySQL database

Modify config/settings.py to change:
- LANGUAGES: List of programming languages to scrape
- USE_COOKIE: Toggle between cookie-based and credential-based authentication
⏱️ Adjust sleep times in services if needed for rate limiting

🆔 id (PK)
👤 username (unique)
🔗 profile_url
🕒 created_at
🕒 updated_at
📅 published_at

🆔 id (PK)
👤 username
📦 repo_name
📝 repo_intro
🔗 repo_url (unique)
🏷️ repo_lang
⭐ repo_stars
🍴 repo_forks
🕒 created_at
🕒 updated_at
📅 published_at

❗ Custom exceptions for authentication, scraping, and database operations
📝 Logging configured at INFO level
🛑 Graceful shutdown of browser instance

Fork the repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/your-feature).
Open a pull request.

This project is licensed under the MIT License - see the LICENSE file for details (create one if needed).

Built with Selenium, SQLAlchemy, and Python.
Inspired by the need to automate GitHub data collection.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4