Git-Pandas is a powerful Python library that transforms Git repository data into pandas DataFrames, making it easy to analyze and visualize your codebase's history, contributors, and development patterns. Built on top of GitPython, it provides a simple yet powerful interface for extracting meaningful insights from your Git repositories.
The Repository
class provides a wrapper around a single Git repository, offering methods to:
The ProjectDirectory
class enables analysis across multiple repositories:
GitHubProfile
objectGit-Pandas requires Python 3.8+ and can be installed using pip:
For enhanced functionality, install additional packages:
# For parallel processing pip install joblib # For Redis caching pip install redis # For visualization pip install matplotlib seabornBasic Repository Analysis
from gitpandas import Repository from gitpandas.cache import DiskCache # Create repository with persistent caching cache = DiskCache('/tmp/git_cache.gz', max_keys=1000) repo = Repository('/path/to/repo', cache_backend=cache) # Get commit history with filtering commits_df = repo.commit_history( branch='main', limit=1000, ignore_globs=['*.pyc', '*.log'], include_globs=['*.py', '*.js'] ) # Analyze blame information blame_df = repo.blame(by='repository') # Calculate bus factor for entire repository bus_factor_df = repo.bus_factor(by='repository') # NEW: Calculate file-wise bus factor file_bus_factor_df = repo.bus_factor(by='file')Cache Management (New in v2.5.0)
# Get cache statistics stats = repo.get_cache_stats() print(f"Cache usage: {stats['global_cache_stats']['cache_usage_percent']:.1f}%") # Warm cache for better performance result = repo.warm_cache( methods=['commit_history', 'blame', 'file_detail'], limit=100 ) print(f"Created {result['cache_entries_created']} cache entries") # Invalidate specific cache entries repo.invalidate_cache(keys=['commit_history']) # Clear all cache for this repository repo.invalidate_cache()Remote Operations (New in v2.5.0)
# Safely fetch changes from remote (read-only) result = repo.safe_fetch_remote(dry_run=True) if result['remote_exists'] and result['changes_available']: # Actually fetch the changes fetch_result = repo.safe_fetch_remote() print(f"Fetch status: {fetch_result['message']}")Multi-Repository Analysis
from gitpandas import ProjectDirectory # Analyze multiple repositories with shared cache project = ProjectDirectory('/path/to/projects', cache_backend=cache) # NEW: Bulk operations across all repositories result = project.bulk_fetch_and_warm( fetch_remote=True, warm_cache=True, parallel=True, cache_methods=['commit_history', 'blame'] ) print(f"Processed {result['repositories_processed']} repositories") print(f"Cache entries created: {result['summary']['total_cache_entries_created']}") # Get project-wide cache statistics cache_stats = project.get_cache_stats() print(f"Total repositories: {cache_stats['total_repositories']}") print(f"Cache coverage: {cache_stats['cache_coverage_percent']:.1f}%")
# Core Analysis repo.commit_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None) repo.file_change_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None) repo.blame(rev="HEAD", committer=True, by="repository", ignore_globs=None, include_globs=None) repo.bus_factor(by="repository", ignore_globs=None, include_globs=None) # by="file" for file-wise repo.punchcard(branch=None, limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None) # Repository Information repo.list_files(rev="HEAD") repo.has_branch(branch) repo.is_bare() repo.has_coverage() repo.coverage() repo.get_commit_content(rev, ignore_globs=None, include_globs=None) # NEW: Remote Operations (v2.5.0) repo.safe_fetch_remote(remote_name='origin', prune=False, dry_run=False) repo.warm_cache(methods=None, **kwargs) # NEW: Cache Management (v2.5.0) repo.invalidate_cache(keys=None, pattern=None) repo.get_cache_stats()
# Initialize with multiple repositories project = ProjectDirectory( working_dir='/path/to/project', # or list of repo paths ignore_repos=None, verbose=True, cache_backend=None, default_branch='main' ) # NEW: Bulk Operations (v2.5.0) project.bulk_fetch_and_warm(fetch_remote=False, warm_cache=False, parallel=True, **kwargs) project.invalidate_cache(keys=None, pattern=None, repositories=None) project.get_cache_stats()EphemeralCache (In-Memory)
from gitpandas.cache import EphemeralCache cache = EphemeralCache(max_keys=1000) repo = Repository('/path/to/repo', cache_backend=cache)
from gitpandas.cache import DiskCache cache = DiskCache('/path/to/cache.gz', max_keys=500) repo = Repository('/path/to/repo', cache_backend=cache)RedisDFCache (Distributed)
from gitpandas.cache import RedisDFCache cache = RedisDFCache( host='localhost', port=6379, db=12, max_keys=1000, ttl=3600 # 1 hour expiration ) repo = Repository('/path/to/repo', cache_backend=cache)
Most analysis methods support these filtering parameters:
branch
: Branch to analyze (defaults to repository's default branch)limit
: Maximum number of commits to analyzedays
: Limit analysis to last N daysignore_globs
: List of glob patterns for files to ignoreinclude_globs
: List of glob patterns for files to includeby
: How to group results (usually 'repository' or 'file')For comprehensive documentation, examples, and API reference:
We welcome contributions! Please review our Contributing Guidelines for details on:
# Clone the repository git clone https://github.com/wdm0006/git-pandas.git cd git-pandas # Install in development mode make install-dev # Run tests make test # Run linting and formatting make lint make format
This project is BSD licensed (see LICENSE.md)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4