RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/wdm0006/git-pandas below:

wdm0006/git-pandas: A wrapper around gitpython to produce pandas dataframes for analysis

Git-Pandas is a powerful Python library that transforms Git repository data into pandas DataFrames, making it easy to analyze and visualize your codebase's history, contributors, and development patterns. Built on top of GitPython, it provides a simple yet powerful interface for extracting meaningful insights from your Git repositories.

🗂️ File-wise Bus Factor: Calculate bus factor at the individual file level
🗄️ Cache Management: Advanced cache invalidation and statistics
🌐 Remote Operations: Safe remote fetching and bulk operations
⚡ Performance Guide: Comprehensive optimization documentation
💾 Enhanced Caching: Disk-based caching and timestamp tracking

Easy to Use: Simple API that converts Git data into familiar pandas DataFrames
Comprehensive Analysis: From basic commit history to complex metrics like bus factor
Flexible: Works with single repositories or entire project directories
Performance Optimized: Advanced caching system with multiple backends
Remote Operations: Safe remote fetching and cache warming capabilities
File-Level Insights: Granular analysis including file-wise bus factor

The Repository class provides a wrapper around a single Git repository, offering methods to:

Extract commit history with filtering by extension and directory
Analyze file changes and blame information with file-level granularity
Track branch and tag information
Generate cumulative blame statistics
Calculate file ownership and bus factor at repository or file level
NEW: Safe remote fetching and cache warming
NEW: Cache management and performance monitoring

The ProjectDirectory class enables analysis across multiple repositories:

Automatically discovers and analyzes nested Git repositories
Aggregates metrics across multiple repositories
Provides project-level insights and statistics
NEW: Bulk remote operations across multiple repositories
NEW: Shared cache management for improved performance

Commit History: Track changes with extension and directory filtering
File Analysis: Monitor edited files and blame information
Branch & Tag Management: Access repository structure information
Cumulative Blame: Generate time-series data of code ownership
File Ownership: Approximate file ownership and contribution patterns
NEW: File-wise bus factor analysis for granular risk assessment

Bus Factor: Calculate project sustainability metrics at repository or file level
Development Time: Estimate hours spent per project or author
NEW: Cross-repository cache management and statistics

Multi-Backend Caching: EphemeralCache, DiskCache, and RedisDFCache
Cache Management: Invalidate, monitor, and warm caches for optimal performance
Remote Operations: Safe fetching without affecting working directory
Bulk Operations: Parallel processing for multiple repositories

Profile Analysis: Analyze GitHub.com profiles via GitHubProfile object
Repository Metrics: Extract repository-specific insights
Contributor Insights: Track external contributions and collaborations

Plotting Helpers: Built-in utilities for common Git analytics
Punchcard Analysis: Generate and visualize commit patterns
Blame Visualization: Create cumulative blame charts
Time Series Analysis: Track changes and patterns over time

Git-Pandas requires Python 3.8+ and can be installed using pip:

For enhanced functionality, install additional packages:

# For parallel processing
pip install joblib

# For Redis caching
pip install redis

# For visualization
pip install matplotlib seaborn

Basic Repository Analysis

from gitpandas import Repository
from gitpandas.cache import DiskCache

# Create repository with persistent caching
cache = DiskCache('/tmp/git_cache.gz', max_keys=1000)
repo = Repository('/path/to/repo', cache_backend=cache)

# Get commit history with filtering
commits_df = repo.commit_history(
    branch='main',
    limit=1000,
    ignore_globs=['*.pyc', '*.log'],
    include_globs=['*.py', '*.js']
)

# Analyze blame information
blame_df = repo.blame(by='repository')

# Calculate bus factor for entire repository
bus_factor_df = repo.bus_factor(by='repository')

# NEW: Calculate file-wise bus factor
file_bus_factor_df = repo.bus_factor(by='file')

Cache Management (New in v2.5.0)

# Get cache statistics
stats = repo.get_cache_stats()
print(f"Cache usage: {stats['global_cache_stats']['cache_usage_percent']:.1f}%")

# Warm cache for better performance
result = repo.warm_cache(
    methods=['commit_history', 'blame', 'file_detail'],
    limit=100
)
print(f"Created {result['cache_entries_created']} cache entries")

# Invalidate specific cache entries
repo.invalidate_cache(keys=['commit_history'])

# Clear all cache for this repository
repo.invalidate_cache()

Remote Operations (New in v2.5.0)

# Safely fetch changes from remote (read-only)
result = repo.safe_fetch_remote(dry_run=True)
if result['remote_exists'] and result['changes_available']:
    # Actually fetch the changes
    fetch_result = repo.safe_fetch_remote()
    print(f"Fetch status: {fetch_result['message']}")

Multi-Repository Analysis

from gitpandas import ProjectDirectory

# Analyze multiple repositories with shared cache
project = ProjectDirectory('/path/to/projects', cache_backend=cache)

# NEW: Bulk operations across all repositories
result = project.bulk_fetch_and_warm(
    fetch_remote=True,
    warm_cache=True,
    parallel=True,
    cache_methods=['commit_history', 'blame']
)

print(f"Processed {result['repositories_processed']} repositories")
print(f"Cache entries created: {result['summary']['total_cache_entries_created']}")

# Get project-wide cache statistics
cache_stats = project.get_cache_stats()
print(f"Total repositories: {cache_stats['total_repositories']}")
print(f"Cache coverage: {cache_stats['cache_coverage_percent']:.1f}%")

# Core Analysis
repo.commit_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None)
repo.file_change_history(branch=None, limit=None, days=None, ignore_globs=None, include_globs=None)
repo.blame(rev="HEAD", committer=True, by="repository", ignore_globs=None, include_globs=None)
repo.bus_factor(by="repository", ignore_globs=None, include_globs=None)  # by="file" for file-wise
repo.punchcard(branch=None, limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None)

# Repository Information
repo.list_files(rev="HEAD")
repo.has_branch(branch)
repo.is_bare()
repo.has_coverage()
repo.coverage()
repo.get_commit_content(rev, ignore_globs=None, include_globs=None)

# NEW: Remote Operations (v2.5.0)
repo.safe_fetch_remote(remote_name='origin', prune=False, dry_run=False)
repo.warm_cache(methods=None, **kwargs)

# NEW: Cache Management (v2.5.0)
repo.invalidate_cache(keys=None, pattern=None)
repo.get_cache_stats()

# Initialize with multiple repositories
project = ProjectDirectory(
    working_dir='/path/to/project',  # or list of repo paths
    ignore_repos=None,
    verbose=True,
    cache_backend=None,
    default_branch='main'
)

# NEW: Bulk Operations (v2.5.0)
project.bulk_fetch_and_warm(fetch_remote=False, warm_cache=False, parallel=True, **kwargs)
project.invalidate_cache(keys=None, pattern=None, repositories=None)
project.get_cache_stats()

EphemeralCache (In-Memory)

from gitpandas.cache import EphemeralCache

cache = EphemeralCache(max_keys=1000)
repo = Repository('/path/to/repo', cache_backend=cache)

from gitpandas.cache import DiskCache

cache = DiskCache('/path/to/cache.gz', max_keys=500)
repo = Repository('/path/to/repo', cache_backend=cache)

RedisDFCache (Distributed)

from gitpandas.cache import RedisDFCache

cache = RedisDFCache(
    host='localhost',
    port=6379,
    db=12,
    max_keys=1000,
    ttl=3600  # 1 hour expiration
)
repo = Repository('/path/to/repo', cache_backend=cache)

Most analysis methods support these filtering parameters:

branch: Branch to analyze (defaults to repository's default branch)
limit: Maximum number of commits to analyze
days: Limit analysis to last N days
ignore_globs: List of glob patterns for files to ignore
include_globs: List of glob patterns for files to include
by: How to group results (usually 'repository' or 'file')

For comprehensive documentation, examples, and API reference:

Use caching for any analysis beyond one-off queries
Choose the right cache backend for your use case:
- EphemeralCache: Development, interactive analysis
- DiskCache: Regular workflows, CI/CD
- RedisDFCache: Team environments, production
Filter data early using glob patterns and limits
Warm your cache before intensive analysis sessions
Use parallel processing for multiple repositories

We welcome contributions! Please review our Contributing Guidelines for details on:

Code of Conduct
Development Setup
Pull Request Process
Starter Issues

# Clone the repository
git clone https://github.com/wdm0006/git-pandas.git
cd git-pandas

# Install in development mode
make install-dev

# Run tests
make test

# Run linting and formatting
make lint
make format

This project is BSD licensed (see LICENSE.md)

v2.5.0 (2025): File-wise bus factor, cache management, remote operations, performance guide
v2.4.0 (2024): Enhanced caching system with timestamps
v2.2.1 (2023): Stability improvements and bug fixes

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4