Posted on Aug 23

Building GitNarrative: How I Parse Git History with Python to Extract Development Patterns

When I started building GitNarrative, I thought the hardest part would be the AI integration. Turns out, the real challenge was analyzing git repositories in a way that actually captures meaningful development patterns.

Here's how I built the git analysis engine that powers GitNarrative's story generation.

The Challenge: Making Sense of Messy Git History

Every git repository tells a story, but extracting that story programmatically is complex. Consider these real commit messages from a typical project:

"fix bug" "refactor" "update dependencies" "THIS FINALLY WORKS" "revert last commit" "actually fix the bug this time"

The challenge is identifying patterns that reveal the actual development journey - the struggles, breakthroughs, and decision points that make compelling narratives.

Library Choice: pygit2 vs GitPython

I evaluated both major Python git libraries:

GitPython: More Pythonic, easier to use

import git repo = git.Repo('/path/to/repo') commits = list(repo.iter_commits())

pygit2: Lower-level, better performance, more control

import pygit2 repo = pygit2.Repository('/path/to/repo') walker = repo.walk(repo.head.target)

I chose pygit2 because GitNarrative needs to process repositories with thousands of commits efficiently. The performance difference is significant for large repositories.

Core Analysis Architecture

Here's the foundation of my git analysis engine:

from dataclasses import dataclass from datetime import datetime from typing import List, Dict, Set import pygit2 @dataclass class CommitAnalysis: sha: str message: str timestamp: datetime files_changed: List[str] additions: int deletions: int author: str is_merge: bool complexity_score: float commit_type: str # 'feature', 'bugfix', 'refactor', 'docs', etc.  class GitAnalyzer: def __init__(self, repo_path: str): self.repo = pygit2.Repository(repo_path) def analyze_repository(self) -> Dict: commits = self._extract_commits() patterns = self._identify_patterns(commits) timeline = self._build_timeline(commits) milestones = self._detect_milestones(commits) return { "commits": commits, "patterns": patterns, "timeline": timeline, "milestones": milestones, "summary": self._generate_summary(commits, patterns) }

Pattern Recognition: The Heart of Story Extraction

The key insight is that commit patterns reveal development phases. Here's how I identify them:

1. Commit Type Classification

def _classify_commit(self, commit_message: str, files_changed: List[str]) -> str: message_lower = commit_message.lower() # Bug fix patterns  if any(keyword in message_lower for keyword in ['fix', 'bug', 'issue', 'error']): return 'bugfix' # Feature patterns  if any(keyword in message_lower for keyword in ['add', 'implement', 'create', 'feature']): return 'feature' # Refactor patterns  if any(keyword in message_lower for keyword in ['refactor', 'restructure', 'reorganize']): return 'refactor' # Documentation  if any(keyword in message_lower for keyword in ['doc', 'readme', 'comment']): return 'docs' # Dependency/config changes  if any(file.endswith(('.json', '.yml', '.yaml', '.toml')) for file in files_changed): return 'config' return 'other'

2. Development Phase Detection

def _identify_development_phases(self, commits: List[CommitAnalysis]) -> List[Dict]: phases = [] current_phase = None for i, commit in enumerate(commits): # Look for phase transition indicators  if self._is_architecture_change(commit): if current_phase: phases.append(current_phase) current_phase = { 'type': 'architecture_change', 'start_commit': commit.sha, 'description': 'Major architectural refactoring', 'commits': [commit] } elif self._is_feature_burst(commits[max(0, i-5):i+1]): # Multiple feature commits in short timeframe  if not current_phase or current_phase['type'] != 'feature_development': if current_phase: phases.append(current_phase) current_phase = { 'type': 'feature_development', 'start_commit': commit.sha, 'description': 'Rapid feature development phase', 'commits': [commit] } if current_phase: current_phase['commits'].append(commit) return phases def _is_architecture_change(self, commit: CommitAnalysis) -> bool: # High file change count + specific patterns  return (len(commit.files_changed) > 10 and commit.complexity_score > 0.8 and any(keyword in commit.message.lower() for keyword in ['refactor', 'restructure', 'migrate']))

3. Struggle and Breakthrough Detection

This is where the storytelling magic happens:

def _detect_struggle_patterns(self, commits: List[CommitAnalysis]) -> List[Dict]: struggles = [] for i in range(len(commits) - 3): window = commits[i:i+4] # Look for multiple attempts at same issue  if self._is_struggle_sequence(window): struggles.append({ 'type': 'debugging_struggle', 'commits': window, 'description': self._describe_struggle(window), 'resolution_commit': self._find_resolution(commits[i+4:i+10]) }) return struggles def _is_struggle_sequence(self, commits: List[CommitAnalysis]) -> bool: # Multiple bug fix attempts in short timeframe  bugfix_count = sum(1 for c in commits if c.commit_type == 'bugfix') # Time clustering (all within days of each other)  time_span = (commits[-1].timestamp - commits[0].timestamp).days return bugfix_count >= 2 and time_span <= 3 def _find_resolution(self, following_commits: List[CommitAnalysis]) -> CommitAnalysis: # Look for commit that likely resolved the issue  for commit in following_commits: if ('work' in commit.message.lower() or 'fix' in commit.message.lower() or commit.complexity_score > 0.6): return commit return None

Timeline Correlation: When Things Happened

Understanding timing is crucial for narrative flow:

def _build_timeline(self, commits: List[CommitAnalysis]) -> Dict: # Group commits by time periods  monthly_activity = defaultdict(list) for commit in commits: month_key = commit.timestamp.strftime('%Y-%m') monthly_activity[month_key].append(commit) timeline = {} for month, month_commits in monthly_activity.items(): timeline[month] = { 'total_commits': len(month_commits), 'commit_types': self._analyze_commit_distribution(month_commits), 'major_changes': self._identify_major_changes(month_commits), 'development_velocity': self._calculate_velocity(month_commits) } return timeline def _calculate_velocity(self, commits: List[CommitAnalysis]) -> float: if not commits: return 0.0 # Factor in commit frequency, complexity, and file changes  total_complexity = sum(c.complexity_score for c in commits) total_files = sum(len(c.files_changed) for c in commits) return (total_complexity * total_files) / len(commits)

Performance Optimizations

Processing large repositories efficiently required several optimizations:

1. Lazy Loading

def _extract_commits(self, max_commits: int = 1000) -> List[CommitAnalysis]: # Process commits in batches to avoid memory issues  walker = self.repo.walk(self.repo.head.target) commits = [] for i, commit in enumerate(walker): if i >= max_commits: break commits.append(self._analyze_single_commit(commit)) return commits

2. Caching Results

from functools import lru_cache @lru_cache(maxsize=128) def _calculate_complexity_score(self, sha: str) -> float: # Expensive calculation cached per commit  commit = self.repo[sha] # ... complexity calculation  return score

3. Parallel Processing for Multiple Repositories

import asyncio from concurrent.futures import ProcessPoolExecutor async def analyze_multiple_repos(repo_paths: List[str]) -> List[Dict]: with ProcessPoolExecutor() as executor: loop = asyncio.get_event_loop() tasks = [ loop.run_in_executor(executor, analyze_single_repo, path) for path in repo_paths ] return await asyncio.gather(*tasks)

Integration with AI Story Generation

The analysis output feeds directly into AI prompts:

def format_for_ai_prompt(self, analysis: Dict) -> str: prompt_data = { 'repository_summary': analysis['summary'], 'development_phases': analysis['patterns']['phases'], 'key_struggles': analysis['patterns']['struggles'], 'breakthrough_moments': analysis['milestones'], 'timeline': analysis['timeline'] } return self._build_narrative_prompt(prompt_data)

Challenges and Solutions

Challenge 1: Repositories with inconsistent commit message styles
Solution: Pattern matching with multiple fallback strategies and file-based analysis

Challenge 2: Merge commits creating noise in analysis
Solution: Filtering strategy that focuses on meaningful commits while preserving merge context

Challenge 3: Very large repositories (10k+ commits)
Solution: Sampling strategy that captures representative commits from different time periods

Results and Validation

The analysis engine successfully processes repositories ranging from small personal projects to large open source codebases. When tested on React's repository, it correctly identified:

The initial experimental phase (2013)
Major architecture rewrites (Fiber, Hooks)
Performance optimization periods
API stabilization phases

What's Next

Current improvements in development:

Better natural language processing of commit messages
Machine learning models for commit classification
Integration with issue tracker data for richer context
Support for monorepo analysis

The git analysis engine is the foundation that makes GitNarrative's storytelling possible. By extracting meaningful patterns from commit history, we can transform boring git logs into compelling narratives about software development.

GitNarrative is available at https://gitnarrative.io - try it with your own repositories to see these patterns in action.

What patterns have you noticed in your own git history? I'd love to hear about interesting commit patterns you've discovered in your projects.

DEV Community