When I started building GitNarrative, I thought the hardest part would be the AI integration. Turns out, the real challenge was analyzing git repositories in a way that actually captures meaningful development patterns.
Here's how I built the git analysis engine that powers GitNarrative's story generation.
The Challenge: Making Sense of Messy Git History
Every git repository tells a story, but extracting that story programmatically is complex. Consider these real commit messages from a typical project:
"fix bug" "refactor" "update dependencies" "THIS FINALLY WORKS" "revert last commit" "actually fix the bug this time"
The challenge is identifying patterns that reveal the actual development journey - the struggles, breakthroughs, and decision points that make compelling narratives.
Library Choice: pygit2 vs GitPython
I evaluated both major Python git libraries:
GitPython: More Pythonic, easier to use
import git repo = git.Repo('/path/to/repo') commits = list(repo.iter_commits())
pygit2: Lower-level, better performance, more control
import pygit2 repo = pygit2.Repository('/path/to/repo') walker = repo.walk(repo.head.target)
I chose pygit2 because GitNarrative needs to process repositories with thousands of commits efficiently. The performance difference is significant for large repositories.
Core Analysis Architecture
Here's the foundation of my git analysis engine:
from dataclasses import dataclass from datetime import datetime from typing import List, Dict, Set import pygit2 @dataclass class CommitAnalysis: sha: str message: str timestamp: datetime files_changed: List[str] additions: int deletions: int author: str is_merge: bool complexity_score: float commit_type: str # 'feature', 'bugfix', 'refactor', 'docs', etc. class GitAnalyzer: def __init__(self, repo_path: str): self.repo = pygit2.Repository(repo_path) def analyze_repository(self) -> Dict: commits = self._extract_commits() patterns = self._identify_patterns(commits) timeline = self._build_timeline(commits) milestones = self._detect_milestones(commits) return { "commits": commits, "patterns": patterns, "timeline": timeline, "milestones": milestones, "summary": self._generate_summary(commits, patterns) }
Pattern Recognition: The Heart of Story Extraction
The key insight is that commit patterns reveal development phases. Here's how I identify them:
1. Commit Type Classification
def _classify_commit(self, commit_message: str, files_changed: List[str]) -> str: message_lower = commit_message.lower() # Bug fix patterns if any(keyword in message_lower for keyword in ['fix', 'bug', 'issue', 'error']): return 'bugfix' # Feature patterns if any(keyword in message_lower for keyword in ['add', 'implement', 'create', 'feature']): return 'feature' # Refactor patterns if any(keyword in message_lower for keyword in ['refactor', 'restructure', 'reorganize']): return 'refactor' # Documentation if any(keyword in message_lower for keyword in ['doc', 'readme', 'comment']): return 'docs' # Dependency/config changes if any(file.endswith(('.json', '.yml', '.yaml', '.toml')) for file in files_changed): return 'config' return 'other'
2. Development Phase Detection
def _identify_development_phases(self, commits: List[CommitAnalysis]) -> List[Dict]: phases = [] current_phase = None for i, commit in enumerate(commits): # Look for phase transition indicators if self._is_architecture_change(commit): if current_phase: phases.append(current_phase) current_phase = { 'type': 'architecture_change', 'start_commit': commit.sha, 'description': 'Major architectural refactoring', 'commits': [commit] } elif self._is_feature_burst(commits[max(0, i-5):i+1]): # Multiple feature commits in short timeframe if not current_phase or current_phase['type'] != 'feature_development': if current_phase: phases.append(current_phase) current_phase = { 'type': 'feature_development', 'start_commit': commit.sha, 'description': 'Rapid feature development phase', 'commits': [commit] } if current_phase: current_phase['commits'].append(commit) return phases def _is_architecture_change(self, commit: CommitAnalysis) -> bool: # High file change count + specific patterns return (len(commit.files_changed) > 10 and commit.complexity_score > 0.8 and any(keyword in commit.message.lower() for keyword in ['refactor', 'restructure', 'migrate']))
3. Struggle and Breakthrough Detection
This is where the storytelling magic happens:
def _detect_struggle_patterns(self, commits: List[CommitAnalysis]) -> List[Dict]: struggles = [] for i in range(len(commits) - 3): window = commits[i:i+4] # Look for multiple attempts at same issue if self._is_struggle_sequence(window): struggles.append({ 'type': 'debugging_struggle', 'commits': window, 'description': self._describe_struggle(window), 'resolution_commit': self._find_resolution(commits[i+4:i+10]) }) return struggles def _is_struggle_sequence(self, commits: List[CommitAnalysis]) -> bool: # Multiple bug fix attempts in short timeframe bugfix_count = sum(1 for c in commits if c.commit_type == 'bugfix') # Time clustering (all within days of each other) time_span = (commits[-1].timestamp - commits[0].timestamp).days return bugfix_count >= 2 and time_span <= 3 def _find_resolution(self, following_commits: List[CommitAnalysis]) -> CommitAnalysis: # Look for commit that likely resolved the issue for commit in following_commits: if ('work' in commit.message.lower() or 'fix' in commit.message.lower() or commit.complexity_score > 0.6): return commit return None
Timeline Correlation: When Things Happened
Understanding timing is crucial for narrative flow:
def _build_timeline(self, commits: List[CommitAnalysis]) -> Dict: # Group commits by time periods monthly_activity = defaultdict(list) for commit in commits: month_key = commit.timestamp.strftime('%Y-%m') monthly_activity[month_key].append(commit) timeline = {} for month, month_commits in monthly_activity.items(): timeline[month] = { 'total_commits': len(month_commits), 'commit_types': self._analyze_commit_distribution(month_commits), 'major_changes': self._identify_major_changes(month_commits), 'development_velocity': self._calculate_velocity(month_commits) } return timeline def _calculate_velocity(self, commits: List[CommitAnalysis]) -> float: if not commits: return 0.0 # Factor in commit frequency, complexity, and file changes total_complexity = sum(c.complexity_score for c in commits) total_files = sum(len(c.files_changed) for c in commits) return (total_complexity * total_files) / len(commits)
Performance Optimizations
Processing large repositories efficiently required several optimizations:
1. Lazy Loading
def _extract_commits(self, max_commits: int = 1000) -> List[CommitAnalysis]: # Process commits in batches to avoid memory issues walker = self.repo.walk(self.repo.head.target) commits = [] for i, commit in enumerate(walker): if i >= max_commits: break commits.append(self._analyze_single_commit(commit)) return commits
2. Caching Results
from functools import lru_cache @lru_cache(maxsize=128) def _calculate_complexity_score(self, sha: str) -> float: # Expensive calculation cached per commit commit = self.repo[sha] # ... complexity calculation return score
3. Parallel Processing for Multiple Repositories
import asyncio from concurrent.futures import ProcessPoolExecutor async def analyze_multiple_repos(repo_paths: List[str]) -> List[Dict]: with ProcessPoolExecutor() as executor: loop = asyncio.get_event_loop() tasks = [ loop.run_in_executor(executor, analyze_single_repo, path) for path in repo_paths ] return await asyncio.gather(*tasks)
Integration with AI Story Generation
The analysis output feeds directly into AI prompts:
def format_for_ai_prompt(self, analysis: Dict) -> str: prompt_data = { 'repository_summary': analysis['summary'], 'development_phases': analysis['patterns']['phases'], 'key_struggles': analysis['patterns']['struggles'], 'breakthrough_moments': analysis['milestones'], 'timeline': analysis['timeline'] } return self._build_narrative_prompt(prompt_data)
Challenges and Solutions
Challenge 1: Repositories with inconsistent commit message styles
Solution: Pattern matching with multiple fallback strategies and file-based analysis
Challenge 2: Merge commits creating noise in analysis
Solution: Filtering strategy that focuses on meaningful commits while preserving merge context
Challenge 3: Very large repositories (10k+ commits)
Solution: Sampling strategy that captures representative commits from different time periods
Results and Validation
The analysis engine successfully processes repositories ranging from small personal projects to large open source codebases. When tested on React's repository, it correctly identified:
- The initial experimental phase (2013)
- Major architecture rewrites (Fiber, Hooks)
- Performance optimization periods
- API stabilization phases
What's Next
Current improvements in development:
- Better natural language processing of commit messages
- Machine learning models for commit classification
- Integration with issue tracker data for richer context
- Support for monorepo analysis
The git analysis engine is the foundation that makes GitNarrative's storytelling possible. By extracting meaningful patterns from commit history, we can transform boring git logs into compelling narratives about software development.
GitNarrative is available at https://gitnarrative.io - try it with your own repositories to see these patterns in action.
What patterns have you noticed in your own git history? I'd love to hear about interesting commit patterns you've discovered in your projects.
Top comments (0)