README.md
# StackOverflow Python Question Monitor *Dockerized crawler tracking new Python questions on StackOverFlow. Features HTML scraping, state persistence, and code visualization. Includes logging decorators for method tracing.* --- ## 🛠️ **Run with Docker** **1. Run Crawler**: ```bash docker build -f Dockerfile.crawler -t crawler . docker run --rm crawler # Saves last_seen_id_python.txt 2. Generate Code Diagrams:
docker build -f Dockerfile.diagrams -t diagrams . docker run -v $(pwd)/diagrams:/app/diagrams --rm diagrams . ├── Dockerfile.crawler # Builds crawler image ├── Dockerfile.diagrams # Generates code2flow diagrams ├── StackOverFlow_Crawler_Kafka/ │ ├── Crawler/ │ │ ├── fetcher.py # Fetches HTML with retries (FetcherStrategy) │ │ ├── parser.py # Extracts Q&A via BeautifulSoup (QuestionParserTemplateMethod) │ │ ├── watcher.py # Polls for new questions (QuestionWatcher) │ │ ├── notification_handler.py # Handles 15+ event types (Notifier + NotificationType enum) │ │ └── tracedecorator.py # Logs method entries/exits to usage.log │ ├── main.py # CLI entry point with dependency setup │ └── models.py # Pydantic models (Question, Constants, ParsConstants) | File/Class | Key Features |
|---|---|
| fetcher.py | Retry logic (3 attempts), User-Agent rotation, URL builder |
| parser.py | CSS selectors for StackOverFlow DOM, Question data extraction |
| watcher.py | Persistent state (last_seen_id), Interval polling (60s default) |
| notification_handler.py | 15+ event types (FETCH_FAILED, NEW_QUESTIONS, etc.) |
| tracedecorator.py | Logs method calls/errors with timestamps to usage.log |
- Question (models.py):
- ID, title, link, tags, votes, answers, views
- Pydantic model for validation
- Constants:
- Configurable parameters (user agent, scrape interval, max questions)