tokencount is a new Rust CLI that helps you answer a deceptively simple question: how many GPT tokens are hidden across this project? If you build AI features, write prompt-heavy docs, or just keep an eye on context windows, this tool makes the audit painless.
Why I built it
Most token counters either run one file at a time or ignore the filesystem realities of big projects. I wanted something that:
- Walks a codebase quickly (parallel rayon workers + OS-native ignore rules)
- Respects
.gitignore
by default and lets me layer custom--exclude
globs - Talks the same language as OpenAI models (
cl100k_base
,o200k_base
, etc.) - Gives a useful summary out of the box: per-file counts, totals, percentiles, and top-N offenders
- Plays nicely with automation (JSON and NDJSON streaming modes)
Features at a glance
- Blazing fast scan –
ignore::WalkBuilder
+ Rayon for concurrent IO/tokenization - Smart defaults – only scans
*.elm
unless you add--include-ext
flags (good for Elm-heavy repos) - Flexible filtering – combine
--include-ext
,--exclude
,--max-bytes
, and--follow-symlinks
- Multiple outputs – table, JSON array with summary, or NDJSON stream for pipelines
- Rich stats – totals, average per file, and P50/P90/P99 percentiles to spot outliers fast
- Quiet/verbose modes – keep CI logs clean or turn on detailed warnings locally
Install
cargo install tokencount
Quick tour
# default: scan current directory, only *.elm files, table output tokencount # include Elm + TypeScript tokencount ./frontend --include-ext elm --include-ext ts # show top 10 largest files by tokens tokencount --top 10 # machine-readable summary for CI tokencount --format json > tokens.json # streaming counts for further processing tokencount --format ndjson # sort descending by token count tokencount --sort tokens
Each run ends with a footer like this:
--- total files: 42 total tokens: 128730 average/file: 3065.00 p50: 812 p90: 7194 p99: 24403
Need only the top offenders? Combine --top N
with either --sort tokens
or the default path sort.
Under the hood
- Ignore handling uses the
ignore
crate with.gitignore
,.git/info/exclude
, and global git ignores respected automatically. I add common junk folders (node_modules
,target
,.git
) so you don’t have to. - Tokenization relies on
tiktoken-rs
, so you get the same counts as OpenAI’scl100k_base
/o200k_base
models. - Error handling is friendly by default—non UTF-8 files or oversized blobs are skipped with warnings (or silently with
--quiet
). - Percentiles use a nearest-rank approach and degrade gracefully when there are zero files.
Roadmap & feedback
I’m exploring:
- More encodings (if you need a different tokenizer, open an issue)
- Optional HTML/Markdown report outputs
- Built-in file size histogram to complement token stats
Repo & issues live here: github.com/CharlonTank/tokencount
If you try tokencount
, I’d love to hear how it fits into your prompt engineering workflow or CI pipelines—reach out in the repo or drop a comment below.
Top comments (0)