Posted on Sep 16

tokencount: Fast GPT token counting CLI

tokencount is a new Rust CLI that helps you answer a deceptively simple question: how many GPT tokens are hidden across this project? If you build AI features, write prompt-heavy docs, or just keep an eye on context windows, this tool makes the audit painless.

Why I built it

Most token counters either run one file at a time or ignore the filesystem realities of big projects. I wanted something that:

Walks a codebase quickly (parallel rayon workers + OS-native ignore rules)
Respects .gitignore by default and lets me layer custom --exclude globs
Talks the same language as OpenAI models (cl100k_base, o200k_base, etc.)
Gives a useful summary out of the box: per-file counts, totals, percentiles, and top-N offenders
Plays nicely with automation (JSON and NDJSON streaming modes)

Features at a glance

Blazing fast scan – ignore::WalkBuilder + Rayon for concurrent IO/tokenization
Smart defaults – only scans *.elm unless you add --include-ext flags (good for Elm-heavy repos)
Flexible filtering – combine --include-ext, --exclude, --max-bytes, and --follow-symlinks
Multiple outputs – table, JSON array with summary, or NDJSON stream for pipelines
Rich stats – totals, average per file, and P50/P90/P99 percentiles to spot outliers fast
Quiet/verbose modes – keep CI logs clean or turn on detailed warnings locally

Install

cargo install tokencount

Quick tour

# default: scan current directory, only *.elm files, table output tokencount # include Elm + TypeScript tokencount ./frontend --include-ext elm --include-ext ts # show top 10 largest files by tokens tokencount --top 10 # machine-readable summary for CI tokencount --format json > tokens.json # streaming counts for further processing tokencount --format ndjson # sort descending by token count tokencount --sort tokens

Each run ends with a footer like this:

--- total files: 42 total tokens: 128730 average/file: 3065.00 p50: 812 p90: 7194 p99: 24403

Need only the top offenders? Combine --top N with either --sort tokens or the default path sort.

Under the hood

Ignore handling uses the ignore crate with .gitignore, .git/info/exclude, and global git ignores respected automatically. I add common junk folders (node_modules, target, .git) so you don’t have to.
Tokenization relies on tiktoken-rs, so you get the same counts as OpenAI’s cl100k_base/o200k_base models.
Error handling is friendly by default—non UTF-8 files or oversized blobs are skipped with warnings (or silently with --quiet).
Percentiles use a nearest-rank approach and degrade gracefully when there are zero files.

Roadmap & feedback

I’m exploring:

More encodings (if you need a different tokenizer, open an issue)
Optional HTML/Markdown report outputs
Built-in file size histogram to complement token stats

Repo & issues live here: github.com/CharlonTank/tokencount

If you try tokencount, I’d love to hear how it fits into your prompt engineering workflow or CI pipelines—reach out in the repo or drop a comment below.

DEV Community