Skip to content
/ swuniq Public

A command-line tool for deduplicating entries in a file or stream with constant memory usage

License

Notifications You must be signed in to change notification settings

mterron/swuniq

Repository files navigation

swuniq

Travis (.org) coverity result Language grade: C/C++

Deduplicate matching lines (within a configurable window) from a file or standard input, writing to standard output.

Like uniq but works on unsorted input to be used as a pipe filter with constant memory usage.

Why?

Sometimes you need consume a data stream (Certificate Transparency log for example) that have non consecutive duplicates and you don't want to deal with them. The usual solution involving awk has unbounded memory usage so that might be a problem, this one doesn't.

Memory Usage

swuniq uses a ringbuffer of configurable size (-w option) as a FIFO queue to store hashes of each line to keep memory use constant (64bits * -w value).

Example

# swuniq -h Usage: swuniq [-w N] [INPUT] Filter matching lines (within a configurable window) from INPUT (or standard input), writing to standard output.	-w N Size of the sliding window to use for deduplication Note: By default swuniq will use a window of 100 lines. # cat input.txt  apple apple apple banana banana strawberry blueberry apple banana strawberry blueberry kiwifruit orange peach watermelon orange watermelon kiwifruit banana banana banana apple kiwifruit # swuniq < input.txt apple banana strawberry blueberry kiwifruit orange peach watermelon # swuniq -w 4 < input.txt apple banana strawberry blueberry kiwifruit orange peach watermelon banana apple kiwifruit # swuniq -w 2 < input.txt  apple banana strawberry blueberry apple banana strawberry blueberry kiwifruit orange peach watermelon orange kiwifruit banana apple kiwifruit 

About

A command-line tool for deduplicating entries in a file or stream with constant memory usage

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published