Skip to content

Conversation

@klauspost
Copy link
Owner

@klauspost klauspost commented Sep 25, 2022

Use 5 byte hash instead of 4 byte hash.

This improves compression in most cases and will also yield faster decompression. Little to no performance impact.

Before/after:

file	out	level	insize	outsize	millis nyc-taxi-data-10M.csv	gzkp	1	3325605752	922273214	14065	225.49 nyc-taxi-data-10M.csv	gzkp	1	3325605752	846471964	14342	221.12 nyc-taxi-data-10M.csv	gzkp	2	3325605752	883782053	15683	202.22 nyc-taxi-data-10M.csv	gzkp	2	3325605752	815766227	14865	213.35 nyc-taxi-data-10M.csv	gzkp	3	3325605752	878726683	17308	183.24 nyc-taxi-data-10M.csv	gzkp	3	3325605752	808448239	16882	187.86 nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20651	153.57 nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20657	153.53 file	out	level	insize	outsize	millis	mb/s enwik9	gzkp	1	1000000000	382781160	5713	166.90 enwik9	gzkp	1	1000000000	374131553	5826	163.69 enwik9	gzkp	2	1000000000	371351753	6131	155.55 enwik9	gzkp	2	1000000000	361881529	5910	161.36 enwik9	gzkp	3	1000000000	364881746	6891	138.39 enwik9	gzkp	3	1000000000	355065173	6960	137.02 enwik9	gzkp	4	1000000000	342732211	8339	114.36 enwik9	gzkp	4	1000000000	342732211	8252	115.57 file	reset	out	level	files	insize	outsize	millis	mb/s objectfiles	true	gzkp	1	708	300491980	56114777	1008	284.27 objectfiles	true	gzkp	1	708	300491980	55300071	998	286.90 objectfiles	true	gzkp	2	708	300491980	53946448	1147	249.71 objectfiles	true	gzkp	2	708	300491980	52750260	1109	258.36 objectfiles	true	gzkp	3	708	300491980	53110452	1220	234.82 objectfiles	true	gzkp	3	708	300491980	51947585	1211	236.46 One of the few regressions: file	out	level	insize	outsize	millis	mb/s rawstudio-mint14.tar	gzkp	1	8558382592	3960117298	36682	222.50 rawstudio-mint14.tar	gzkp	1	8558382592	3985295228	36619	222.88 rawstudio-mint14.tar	gzkp	2	8558382592	3899597850	38683	210.99 rawstudio-mint14.tar	gzkp	2	8558382592	3921716642	36754	222.06 rawstudio-mint14.tar	gzkp	3	8558382592	3848762302	46588	175.19 rawstudio-mint14.tar	gzkp	3	8558382592	3846475496	45611	178.94 
Use 5 byte hash instead of 4 byte hash. This improves compression in most cases and will also yield faster decompression. Little to no performance impact. Before/after: ``` file	out	level	insize	outsize	millis nyc-taxi-data-10M.csv	gzkp	1	3325605752	922273214	14065	225.49 nyc-taxi-data-10M.csv	gzkp	1	3325605752	846471964	14564	217.76 nyc-taxi-data-10M.csv	gzkp	2	3325605752	883782053	15683	202.22 nyc-taxi-data-10M.csv	gzkp	2	3325605752	815766227	15057	210.63 nyc-taxi-data-10M.csv	gzkp	3	3325605752	878726683	17308	183.24 nyc-taxi-data-10M.csv	gzkp	3	3325605752	807241782	17184	184.56 nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20651	153.57 nyc-taxi-data-10M.csv	gzkp	4	3325605752	789447233	20862	152.02 file	out	level	insize	outsize	millis	mb/s enwik9	gzkp	1	1000000000	382781160	5713	166.90 enwik9	gzkp	1	1000000000	374131553	5926	160.90 enwik9	gzkp	2	1000000000	371351753	6131	155.55 enwik9	gzkp	2	1000000000	361881529	6007	158.74 enwik9	gzkp	3	1000000000	364881746	6891	138.39 enwik9	gzkp	3	1000000000	355065173	7043	135.39 enwik9	gzkp	4	1000000000	342732211	8339	114.36 enwik9	gzkp	4	1000000000	342732211	8327	114.52 ```
@klauspost klauspost merged commit b8a3c61 into master Sep 25, 2022
@klauspost klauspost deleted the improve-l1-3-flate-compression branch September 25, 2022 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants