Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OpenZL compressed SAM/BAM vs. CRAM is the interesting comparison. It would really test the flexibility of the framework. Can OpenZL reach the same level of compression, and how much effort does it take?

I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.



Do you happen to have a pointer to a good open source dataset to look at?

Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.

We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.


For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further


Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.


Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact