Improve chromsizes File Validation to Catch Formatting Errors Early #458
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Fixes: #209
Original Issues: #142 & #124
Related Issues:
Overview
This pull request improves the
read_chromsizesfunction to catch formatting errors in chromsizes files early and provide clear, actionable error messages. Previously, issues like spaces instead of tabs, hidden characters, or malformed rows could slip through, causing confusing downstream errors (e.g.,ValueError: cannot convert float NaN to integer). Now, the function validates the file format upfront, ensuring it’s tab-delimited, has exactly two columns, and contains valid integer lengths—making it more robust and user-friendly.What Was Happening Before?
pandas.read_csv. This led toNaNvalues in thelengthcolumn, which crashed later steps like binning with vague errors.1000000 extra_columnas a single value, resulting inNaNforlength. Similarly, spaces instead of tabs (e.g.,chr1 1000000) caused misparsing.What’s Changed?
This update adds proactive checks to
read_chromsizesto catch these issues right away. Here’s what’s new:Strict Tab Enforcement:
Exact Two-Column Validation:
pandas.read_csvwithon_bad_lines="error", which rejects files with too few or too many columns (e.g.,chr1\t1000000\textraorchr1). This prevents silent misparsing.Numeric Length Check:
lengthcolumn to numbers withpd.to_numeric(errors="coerce"). If any values turn intoNaN(e.g., due to text likeallele1or hidden characters), we raise a detailed error:How It Works Now
Good File:
→ Works perfectly, returns a
pd.Serieswith lengths indexed by chromosome names.Bad File with Spaces:
→ Fails early:
ValueError: Chromsizes file uses spaces instead of tabs...Bad File with Invalid Lengths:
→ Fails with:
ValueError: Chromsizes file contains invalid length values... Invalid rows: chr2 NaNBad File with Extra Columns:
→ Fails with a
pandasparsing error about mismatched columns.Benefits
Notes
verboseoption, as per maintainer feedback—it’s not needed here.Testing
This update makes
coolermore reliable and easier to use by catching chromsizes issues upfront with clear guidance for users.