Skip to content

Conversation

@ssam18
Copy link

@ssam18 ssam18 commented Nov 12, 2025

Closes #63089

Description

This PR fixes a segmentation fault that occurs when reading CSV files containing numbers with extremely large exponents in scientific notation (e.g., 4e492493924924).

Root Cause

The issue was an integer overflow in the xstrtod() function in pandas/_libs/src/parser/tokenizer.c. When parsing the exponent portion of scientific notation, the code accumulated digits into an int variable without bounds checking:

int n = 0; while (isdigit_ascii(*p)) { n = n * 10 + (*p - '0'); // Integer overflow with large exponents num_digits++; p++; }

With an exponent like 492493924924, the variable n would overflow, causing undefined behavior that manifests as a segmentation fault.

Solution

I added a maximum digit cap (MAX_EXPONENT_DIGITS = 4) when accumulating the exponent value:

  • Only the first 4 digits are used for the actual exponent value (allowing up to 9999)
  • Remaining digits are still consumed to maintain correct parsing position
  • This is sufficient since valid double-precision exponents are limited to roughly ±308 anyway
  • The existing range check (DBL_MIN_EXP to DBL_MAX_EXP) will properly handle out-of-range values

Testing

Added test_issue_63089.py with test cases covering:

  • The exact case from the issue report
  • Various edge cases with extremely large positive and negative exponents
  • Numbers with decimal points and large exponents

The fix prevents the overflow while maintaining correct parsing behavior for valid scientific notation.


Checklist:

Fixes pandas-dev#63089 When parsing scientific notation in CSV files, extremely large exponent values (e.g., '4e492493924924') caused integer overflow in the exponent accumulation loop, leading to undefined behavior and segmentation faults. The issue occurred in xstrtod() at pandas/_libs/src/parser/tokenizer.c where exponent digits were accumulated without bounds checking: int n = 0; while (isdigit_ascii(*p)) { n = n * 10 + (*p - '0'); // Overflow here with large exponents ... } Solution: - Add a maximum exponent digits cap (MAX_EXPONENT_DIGITS = 4) to prevent overflow while still allowing valid scientific notation - Continue consuming remaining digits to maintain correct parsing position - The capped value (up to 9999) is sufficient since the subsequent range check (DBL_MIN_EXP to DBL_MAX_EXP) will catch invalid exponents This fix prevents the overflow while maintaining correct parsing behavior for both valid and invalid exponent values. Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
@jbrockmendel jbrockmendel added the AI Slop Suspected of being AI-generated, which is not welcome. label Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Slop Suspected of being AI-generated, which is not welcome.

2 participants