Skip to content

Conversation

ilevkivskyi
Copy link
Member

After looking more at some real data I found that:

  • More than 99.9% of all ints are between -10 and 117. Values are a bit arbitrary TBH, the idea is that we should include small negative values (for TypeVarIds) and still be able to fit them in 1 byte.
  • More than 99.9% of strings are shorter that 128 bytes (again the idea is to fit the length into a single byte)

Note there are very few integers that would fit in two bytes currently. This is because we only store line for type alias nodes, and type aliases are usually defined at the top of a module. We can add special case for two bytes later when needed.

We could probably save another byte for long strings and medium integers, but I don't want to have anything fancy that would only affect less than 0.1% cases.

Finally you may notice I add a small correctness change I noticed accidentally when working on this, it is not really related, but it is so minor that it doesn't deserve a separate PR.

This comment has been minimized.

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- did you check how much this reduces the size of binary cache files?

write_int(b, 2 ** 85)
write_int(b, 255)
write_int(b, -1)
write_int(b, -255)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test also the edge cases (-11, -10, -9, 116, 117, 118). Test a few more different lengths of integers (e.g. 15 bits, 23 bits, 30 bits) with arbitrary lower bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think it would also make sense to test something like len(data.getvalue()) == 1 etc.

@ilevkivskyi
Copy link
Member Author

@JukkaL

did you check how much this reduces the size of binary cache files?

It looks like it's ~40% smaller now, for example:

  • .mypy_cache/3.12/types.data.ff 183K (master)
  • .mypy_cache/3.12/types.data.ff 103K (this PR)
  • also btw .mypy_cache/3.12/types.data.json 381K
Copy link
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@ilevkivskyi ilevkivskyi merged commit 6a88c21 into python:master Aug 29, 2025
20 checks passed
@ilevkivskyi ilevkivskyi deleted the compact-int branch August 29, 2025 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants