Skip to content

Conversation

@brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Jul 7, 2025

As the new comment says, upon manual review of -O3, -O2, and -Os, it seems that -Os generates the best code for the JIT's use-case. Perf impact is close to noise, but slightly positive on x86-64 Linux and AArch64 macOS, neutral on AArch64 Linux, and slightly negative on x86-64 Windows. According to the stats, the size of JIT code is down by about 1-2%: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250628-3.15.0a0-33054dd-JIT/README.md

Here's an example of how skipping tail-duplication removes an extra jump and a duplicate instruction from _POP_TOP (also reducing its size by 19%):

- // 11: 75 04 jne 0x17 <_JIT_ENTRY+0x17> + // 11: 75 0f jne 0x22 <_JIT_ENTRY+0x22> // 13: ff 0f decl (%rdi) - // 15: 74 07 je 0x1e <_JIT_ENTRY+0x1e> - // 17: 4d 8b 6c 24 40 movq 0x40(%r12), %r13 - // 1c: eb 10 jmp 0x2e <_JIT_CONTINUE> - // 1e: 50 pushq %rax - // 1f: ff 15 00 00 00 00 callq *(%rip) # 0x25 <_JIT_ENTRY+0x25> - // 0000000000000021: R_X86_64_GOTPCRELX _Py_Dealloc-0x4 - // 25: 48 83 c4 08 addq $0x8, %rsp - // 29: 4d 8b 6c 24 40 movq 0x40(%r12), %r13 - const unsigned char code_body[46] = { + // 15: 75 0b jne 0x22 <_JIT_ENTRY+0x22> + // 17: 50 pushq %rax + // 18: ff 15 00 00 00 00 callq *(%rip) # 0x1e <_JIT_ENTRY+0x1e> + // 000000000000001a: R_X86_64_GOTPCRELX _Py_Dealloc-0x4 + // 1e: 48 83 c4 08 addq $0x8, %rsp + // 22: 4d 8b 6c 24 40 movq 0x40(%r12), %r13 + const unsigned char code_body[39] = { 0x49, 0x8b, 0x7d, 0xf8, 0x49, 0x83, 0xc5, 0xf8, 0x4d, 0x89, 0x6c, 0x24, 0x40, 0x40, 0xf6, 0xc7, - 0x01, 0x75, 0x04, 0xff, 0x0f, 0x74, 0x07, 0x4d, - 0x8b, 0x6c, 0x24, 0x40, 0xeb, 0x10, 0x50, 0xff, - 0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc4, - 0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40, + 0x01, 0x75, 0x0f, 0xff, 0x0f, 0x75, 0x0b, 0x50, + 0xff, 0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83, + 0xc4, 0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40, };

Full diff for the stencils here:

https://gist.github.com/brandtbucher/7340be56f2d2cf7061b5c9bf1c87939c

@brandtbucher brandtbucher self-assigned this Jul 7, 2025
@brandtbucher brandtbucher added performance Performance or resource usage skip news interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-JIT labels Jul 7, 2025
@bedevere-app bedevere-app bot mentioned this pull request Jul 7, 2025
13 tasks
f"-I{CPYTHON / 'Python'}",
f"-I{CPYTHON / 'Tools' / 'jit'}",
"-O3",
# -O2 and -O3 include some optimizations that make sense for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you investigate -Oz as well? The clang docs are fairly vague, but they say it reduces code size even further, so I'm curious if it's worth investigating as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! I'm definitely down to try benchmarking it after this lands.

I suspect it may be quite a bit slower, though. My understanding is that -Os does all of the meaningful performance optimizations except those that increase size, while -Oz will actually hurt performance in pursuit of the smallest possible machine code. Our goal is to be fast, of course, but in this particular case -Os is also just giving us better code (as a side-effect of not aligning jumps or duplicating tails, etc). So smaller isn't necessarily always better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure this is going to be a win. It basically turns off inlining for functions called more than once. For instance, _POP_TWO turns from this on -Os:

 // 0000000000000000 <_JIT_ENTRY>: // 0: 50 pushq %rax // 1: 49 8d 45 f8 leaq -0x8(%r13), %rax // 5: 49 8b 5d f0 movq -0x10(%r13), %rbx // 9: 49 8b 7d f8 movq -0x8(%r13), %rdi // d: 49 89 44 24 40 movq %rax, 0x40(%r12) // 12: 40 f6 c7 01 testb $0x1, %dil // 16: 75 0a jne 0x22 <_JIT_ENTRY+0x22> // 18: ff 0f decl (%rdi) // 1a: 75 06 jne 0x22 <_JIT_ENTRY+0x22> // 1c: ff 15 00 00 00 00 callq *(%rip) # 0x22 <_JIT_ENTRY+0x22> // 000000000000001e: R_X86_64_GOTPCRELX _Py_Dealloc-0x4 // 22: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12) // 28: f6 c3 01 testb $0x1, %bl // 2b: 75 0d jne 0x3a <_JIT_ENTRY+0x3a> // 2d: ff 0b decl (%rbx) // 2f: 75 09 jne 0x3a <_JIT_ENTRY+0x3a> // 31: 48 89 df movq %rbx, %rdi // 34: ff 15 00 00 00 00 callq *(%rip) # 0x3a <_JIT_ENTRY+0x3a> // 0000000000000036: R_X86_64_GOTPCRELX _Py_Dealloc-0x4 // 3a: 4d 8b 6c 24 40 movq 0x40(%r12), %r13 // 3f: 58 popq %rax 

Into this on -Oz (outlining PyStackRef_CLOSE makes it 2 bytes shorter, but adds up to three additional jumps):

 // 0000000000000000 <_JIT_ENTRY>: // 0: 50 pushq %rax // 1: 49 8d 45 f8 leaq -0x8(%r13), %rax // 5: 49 8b 5d f0 movq -0x10(%r13), %rbx // 9: 49 8b 7d f8 movq -0x8(%r13), %rdi // d: 49 89 44 24 40 movq %rax, 0x40(%r12) // 12: e8 16 00 00 00 callq 0x2d <PyStackRef_CLOSE> // 17: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12) // 1d: 48 89 df movq %rbx, %rdi // 20: e8 08 00 00 00 callq 0x2d <PyStackRef_CLOSE> // 25: 4d 8b 6c 24 40 movq 0x40(%r12), %r13 // 2a: 58 popq %rax // 2b: eb 11 jmp 0x3e <_JIT_CONTINUE> // // 000000000000002d <PyStackRef_CLOSE>: // 2d: 40 f6 c7 01 testb $0x1, %dil // 31: 75 04 jne 0x37 <PyStackRef_CLOSE+0xa> // 33: ff 0f decl (%rdi) // 35: 74 01 je 0x38 <PyStackRef_CLOSE+0xb> // 37: c3 retq // 38: ff 25 00 00 00 00 jmpq *(%rip) # 0x3e <_JIT_CONTINUE> // 000000000000003a: R_X86_64_GOTPCRELX _Py_Dealloc-0x4 

I'll still try benchmarking it though. But I'll land this PR in the meantime since it's just a one-character change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, -Oz is about 1-2% slower across the board.

@brandtbucher brandtbucher merged commit c49dc3b into python:main Jul 9, 2025
72 checks passed
AndPuQing pushed a commit to AndPuQing/cpython that referenced this pull request Jul 11, 2025
Pranjal095 pushed a commit to Pranjal095/cpython that referenced this pull request Jul 12, 2025
picnixz pushed a commit to picnixz/cpython that referenced this pull request Jul 13, 2025
taegyunkim pushed a commit to taegyunkim/cpython that referenced this pull request Aug 4, 2025
Agent-Hellboy pushed a commit to Agent-Hellboy/cpython that referenced this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage skip news topic-JIT

2 participants