-
- Notifications
You must be signed in to change notification settings - Fork 760
Description
Hello!
I recently updated to a more recent version of this repo for my team's git hooks and we noticed that check_case_conflict got a lot slower. I believe I've tracked down what's going on and I have a suggested improvement, though I'm not sure if it's a great approach.
For context, this is the Legends of Runeterra game's monorepo :) It's got around 1m files and some are pretty deeply nested in directories (a little bit more on this at https://technology.riotgames.com/news/legends-runeterra-cicd-pipeline if anyone's curious)
Testing done using Python 3.6.3 on Windows 10 with pre-commit-hooks==4.0.1
and pre-commit==2.13.0
#575 introduced also checking directories, which is great! But this snippet ends up being slow because os.path.dirname
is slow (at least on windows :( )
def parents(file: str) -> Iterator[str]: file = os.path.dirname(file) while file: yield file file = os.path.dirname(file)
I used cProfile
to run this against a single file:
from pre_commit_hooks import check_case_conflict import cProfile cProfile.run('check_case_conflict.find_conflicting_filenames("Jenkinsfile")')
results:
λ py test.py 41899740 function calls in 20.114 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.047 0.047 20.114 20.114 <string>:1(<module>) 4 0.000 0.000 0.000 0.000 _weakrefset.py:38(_remove) 4 0.000 0.000 0.000 0.000 _weakrefset.py:81(add) 3 0.000 0.000 0.277 0.092 check_case_conflict.py:13(lower_set) 3 0.186 0.062 0.277 0.092 check_case_conflict.py:14(<setcomp>) 3189815 1.179 0.000 17.790 0.000 check_case_conflict.py:17(parents) 2 0.000 0.000 18.831 9.416 check_case_conflict.py:24(directories_for) 2 1.042 0.521 18.831 9.416 check_case_conflict.py:25(<setcomp>) 1 0.147 0.147 20.066 20.066 check_case_conflict.py:28(find_conflicting_filenames) 3189815 3.710 0.000 5.528 0.000 ntpath.py:121(splitdrive) 3189815 7.833 0.000 15.701 0.000 ntpath.py:199(split) 3189815 0.910 0.000 16.611 0.000 ntpath.py:240(dirname) 3189815 0.813 0.000 1.216 0.000 ntpath.py:33(_get_bothseps) 2 0.000 0.000 0.000 0.000 subprocess.py:1023(_internal_poll) 2 0.000 0.000 0.022 0.011 subprocess.py:1040(wait) 2 0.000 0.000 0.665 0.332 subprocess.py:1067(_communicate) 22 0.000 0.000 0.000 0.000 subprocess.py:179(Close) ...
I messed with it a bit locally and this approach seems to perform significantly better:
def parents(file: str) -> Iterator[str]: path_parts = file.split(os.sep) file = path_parts[-1] while path_parts: yield file file = path_parts.pop()
λ py test.py 1962384 function calls in 2.469 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.042 0.042 2.469 2.469 <string>:1(<module>) 4 0.000 0.000 0.000 0.000 _weakrefset.py:38(_remove) 4 0.000 0.000 0.000 0.000 _weakrefset.py:81(add) 3 0.000 0.000 0.283 0.094 check_case_conflict.py:13(lower_set) 3 0.195 0.065 0.283 0.094 check_case_conflict.py:14(<setcomp>) 784800 0.369 0.000 0.670 0.000 check_case_conflict.py:17(parents) 2 0.000 0.000 1.079 0.539 check_case_conflict.py:26(directories_for) 2 0.409 0.204 1.079 0.539 check_case_conflict.py:27(<setcomp>) 1 0.231 0.231 2.427 2.427 check_case_conflict.py:30(find_conflicting_filenames) 2 0.000 0.000 0.000 0.000 subprocess.py:1023(_internal_poll) 2 0.000 0.000 0.020 0.010 subprocess.py:1040(wait) 2 0.000 0.000 0.695 0.348 subprocess.py:1067(_communicate) ...
But it doesn't really feel good doing this instead of using the built-in os.path
stuff. I'm not really sure how correct/robust/multi-OS-compatible this is.
I'm curious if anyone has alternative ideas! But also if this is simply out of scope for this tool that's fine too :) Thanks for reading!