Change logging formatting to lazily evaluate (performance critical) #60

OneRaynyDay · 2020-12-21T20:52:16Z

When comparing presto-python-client with pyhive for performance reasons, I found this client to run extremely slow for queries with large throughput. Running line_profiler on fetchall(), we see this:

Total time: 42.8376 s dbapi.py Function: fetchall at line 470 Line # Hits Time Per Hit % Time Line Contents ============================================================== 470 def fetchall(self): 471 # type: () -> List[List[Any]] 472 1 42837588.0 42837588.0 100.0 return list(self.genall()) Total time: 14.4723 s client.py Function: process at line 397 Line # Hits Time Per Hit % Time Line Contents ============================================================== 397 def process(self, http_response): ... 402 313 336.0 1.1 0.0 http_response.encoding = "utf-8" 403 313 7311835.0 23360.5 50.5 response = http_response.json() 404 313 7147288.0 22834.8 49.4 logger.debug("HTTP {}: {}".format(http_response.status_code, response)) 405 313 551.0 1.8 0.0 if "error" in response: ... 428 313 296.0 0.9 0.0 rows=response.get("data", []), 429 313 281.0 0.9 0.0 columns=response.get("columns"), 430 ) Total time: 42.4591 s client.py Function: __iter__ at line 451 Line # Hits Time Per Hit % Time Line Contents ============================================================== 451 def __iter__(self): ... 458 # Subsequent fetches from GET requests until next_uri is empty. 459 314 880.0 2.8 0.0 while not self._query.is_finished(): 460 313 35453558.0 113270.2 83.5 rows = self._query.fetch() 461 160313 68578.0 0.4 0.2 for row in rows: 462 160000 90325.0 0.6 0.2 self._rownumber += 1 463 160000 6775992.0 42.3 16.0 logger.debug("row {}".format(row)) 464 160000 69789.0 0.4 0.2 yield row Total time: 35.4451 s client.py Function: fetch at line 532 Line # Hits Time Per Hit % Time Line Contents ============================================================== 532 def fetch(self): 533 # type: () -> List[List[Any]] 534 """Continue fetching data for the current query_id""" 535 313 20928528.0 66864.3 59.0 response = self._request.get(self._request.next_uri) 536 313 14479264.0 46259.6 40.8 status = self._request.process(response) 537 313 252.0 0.8 0.0 if status.columns: ...

Zooming into two lines:

 404 313 7147288.0 22834.8 49.4 logger.debug("HTTP {}: {}".format(http_response.status_code, response)) 463 160000 6775992.0 42.3 16.0 logger.debug("row {}".format(row))

We see that although our LEVEL is set to INFO, the formatting on these strings are done eagerly and then discarded. The formatting itself takes a whole 7 seconds and 6 seconds respectively out of 43 seconds. For context, Pyhive takes roughly 28 seconds. The exact setup to reproduce this benchmark is redacted, but a comparable benchmark could be running SELECT * FROM some_big_table from a localhost coordinator. We would love to use this presto client in our production workflow but with the performance issue here we have decided not to consider it until a fix has been applied.

cla-bot · 2020-12-21T20:52:17Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla.

findepi · 2020-12-21T21:10:35Z

@OneRaynyDay just making sure -- have you filed the CLA already?

OneRaynyDay · 2020-12-22T01:23:51Z

@OneRaynyDay just making sure -- have you filed the CLA already?

Yep, I sent an email to cla@prestosql.io regarding this PR with a signed document for CLA. Let me know if you need anything on my side before we merge it :)

martint · 2020-12-23T04:01:09Z

@cla-bot check

cla-bot · 2020-12-23T04:01:11Z

The cla-bot has been summoned, and re-checked this pull request!

python2 logging style

abea26f

findepi approved these changes Dec 21, 2020

View reviewed changes

cla-bot bot added the cla-signed label Dec 23, 2020

findepi merged commit 33aeede into trinodb:master Dec 27, 2020

findepi mentioned this pull request Jan 12, 2021

Change logging formatting to lazily evaluate (performance critical) #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change logging formatting to lazily evaluate (performance critical) #60

Change logging formatting to lazily evaluate (performance critical) #60

Uh oh!

OneRaynyDay commented Dec 21, 2020

cla-bot bot commented Dec 21, 2020

findepi commented Dec 21, 2020

OneRaynyDay commented Dec 22, 2020

martint commented Dec 23, 2020

cla-bot bot commented Dec 23, 2020

Labels

3 participants

Change logging formatting to lazily evaluate (performance critical) #60

Change logging formatting to lazily evaluate (performance critical) #60

Uh oh!

Conversation

OneRaynyDay commented Dec 21, 2020

cla-bot bot commented Dec 21, 2020

findepi commented Dec 21, 2020

OneRaynyDay commented Dec 22, 2020

martint commented Dec 23, 2020

cla-bot bot commented Dec 23, 2020

Labels

3 participants