Skip to content

Conversation

@OneRaynyDay
Copy link
Member

When comparing presto-python-client with pyhive for performance reasons, I found this client to run extremely slow for queries with large throughput. Running line_profiler on fetchall(), we see this:

Total time: 42.8376 s dbapi.py Function: fetchall at line 470 Line # Hits Time Per Hit % Time Line Contents ============================================================== 470 def fetchall(self): 471 # type: () -> List[List[Any]] 472 1 42837588.0 42837588.0 100.0 return list(self.genall()) Total time: 14.4723 s client.py Function: process at line 397 Line # Hits Time Per Hit % Time Line Contents ============================================================== 397 def process(self, http_response): ... 402 313 336.0 1.1 0.0 http_response.encoding = "utf-8" 403 313 7311835.0 23360.5 50.5 response = http_response.json() 404 313 7147288.0 22834.8 49.4 logger.debug("HTTP {}: {}".format(http_response.status_code, response)) 405 313 551.0 1.8 0.0 if "error" in response: ... 428 313 296.0 0.9 0.0 rows=response.get("data", []), 429 313 281.0 0.9 0.0 columns=response.get("columns"), 430 ) Total time: 42.4591 s client.py Function: __iter__ at line 451 Line # Hits Time Per Hit % Time Line Contents ============================================================== 451 def __iter__(self): ... 458 # Subsequent fetches from GET requests until next_uri is empty. 459 314 880.0 2.8 0.0 while not self._query.is_finished(): 460 313 35453558.0 113270.2 83.5 rows = self._query.fetch() 461 160313 68578.0 0.4 0.2 for row in rows: 462 160000 90325.0 0.6 0.2 self._rownumber += 1 463 160000 6775992.0 42.3 16.0 logger.debug("row {}".format(row)) 464 160000 69789.0 0.4 0.2 yield row Total time: 35.4451 s client.py Function: fetch at line 532 Line # Hits Time Per Hit % Time Line Contents ============================================================== 532 def fetch(self): 533 # type: () -> List[List[Any]] 534 """Continue fetching data for the current query_id""" 535 313 20928528.0 66864.3 59.0 response = self._request.get(self._request.next_uri) 536 313 14479264.0 46259.6 40.8 status = self._request.process(response) 537 313 252.0 0.8 0.0 if status.columns: ... 

Zooming into two lines:

 404 313 7147288.0 22834.8 49.4 logger.debug("HTTP {}: {}".format(http_response.status_code, response)) 463 160000 6775992.0 42.3 16.0 logger.debug("row {}".format(row)) 

We see that although our LEVEL is set to INFO, the formatting on these strings are done eagerly and then discarded. The formatting itself takes a whole 7 seconds and 6 seconds respectively out of 43 seconds. For context, Pyhive takes roughly 28 seconds. The exact setup to reproduce this benchmark is redacted, but a comparable benchmark could be running SELECT * FROM some_big_table from a localhost coordinator. We would love to use this presto client in our production workflow but with the performance issue here we have decided not to consider it until a fix has been applied.

@cla-bot
Copy link

cla-bot bot commented Dec 21, 2020

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla.

@findepi
Copy link
Member

findepi commented Dec 21, 2020

@OneRaynyDay just making sure -- have you filed the CLA already?

@OneRaynyDay
Copy link
Member Author

@OneRaynyDay just making sure -- have you filed the CLA already?

Yep, I sent an email to cla@prestosql.io regarding this PR with a signed document for CLA. Let me know if you need anything on my side before we merge it :)

@martint
Copy link
Member

martint commented Dec 23, 2020

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Dec 23, 2020
@cla-bot
Copy link

cla-bot bot commented Dec 23, 2020

The cla-bot has been summoned, and re-checked this pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants