Skip to content

Conversation

@OneRaynyDay
Copy link
Member

@OneRaynyDay OneRaynyDay commented Dec 15, 2020

When comparing presto-python-client with pyhive for performance reasons, I found this client to run extremely slow for queries with large throughput. Running line_profiler on fetchall(), we see this:

Total time: 42.8376 s dbapi.py Function: fetchall at line 470 Line # Hits Time Per Hit % Time Line Contents ============================================================== 470 def fetchall(self): 471 # type: () -> List[List[Any]] 472 1 42837588.0 42837588.0 100.0 return list(self.genall()) Total time: 14.4723 s client.py Function: process at line 397 Line # Hits Time Per Hit % Time Line Contents ============================================================== 397 def process(self, http_response): ... 402 313 336.0 1.1 0.0 http_response.encoding = "utf-8" 403 313 7311835.0 23360.5 50.5 response = http_response.json() 404 313 7147288.0 22834.8 49.4 logger.debug("HTTP {}: {}".format(http_response.status_code, response)) 405 313 551.0 1.8 0.0 if "error" in response: ... 428 313 296.0 0.9 0.0 rows=response.get("data", []), 429 313 281.0 0.9 0.0 columns=response.get("columns"), 430 ) Total time: 42.4591 s client.py Function: __iter__ at line 451 Line # Hits Time Per Hit % Time Line Contents ============================================================== 451 def __iter__(self): ... 458 # Subsequent fetches from GET requests until next_uri is empty. 459 314 880.0 2.8 0.0 while not self._query.is_finished(): 460 313 35453558.0 113270.2 83.5 rows = self._query.fetch() 461 160313 68578.0 0.4 0.2 for row in rows: 462 160000 90325.0 0.6 0.2 self._rownumber += 1 463 160000 6775992.0 42.3 16.0 logger.debug("row {}".format(row)) 464 160000 69789.0 0.4 0.2 yield row Total time: 35.4451 s client.py Function: fetch at line 532 Line # Hits Time Per Hit % Time Line Contents ============================================================== 532 def fetch(self): 533 # type: () -> List[List[Any]] 534 """Continue fetching data for the current query_id""" 535 313 20928528.0 66864.3 59.0 response = self._request.get(self._request.next_uri) 536 313 14479264.0 46259.6 40.8 status = self._request.process(response) 537 313 252.0 0.8 0.0 if status.columns: ... 

Zooming into two lines:

 404 313 7147288.0 22834.8 49.4 logger.debug("HTTP {}: {}".format(http_response.status_code, response)) 463 160000 6775992.0 42.3 16.0 logger.debug("row {}".format(row)) 

We see that although our LEVEL is set to INFO, the formatting on these strings are done eagerly and then discarded. The formatting itself takes a whole 7 seconds and 6 seconds respectively out of 43 seconds. For context, Pyhive takes roughly 28 seconds. The exact setup to reproduce this benchmark is redacted, but a comparable benchmark could be running SELECT * FROM some_big_table from a localhost coordinator. We would love to use this presto client in our production workflow but with the performance issue here we have decided not to consider it until a fix has been applied.

To do this, people typically use logging.info("%s", x) syntax. However, the official python docs don't encourage this style. So how can we use logging.info("{}", x), preserving our original logs and formatting style? The change here is a common one - the logging library provides a LoggerAdapter class that allow us to return a "logger" of the same API but perform python3 formatting style only if the level is appropriate (lazily evaluated). For example, see the official python cookbook.

@cla-bot
Copy link

cla-bot bot commented Dec 15, 2020

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla.

@cla-bot
Copy link

cla-bot bot commented Dec 15, 2020

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

"redirect {} from {} to {}".format(
http_response.status_code, location, url
)
"redirect {} from {} to {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely a good improvement!

nit: do you think it would be ok to format this in one line

logger.info("redirect {} from {} to {}", http_response.status_code, location, url) 

?

(BTW i think this could go at debug level, but that's perhaps out of scope of this PR)

Copy link
Member Author

@OneRaynyDay OneRaynyDay Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah of course @ formatting :)

logger = logging.getLogger(name)
logger.setLevel(log_level)
return logger
return FormatLogger(logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the builtin logging's formatting capabilities?
Is there something we could use to avoid adding a wrapper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builtin logging's simplest formatting capabilities can allow you to lazily log in the old python2 syntax:

logging.log("%s", something) 

There currently is no support for python 3 formatting due to ambiguous kwargs that might arise. In general, python2 style is not encouraged. @ wrapper, the LoggingAdapter is actually a pattern that python suggests on using - we're not actually adding any overhead because all I've done is replace the string processing function on the logger level :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMo the old-school %s formatting is good enough for our modest needs.
I would rather keep the code simple using what is immediately available from the standard library -- and wait for the standard library to gain more reach formatting options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I can open another PR for that instead 👍

@findepi
Copy link
Member

findepi commented Jan 12, 2021

I believe this has been replaced with #60, so let me close this.
Please reopen if i missed anything.

@findepi findepi closed this Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants