Skip to content

Conversation

@fruch
Copy link

@fruch fruch commented Mar 11, 2020

No description provided.

@fruch fruch requested review from bentsi and vladzcloudius March 12, 2020 07:07
@fruch fruch self-assigned this Mar 12, 2020
@vladzcloudius
Copy link

Please, don't squash commits - you can always do it later.
Please, put proper description what each next patch is about.

@ultrabug
Copy link
Collaborator

ultrabug commented Apr 8, 2020

Hi guys, as discussed on Slack I've pushed the changes necessary to a more graceful shard aware connection pooling ;)

@fruch
Copy link
Author

fruch commented Apr 13, 2020

@vladzcloudius care to take a look at the test in 5b636fc.

I guess you have some more ideas, on how/what should be test in that regards

@vladzcloudius
Copy link

@fruch I'd like to see how driver behaves when you tear connections down: send RST from the remote side.

@dimaqq dimaqq mentioned this pull request Jun 2, 2020
fruch added 3 commits June 2, 2020 13:33
* unit test for parsing and shard_id calculation * integration test using trace to verify requests going to the correct shard
since the async callback, to be on the safe side, adding a lock to the `set.remove()` call
fruch and others added 15 commits June 2, 2020 13:33
refactor the logic of re-opening connection to all shards into it's own function so we can reuse it in `_replace()`
base on @ultrabug comment, changing the debug print to be more clear and informative
based on @ultrabug comments, this should be named better
Better explian why do we loop at max twice over the shards in this function.
Protocol does not (yet?) allow clients to specify which shard they want to connect to and thus depend on a round-robin allocation of the shard_id made by the host node (see system.clients table). This means that on a live cluster where client connections come and go, we cannot guarantee that we will get a connection to every shard... even by retrying twice (which slows down the connection startup also) This commit switches to use an optimistic approach where we try to connect as many times as there are shards on the remote cluster at first. Then when routing_key is used and shard aware connection picking can take place, we will try to open missing connections as we detect them. This is more graceful and allows us to not fail if we miss shard connections as well as reduce the connection startup time! A long running client will hopefully get a connection to all shards after a while! That's the best we can do for now until the protocol evolves.
Initial connection tentatives to shards are now scheduled instead of being blocking on startup. This allows the shard aware driver to connect to a Scylla cluster as fast as Cassandra one!
* add test for multiple client at the same time * add test for closing connections * add test for blocking(timing out) connections
On busy systems we could overwhelm the threadpool executor queue with repeated submissions of speculative shard connections to the same shard which results in having an unbound number of connection openings to scylla nodes This could be seen as a "connection leak" and was also not respecting the signal of cluster connections shutting down
@fruch
Copy link
Author

fruch commented Jun 2, 2020

@ultrabug FYI, I've rebase and force pushed this branch (not very social of me.. but I want travis to start working ontop of this PR too)

@fruch fruch merged commit a49555b into scylladb:master Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants