Skip to content
This repository was archived by the owner on Dec 13, 2023. It is now read-only.
39 changes: 23 additions & 16 deletions 3.4/smart-joins.md → 3.4/smartjoins.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
---
layout: default
description: Introduced in
description: SmartJoins allow to execute co-located join operations among identically sharded collections.
title: SmartJoins for ArangoDB Clusters
redirect_from:
- /3.4/smart-joins.html
---
Smart Joins
===========
SmartJoins
==========

<small>Introduced in: v3.4.5, v3.5.0</small>

Expand All @@ -12,8 +15,12 @@ This feature is only available in the
[**Enterprise Edition**](https://www.arangodb.com/why-arangodb/arangodb-enterprise/){:target="_blank"}
{% endhint %}

When doing joins in an ArangoDB cluster, data has to be exchanged between different servers.
SmartJoins allow to execute co-located join operations among identically sharded collections.

Cluster joins without being smart
---------------------------------

When doing joins in an ArangoDB cluster, data has to be exchanged between different servers.
Joins between different collections in a cluster normally require roundtrips between the
shards of these collections for fetching the data. Requests are routed through an extra
coordinator hop.
Expand Down Expand Up @@ -140,8 +147,8 @@ This is a precondition for running joins locally, and thanks to the effects of
`distributeShardsLike` it is now satisfied!


Smart joins using distributeShardsLike
--------------------------------------
SmartJoins using distributeShardsLike
-------------------------------------

With the two collections in place like this, an AQL query that uses a FILTER condition
that refers from the shard key of the one collection to the shard key of the other collection
Expand All @@ -166,7 +173,7 @@ As can be seen above, the extra hop via the coordinator is gone here, which will
less cluster-internal traffic and a faster response time.


Smart joins will also work if the shard key of the second collection is not *_key*,
SmartJoins will also work if the shard key of the second collection is not *_key*,
and even for non-unique shard key values, e.g.:

arangosh> db._create("c1", {numberOfShards: 4, shardKeys: ["_key"]});
Expand Down Expand Up @@ -194,13 +201,13 @@ and even for non-unique shard key values, e.g.:
6 ReturnNode COOR 2000 - RETURN doc1

{% hint 'tip' %}
All above examples used two collections only. Smart joins will also work when joining
All above examples used two collections only. SmartJoins will also work when joining
more than two collections which have the same data distribution enforced via their
`distributeShardsLike` attribute and using the shard keys as the join criteria as shown above.
{% endhint %}

Smart joins using smartJoinAttribute
------------------------------------
SmartJoins using smartJoinAttribute
-----------------------------------

In case the join on the second collection must be performed on a non-shard key
attribute, there is the option to specify a *smartJoinAttribute* for the collection.
Expand Down Expand Up @@ -252,8 +259,8 @@ The join can now be performed via the collection's *smartJoinAttribute*:
6 ReturnNode COOR 101 - RETURN doc1


Restricting smart joins to a single shard
-----------------------------------------
Restricting SmartJoins to a single shard
----------------------------------------

If a FILTER condition is used on one of the shard keys, the optimizer will also try
to restrict the queries to just the required shards:
Expand All @@ -277,12 +284,12 @@ to restrict the queries to just the required shards:
Limitations
-----------

In ArangoDB 3.4, the smart join optimization must explicitly be turned on in the
In ArangoDB 3.4, the SmartJoin optimization must explicitly be turned on in the
server configuration, using the startup option `--query.smart-joins true`. If that
configuration is not set, the smart join optimization will not be performed.
configuration is not set, the SmartJoin optimization will not be performed.
Future versions ArangoDB will lift that requirement.

The smart join optimization is currently triggered only for data selection queries,
The SmartJoin optimization is currently triggered only for data selection queries,
but not for any data-manipulation operations such as INSERT, UPDATE, REPLACE, REMOVE
or UPSERT, neither traversals or subqueries.

Expand All @@ -294,5 +301,5 @@ It is restricted to be used with simple shard key attributes (such as `_key`, `p
but not with nested attributes (e.g. `name.first`). There should be exactly one shard
key attribute defined for each collection.

Finally, the smart join optimization requires that the collections are joined on their
Finally, the SmartJoin optimization requires that the collections are joined on their
shard key attributes (or smartJoinAttribute) using an equality comparison.
30 changes: 15 additions & 15 deletions 3.5/release-notes-new-features35.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,12 @@ paths of increasing length from a start vertex to a target vertex. For more deta
see the [k Shortest Paths documentation](aql/graphs-kshortest-paths.html).


Smart Joins
-----------
SmartJoins
----------

The "smart joins" feature available in the ArangoDB Enterprise Edition allows running
joins between two sharded collections with performance close to that of a local join
operation.
The SmartJoins feature available in the ArangoDB Enterprise Edition allows
running joins between two sharded collections with performance close to that
of a local join operation.

The prerequisite for this is that the two collections have an identical sharding setup,
established via the `distributeShardsLike` attribute of one of the collections.
Expand All @@ -90,15 +90,15 @@ Quick example setup for two collections with identical sharding:
> db.orders.ensureIndex({ type: "hash", fields: ["productId"] });

Now an AQL query that joins the two collections via their shard keys will benefit from
the smart join optimization, e.g.
the SmartJoin optimization, e.g.

FOR p IN products
FOR o IN orders
FILTER p._key == o.productId
RETURN o

In this query's execution plan, the extra hop via the coordinator can be saved
that is normally there for generic joins. Thanks to the smart join optimization,
that is normally there for generic joins. Thanks to the SmartJoin optimization,
the query's execution is as simple as:

Execution plan:
Expand All @@ -110,7 +110,7 @@ the query's execution is as simple as:
11 GatherNode COOR 0 - GATHER
6 ReturnNode COOR 0 - RETURN o

Without the smart join optimization, there will be an extra hop via the
Without the SmartJoin optimization, there will be an extra hop via the
coordinator for shipping the data from each shard of the one collection to
each shard of the other collection, which will be a lot more expensive:

Expand All @@ -127,25 +127,25 @@ each shard of the other collection, which will be a lot more expensive:
11 GatherNode COOR 3 - GATHER
6 ReturnNode COOR 3 - RETURN o

In the end, smart joins can optimize away a lot of the inter-node network
In the end, SmartJoins can optimize away a lot of the inter-node network
requests normally required for performing a join between sharded collections.
The performance advantage of smart joins compared to regular joins will grow
The performance advantage of SmartJoins compared to regular joins will grow
with the number of shards of the underlying collections.

In general, for two collections with `n` shards each, the minimal number of
network requests for the general join (_no_ smart joins optimization) will be
network requests for the general join (_no_ SmartJoins optimization) will be
`n * (n + 2)`. The number of network requests increases quadratically with the
number of shards.

Smart joins can get away with a minimal number of `n` requests here, which scales
SmartJoins can get away with a minimal number of `n` requests here, which scales
linearly with the number of shards.

Smart joins will also be especially advantageous for queries that have to ship a lot
SmartJoins will also be especially advantageous for queries that have to ship a lot
of data around for performing the join, but that will filter out most of the data
after the join. In this case smart joins should greatly outperform the general join,
after the join. In this case SmartJoins should greatly outperform the general join,
as they will eliminate most of the inter-node data shipping overhead.

Also see the [Smart Joins](smart-joins.html) page.
Also see the [SmartJoins](smartjoins.html) page.


Background Index Creation
Expand Down
33 changes: 19 additions & 14 deletions 3.5/smart-joins.md → 3.5/smartjoins.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
---
layout: default
description: Introduced in
description: SmartJoins allow to execute co-located join operations among identically sharded collections.
title: SmartJoins for ArangoDB Clusters
---
Smart Joins
===========
SmartJoins
==========

<small>Introduced in: v3.4.5, v3.5.0</small>

Expand All @@ -12,8 +13,12 @@ This feature is only available in the
[**Enterprise Edition**](https://www.arangodb.com/why-arangodb/arangodb-enterprise/){:target="_blank"}
{% endhint %}

When doing joins in an ArangoDB cluster, data has to be exchanged between different servers.
SmartJoins allow to execute co-located join operations among identically sharded collections.

Cluster joins without being smart
---------------------------------

When doing joins in an ArangoDB cluster, data has to be exchanged between different servers.
Joins between different collections in a cluster normally require roundtrips between the
shards of these collections for fetching the data. Requests are routed through an extra
coordinator hop.
Expand Down Expand Up @@ -140,8 +145,8 @@ This is a precondition for running joins locally, and thanks to the effects of
`distributeShardsLike` it is now satisfied!


Smart joins using distributeShardsLike
--------------------------------------
SmartJoins using distributeShardsLike
-------------------------------------

With the two collections in place like this, an AQL query that uses a FILTER condition
that refers from the shard key of the one collection to the shard key of the other collection
Expand All @@ -166,7 +171,7 @@ As can be seen above, the extra hop via the coordinator is gone here, which will
less cluster-internal traffic and a faster response time.


Smart joins will also work if the shard key of the second collection is not *_key*,
SmartJoins will also work if the shard key of the second collection is not *_key*,
and even for non-unique shard key values, e.g.:

arangosh> db._create("c1", {numberOfShards: 4, shardKeys: ["_key"]});
Expand Down Expand Up @@ -194,13 +199,13 @@ and even for non-unique shard key values, e.g.:
6 ReturnNode COOR 2000 - RETURN doc1

{% hint 'tip' %}
All above examples used two collections only. Smart joins will also work when joining
All above examples used two collections only. SmartJoins will also work when joining
more than two collections which have the same data distribution enforced via their
`distributeShardsLike` attribute and using the shard keys as the join criteria as shown above.
{% endhint %}

Smart joins using smartJoinAttribute
------------------------------------
SmartJoins using smartJoinAttribute
-----------------------------------

In case the join on the second collection must be performed on a non-shard key
attribute, there is the option to specify a *smartJoinAttribute* for the collection.
Expand Down Expand Up @@ -252,8 +257,8 @@ The join can now be performed via the collection's *smartJoinAttribute*:
6 ReturnNode COOR 101 - RETURN doc1


Restricting smart joins to a single shard
-----------------------------------------
Restricting SmartJoins to a single shard
----------------------------------------

If a FILTER condition is used on one of the shard keys, the optimizer will also try
to restrict the queries to just the required shards:
Expand All @@ -277,7 +282,7 @@ to restrict the queries to just the required shards:
Limitations
-----------

The smart join optimization is currently triggered only for data selection queries,
The SmartJoin optimization is currently triggered only for data selection queries,
but not for any data-manipulation operations such as INSERT, UPDATE, REPLACE, REMOVE
or UPSERT, neither traversals, subqueries or views.

Expand All @@ -289,5 +294,5 @@ It is restricted to be used with simple shard key attributes (such as `_key`, `p
but not with nested attributes (e.g. `name.first`). There should be exactly one shard
key attribute defined for each collection.

Finally, the smart join optimization requires that the collections are joined on their
Finally, the SmartJoin optimization requires that the collections are joined on their
shard key attributes (or smartJoinAttribute) using an equality comparison.
4 changes: 2 additions & 2 deletions _data/3.4-manual.yml
Original file line number Diff line number Diff line change
Expand Up @@ -498,8 +498,8 @@
href: foxx-migrating2x-queries.html
- text: Satellite Collections
href: satellites.html
- text: Smart Joins
href: smart-joins.html
- text: SmartJoins
href: smartjoins.html
- subtitle: OPERATIONS
- text: Installation
href: installation.html
Expand Down
4 changes: 2 additions & 2 deletions _data/3.5-manual.yml
Original file line number Diff line number Diff line change
Expand Up @@ -504,8 +504,8 @@
href: foxx-migrating2x-queries.html
- text: Satellite Collections
href: satellites.html
- text: Smart Joins
href: smart-joins.html
- text: SmartJoins
href: smartjoins.html
- subtitle: OPERATIONS
- text: Installation
href: installation.html
Expand Down