arangodb · Simran-B · Aug 24, 2021 · Aug 18, 2021 · Aug 20, 2021 · Aug 20, 2021
diff --git a/3.8/analyzers.md b/3.8/analyzers.md
@@ -102,7 +102,7 @@ The currently implemented Analyzer types are:
  normalization, stop-word filtering and edge _n_-gram generation
 - `aql`: for running AQL query to prepare tokens for index
 - `pipeline`: for chaining multiple Analyzers
-{%- comment %}- `stopwords`: removes the specified tokens from the input{% endcomment %}
+- `stopwords`: removes the specified tokens from the input
 - `geojson`: breaks up a GeoJSON object into a set of indexable tokens
 - `geopoint`: breaks up a JSON object describing a coordinate into a set of
  indexable tokens
@@ -120,7 +120,7 @@ Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
 [`text`](#text) | Yes | Yes | Yes | (Yes)
 [`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
 [`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
-{%- comment %}[`stopwords`](#stopwords) | No | No | No | No{% endcomment %}
+[`stopwords`](#stopwords) | No | No | No | No
 [`geojson`](#geojson) | – | – | – | –
 [`geopoint`](#geopoint) | – | – | – | –
 
@@ -728,10 +728,9 @@ Split at delimiting characters `,` and `;`, then stem the tokens:
 {% endarangoshexample %}
 {% include arangoshexample.html id=examplevar script=script result=result %}
 
-{% comment %}
 ### `stopwords`
 
-<small>Introduced in: v3.8.0</small>
+<small>Introduced in: v3.8.1</small>
 
 An Analyzer capable of removing specified tokens from the input.
 
@@ -802,7 +801,6 @@ lower-case and base characters) and then discards the stopwords `and` and `the`:
  @endDocuBlock analyzerPipelineStopwords
 {% endarangoshexample %}
 {% include arangoshexample.html id=examplevar script=script result=result %}
-{% endcomment %}
 
 ### `geojson`
 

diff --git a/3.8/highlights.md b/3.8/highlights.md
@@ -27,8 +27,8 @@ Version 3.8
  [Geo](analyzers.html#geojson)
  [Analyzers](analyzers.html#geopoint) and
  [ArangoSearch Geo functions](aql/functions-arangosearch.html#geo-functions).
- {% comment %}A new [**Stopwords Analyzer**](analyzers.html#stopwords) that
- can be used standalone or in an Analyzer pipeline.{% endcomment %}
+ A new [**Stopwords Analyzer**](analyzers.html#stopwords) that
+ can be used standalone or in an Analyzer pipeline.
 
 - A [**`WINDOW` operation**](aql/operations-window.html) for aggregations over
  adjacent rows, value ranges or time windows.

diff --git a/3.9/analyzers.md b/3.9/analyzers.md
@@ -100,29 +100,34 @@ The currently implemented Analyzer types are:
 - `ngram`: create _n_-grams from value with user-defined lengths
 - `text`: tokenize into words, optionally with stemming,
  normalization, stop-word filtering and edge _n_-gram generation
+- `segmentation`: language-agnostic text tokenization, optionally with
+ normalization
 - `aql`: for running AQL query to prepare tokens for index
 - `pipeline`: for chaining multiple Analyzers
-{%- comment %}- `stopwords`: removes the specified tokens from the input{% endcomment %}
+- `stopwords`: removes the specified tokens from the input
+- `collation`: to respect the alphabetic order of a language in range queries
 - `geojson`: breaks up a GeoJSON object into a set of indexable tokens
 - `geopoint`: breaks up a JSON object describing a coordinate into a set of
  indexable tokens
 
 Available normalizations are case conversion and accent removal
 (conversion of characters with diacritical marks to the base characters).
 
-Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
-:-------------------------|:------------:|:--------:|:-------------:|:--------:
-[`identity`](#identity) | No | No | No | No
-[`delimiter`](#delimiter) | (Yes) | No | No | No
-[`stem`](#stem) | No | Yes | No | No
-[`norm`](#norm) | No | No | Yes | No
-[`ngram`](#ngram) | No | No | No | Yes
-[`text`](#text) | Yes | Yes | Yes | (Yes)
-[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
-[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
-{%- comment %}[`stopwords`](#stopwords) | No | No | No | No{% endcomment %}
-[`geojson`](#geojson) | – | – | – | –
-[`geopoint`](#geopoint) | – | – | – | –
+Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
+:------------------------------:|:------------:|:--------:|:-------------:|:--------:
+[`identity`](#identity) | No | No | No | No
+[`delimiter`](#delimiter) | (Yes) | No | No | No
+[`stem`](#stem) | No | Yes | No | No
+[`norm`](#norm) | No | No | Yes | No
+[`ngram`](#ngram) | No | No | No | Yes
+[`text`](#text) | Yes | Yes | Yes | (Yes)
+[`segmentation`](#segmentation) | Yes | No | Yes | No
+[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
+[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
+[`stopwords`](#stopwords) | No | No | No | No
+[`collation`](#collation) | No | No | No | No
+[`geojson`](#geojson) | – | – | – | –
+[`geopoint`](#geopoint) | – | – | – | –
 
 Analyzer Properties
 -------------------
@@ -163,7 +168,6 @@ attributes:
 
 - `delimiter` (string): the delimiting character(s)
 
-
 **Examples**
 
 Split input strings into tokens at hyphen-minus characters:
@@ -486,6 +490,58 @@ stemming disabled and `"the"` defined as stop-word to exclude it:
 {% endarangoshexample %}
 {% include arangoshexample.html id=examplevar script=script result=result %}
 
+### `collation`
+
+<small>Introduced in: v3.9.0</small>
+
+An Analyzer capable of converting the input into a set of language-specific
+tokens. This makes comparisons follow the rules of the respective language,
+most notable in range queries against Views.
+
+The *properties* allowed for this Analyzer are an object with the following
+attributes:
+
+- `locale` (string): a locale in the format
+ `language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
+ parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
+ meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
+
+**Examples**
+
+In Swedish, the letter `å` (note the small circle above the `a`) comes after
+`z`. Other languages treat it like a regular `a`, putting it before `b`.
+Below example creates two `collation` Analyzers, one with an English locale
+(`en`) and one with a Swedish locale (`sv`). It then demonstrates the
+difference in alphabetical order using a simple range query that returns
+letters before `c`:
+
+{% arangoshexample examplevar="examplevar" script="script" result="result" %}
+ @startDocuBlockInline analyzerCollation
+ @EXAMPLE_ARANGOSH_OUTPUT{analyzerCollation}
+ var analyzers = require("@arangodb/analyzers");
+ var en = analyzers.save("collation_en", "collation", { locale: "en.utf-8" }, []);
+ var sv = analyzers.save("collation_sv", "collation", { locale: "sv.utf-8" }, []);
+ var test = db._create("test");
+ | db.test.save([
+ | { text: "a" },
+ | { text: "å" },
+ | { text: "b" },
+ | { text: "z" },
+ ]);
+ | var view = db._createView("view", "arangosearch",
+ { links: { test: { analyzers: [ "collation_en", "collation_sv" ], includeAllFields: true }}});
+ ~ db._query("FOR doc IN view OPTIONS { waitForSync: true } LIMIT 1 RETURN true");
+ db._query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_en')[0], 'collation_en') RETURN doc.text");
+ db._query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_sv')[0], 'collation_sv') RETURN doc.text");
+ ~ db._dropView(view.name());
+ ~ db._drop(test.name());
+ ~ analyzers.remove(en.name);
+ ~ analyzers.remove(sv.name);
+ @END_EXAMPLE_ARANGOSH_OUTPUT
+ @endDocuBlock analyzerCollation
+{% endarangoshexample %}
+{% include arangoshexample.html id=examplevar script=script result=result %}
+
 ### `aql`
 
 <small>Introduced in: v3.8.0</small>
@@ -728,10 +784,9 @@ Split at delimiting characters `,` and `;`, then stem the tokens:
 {% endarangoshexample %}
 {% include arangoshexample.html id=examplevar script=script result=result %}
 
-{% comment %}
 ### `stopwords`
 
-<small>Introduced in: v3.8.0</small>
+<small>Introduced in: v3.8.1</small>
 
 An Analyzer capable of removing specified tokens from the input.
 
@@ -802,7 +857,70 @@ lower-case and base characters) and then discards the stopwords `and` and `the`:
  @endDocuBlock analyzerPipelineStopwords
 {% endarangoshexample %}
 {% include arangoshexample.html id=examplevar script=script result=result %}
-{% endcomment %}
+
+### `segmentation`
+
+<small>Introduced in: v3.9.0</small>
+
+An Analyzer capable of breaking up the input text into tokens in a
+language-agnostic manner as per
+[Unicode Standard Annex #29](https://unicode.org/reports/tr29){:target="_blank"},
+making it suitable for mixed language strings. It can optionally preserve all
+non-whitespace or all characters instead of keeping alphanumeric characters only,
+as well as apply case conversion.
+
+The *properties* allowed for this Analyzer are an object with the following
+attributes:
+
+- `break` (string, _optional_):
+ - `"all"`: return all tokens
+ - `"alpha"`: return tokens composed of alphanumeric characters only (default).
+ Alphanumeric characters are Unicode codepoints from the Letter and Number
+ categories, see
+ [Unicode Technical Note #36](https://www.unicode.org/notes/tn36/){:target="_blank"}.
+ - `"graphic"`: return tokens composed of non-whitespace characters only.
+ Note that the list of whitespace characters does not include line breaks:
+ - `U+0009` Character Tabulation
+ - `U+0020` Space
+ - `U+0085` Next Line
+ - `U+00A0` No-break Space
+ - `U+1680` Ogham Space Mark
+ - `U+2000` En Quad
+ - `U+2028` Line Separator
+ - `U+202F` Narrow No-break Space
+ - `U+205F` Medium Mathematical Space
+ - `U+3000` Ideographic Space
+- `case` (string, _optional_):
+ - `"lower"` to convert to all lower-case characters (default)
+ - `"upper"` to convert to all upper-case characters
+ - `"none"` to not change character case
+
+**Examples**
+
+Create different `segmentation` Analyzers to show the behavior of the different
+`break` options:
+
+{% arangoshexample examplevar="examplevar" script="script" result="result" %}
+ @startDocuBlockInline analyzerSegmentationBreak
+ @EXAMPLE_ARANGOSH_OUTPUT{analyzerSegmentationBreak}
+ var analyzers = require("@arangodb/analyzers");
+ var all = analyzers.save("segment_all", "segmentation", { break: "all" }, []);
+ var alpha = analyzers.save("segment_alpha", "segmentation", { break: "alpha" }, []);
+ var graphic = analyzers.save("segment_graphic", "segmentation", { break: "graphic" }, []);
+ | db._query(`LET str = 'Test\twith An_EMAIL-address+123@example.org\n蝴蝶。\u2028бутерброд'
+ | RETURN {
+ | "all": TOKENS(str, 'segment_all'),
+ | "alpha": TOKENS(str, 'segment_alpha'),
+ | "graphic": TOKENS(str, 'segment_graphic'),
+ | }
+ `);
+ ~ analyzers.remove(all.name);
+ ~ analyzers.remove(alpha.name);
+ ~ analyzers.remove(graphic.name);
+ @END_EXAMPLE_ARANGOSH_OUTPUT
+ @endDocuBlock analyzerSegmentationBreak
+{% endarangoshexample %}
+{% include arangoshexample.html id=examplevar script=script result=result %}
 
 ### `geojson`
 

diff --git a/3.9/generated/Examples/analyzerCollation.generated b/3.9/generated/Examples/analyzerCollation.generated
@@ -0,0 +1,47 @@
+arangosh&gt; <span class="hljs-keyword">var</span> analyzers = <span class="hljs-built_in">require</span>(<span class="hljs-string">&quot;@arangodb/analyzers&quot;</span>);
+arangosh&gt; <span class="hljs-keyword">var</span> en = analyzers.save(<span class="hljs-string">&quot;collation_en&quot;</span>, <span class="hljs-string">&quot;collation&quot;</span>, { <span class="hljs-attr">locale</span>: <span class="hljs-string">&quot;en.utf-8&quot;</span> }, []);
+arangosh&gt; <span class="hljs-keyword">var</span> sv = analyzers.save(<span class="hljs-string">&quot;collation_sv&quot;</span>, <span class="hljs-string">&quot;collation&quot;</span>, { <span class="hljs-attr">locale</span>: <span class="hljs-string">&quot;sv.utf-8&quot;</span> }, []);
+arangosh&gt; <span class="hljs-keyword">var</span> test = db._create(<span class="hljs-string">&quot;test&quot;</span>);
+arangosh&gt; db.test.save([
+........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;a&quot;</span> },
+........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;å&quot;</span> },
+........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;b&quot;</span> },
+........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;z&quot;</span> },
+........&gt; ]);
+[ 
+ { 
+ <span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69994&quot;</span>, 
+ <span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69994&quot;</span>, 
+ <span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO---&quot;</span> 
+ }, 
+ { 
+ <span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69995&quot;</span>, 
+ <span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69995&quot;</span>, 
+ <span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--_&quot;</span> 
+ }, 
+ { 
+ <span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69996&quot;</span>, 
+ <span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69996&quot;</span>, 
+ <span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--A&quot;</span> 
+ }, 
+ { 
+ <span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69997&quot;</span>, 
+ <span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69997&quot;</span>, 
+ <span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--B&quot;</span> 
+ } 
+]
+arangosh&gt; <span class="hljs-keyword">var</span> view = db._createView(<span class="hljs-string">&quot;view&quot;</span>, <span class="hljs-string">&quot;arangosearch&quot;</span>,
+........&gt; { <span class="hljs-attr">links</span>: { <span class="hljs-attr">test</span>: { <span class="hljs-attr">analyzers</span>: [ <span class="hljs-string">&quot;collation_en&quot;</span>, <span class="hljs-string">&quot;collation_sv&quot;</span> ], <span class="hljs-attr">includeAllFields</span>: <span class="hljs-literal">true</span> }}});
+arangosh&gt; db._query(<span class="hljs-string">&quot;FOR doc IN view SEARCH ANALYZER(doc.text &lt; TOKENS(&#x27;c&#x27;, &#x27;collation_en&#x27;)[0], &#x27;collation_en&#x27;) RETURN doc.text&quot;</span>);
+[ 
+ <span class="hljs-string">&quot;a&quot;</span>, 
+ <span class="hljs-string">&quot;å&quot;</span>, 
+ <span class="hljs-string">&quot;b&quot;</span> 
+]
+[object ArangoQueryCursor, <span class="hljs-attr">count</span>: <span class="hljs-number">3</span>, <span class="hljs-attr">cached</span>: <span class="hljs-literal">false</span>, <span class="hljs-attr">hasMore</span>: <span class="hljs-literal">false</span>]
+arangosh&gt; db._query(<span class="hljs-string">&quot;FOR doc IN view SEARCH ANALYZER(doc.text &lt; TOKENS(&#x27;c&#x27;, &#x27;collation_sv&#x27;)[0], &#x27;collation_sv&#x27;) RETURN doc.text&quot;</span>);
+[ 
+ <span class="hljs-string">&quot;a&quot;</span>, 
+ <span class="hljs-string">&quot;b&quot;</span> 
+]
+[object ArangoQueryCursor, <span class="hljs-attr">count</span>: <span class="hljs-number">2</span>, <span class="hljs-attr">cached</span>: <span class="hljs-literal">false</span>, <span class="hljs-attr">hasMore</span>: <span class="hljs-literal">false</span>]