Skip to content
This repository was archived by the owner on Dec 13, 2023. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions 3.8/analyzers.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ The currently implemented Analyzer types are:
normalization, stop-word filtering and edge _n_-gram generation
- `aql`: for running AQL query to prepare tokens for index
- `pipeline`: for chaining multiple Analyzers
{%- comment %}- `stopwords`: removes the specified tokens from the input{% endcomment %}
- `stopwords`: removes the specified tokens from the input
- `geojson`: breaks up a GeoJSON object into a set of indexable tokens
- `geopoint`: breaks up a JSON object describing a coordinate into a set of
indexable tokens
Expand All @@ -120,7 +120,7 @@ Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
[`text`](#text) | Yes | Yes | Yes | (Yes)
[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
{%- comment %}[`stopwords`](#stopwords) | No | No | No | No{% endcomment %}
[`stopwords`](#stopwords) | No | No | No | No
[`geojson`](#geojson) | – | – | – | –
[`geopoint`](#geopoint) | – | – | – | –

Expand Down Expand Up @@ -728,10 +728,9 @@ Split at delimiting characters `,` and `;`, then stem the tokens:
{% endarangoshexample %}
{% include arangoshexample.html id=examplevar script=script result=result %}

{% comment %}
### `stopwords`

<small>Introduced in: v3.8.0</small>
<small>Introduced in: v3.8.1</small>

An Analyzer capable of removing specified tokens from the input.

Expand Down Expand Up @@ -802,7 +801,6 @@ lower-case and base characters) and then discards the stopwords `and` and `the`:
@endDocuBlock analyzerPipelineStopwords
{% endarangoshexample %}
{% include arangoshexample.html id=examplevar script=script result=result %}
{% endcomment %}

### `geojson`

Expand Down
4 changes: 2 additions & 2 deletions 3.8/highlights.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ Version 3.8
[Geo](analyzers.html#geojson)
[Analyzers](analyzers.html#geopoint) and
[ArangoSearch Geo functions](aql/functions-arangosearch.html#geo-functions).
{% comment %}A new [**Stopwords Analyzer**](analyzers.html#stopwords) that
can be used standalone or in an Analyzer pipeline.{% endcomment %}
A new [**Stopwords Analyzer**](analyzers.html#stopwords) that
can be used standalone or in an Analyzer pipeline.

- A [**`WINDOW` operation**](aql/operations-window.html) for aggregations over
adjacent rows, value ranges or time windows.
Expand Down
154 changes: 136 additions & 18 deletions 3.9/analyzers.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,29 +100,34 @@ The currently implemented Analyzer types are:
- `ngram`: create _n_-grams from value with user-defined lengths
- `text`: tokenize into words, optionally with stemming,
normalization, stop-word filtering and edge _n_-gram generation
- `segmentation`: language-agnostic text tokenization, optionally with
normalization
- `aql`: for running AQL query to prepare tokens for index
- `pipeline`: for chaining multiple Analyzers
{%- comment %}- `stopwords`: removes the specified tokens from the input{% endcomment %}
- `stopwords`: removes the specified tokens from the input
- `collation`: to respect the alphabetic order of a language in range queries
- `geojson`: breaks up a GeoJSON object into a set of indexable tokens
- `geopoint`: breaks up a JSON object describing a coordinate into a set of
indexable tokens

Available normalizations are case conversion and accent removal
(conversion of characters with diacritical marks to the base characters).

Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
:-------------------------|:------------:|:--------:|:-------------:|:--------:
[`identity`](#identity) | No | No | No | No
[`delimiter`](#delimiter) | (Yes) | No | No | No
[`stem`](#stem) | No | Yes | No | No
[`norm`](#norm) | No | No | Yes | No
[`ngram`](#ngram) | No | No | No | Yes
[`text`](#text) | Yes | Yes | Yes | (Yes)
[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
{%- comment %}[`stopwords`](#stopwords) | No | No | No | No{% endcomment %}
[`geojson`](#geojson) | – | – | – | –
[`geopoint`](#geopoint) | – | – | – | –
Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
:------------------------------:|:------------:|:--------:|:-------------:|:--------:
[`identity`](#identity) | No | No | No | No
[`delimiter`](#delimiter) | (Yes) | No | No | No
[`stem`](#stem) | No | Yes | No | No
[`norm`](#norm) | No | No | Yes | No
[`ngram`](#ngram) | No | No | No | Yes
[`text`](#text) | Yes | Yes | Yes | (Yes)
[`segmentation`](#segmentation) | Yes | No | Yes | No
[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
[`stopwords`](#stopwords) | No | No | No | No
[`collation`](#collation) | No | No | No | No
[`geojson`](#geojson) | – | – | – | –
[`geopoint`](#geopoint) | – | – | – | –

Analyzer Properties
-------------------
Expand Down Expand Up @@ -163,7 +168,6 @@ attributes:

- `delimiter` (string): the delimiting character(s)


**Examples**

Split input strings into tokens at hyphen-minus characters:
Expand Down Expand Up @@ -486,6 +490,58 @@ stemming disabled and `"the"` defined as stop-word to exclude it:
{% endarangoshexample %}
{% include arangoshexample.html id=examplevar script=script result=result %}

### `collation`

<small>Introduced in: v3.9.0</small>

An Analyzer capable of converting the input into a set of language-specific
tokens. This makes comparisons follow the rules of the respective language,
most notable in range queries against Views.

The *properties* allowed for this Analyzer are an object with the following
attributes:

- `locale` (string): a locale in the format
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).

**Examples**

In Swedish, the letter `å` (note the small circle above the `a`) comes after
`z`. Other languages treat it like a regular `a`, putting it before `b`.
Below example creates two `collation` Analyzers, one with an English locale
(`en`) and one with a Swedish locale (`sv`). It then demonstrates the
difference in alphabetical order using a simple range query that returns
letters before `c`:

{% arangoshexample examplevar="examplevar" script="script" result="result" %}
@startDocuBlockInline analyzerCollation
@EXAMPLE_ARANGOSH_OUTPUT{analyzerCollation}
var analyzers = require("@arangodb/analyzers");
var en = analyzers.save("collation_en", "collation", { locale: "en.utf-8" }, []);
var sv = analyzers.save("collation_sv", "collation", { locale: "sv.utf-8" }, []);
var test = db._create("test");
| db.test.save([
| { text: "a" },
| { text: "å" },
| { text: "b" },
| { text: "z" },
]);
| var view = db._createView("view", "arangosearch",
{ links: { test: { analyzers: [ "collation_en", "collation_sv" ], includeAllFields: true }}});
~ db._query("FOR doc IN view OPTIONS { waitForSync: true } LIMIT 1 RETURN true");
db._query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_en')[0], 'collation_en') RETURN doc.text");
db._query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_sv')[0], 'collation_sv') RETURN doc.text");
~ db._dropView(view.name());
~ db._drop(test.name());
~ analyzers.remove(en.name);
~ analyzers.remove(sv.name);
@END_EXAMPLE_ARANGOSH_OUTPUT
@endDocuBlock analyzerCollation
{% endarangoshexample %}
{% include arangoshexample.html id=examplevar script=script result=result %}

### `aql`

<small>Introduced in: v3.8.0</small>
Expand Down Expand Up @@ -728,10 +784,9 @@ Split at delimiting characters `,` and `;`, then stem the tokens:
{% endarangoshexample %}
{% include arangoshexample.html id=examplevar script=script result=result %}

{% comment %}
### `stopwords`

<small>Introduced in: v3.8.0</small>
<small>Introduced in: v3.8.1</small>

An Analyzer capable of removing specified tokens from the input.

Expand Down Expand Up @@ -802,7 +857,70 @@ lower-case and base characters) and then discards the stopwords `and` and `the`:
@endDocuBlock analyzerPipelineStopwords
{% endarangoshexample %}
{% include arangoshexample.html id=examplevar script=script result=result %}
{% endcomment %}

### `segmentation`

<small>Introduced in: v3.9.0</small>

An Analyzer capable of breaking up the input text into tokens in a
language-agnostic manner as per
[Unicode Standard Annex #29](https://unicode.org/reports/tr29){:target="_blank"},
making it suitable for mixed language strings. It can optionally preserve all
non-whitespace or all characters instead of keeping alphanumeric characters only,
as well as apply case conversion.

The *properties* allowed for this Analyzer are an object with the following
attributes:

- `break` (string, _optional_):
- `"all"`: return all tokens
- `"alpha"`: return tokens composed of alphanumeric characters only (default).
Alphanumeric characters are Unicode codepoints from the Letter and Number
categories, see
[Unicode Technical Note #36](https://www.unicode.org/notes/tn36/){:target="_blank"}.
- `"graphic"`: return tokens composed of non-whitespace characters only.
Note that the list of whitespace characters does not include line breaks:
- `U+0009` Character Tabulation
- `U+0020` Space
- `U+0085` Next Line
- `U+00A0` No-break Space
- `U+1680` Ogham Space Mark
- `U+2000` En Quad
- `U+2028` Line Separator
- `U+202F` Narrow No-break Space
- `U+205F` Medium Mathematical Space
- `U+3000` Ideographic Space
- `case` (string, _optional_):
- `"lower"` to convert to all lower-case characters (default)
- `"upper"` to convert to all upper-case characters
- `"none"` to not change character case

**Examples**

Create different `segmentation` Analyzers to show the behavior of the different
`break` options:

{% arangoshexample examplevar="examplevar" script="script" result="result" %}
@startDocuBlockInline analyzerSegmentationBreak
@EXAMPLE_ARANGOSH_OUTPUT{analyzerSegmentationBreak}
var analyzers = require("@arangodb/analyzers");
var all = analyzers.save("segment_all", "segmentation", { break: "all" }, []);
var alpha = analyzers.save("segment_alpha", "segmentation", { break: "alpha" }, []);
var graphic = analyzers.save("segment_graphic", "segmentation", { break: "graphic" }, []);
| db._query(`LET str = 'Test\twith An_EMAIL-address+123@example.org\n蝴蝶。\u2028бутерброд'
| RETURN {
| "all": TOKENS(str, 'segment_all'),
| "alpha": TOKENS(str, 'segment_alpha'),
| "graphic": TOKENS(str, 'segment_graphic'),
| }
`);
~ analyzers.remove(all.name);
~ analyzers.remove(alpha.name);
~ analyzers.remove(graphic.name);
@END_EXAMPLE_ARANGOSH_OUTPUT
@endDocuBlock analyzerSegmentationBreak
{% endarangoshexample %}
{% include arangoshexample.html id=examplevar script=script result=result %}

### `geojson`

Expand Down
47 changes: 47 additions & 0 deletions 3.9/generated/Examples/analyzerCollation.generated
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
arangosh&gt; <span class="hljs-keyword">var</span> analyzers = <span class="hljs-built_in">require</span>(<span class="hljs-string">&quot;@arangodb/analyzers&quot;</span>);
arangosh&gt; <span class="hljs-keyword">var</span> en = analyzers.save(<span class="hljs-string">&quot;collation_en&quot;</span>, <span class="hljs-string">&quot;collation&quot;</span>, { <span class="hljs-attr">locale</span>: <span class="hljs-string">&quot;en.utf-8&quot;</span> }, []);
arangosh&gt; <span class="hljs-keyword">var</span> sv = analyzers.save(<span class="hljs-string">&quot;collation_sv&quot;</span>, <span class="hljs-string">&quot;collation&quot;</span>, { <span class="hljs-attr">locale</span>: <span class="hljs-string">&quot;sv.utf-8&quot;</span> }, []);
arangosh&gt; <span class="hljs-keyword">var</span> test = db._create(<span class="hljs-string">&quot;test&quot;</span>);
arangosh&gt; db.test.save([
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;a&quot;</span> },
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;å&quot;</span> },
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;b&quot;</span> },
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;z&quot;</span> },
........&gt; ]);
[
{
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69994&quot;</span>,
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69994&quot;</span>,
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO---&quot;</span>
},
{
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69995&quot;</span>,
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69995&quot;</span>,
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--_&quot;</span>
},
{
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69996&quot;</span>,
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69996&quot;</span>,
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--A&quot;</span>
},
{
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69997&quot;</span>,
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69997&quot;</span>,
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--B&quot;</span>
}
]
arangosh&gt; <span class="hljs-keyword">var</span> view = db._createView(<span class="hljs-string">&quot;view&quot;</span>, <span class="hljs-string">&quot;arangosearch&quot;</span>,
........&gt; { <span class="hljs-attr">links</span>: { <span class="hljs-attr">test</span>: { <span class="hljs-attr">analyzers</span>: [ <span class="hljs-string">&quot;collation_en&quot;</span>, <span class="hljs-string">&quot;collation_sv&quot;</span> ], <span class="hljs-attr">includeAllFields</span>: <span class="hljs-literal">true</span> }}});
arangosh&gt; db._query(<span class="hljs-string">&quot;FOR doc IN view SEARCH ANALYZER(doc.text &lt; TOKENS(&#x27;c&#x27;, &#x27;collation_en&#x27;)[0], &#x27;collation_en&#x27;) RETURN doc.text&quot;</span>);
[
<span class="hljs-string">&quot;a&quot;</span>,
<span class="hljs-string">&quot;å&quot;</span>,
<span class="hljs-string">&quot;b&quot;</span>
]
[object ArangoQueryCursor, <span class="hljs-attr">count</span>: <span class="hljs-number">3</span>, <span class="hljs-attr">cached</span>: <span class="hljs-literal">false</span>, <span class="hljs-attr">hasMore</span>: <span class="hljs-literal">false</span>]
arangosh&gt; db._query(<span class="hljs-string">&quot;FOR doc IN view SEARCH ANALYZER(doc.text &lt; TOKENS(&#x27;c&#x27;, &#x27;collation_sv&#x27;)[0], &#x27;collation_sv&#x27;) RETURN doc.text&quot;</span>);
[
<span class="hljs-string">&quot;a&quot;</span>,
<span class="hljs-string">&quot;b&quot;</span>
]
[object ArangoQueryCursor, <span class="hljs-attr">count</span>: <span class="hljs-number">2</span>, <span class="hljs-attr">cached</span>: <span class="hljs-literal">false</span>, <span class="hljs-attr">hasMore</span>: <span class="hljs-literal">false</span>]
Loading