Skip to content
This repository was archived by the owner on Dec 13, 2023. It is now read-only.

Commit 9973da1

Browse files
authored
Add segmentation and collation Analyzers, re-add stopwords Analyzer (#751)
1 parent 624a719 commit 9973da1

File tree

7 files changed

+278
-27
lines changed

7 files changed

+278
-27
lines changed

3.8/analyzers.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ The currently implemented Analyzer types are:
102102
normalization, stop-word filtering and edge _n_-gram generation
103103
- `aql`: for running AQL query to prepare tokens for index
104104
- `pipeline`: for chaining multiple Analyzers
105-
{%- comment %}- `stopwords`: removes the specified tokens from the input{% endcomment %}
105+
- `stopwords`: removes the specified tokens from the input
106106
- `geojson`: breaks up a GeoJSON object into a set of indexable tokens
107107
- `geopoint`: breaks up a JSON object describing a coordinate into a set of
108108
indexable tokens
@@ -120,7 +120,7 @@ Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
120120
[`text`](#text) | Yes | Yes | Yes | (Yes)
121121
[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
122122
[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
123-
{%- comment %}[`stopwords`](#stopwords) | No | No | No | No{% endcomment %}
123+
[`stopwords`](#stopwords) | No | No | No | No
124124
[`geojson`](#geojson) | – | – | – | –
125125
[`geopoint`](#geopoint) | – | – | – | –
126126

@@ -728,10 +728,9 @@ Split at delimiting characters `,` and `;`, then stem the tokens:
728728
{% endarangoshexample %}
729729
{% include arangoshexample.html id=examplevar script=script result=result %}
730730

731-
{% comment %}
732731
### `stopwords`
733732

734-
<small>Introduced in: v3.8.0</small>
733+
<small>Introduced in: v3.8.1</small>
735734

736735
An Analyzer capable of removing specified tokens from the input.
737736

@@ -802,7 +801,6 @@ lower-case and base characters) and then discards the stopwords `and` and `the`:
802801
@endDocuBlock analyzerPipelineStopwords
803802
{% endarangoshexample %}
804803
{% include arangoshexample.html id=examplevar script=script result=result %}
805-
{% endcomment %}
806804

807805
### `geojson`
808806

3.8/highlights.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ Version 3.8
2727
[Geo](analyzers.html#geojson)
2828
[Analyzers](analyzers.html#geopoint) and
2929
[ArangoSearch Geo functions](aql/functions-arangosearch.html#geo-functions).
30-
{% comment %}A new [**Stopwords Analyzer**](analyzers.html#stopwords) that
31-
can be used standalone or in an Analyzer pipeline.{% endcomment %}
30+
A new [**Stopwords Analyzer**](analyzers.html#stopwords) that
31+
can be used standalone or in an Analyzer pipeline.
3232

3333
- A [**`WINDOW` operation**](aql/operations-window.html) for aggregations over
3434
adjacent rows, value ranges or time windows.

3.9/analyzers.md

Lines changed: 136 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -100,29 +100,34 @@ The currently implemented Analyzer types are:
100100
- `ngram`: create _n_-grams from value with user-defined lengths
101101
- `text`: tokenize into words, optionally with stemming,
102102
normalization, stop-word filtering and edge _n_-gram generation
103+
- `segmentation`: language-agnostic text tokenization, optionally with
104+
normalization
103105
- `aql`: for running AQL query to prepare tokens for index
104106
- `pipeline`: for chaining multiple Analyzers
105-
{%- comment %}- `stopwords`: removes the specified tokens from the input{% endcomment %}
107+
- `stopwords`: removes the specified tokens from the input
108+
- `collation`: to respect the alphabetic order of a language in range queries
106109
- `geojson`: breaks up a GeoJSON object into a set of indexable tokens
107110
- `geopoint`: breaks up a JSON object describing a coordinate into a set of
108111
indexable tokens
109112

110113
Available normalizations are case conversion and accent removal
111114
(conversion of characters with diacritical marks to the base characters).
112115

113-
Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
114-
:-------------------------|:------------:|:--------:|:-------------:|:--------:
115-
[`identity`](#identity) | No | No | No | No
116-
[`delimiter`](#delimiter) | (Yes) | No | No | No
117-
[`stem`](#stem) | No | Yes | No | No
118-
[`norm`](#norm) | No | No | Yes | No
119-
[`ngram`](#ngram) | No | No | No | Yes
120-
[`text`](#text) | Yes | Yes | Yes | (Yes)
121-
[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
122-
[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
123-
{%- comment %}[`stopwords`](#stopwords) | No | No | No | No{% endcomment %}
124-
[`geojson`](#geojson) | – | – | – | –
125-
[`geopoint`](#geopoint) | – | – | – | –
116+
Analyzer / Feature | Tokenization | Stemming | Normalization | _N_-grams
117+
:------------------------------:|:------------:|:--------:|:-------------:|:--------:
118+
[`identity`](#identity) | No | No | No | No
119+
[`delimiter`](#delimiter) | (Yes) | No | No | No
120+
[`stem`](#stem) | No | Yes | No | No
121+
[`norm`](#norm) | No | No | Yes | No
122+
[`ngram`](#ngram) | No | No | No | Yes
123+
[`text`](#text) | Yes | Yes | Yes | (Yes)
124+
[`segmentation`](#segmentation) | Yes | No | Yes | No
125+
[`aql`](#aql) | (Yes) | (Yes) | (Yes) | (Yes)
126+
[`pipeline`](#pipeline) | (Yes) | (Yes) | (Yes) | (Yes)
127+
[`stopwords`](#stopwords) | No | No | No | No
128+
[`collation`](#collation) | No | No | No | No
129+
[`geojson`](#geojson) | – | – | – | –
130+
[`geopoint`](#geopoint) | – | – | – | –
126131

127132
Analyzer Properties
128133
-------------------
@@ -163,7 +168,6 @@ attributes:
163168

164169
- `delimiter` (string): the delimiting character(s)
165170

166-
167171
**Examples**
168172

169173
Split input strings into tokens at hyphen-minus characters:
@@ -486,6 +490,58 @@ stemming disabled and `"the"` defined as stop-word to exclude it:
486490
{% endarangoshexample %}
487491
{% include arangoshexample.html id=examplevar script=script result=result %}
488492

493+
### `collation`
494+
495+
<small>Introduced in: v3.9.0</small>
496+
497+
An Analyzer capable of converting the input into a set of language-specific
498+
tokens. This makes comparisons follow the rules of the respective language,
499+
most notable in range queries against Views.
500+
501+
The *properties* allowed for this Analyzer are an object with the following
502+
attributes:
503+
504+
- `locale` (string): a locale in the format
505+
`language[_COUNTRY][.encoding][@variant]` (square brackets denote optional
506+
parts), e.g. `"de.utf-8"` or `"en_US.utf-8"`. Only UTF-8 encoding is
507+
meaningful in ArangoDB. Also see [Supported Languages](#supported-languages).
508+
509+
**Examples**
510+
511+
In Swedish, the letter `å` (note the small circle above the `a`) comes after
512+
`z`. Other languages treat it like a regular `a`, putting it before `b`.
513+
Below example creates two `collation` Analyzers, one with an English locale
514+
(`en`) and one with a Swedish locale (`sv`). It then demonstrates the
515+
difference in alphabetical order using a simple range query that returns
516+
letters before `c`:
517+
518+
{% arangoshexample examplevar="examplevar" script="script" result="result" %}
519+
@startDocuBlockInline analyzerCollation
520+
@EXAMPLE_ARANGOSH_OUTPUT{analyzerCollation}
521+
var analyzers = require("@arangodb/analyzers");
522+
var en = analyzers.save("collation_en", "collation", { locale: "en.utf-8" }, []);
523+
var sv = analyzers.save("collation_sv", "collation", { locale: "sv.utf-8" }, []);
524+
var test = db._create("test");
525+
| db.test.save([
526+
| { text: "a" },
527+
| { text: "å" },
528+
| { text: "b" },
529+
| { text: "z" },
530+
]);
531+
| var view = db._createView("view", "arangosearch",
532+
{ links: { test: { analyzers: [ "collation_en", "collation_sv" ], includeAllFields: true }}});
533+
~ db._query("FOR doc IN view OPTIONS { waitForSync: true } LIMIT 1 RETURN true");
534+
db._query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_en')[0], 'collation_en') RETURN doc.text");
535+
db._query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_sv')[0], 'collation_sv') RETURN doc.text");
536+
~ db._dropView(view.name());
537+
~ db._drop(test.name());
538+
~ analyzers.remove(en.name);
539+
~ analyzers.remove(sv.name);
540+
@END_EXAMPLE_ARANGOSH_OUTPUT
541+
@endDocuBlock analyzerCollation
542+
{% endarangoshexample %}
543+
{% include arangoshexample.html id=examplevar script=script result=result %}
544+
489545
### `aql`
490546

491547
<small>Introduced in: v3.8.0</small>
@@ -728,10 +784,9 @@ Split at delimiting characters `,` and `;`, then stem the tokens:
728784
{% endarangoshexample %}
729785
{% include arangoshexample.html id=examplevar script=script result=result %}
730786

731-
{% comment %}
732787
### `stopwords`
733788

734-
<small>Introduced in: v3.8.0</small>
789+
<small>Introduced in: v3.8.1</small>
735790

736791
An Analyzer capable of removing specified tokens from the input.
737792

@@ -802,7 +857,70 @@ lower-case and base characters) and then discards the stopwords `and` and `the`:
802857
@endDocuBlock analyzerPipelineStopwords
803858
{% endarangoshexample %}
804859
{% include arangoshexample.html id=examplevar script=script result=result %}
805-
{% endcomment %}
860+
861+
### `segmentation`
862+
863+
<small>Introduced in: v3.9.0</small>
864+
865+
An Analyzer capable of breaking up the input text into tokens in a
866+
language-agnostic manner as per
867+
[Unicode Standard Annex #29](https://unicode.org/reports/tr29){:target="_blank"},
868+
making it suitable for mixed language strings. It can optionally preserve all
869+
non-whitespace or all characters instead of keeping alphanumeric characters only,
870+
as well as apply case conversion.
871+
872+
The *properties* allowed for this Analyzer are an object with the following
873+
attributes:
874+
875+
- `break` (string, _optional_):
876+
- `"all"`: return all tokens
877+
- `"alpha"`: return tokens composed of alphanumeric characters only (default).
878+
Alphanumeric characters are Unicode codepoints from the Letter and Number
879+
categories, see
880+
[Unicode Technical Note #36](https://www.unicode.org/notes/tn36/){:target="_blank"}.
881+
- `"graphic"`: return tokens composed of non-whitespace characters only.
882+
Note that the list of whitespace characters does not include line breaks:
883+
- `U+0009` Character Tabulation
884+
- `U+0020` Space
885+
- `U+0085` Next Line
886+
- `U+00A0` No-break Space
887+
- `U+1680` Ogham Space Mark
888+
- `U+2000` En Quad
889+
- `U+2028` Line Separator
890+
- `U+202F` Narrow No-break Space
891+
- `U+205F` Medium Mathematical Space
892+
- `U+3000` Ideographic Space
893+
- `case` (string, _optional_):
894+
- `"lower"` to convert to all lower-case characters (default)
895+
- `"upper"` to convert to all upper-case characters
896+
- `"none"` to not change character case
897+
898+
**Examples**
899+
900+
Create different `segmentation` Analyzers to show the behavior of the different
901+
`break` options:
902+
903+
{% arangoshexample examplevar="examplevar" script="script" result="result" %}
904+
@startDocuBlockInline analyzerSegmentationBreak
905+
@EXAMPLE_ARANGOSH_OUTPUT{analyzerSegmentationBreak}
906+
var analyzers = require("@arangodb/analyzers");
907+
var all = analyzers.save("segment_all", "segmentation", { break: "all" }, []);
908+
var alpha = analyzers.save("segment_alpha", "segmentation", { break: "alpha" }, []);
909+
var graphic = analyzers.save("segment_graphic", "segmentation", { break: "graphic" }, []);
910+
| db._query(`LET str = 'Test\twith An_EMAIL-address+123@example.org\n蝴蝶。\u2028бутерброд'
911+
| RETURN {
912+
| "all": TOKENS(str, 'segment_all'),
913+
| "alpha": TOKENS(str, 'segment_alpha'),
914+
| "graphic": TOKENS(str, 'segment_graphic'),
915+
| }
916+
`);
917+
~ analyzers.remove(all.name);
918+
~ analyzers.remove(alpha.name);
919+
~ analyzers.remove(graphic.name);
920+
@END_EXAMPLE_ARANGOSH_OUTPUT
921+
@endDocuBlock analyzerSegmentationBreak
922+
{% endarangoshexample %}
923+
{% include arangoshexample.html id=examplevar script=script result=result %}
806924

807925
### `geojson`
808926

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
arangosh&gt; <span class="hljs-keyword">var</span> analyzers = <span class="hljs-built_in">require</span>(<span class="hljs-string">&quot;@arangodb/analyzers&quot;</span>);
2+
arangosh&gt; <span class="hljs-keyword">var</span> en = analyzers.save(<span class="hljs-string">&quot;collation_en&quot;</span>, <span class="hljs-string">&quot;collation&quot;</span>, { <span class="hljs-attr">locale</span>: <span class="hljs-string">&quot;en.utf-8&quot;</span> }, []);
3+
arangosh&gt; <span class="hljs-keyword">var</span> sv = analyzers.save(<span class="hljs-string">&quot;collation_sv&quot;</span>, <span class="hljs-string">&quot;collation&quot;</span>, { <span class="hljs-attr">locale</span>: <span class="hljs-string">&quot;sv.utf-8&quot;</span> }, []);
4+
arangosh&gt; <span class="hljs-keyword">var</span> test = db._create(<span class="hljs-string">&quot;test&quot;</span>);
5+
arangosh&gt; db.test.save([
6+
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;a&quot;</span> },
7+
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;å&quot;</span> },
8+
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;b&quot;</span> },
9+
........&gt; { <span class="hljs-attr">text</span>: <span class="hljs-string">&quot;z&quot;</span> },
10+
........&gt; ]);
11+
[
12+
{
13+
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69994&quot;</span>,
14+
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69994&quot;</span>,
15+
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO---&quot;</span>
16+
},
17+
{
18+
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69995&quot;</span>,
19+
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69995&quot;</span>,
20+
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--_&quot;</span>
21+
},
22+
{
23+
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69996&quot;</span>,
24+
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69996&quot;</span>,
25+
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--A&quot;</span>
26+
},
27+
{
28+
<span class="hljs-string">&quot;_id&quot;</span> : <span class="hljs-string">&quot;test/69997&quot;</span>,
29+
<span class="hljs-string">&quot;_key&quot;</span> : <span class="hljs-string">&quot;69997&quot;</span>,
30+
<span class="hljs-string">&quot;_rev&quot;</span> : <span class="hljs-string">&quot;_c0M_WIO--B&quot;</span>
31+
}
32+
]
33+
arangosh&gt; <span class="hljs-keyword">var</span> view = db._createView(<span class="hljs-string">&quot;view&quot;</span>, <span class="hljs-string">&quot;arangosearch&quot;</span>,
34+
........&gt; { <span class="hljs-attr">links</span>: { <span class="hljs-attr">test</span>: { <span class="hljs-attr">analyzers</span>: [ <span class="hljs-string">&quot;collation_en&quot;</span>, <span class="hljs-string">&quot;collation_sv&quot;</span> ], <span class="hljs-attr">includeAllFields</span>: <span class="hljs-literal">true</span> }}});
35+
arangosh&gt; db._query(<span class="hljs-string">&quot;FOR doc IN view SEARCH ANALYZER(doc.text &lt; TOKENS(&#x27;c&#x27;, &#x27;collation_en&#x27;)[0], &#x27;collation_en&#x27;) RETURN doc.text&quot;</span>);
36+
[
37+
<span class="hljs-string">&quot;a&quot;</span>,
38+
<span class="hljs-string">&quot;å&quot;</span>,
39+
<span class="hljs-string">&quot;b&quot;</span>
40+
]
41+
[object ArangoQueryCursor, <span class="hljs-attr">count</span>: <span class="hljs-number">3</span>, <span class="hljs-attr">cached</span>: <span class="hljs-literal">false</span>, <span class="hljs-attr">hasMore</span>: <span class="hljs-literal">false</span>]
42+
arangosh&gt; db._query(<span class="hljs-string">&quot;FOR doc IN view SEARCH ANALYZER(doc.text &lt; TOKENS(&#x27;c&#x27;, &#x27;collation_sv&#x27;)[0], &#x27;collation_sv&#x27;) RETURN doc.text&quot;</span>);
43+
[
44+
<span class="hljs-string">&quot;a&quot;</span>,
45+
<span class="hljs-string">&quot;b&quot;</span>
46+
]
47+
[object ArangoQueryCursor, <span class="hljs-attr">count</span>: <span class="hljs-number">2</span>, <span class="hljs-attr">cached</span>: <span class="hljs-literal">false</span>, <span class="hljs-attr">hasMore</span>: <span class="hljs-literal">false</span>]

0 commit comments

Comments
 (0)