@@ -100,29 +100,34 @@ The currently implemented Analyzer types are:
100100-  ` ngram ` : create _ n_ -grams from value with user-defined lengths
101101-  ` text ` : tokenize into words, optionally with stemming,
102102 normalization, stop-word filtering and edge _ n_ -gram generation
103+ -  ` segmentation ` : language-agnostic text tokenization, optionally with
104+  normalization
103105-  ` aql ` : for running AQL query to prepare tokens for index
104106-  ` pipeline ` : for chaining multiple Analyzers
105- {%- comment %}- ` stopwords ` : removes the specified tokens from the input{% endcomment %}
107+ -  ` stopwords ` : removes the specified tokens from the input
108+ -  ` collation ` : to respect the alphabetic order of a language in range queries
106109-  ` geojson ` : breaks up a GeoJSON object into a set of indexable tokens
107110-  ` geopoint ` : breaks up a JSON object describing a coordinate into a set of
108111 indexable tokens
109112
110113Available normalizations are case conversion and accent removal
111114(conversion of characters with diacritical marks to the base characters).
112115
113- Analyzer / Feature | Tokenization | Stemming | Normalization | _ N_ -grams
114- :-------------------------|:------------:|:--------:|:-------------:|:--------:
115- [ ` identity ` ] ( #identity )  | No | No | No | No
116- [ ` delimiter ` ] ( #delimiter )  | (Yes) | No | No | No
117- [ ` stem ` ] ( #stem )  | No | Yes | No | No
118- [ ` norm ` ] ( #norm )  | No | No | Yes | No
119- [ ` ngram ` ] ( #ngram )  | No | No | No | Yes
120- [ ` text ` ] ( #text )  | Yes | Yes | Yes | (Yes)
121- [ ` aql ` ] ( #aql )  | (Yes) | (Yes) | (Yes) | (Yes)
122- [ ` pipeline ` ] ( #pipeline )  | (Yes) | (Yes) | (Yes) | (Yes)
123- {%- comment %}[ ` stopwords ` ] ( #stopwords )  | No | No | No | No{% endcomment %}
124- [ ` geojson ` ] ( #geojson )  | – | – | – | –
125- [ ` geopoint ` ] ( #geopoint )  | – | – | – | –
116+ Analyzer / Feature | Tokenization | Stemming | Normalization | _ N_ -grams
117+ :------------------------------:|:------------:|:--------:|:-------------:|:--------:
118+ [ ` identity ` ] ( #identity )  | No | No | No | No
119+ [ ` delimiter ` ] ( #delimiter )  | (Yes) | No | No | No
120+ [ ` stem ` ] ( #stem )  | No | Yes | No | No
121+ [ ` norm ` ] ( #norm )  | No | No | Yes | No
122+ [ ` ngram ` ] ( #ngram )  | No | No | No | Yes
123+ [ ` text ` ] ( #text )  | Yes | Yes | Yes | (Yes)
124+ [ ` segmentation ` ] ( #segmentation )  | Yes | No | Yes | No
125+ [ ` aql ` ] ( #aql )  | (Yes) | (Yes) | (Yes) | (Yes)
126+ [ ` pipeline ` ] ( #pipeline )  | (Yes) | (Yes) | (Yes) | (Yes)
127+ [ ` stopwords ` ] ( #stopwords )  | No | No | No | No
128+ [ ` collation ` ] ( #collation )  | No | No | No | No
129+ [ ` geojson ` ] ( #geojson )  | – | – | – | –
130+ [ ` geopoint ` ] ( #geopoint )  | – | – | – | –
126131
127132Analyzer Properties
128133------------------- 
@@ -163,7 +168,6 @@ attributes:
163168
164169-  ` delimiter `  (string): the delimiting character(s)
165170
166- 
167171** Examples** 
168172
169173Split input strings into tokens at hyphen-minus characters:
@@ -486,6 +490,58 @@ stemming disabled and `"the"` defined as stop-word to exclude it:
486490{% endarangoshexample %}
487491{% include arangoshexample.html id=examplevar script=script result=result %}
488492
493+ ### ` collation `  
494+ 
495+ <small >Introduced in: v3.9.0</small >
496+ 
497+ An Analyzer capable of converting the input into a set of language-specific
498+ tokens. This makes comparisons follow the rules of the respective language,
499+ most notable in range queries against Views.
500+ 
501+ The * properties*  allowed for this Analyzer are an object with the following
502+ attributes:
503+ 
504+ -  ` locale `  (string): a locale in the format
505+  ` language[_COUNTRY][.encoding][@variant] `  (square brackets denote optional
506+  parts), e.g. ` "de.utf-8" `  or ` "en_US.utf-8" ` . Only UTF-8 encoding is
507+  meaningful in ArangoDB. Also see [ Supported Languages] ( #supported-languages ) .
508+ 
509+ ** Examples** 
510+ 
511+ In Swedish, the letter ` å `  (note the small circle above the ` a ` ) comes after
512+ ` z ` . Other languages treat it like a regular ` a ` , putting it before ` b ` .
513+ Below example creates two ` collation `  Analyzers, one with an English locale
514+ (` en ` ) and one with a Swedish locale (` sv ` ). It then demonstrates the
515+ difference in alphabetical order using a simple range query that returns
516+ letters before ` c ` :
517+ 
518+ {% arangoshexample examplevar="examplevar" script="script" result="result" %}
519+  @startDocuBlockInline   analyzerCollation
520+  @EXAMPLE_ARANGOSH_OUTPUT{analyzerCollation}
521+  var analyzers = require("@arangodb/analyzers  ");
522+  var en = analyzers.save("collation_en", "collation", { locale: "en.utf-8" }, [ ] );
523+  var sv = analyzers.save("collation_sv", "collation", { locale: "sv.utf-8" }, [ ] );
524+  var test = db._ create("test");
525+  | db.test.save([ 
526+  | { text: "a" },
527+  | { text: "å" },
528+  | { text: "b" },
529+  | { text: "z" },
530+  ] );
531+  | var view = db._ createView("view", "arangosearch",
532+  { links: { test: { analyzers: [  "collation_en", "collation_sv" ] , includeAllFields: true }}});
533+  ~ db._ query("FOR doc IN view OPTIONS { waitForSync: true } LIMIT 1 RETURN true");
534+  db._ query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_en')[ 0] , 'collation_en') RETURN doc.text");
535+  db._ query("FOR doc IN view SEARCH ANALYZER(doc.text < TOKENS('c', 'collation_sv')[ 0] , 'collation_sv') RETURN doc.text");
536+  ~ db._ dropView(view.name());
537+  ~ db._ drop(test.name());
538+  ~ analyzers.remove(en.name);
539+  ~ analyzers.remove(sv.name);
540+  @END_EXAMPLE_ARANGOSH_OUTPUT
541+  @endDocuBlock   analyzerCollation
542+ {% endarangoshexample %}
543+ {% include arangoshexample.html id=examplevar script=script result=result %}
544+ 
489545### ` aql `  
490546
491547<small >Introduced in: v3.8.0</small >
@@ -728,10 +784,9 @@ Split at delimiting characters `,` and `;`, then stem the tokens:
728784{% endarangoshexample %}
729785{% include arangoshexample.html id=examplevar script=script result=result %}
730786
731- {% comment %}
732787### ` stopwords `  
733788
734- <small >Introduced in: v3.8.0 </small >
789+ <small >Introduced in: v3.8.1 </small >
735790
736791An Analyzer capable of removing specified tokens from the input.
737792
@@ -802,7 +857,70 @@ lower-case and base characters) and then discards the stopwords `and` and `the`:
802857 @endDocuBlock analyzerPipelineStopwords
803858{%  endarangoshexample % }
804859{%  include arangoshexample .html  id= examplevar script= script result= result % }
805- {%  endcomment % }
860+ 
861+ ### ` segmentation` 
862+ 
863+ < small> Introduced in:  v3.9 .0 < / small> 
864+ 
865+ An Analyzer capable of  breaking up the input text into tokens in  a
866+ language- agnostic manner as per
867+ [Unicode  Standard  Annex  #29 ](https : // unicode.org/reports/tr29){:target="_blank"},
868+ making  it  suitable  for  mixed  language  strings . It  can  optionally  preserve  all 
869+ non - whitespace  or  all  characters  instead  of  keeping  alphanumeric  characters  only ,
870+ as  well  as  apply  case  conversion .
871+ 
872+ The  * properties *  allowed  for  this  Analyzer  are  an  object  with  the  following 
873+ attributes : 
874+ 
875+ -  ` break` : 
876+  -  ` "all"` :  return  all  tokens 
877+  -  ` "alpha"` :  return  tokens  composed  of  alphanumeric  characters  only  (default).
878+  Alphanumeric  characters  are  Unicode  codepoints  from  the  Letter  and  Number 
879+  categories , see 
880+  [Unicode  Technical  Note  #36 ](https : // www.unicode.org/notes/tn36/){:target="_blank"}.
881+  -  ` "graphic"` :  return  tokens  composed  of  non - whitespace  characters  only .
882+  Note  that  the  list  of  whitespace  characters  does  not  include  line  breaks : 
883+  -  ` U+0009` Character  Tabulation 
884+  -  ` U+0020` Space 
885+  -  ` U+0085` Next  Line 
886+  -  ` U+00A0` No - break  Space 
887+  -  ` U+1680` Ogham  Space  Mark 
888+  -  ` U+2000` En  Quad 
889+  -  ` U+2028` Line  Separator 
890+  -  ` U+202F` Narrow  No - break  Space 
891+  -  ` U+205F` Medium  Mathematical  Space 
892+  -  ` U+3000` Ideographic  Space 
893+ -  ` case` : 
894+  -  ` "lower"` to  convert  to  all  lower - case  characters  (default)
895+  -  ` "upper"` to  convert  to  all  upper - case  characters 
896+  -  ` "none"` to  not  change  character  case 
897+ 
898+ ** Examples ** 
899+ 
900+ Create  different  ` segmentation` Analyzers  to  show  the  behavior  of  the  different 
901+ ` break` options : 
902+ 
903+ {%  arangoshexample examplevar= " examplevar" = " script" = " result" % }
904+  @startDocuBlockInline  analyzerSegmentationBreak 
905+  @EXAMPLE_ARANGOSH_OUTPUT {analyzerSegmentationBreak}
906+  var  analyzers  =  require (" @arangodb/analyzers" 
907+  var  all  =  analyzers .save (" segment_all" " segmentation" :  " all" 
908+  var  alpha  =  analyzers .save (" segment_alpha" " segmentation" :  " alpha" 
909+  var  graphic  =  analyzers .save (" segment_graphic" " segmentation" :  " graphic" 
910+  |  db ._query (` LET str = 'Test\t with An_EMAIL-address+123@example.org\n 蝴蝶。\u2028 бутерброд'
911+  | RETURN { 
912+  | "all": TOKENS(str, 'segment_all'), 
913+  | "alpha": TOKENS(str, 'segment_alpha'), 
914+  | "graphic": TOKENS(str, 'segment_graphic'), 
915+  | } 
916+  `  );
917+  ~  analyzers .remove (all .name );
918+  ~  analyzers .remove (alpha .name );
919+  ~  analyzers .remove (graphic .name );
920+  @END_EXAMPLE_ARANGOSH_OUTPUT 
921+  @endDocuBlock  analyzerSegmentationBreak 
922+ {%  endarangoshexample % }
923+ {%  include arangoshexample .html  id= examplevar script= script result= result % }
806924
807925### ` geojson` 
808926
0 commit comments