FIELD DATA TYPESby Bo Andersen - codingexplained.com
OUTLINE ➤ Core data types ➤ String, numeric, data, boolean, binary ➤ Complex data types ➤ Object, array, nested ➤ Geo data types ➤ Geo-point, Geo-shape ➤ Specialized data types ➤ IPv4, completion, token count, attachment
CORE DATA TYPES
STRING ➤ String field types accept string values ➤ Can be sub-divided into full text and keywords ➤ We will take a look at these next
STRING - FULL TEXT ➤ Typically used for text based relevance searches (e.g. search for products by name) ➤ Full text fields are analyzed ➤ Data is passed through an analyzer to convert the string into a list of individual terms, before being indexed ➤ This allows Elasticsearch to search for individual words within a full text field ➤ Full text fields are not used for sorting and are rarely used for aggregations
STRING - KEYWORDS ➤ Exact values such as tags, status, e-mail addresses, etc. ➤ Keywords fields are not analyzed ➤ The exact string value is added to the index as a single term ➤ Typically used for filtering ➤ E.g. find all products where status is "On Discount" ➤ Also often used for sorting and aggregations
NUMERIC ➤ Supports the following numeric types ➤ long (signed 64-bit integer) ➤ integer (signed 32-bit integer) ➤ short (signed 16-bit integer) ➤ byte (signed 8-bit integer) ➤ double (double-precision 64-bit floating point) ➤ float (single-precision 32-bit floating point)
DATE ➤ Dates in Elasticsearch can be either ➤ Strings containing formatted dates ➤ E.g. 2016-01-01 or 2016/01/01 12:00:00 ➤ A long number representing milliseconds since the epoch ➤ An integer representing seconds since the epoch ➤ Internally stored as a long number representing milliseconds since the epoch
DATE - FORMATS ➤ Defaults to strict_date_optional_time||epoch_millis ➤ Dates with optional timestamps, which conform to the formats supported by strict_date_optional_time - or milliseconds since the epoch ➤ Examples ➤ 2016-01-01 (date only) ➤ 2016-01-01T12:00:00Z (date including time) ➤ 1410020500000 (milliseconds since the epoch) ➤ Multiple formats can be specified by separating them with the || separator ➤ E.g. yyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis
BOOLEAN ➤ Boolean fields accept true and false values as in JSON ➤ Can also accept strings and numbers which are interpreted as either true or false ➤ False values ➤ false, "false", "off", "no", "0", "" (empty string), 0, 0.0 ➤ True values ➤ Anything that is not false
BINARY ➤ A binary value as a Base64 encoded string ➤ E.g. aHR0cDovL2NvZGluZ2V4cGxhaW5lZC5jb20= ➤ Not searchable
COMPLEX DATA TYPES
OBJECT ➤ JSON documents are hierarchical ➤ A document may contain inner objects, which in turn may contain inner objects ➤ In Elasticsearch, documents are indexed as flat lists of key-value pairs { "message": "Some text...", "customer.age": 26, "customer.address.city": "Copenhagen", "customer.address.country": "Denmark" }
ARRAY ➤ Elasticsearch does not have a dedicated array type ➤ Any field can contain zero or more values by default ➤ All values in an array must be of the same data type ➤ When adding a field dynamically, the first value in the array determines the field type ➤ Examples ➤ Array of strings: ["Elasticsearch", "rocks"] ➤ Array of integers: [1, 2] ➤ Array of arrays: [1, [2, 3]] - equivalent of [1, 2, 3] ➤ Array of objects: [{ "name": "Andy", "age": 26 }, { "name": "Brenda", "age": 32 }]
ARRAY - OBJECTS ➤ Arrays of objects do not work as you would expect ➤ You cannot query each object independently of the other objects in the array ➤ Lucene has no concept of inner objects ➤ Elasticsearch flattens object hierarchies into a list of field names and values is stored similar to this: { "users : [{ "name": "Andy", "age": 26 }, { "name": "Brenda", "age": 32 }] } { "users.name": ["Andy", "Brenda"], "users.age": [32, 26] } ➤ The association between "Andy" and 26 is lost ➤ A search for a user named "Andy" who is 26 years old would return incorrect results! ➤ If you need to be able to do this, then you must use the nested data type
NESTED ➤ If you need to index arrays of objects and to maintain the independence of each object in the array, you should used the nested data type ➤ Internally, nested objects index each object in the array as a separate hidden document ➤ Each nested object can be queried independently of the others, with a nested query ➤ A nested query is executed against the nested objects as if they were indexed as separate documents (internally, this is actually the case)
GEO DATA TYPES
GEO-POINT ➤ Latitude-longitude pairs ➤ Used for geographical operations on documents (searching, sorting, ...) { "location": { "lat": 33.5206608, "lon": -86.8024900 } } { "location": "33.5206608,-86.8024900" } { "location": "drm3btev3e86" } { "location": [-86.8024900,33.5206608] } 1 2 3 4
GEO-SHAPE ➤ Geo shapes such as rectangles and polygons ➤ Should be used when either the data being indexed or the queries being executed contain shapes other than just points ➤ LineString ➤ Array of two or more positions (array of arrays). Straight line in the case of two points ➤ Polygon ➤ An array of arrays, where each array contains points ➤ The first and last points in the outer array must be the same (to close the polygon) ➤ ...
SPECIALIZED DATA TYPES
IPV4 ➤ Used to map IPv4 addresses ➤ Internally, values are indexed as long values
COMPLETION ➤ The completion suggester is a so-called prefix suggester ➤ It does not do spell correction, but enables basic auto-complete functionality ➤ Useful for providing the user with suggestions while searching, e.g. like on Google ➤ Stores a FST (Finite State Transducer) as part of the index ➤ Allows for very fast loads and executions ➤ You don't have to worry about this - just know when to use this type
TOKEN COUNT ➤ An integer field which accepts string values ➤ The string values are analyzed, and the number of tokens are indexed ➤ Example ➤ A name property could have a length field of the type token_count ➤ Then, a search query could be executed to find persons whose name contains X tokens (split by space, for instance)
ATTACHMENT ➤ Lets Elasticsearch index attachments in common formats ➤ E.g. PDF, XLS, PPT, ... ➤ Attachment content is stored as a Base64 encoded string ➤ This functionality is available as a plugin that must be installed ➤ sudo /path/to/elasticsearchbin/plugin install mapper-attachments ➤ Must be installed on every node of a cluster ➤ Nodes must be restarted after the installation
THANK YOU FOR WATCHING!

Elasticsearch Field Data Types

  • 1.
    FIELD DATA TYPESbyBo Andersen - codingexplained.com
  • 2.
    OUTLINE ➤ Core datatypes ➤ String, numeric, data, boolean, binary ➤ Complex data types ➤ Object, array, nested ➤ Geo data types ➤ Geo-point, Geo-shape ➤ Specialized data types ➤ IPv4, completion, token count, attachment
  • 3.
  • 4.
    STRING ➤ String fieldtypes accept string values ➤ Can be sub-divided into full text and keywords ➤ We will take a look at these next
  • 5.
    STRING - FULLTEXT ➤ Typically used for text based relevance searches (e.g. search for products by name) ➤ Full text fields are analyzed ➤ Data is passed through an analyzer to convert the string into a list of individual terms, before being indexed ➤ This allows Elasticsearch to search for individual words within a full text field ➤ Full text fields are not used for sorting and are rarely used for aggregations
  • 6.
    STRING - KEYWORDS ➤Exact values such as tags, status, e-mail addresses, etc. ➤ Keywords fields are not analyzed ➤ The exact string value is added to the index as a single term ➤ Typically used for filtering ➤ E.g. find all products where status is "On Discount" ➤ Also often used for sorting and aggregations
  • 7.
    NUMERIC ➤ Supports thefollowing numeric types ➤ long (signed 64-bit integer) ➤ integer (signed 32-bit integer) ➤ short (signed 16-bit integer) ➤ byte (signed 8-bit integer) ➤ double (double-precision 64-bit floating point) ➤ float (single-precision 32-bit floating point)
  • 8.
    DATE ➤ Dates inElasticsearch can be either ➤ Strings containing formatted dates ➤ E.g. 2016-01-01 or 2016/01/01 12:00:00 ➤ A long number representing milliseconds since the epoch ➤ An integer representing seconds since the epoch ➤ Internally stored as a long number representing milliseconds since the epoch
  • 9.
    DATE - FORMATS ➤Defaults to strict_date_optional_time||epoch_millis ➤ Dates with optional timestamps, which conform to the formats supported by strict_date_optional_time - or milliseconds since the epoch ➤ Examples ➤ 2016-01-01 (date only) ➤ 2016-01-01T12:00:00Z (date including time) ➤ 1410020500000 (milliseconds since the epoch) ➤ Multiple formats can be specified by separating them with the || separator ➤ E.g. yyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis
  • 10.
    BOOLEAN ➤ Boolean fieldsaccept true and false values as in JSON ➤ Can also accept strings and numbers which are interpreted as either true or false ➤ False values ➤ false, "false", "off", "no", "0", "" (empty string), 0, 0.0 ➤ True values ➤ Anything that is not false
  • 11.
    BINARY ➤ A binaryvalue as a Base64 encoded string ➤ E.g. aHR0cDovL2NvZGluZ2V4cGxhaW5lZC5jb20= ➤ Not searchable
  • 12.
  • 13.
    OBJECT ➤ JSON documentsare hierarchical ➤ A document may contain inner objects, which in turn may contain inner objects ➤ In Elasticsearch, documents are indexed as flat lists of key-value pairs { "message": "Some text...", "customer.age": 26, "customer.address.city": "Copenhagen", "customer.address.country": "Denmark" }
  • 14.
    ARRAY ➤ Elasticsearch doesnot have a dedicated array type ➤ Any field can contain zero or more values by default ➤ All values in an array must be of the same data type ➤ When adding a field dynamically, the first value in the array determines the field type ➤ Examples ➤ Array of strings: ["Elasticsearch", "rocks"] ➤ Array of integers: [1, 2] ➤ Array of arrays: [1, [2, 3]] - equivalent of [1, 2, 3] ➤ Array of objects: [{ "name": "Andy", "age": 26 }, { "name": "Brenda", "age": 32 }]
  • 15.
    ARRAY - OBJECTS ➤Arrays of objects do not work as you would expect ➤ You cannot query each object independently of the other objects in the array ➤ Lucene has no concept of inner objects ➤ Elasticsearch flattens object hierarchies into a list of field names and values is stored similar to this: { "users : [{ "name": "Andy", "age": 26 }, { "name": "Brenda", "age": 32 }] } { "users.name": ["Andy", "Brenda"], "users.age": [32, 26] } ➤ The association between "Andy" and 26 is lost ➤ A search for a user named "Andy" who is 26 years old would return incorrect results! ➤ If you need to be able to do this, then you must use the nested data type
  • 16.
    NESTED ➤ If youneed to index arrays of objects and to maintain the independence of each object in the array, you should used the nested data type ➤ Internally, nested objects index each object in the array as a separate hidden document ➤ Each nested object can be queried independently of the others, with a nested query ➤ A nested query is executed against the nested objects as if they were indexed as separate documents (internally, this is actually the case)
  • 17.
  • 18.
    GEO-POINT ➤ Latitude-longitude pairs ➤Used for geographical operations on documents (searching, sorting, ...) { "location": { "lat": 33.5206608, "lon": -86.8024900 } } { "location": "33.5206608,-86.8024900" } { "location": "drm3btev3e86" } { "location": [-86.8024900,33.5206608] } 1 2 3 4
  • 19.
    GEO-SHAPE ➤ Geo shapessuch as rectangles and polygons ➤ Should be used when either the data being indexed or the queries being executed contain shapes other than just points ➤ LineString ➤ Array of two or more positions (array of arrays). Straight line in the case of two points ➤ Polygon ➤ An array of arrays, where each array contains points ➤ The first and last points in the outer array must be the same (to close the polygon) ➤ ...
  • 20.
  • 21.
    IPV4 ➤ Used tomap IPv4 addresses ➤ Internally, values are indexed as long values
  • 22.
    COMPLETION ➤ The completionsuggester is a so-called prefix suggester ➤ It does not do spell correction, but enables basic auto-complete functionality ➤ Useful for providing the user with suggestions while searching, e.g. like on Google ➤ Stores a FST (Finite State Transducer) as part of the index ➤ Allows for very fast loads and executions ➤ You don't have to worry about this - just know when to use this type
  • 23.
    TOKEN COUNT ➤ Aninteger field which accepts string values ➤ The string values are analyzed, and the number of tokens are indexed ➤ Example ➤ A name property could have a length field of the type token_count ➤ Then, a search query could be executed to find persons whose name contains X tokens (split by space, for instance)
  • 24.
    ATTACHMENT ➤ Lets Elasticsearchindex attachments in common formats ➤ E.g. PDF, XLS, PPT, ... ➤ Attachment content is stored as a Base64 encoded string ➤ This functionality is available as a plugin that must be installed ➤ sudo /path/to/elasticsearchbin/plugin install mapper-attachments ➤ Must be installed on every node of a cluster ➤ Nodes must be restarted after the installation
  • 25.