Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and Cassandra Paul Brebner instaclustr.com Technology Evangelist ApacheCon Geospatial Track 2020 © Instaclustr Pty Limited, 2020
Instaclustr Overview ©Instaclustr Pty Limited, 2020
Overview • “Location” in the news in 2020 • Anomaly Detection: baseline throughput • Spatial Anomaly Detection problem • Solutions: location representation and querying/indexing o Bounding boxes and secondary indexes o Geohashes o Lucene index o Results o 3D © 2020 Instaclustr Pty Limited
In the News John Conway Legendary Polymath Passed away this year © 2020 Instaclustr Pty Limited
Game of Life Next state of each cell depends on state of immediate neighbors © 2020 Instaclustr Pty Limited
Simple rules but complex patterns Game of Life © 2020 Instaclustr Pty Limited
Also In the News Social distancing and COVID-19 tracing Uncle Ron’s social distancing 3000 invention Or COVIDSAFE App? © 2020 Instaclustr Pty Limited
Uncle Ron’s social distancing 3000 invention Also In the News And “planet-killer” asteroid missed the Earth in April (16x moon orbit) © 2020 Instaclustr Pty Limited
Previously… KAFKA, CASSANDRA AND KUBERNETES AT SCALE: REAL- TIME ANOMALY DETECTION ON 19 BILLION EVENTS A DAY
Anomaly Detection Spot the difference At speed (< 1s RT) and scale (high throughput, lots of data) © 2020 Instaclustr Pty Limited
How Does It Work? • CUSUM (Cumulative Sum Control Chart) • Statistical analysis of historical data • Data for a single variable/key at a time • Potentially billions of keys © 2020 Instaclustr Pty Limited
Pipeline Design • Kafka, Cassandra and Kubernetes Clusters – Kafka handles streaming, Cassandra for data storage, Kubernetes for application scaling • Efficient Cassandra Data writes and reads with key, a unique “account ID” or similar © 2020 Instaclustr Pty Limited Kubernetes
Cassandra Data Model • Events are timeseries • Id is Partition Key - Time is clustering key (order) • Read gets most recent 50 values for id, very fast create table event_stream ( id bigint, time timestamp, value double, primary key (id, time) ) with clustering order by (time desc); select value from event_stream where id=314159265 limit 50; © 2020 Instaclustr Pty Limited
Cluster details – CPU cores Cassandra 384 + Application pipeline (workers) 118 + Kafka 72 = 574 Cores Total 384 118 72 0 100 200 300 400 500 600 700 Cores Cores Used 574 Total Cassandra Workers Kafka
Baseline Throughput 19 Billion Anomaly Checks/Day = 100% 0 20 40 60 80 100 120 Baseline (single transaction ID) Normalised (%) © 2020 Instaclustr Pty Limited
Harder Problem: Spot the Differences in Space Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Douglas Adams, The Hitchhiker’s Guide to the Galaxy © 2020 Instaclustr Pty Limited
Spatial Anomalies Many and varied © 2020 Instaclustr Pty Limited
Real Example: John Snow No, not this one! © 2020 Instaclustr Pty Limited
John Snow’s 1854 Cholera Map • Death’s per household + location • Used to identify a polluted pump (X) • Some outliers—brewers drank beer not water! X © 2020 Instaclustr Pty Limited
But… First you have to know where you are: Location To usefully represent location need: • Coordinate system • Map • Scale © 2020 Instaclustr Pty Limited
Better • <lat, long> coordinates • Scale • Interesting locations, “bulk of treasure here” © 2020 Instaclustr Pty Limited
Geospatial Anomaly Detection South Atlantic Geomagnetic Anomaly New problem… • Rather than a single ID, events now have a location (and a value) • The problem now is to o find the nearest 50 events to each new event o Quickly (< 1s RT) • Can’t make any assumptions about geospatial properties of events o including location, density or distribution – i.e. where, or how many o Need to search from smallest to increasingly larger areas o E.g. South Atlantic Geomagnetic Anomaly is BIG • Uber (ApacheCon 2019 talk) uses similar • technologies to o forecast demand o Increase area until they have sufficient data for predictions • Can we use <lat, long> as Cassandra partition key? o Yes, compound partition keys are allowed. o But can only select the exact locations. © 2020 Instaclustr Pty Limited
How to Compute Nearness To compute distance between locations you need coordinate system e.g. Mercator map Flat earth, distortion nearer poles © 2020 Instaclustr Pty Limited
World is (approx.) Spherical Calculation of distance between two latitudinal/longitudinal points is non-trivial © 2020 Instaclustr Pty Limited
Bounding Box Approximation of distance using inequalities © 2020 Instaclustr Pty Limited
Bounding Boxes and Cassandra? • Use ”country” partition key, Lat/long/time clustering keys • But can’t run the query with multiple inequalities CREATE TABLE latlong ( country text, lat double, long double, time timestamp, PRIMARY KEY (country, lat, long, time) ) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC); select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50; InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)" © 2020 Instaclustr Pty Limited
Secondary Indexes to the Rescue? Secondary Indexes o create index i1 on latlong (lat); o create index i2 on latlong (long); • But same restrictions as clustering columns. SASI - SSTable Attached Secondary Index • Supports more complex queries more efficiently o create custom index i1 on latlong (long) using 'org.apache.cassandra.index.sasi.SASIIndex'; o create custom index i2 on latlong (lat) using 'org.apache.cassandra.index.sasi.SASIIndex’; • select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow filtering; • “allow filtering” may be inefficient (if many rows have to be retrieved prior to filtering) and isn’t suitable for production. • But SASI docs say o even though “allow filtering” must be used with 2 or more column inequalities, there is actually no filtering taking place © 2020 Instaclustr Pty Limited
Results Very slow (< 1%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI © 2020 Instaclustr Pty Limited
Geohashes to the Rescue? • Divide maps into named and hierarchical areas • We’ve been something similar already: “country” partition key E.g. plate tectonics © 2020 Instaclustr Pty Limited
Geohashes Rectangular areas Variable length base-32 string Single char regions 5,000km x 5,000km Each extra letter gives 32 sub-areas 8 chars is 40mx20m En/de-code lat/long to/from geohash But: Edges cases, non-linear near poles © 2020 Instaclustr Pty Limited
Some Geohashes Are Words “ketchup” is in Africa © 2020 Instaclustr Pty Limited
Some Geohashes Are Words 153mx153m153m x 153m © 2020 Instaclustr Pty Limited
“Trump” Is in Kazakhstan! Not to scale 5kmx5km © 2020 Instaclustr Pty Limited
Modifications for Geohashes • Lat/long encoded as geohash • Geohash is new key • Geohash used to query Cassandra © 2020 Instaclustr Pty Limited
Geohashes and Cassandra In theory Geohashes work well for database indexes CREATE TABLE geohash1to8 ( geohash1 text, time timestamp, geohash2 text, geohash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (hash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE INDEX i8 ON geohash1to8 (geohash8); CREATE INDEX i7 ON geohash1to8 (geohash7); CREATE INDEX i6 ON geohash1to8 (geohash6); CREATE INDEX i5 ON geohash1to8 (geohash5); CREATE INDEX i4 ON geohash1to8 (geohash4); CREATE INDEX i3 ON geohash1to8 (geohash3); CREATE INDEX i2 ON geohash1to8 (geohash2); Option 1: Multiple Indexed Geohash Columns © 2020 Instaclustr Pty Limited
• Query from smallest to largest areas select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50; select * from geohash1to8 where geohash1=’e’ limit 50; Tradeoffs? Multiple secondary columns/indexes, multiple queries, accuracy and number of queries depends on spatial distribution and density • Stop when 50 rows found © 2020 Instaclustr Pty Limited
Results Slightly better Option 1 = 10% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 © 2020 Instaclustr Pty Limited
Geohashes and Cassandra Denormalization is “Normal” in Cassandra Create 8 tables, one for each geohash length CREATE TABLE geohash1 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); … CREATE TABLE geohash8 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); Option 2: Denormalized Multiple Tables © 2020 Instaclustr Pty Limited
• Select from smallest to largest areas using corresponding table select * from geohash8 where geohash=’everywhe’ limit 50; select * from geohash7 where geohash=’everywh’ limit 50; select * from geohash6 where geohash=’everyw’ limit 50; select * from geohash5 where geohash=’every’ limit 50; select * from geohash4 where geohash=’ever’ limit 50; select * from geohash3 where geohash=’eve’ limit 50; select * from geohash2 where geohash=’ev’ limit 50; select * from geohash1 where geohash=’e’ limit 50; Tradeoffs? Multiple tables and writes, multiple queries © 2020 Instaclustr Pty Limited
Results Getting better Option 2 = 20% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 © 2020 Instaclustr Pty Limited
Geohashes and Cassandra Similar to Option 1 but using clustering columns CREATE TABLE geohash1to8_clustering ( geohash1 text, time timestamp, geohash2 text, gephash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (geohash1, geohash2, geohash3, geohash4, geohash5, geohash6, geohash7, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC, geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC, geohash8 DESC, time DESC); Option 3: Clustering Column(s) © 2020 Instaclustr Pty Limited
How Do Clustering Columns Work? • Clustering columns are good for modelling and efficient querying of hierarchical/nested data • Query must include higher level columns with equality operator, ranges are only allowed on last column in query, lower level columns don’t have to be included. For example: o select * from geohash1to8_clustering where geohash1=’e’ and geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50; • But why have multiple clustering columns when one is actually enough… Good for Hierarchical Data © 2020 Instaclustr Pty Limited
Better: Single Geohash Clustering Column Geohash8 and time are clustering keys CREATE TABLE geohash_clustering ( geohash1 text, time timestamp, geohash8 text, lat double, long double, PRIMARY KEY (geohash1, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash8 DESC, time DESC); © 2020 Instaclustr Pty Limited
Inequality Range Query • With decreasing length geohashes • Stop when result has 50 rows select * from geohash_clustering where geohash1=’e’ and geohash8=’everywhe’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’every0’ and geohash8 <=’everyz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ever0’ and geohash8 <=’everz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’eve0’ and geohash8 <=’evez’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ev0’ and geohash8 <=’evz’ limit 50; select * from geohash_clustering where geohash1=’e’ limit 50; © 2020 Instaclustr Pty Limited
Geohash Results Option 3 is best = 34% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 © 2020 Instaclustr Pty Limited
Issues? ■ Cardinality for partition key ● should be > 100,000 ● >= 4 character geohash ■ Unbounded partitions are bad ● May need composite partition key in production ● e.g. extra time bucket (hour, day, etc) ■ Space vs time ● could have different sized buckets for different sized spaces ● E.g. bigger areas with more frequent events may need shorter time buckets to limit size ● This may depend on the space-time scales of underlying systems/processes ● E.g. Spatial and temporal scales of oceanographic processes (left)
Another Option: Cassandra Apache Lucene Index Plugin A Concordance © 2020 Instaclustr Pty Limited
Cassandra Lucene Index Plugin • The Cassandra Lucene Index is a plugin for Apache Cassandra: o That extends its index functionality to provide near real-time search, including full-text search capabilities and free multivariable, geospatial and bitemporal search o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. • Instaclustr now supports the plugin o Optional add-on to managed Cassandra service o And code support https://github.com/instaclustr/cassandra-lucene-index • How does this help for Geospatial queries? o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search, geo shape search, multiple distance units, geo transformations, and complex geo shapes. © 2020 Instaclustr Pty Limited
Cassandra Table and Lucene Indexes • Geopoint Example • Under the hood indexing is done using a tree structure with geohashes (configurable precision) CREATE TABLE latlong_lucene ( geohash1 text, value double, time timestamp, latitude double, longitude double, Primary key (geohash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE CUSTOM INDEX latlong_index ON latlong_lucene () USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { geohash1: {type: "string"}, value: {type: "double"}, time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"}, place: {type: "geo_point", latitude: "latitude", longitude: "longitude"} }' };
Search Options • Sort • Sophisticated but complex semantics (see the docs) SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ sort: [ {field: "place", type: "geo_distance", latitude: " + <lat> + ", longitude: " + <long> + "}, {field: "time", reverse: true} ] }') and geohash1=<geohash> limit 50; © 2020 Instaclustr Pty Limited
Search Options • Building Box Filter • Need to compute box corners SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_bbox", field: "place", min_latitude: " + <minLat> + ", max_latitude: " + <maxLat> + ", min_longitude: " + <minLon> + ", max_longitude: " + <maxLon> + " }}’) limit 50; © 2020 Instaclustr Pty Limited
Search Options • Geo Distance Filter SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_distance", field: "place", latitude: " + <lat> + ", longitude: " + <long> + ", max_distance: " <distance> + "km" } }') and geohash1=' + <hash1> + ' limit 50; © 2020 Instaclustr Pty Limited
Search Options: Prefix Filter Prefix search is useful for searching larger areas over a single geohash column as you can search for a substring SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: [ {type: "prefix", field: "geohash1", value: <geohash>} ] }') limit 50 Similar to inequality over clustering column © 2020 Instaclustr Pty Limited
Lucene Results Options = 2-25% Best is prefix filter (25%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash © 2020 Instaclustr Pty Limited
Overall Geohash options are faster (25%, 34%) 34% -> 6.5 Billion Anomaly checks/day 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash G e o h a s h G e o h a s h © 2020 Instaclustr Pty Limited
Overall Geohash options are faster (25%, 34%) Lucene bounded box/geo distance most accurate but only 5% of baseline performance 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash L u c e n e L u c e n e © 2020 Instaclustr Pty Limited
3D (Up and Down) Who needs it? © 2020 Instaclustr Pty Limited
Location, Altitude and Volume • 3D Geohashes represent 2D location, altitude, and volume • A 3D geohash is a cube © 2020 Instaclustr Pty Limited
Application? 3D Drone Proximity Detection
Proximity rules > 50m from people and property >150m from congested areas > 1000m from airports > 5000m from exclusion zones Just happen to correspond to different length 3D geohashes,
3D Geohashes Works with all the geohash index options Therefore reasonably fast to compute 3D proximity More accurate slower options can be improved with bigger Cassandra clusters © 2020 Instaclustr Pty Limited 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h
• Demo 3D Geohash java code o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762 o produces valid 3D geohashes for altitudes from 13km below sea level to geostationary satellite orbit o Can be used for any of the geohash options More Information? © 2020 Instaclustr Pty Limited
More Information? © 2020 Instaclustr Pty Limited • Blogs: https://www.instaclustr.com/paul-brebner/ • Try us out! Free trial at Instaclustr.com
©Instaclustr Pty Limited, 2020
www.instaclustr.com info@instaclustr.com @instaclustr THANK YOU!

Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

  • 1.
    Massively Scalable Real-time GeospatialData Processing with Apache Kafka and Cassandra Paul Brebner instaclustr.com Technology Evangelist ApacheCon Geospatial Track 2020 © Instaclustr Pty Limited, 2020
  • 2.
  • 3.
    Overview • “Location” inthe news in 2020 • Anomaly Detection: baseline throughput • Spatial Anomaly Detection problem • Solutions: location representation and querying/indexing o Bounding boxes and secondary indexes o Geohashes o Lucene index o Results o 3D © 2020 Instaclustr Pty Limited
  • 4.
    In the News John Conway LegendaryPolymath Passed away this year © 2020 Instaclustr Pty Limited
  • 5.
    Game of Life Next stateof each cell depends on state of immediate neighbors © 2020 Instaclustr Pty Limited
  • 6.
    Simple rules butcomplex patterns Game of Life © 2020 Instaclustr Pty Limited
  • 7.
    Also In the News Socialdistancing and COVID-19 tracing Uncle Ron’s social distancing 3000 invention Or COVIDSAFE App? © 2020 Instaclustr Pty Limited
  • 8.
    Uncle Ron’s socialdistancing 3000 invention Also In the News And “planet-killer” asteroid missed the Earth in April (16x moon orbit) © 2020 Instaclustr Pty Limited
  • 9.
    Previously… KAFKA, CASSANDRA AND KUBERNETES AT SCALE:REAL- TIME ANOMALY DETECTION ON 19 BILLION EVENTS A DAY
  • 10.
    Anomaly Detection Spot the difference Atspeed (< 1s RT) and scale (high throughput, lots of data) © 2020 Instaclustr Pty Limited
  • 11.
    How Does It Work? • CUSUM(Cumulative Sum Control Chart) • Statistical analysis of historical data • Data for a single variable/key at a time • Potentially billions of keys © 2020 Instaclustr Pty Limited
  • 12.
    Pipeline Design • Kafka, Cassandraand Kubernetes Clusters – Kafka handles streaming, Cassandra for data storage, Kubernetes for application scaling • Efficient Cassandra Data writes and reads with key, a unique “account ID” or similar © 2020 Instaclustr Pty Limited Kubernetes
  • 13.
    Cassandra Data Model • Events aretimeseries • Id is Partition Key - Time is clustering key (order) • Read gets most recent 50 values for id, very fast create table event_stream ( id bigint, time timestamp, value double, primary key (id, time) ) with clustering order by (time desc); select value from event_stream where id=314159265 limit 50; © 2020 Instaclustr Pty Limited
  • 14.
    Cluster details – CPU cores Cassandra384 + Application pipeline (workers) 118 + Kafka 72 = 574 Cores Total 384 118 72 0 100 200 300 400 500 600 700 Cores Cores Used 574 Total Cassandra Workers Kafka
  • 15.
    Baseline Throughput 19 Billion AnomalyChecks/Day = 100% 0 20 40 60 80 100 120 Baseline (single transaction ID) Normalised (%) © 2020 Instaclustr Pty Limited
  • 16.
    Harder Problem: Spot the Differences in Space Spaceis big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Douglas Adams, The Hitchhiker’s Guide to the Galaxy © 2020 Instaclustr Pty Limited
  • 17.
    Spatial Anomalies Many and varied ©2020 Instaclustr Pty Limited
  • 18.
    Real Example: John Snow No, notthis one! © 2020 Instaclustr Pty Limited
  • 19.
    John Snow’s 1854 CholeraMap • Death’s per household + location • Used to identify a polluted pump (X) • Some outliers—brewers drank beer not water! X © 2020 Instaclustr Pty Limited
  • 20.
    But… First you haveto know where you are: Location To usefully represent location need: • Coordinate system • Map • Scale © 2020 Instaclustr Pty Limited
  • 21.
    Better • <lat, long>coordinates • Scale • Interesting locations, “bulk of treasure here” © 2020 Instaclustr Pty Limited
  • 22.
    Geospatial Anomaly Detection SouthAtlantic Geomagnetic Anomaly New problem… • Rather than a single ID, events now have a location (and a value) • The problem now is to o find the nearest 50 events to each new event o Quickly (< 1s RT) • Can’t make any assumptions about geospatial properties of events o including location, density or distribution – i.e. where, or how many o Need to search from smallest to increasingly larger areas o E.g. South Atlantic Geomagnetic Anomaly is BIG • Uber (ApacheCon 2019 talk) uses similar • technologies to o forecast demand o Increase area until they have sufficient data for predictions • Can we use <lat, long> as Cassandra partition key? o Yes, compound partition keys are allowed. o But can only select the exact locations. © 2020 Instaclustr Pty Limited
  • 23.
    How to ComputeNearness To compute distance between locations you need coordinate system e.g. Mercator map Flat earth, distortion nearer poles © 2020 Instaclustr Pty Limited
  • 24.
    World is (approx.)Spherical Calculation of distance between two latitudinal/longitudinal points is non-trivial © 2020 Instaclustr Pty Limited
  • 25.
    Bounding Box Approximation ofdistance using inequalities © 2020 Instaclustr Pty Limited
  • 26.
    Bounding Boxes andCassandra? • Use ”country” partition key, Lat/long/time clustering keys • But can’t run the query with multiple inequalities CREATE TABLE latlong ( country text, lat double, long double, time timestamp, PRIMARY KEY (country, lat, long, time) ) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC); select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50; InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)" © 2020 Instaclustr Pty Limited
  • 27.
    Secondary Indexes to the Rescue? Secondary Indexes ocreate index i1 on latlong (lat); o create index i2 on latlong (long); • But same restrictions as clustering columns. SASI - SSTable Attached Secondary Index • Supports more complex queries more efficiently o create custom index i1 on latlong (long) using 'org.apache.cassandra.index.sasi.SASIIndex'; o create custom index i2 on latlong (lat) using 'org.apache.cassandra.index.sasi.SASIIndex’; • select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow filtering; • “allow filtering” may be inefficient (if many rows have to be retrieved prior to filtering) and isn’t suitable for production. • But SASI docs say o even though “allow filtering” must be used with 2 or more column inequalities, there is actually no filtering taking place © 2020 Instaclustr Pty Limited
  • 28.
    Results Very slow (<1%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI © 2020 Instaclustr Pty Limited
  • 29.
    Geohashes to the Rescue? • Dividemaps into named and hierarchical areas • We’ve been something similar already: “country” partition key E.g. plate tectonics © 2020 Instaclustr Pty Limited
  • 30.
    Geohashes Rectangular areas Variable lengthbase-32 string Single char regions 5,000km x 5,000km Each extra letter gives 32 sub-areas 8 chars is 40mx20m En/de-code lat/long to/from geohash But: Edges cases, non-linear near poles © 2020 Instaclustr Pty Limited
  • 31.
    Some Geohashes Are Words “ketchup” isin Africa © 2020 Instaclustr Pty Limited
  • 32.
    Some Geohashes Are Words 153mx153m153m x153m © 2020 Instaclustr Pty Limited
  • 33.
    “Trump” Is in Kazakhstan! Not toscale 5kmx5km © 2020 Instaclustr Pty Limited
  • 34.
    Modifications for Geohashes •Lat/long encoded as geohash • Geohash is new key • Geohash used to query Cassandra © 2020 Instaclustr Pty Limited
  • 35.
    Geohashes and Cassandra Intheory Geohashes work well for database indexes CREATE TABLE geohash1to8 ( geohash1 text, time timestamp, geohash2 text, geohash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (hash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE INDEX i8 ON geohash1to8 (geohash8); CREATE INDEX i7 ON geohash1to8 (geohash7); CREATE INDEX i6 ON geohash1to8 (geohash6); CREATE INDEX i5 ON geohash1to8 (geohash5); CREATE INDEX i4 ON geohash1to8 (geohash4); CREATE INDEX i3 ON geohash1to8 (geohash3); CREATE INDEX i2 ON geohash1to8 (geohash2); Option 1: Multiple Indexed Geohash Columns © 2020 Instaclustr Pty Limited
  • 36.
    • Query fromsmallest to largest areas select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50; select * from geohash1to8 where geohash1=’e’ limit 50; Tradeoffs? Multiple secondary columns/indexes, multiple queries, accuracy and number of queries depends on spatial distribution and density • Stop when 50 rows found © 2020 Instaclustr Pty Limited
  • 37.
    Results Slightly better Option 1= 10% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 © 2020 Instaclustr Pty Limited
  • 38.
    Geohashes and Cassandra Denormalizationis “Normal” in Cassandra Create 8 tables, one for each geohash length CREATE TABLE geohash1 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); … CREATE TABLE geohash8 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); Option 2: Denormalized Multiple Tables © 2020 Instaclustr Pty Limited
  • 39.
    • Select fromsmallest to largest areas using corresponding table select * from geohash8 where geohash=’everywhe’ limit 50; select * from geohash7 where geohash=’everywh’ limit 50; select * from geohash6 where geohash=’everyw’ limit 50; select * from geohash5 where geohash=’every’ limit 50; select * from geohash4 where geohash=’ever’ limit 50; select * from geohash3 where geohash=’eve’ limit 50; select * from geohash2 where geohash=’ev’ limit 50; select * from geohash1 where geohash=’e’ limit 50; Tradeoffs? Multiple tables and writes, multiple queries © 2020 Instaclustr Pty Limited
  • 40.
    Results Getting better Option 2= 20% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 © 2020 Instaclustr Pty Limited
  • 41.
    Geohashes and Cassandra Similarto Option 1 but using clustering columns CREATE TABLE geohash1to8_clustering ( geohash1 text, time timestamp, geohash2 text, gephash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (geohash1, geohash2, geohash3, geohash4, geohash5, geohash6, geohash7, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC, geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC, geohash8 DESC, time DESC); Option 3: Clustering Column(s) © 2020 Instaclustr Pty Limited
  • 42.
    How Do ClusteringColumns Work? • Clustering columns are good for modelling and efficient querying of hierarchical/nested data • Query must include higher level columns with equality operator, ranges are only allowed on last column in query, lower level columns don’t have to be included. For example: o select * from geohash1to8_clustering where geohash1=’e’ and geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50; • But why have multiple clustering columns when one is actually enough… Good for Hierarchical Data © 2020 Instaclustr Pty Limited
  • 43.
    Better: Single Geohash ClusteringColumn Geohash8 and time are clustering keys CREATE TABLE geohash_clustering ( geohash1 text, time timestamp, geohash8 text, lat double, long double, PRIMARY KEY (geohash1, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash8 DESC, time DESC); © 2020 Instaclustr Pty Limited
  • 44.
    Inequality Range Query •With decreasing length geohashes • Stop when result has 50 rows select * from geohash_clustering where geohash1=’e’ and geohash8=’everywhe’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’every0’ and geohash8 <=’everyz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ever0’ and geohash8 <=’everz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’eve0’ and geohash8 <=’evez’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ev0’ and geohash8 <=’evz’ limit 50; select * from geohash_clustering where geohash1=’e’ limit 50; © 2020 Instaclustr Pty Limited
  • 45.
    Geohash Results Option 3 isbest = 34% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 © 2020 Instaclustr Pty Limited
  • 46.
    Issues? ■ Cardinality forpartition key ● should be > 100,000 ● >= 4 character geohash ■ Unbounded partitions are bad ● May need composite partition key in production ● e.g. extra time bucket (hour, day, etc) ■ Space vs time ● could have different sized buckets for different sized spaces ● E.g. bigger areas with more frequent events may need shorter time buckets to limit size ● This may depend on the space-time scales of underlying systems/processes ● E.g. Spatial and temporal scales of oceanographic processes (left)
  • 47.
  • 48.
    Cassandra Lucene IndexPlugin • The Cassandra Lucene Index is a plugin for Apache Cassandra: o That extends its index functionality to provide near real-time search, including full-text search capabilities and free multivariable, geospatial and bitemporal search o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. • Instaclustr now supports the plugin o Optional add-on to managed Cassandra service o And code support https://github.com/instaclustr/cassandra-lucene-index • How does this help for Geospatial queries? o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search, geo shape search, multiple distance units, geo transformations, and complex geo shapes. © 2020 Instaclustr Pty Limited
  • 49.
    Cassandra Table and LuceneIndexes • Geopoint Example • Under the hood indexing is done using a tree structure with geohashes (configurable precision) CREATE TABLE latlong_lucene ( geohash1 text, value double, time timestamp, latitude double, longitude double, Primary key (geohash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE CUSTOM INDEX latlong_index ON latlong_lucene () USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { geohash1: {type: "string"}, value: {type: "double"}, time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"}, place: {type: "geo_point", latitude: "latitude", longitude: "longitude"} }' };
  • 50.
    Search Options • Sort •Sophisticated but complex semantics (see the docs) SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ sort: [ {field: "place", type: "geo_distance", latitude: " + <lat> + ", longitude: " + <long> + "}, {field: "time", reverse: true} ] }') and geohash1=<geohash> limit 50; © 2020 Instaclustr Pty Limited
  • 51.
    Search Options • BuildingBox Filter • Need to compute box corners SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_bbox", field: "place", min_latitude: " + <minLat> + ", max_latitude: " + <maxLat> + ", min_longitude: " + <minLon> + ", max_longitude: " + <maxLon> + " }}’) limit 50; © 2020 Instaclustr Pty Limited
  • 52.
    Search Options • GeoDistance Filter SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_distance", field: "place", latitude: " + <lat> + ", longitude: " + <long> + ", max_distance: " <distance> + "km" } }') and geohash1=' + <hash1> + ' limit 50; © 2020 Instaclustr Pty Limited
  • 53.
    Search Options: PrefixFilter Prefix search is useful for searching larger areas over a single geohash column as you can search for a substring SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: [ {type: "prefix", field: "geohash1", value: <geohash>} ] }') limit 50 Similar to inequality over clustering column © 2020 Instaclustr Pty Limited
  • 54.
    Lucene Results Options = 2-25% Bestis prefix filter (25%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash © 2020 Instaclustr Pty Limited
  • 55.
    Overall Geohash options are faster(25%, 34%) 34% -> 6.5 Billion Anomaly checks/day 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash G e o h a s h G e o h a s h © 2020 Instaclustr Pty Limited
  • 56.
    Overall Geohash options are faster(25%, 34%) Lucene bounded box/geo distance most accurate but only 5% of baseline performance 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash L u c e n e L u c e n e © 2020 Instaclustr Pty Limited
  • 57.
    3D (Up and Down) Who needsit? © 2020 Instaclustr Pty Limited
  • 58.
    Location, Altitude andVolume • 3D Geohashes represent 2D location, altitude, and volume • A 3D geohash is a cube © 2020 Instaclustr Pty Limited
  • 59.
  • 60.
    Proximity rules > 50m frompeople and property >150m from congested areas > 1000m from airports > 5000m from exclusion zones Just happen to correspond to different length 3D geohashes,
  • 61.
    3D Geohashes Works withall the geohash index options Therefore reasonably fast to compute 3D proximity More accurate slower options can be improved with bigger Cassandra clusters © 2020 Instaclustr Pty Limited 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h
  • 62.
    • Demo 3DGeohash java code o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762 o produces valid 3D geohashes for altitudes from 13km below sea level to geostationary satellite orbit o Can be used for any of the geohash options More Information? © 2020 Instaclustr Pty Limited
  • 63.
    More Information? © 2020Instaclustr Pty Limited • Blogs: https://www.instaclustr.com/paul-brebner/ • Try us out! Free trial at Instaclustr.com
  • 64.
  • 65.