Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and Cassandra Paul Brebner instaclustr.com Technology Evangelist ApacheCon Geospatial Track 2020 © Instaclustr Pty Limited, 2020

Instaclustr Overview ©Instaclustr Pty Limited, 2020

Overview • “Location” in the news in 2020 • Anomaly Detection: baseline throughput • Spatial Anomaly Detection problem • Solutions: location representation and querying/indexing o Bounding boxes and secondary indexes o Geohashes o Lucene index o Results o 3D © 2020 Instaclustr Pty Limited

Previously… KAFKA, CASSANDRA AND KUBERNETES AT SCALE: REAL- TIME ANOMALY DETECTION ON 19 BILLION EVENTS A DAY

How Does It Work? • CUSUM (Cumulative Sum Control Chart) • Statistical analysis of historical data • Data for a single variable/key at a time • Potentially billions of keys © 2020 Instaclustr Pty Limited

Pipeline Design • Kafka, Cassandra and Kubernetes Clusters – Kafka handles streaming, Cassandra for data storage, Kubernetes for application scaling • Efficient Cassandra Data writes and reads with key, a unique “account ID” or similar © 2020 Instaclustr Pty Limited Kubernetes

Cassandra Data Model • Events are timeseries • Id is Partition Key - Time is clustering key (order) • Read gets most recent 50 values for id, very fast create table event_stream ( id bigint, time timestamp, value double, primary key (id, time) ) with clustering order by (time desc); select value from event_stream where id=314159265 limit 50; © 2020 Instaclustr Pty Limited

Cluster details – CPU cores Cassandra 384 + Application pipeline (workers) 118 + Kafka 72 = 574 Cores Total 384 118 72 0 100 200 300 400 500 600 700 Cores Cores Used 574 Total Cassandra Workers Kafka

Harder Problem: Spot the Differences in Space Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Douglas Adams, The Hitchhiker’s Guide to the Galaxy © 2020 Instaclustr Pty Limited

Geospatial Anomaly Detection South Atlantic Geomagnetic Anomaly New problem… • Rather than a single ID, events now have a location (and a value) • The problem now is to o find the nearest 50 events to each new event o Quickly (< 1s RT) • Can’t make any assumptions about geospatial properties of events o including location, density or distribution – i.e. where, or how many o Need to search from smallest to increasingly larger areas o E.g. South Atlantic Geomagnetic Anomaly is BIG • Uber (ApacheCon 2019 talk) uses similar • technologies to o forecast demand o Increase area until they have sufficient data for predictions • Can we use <lat, long> as Cassandra partition key? o Yes, compound partition keys are allowed. o But can only select the exact locations. © 2020 Instaclustr Pty Limited

Bounding Boxes and Cassandra? • Use ”country” partition key, Lat/long/time clustering keys • But can’t run the query with multiple inequalities CREATE TABLE latlong ( country text, lat double, long double, time timestamp, PRIMARY KEY (country, lat, long, time) ) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC); select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50; InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)" © 2020 Instaclustr Pty Limited

Secondary Indexes to the Rescue? Secondary Indexes o create index i1 on latlong (lat); o create index i2 on latlong (long); • But same restrictions as clustering columns. SASI - SSTable Attached Secondary Index • Supports more complex queries more efficiently o create custom index i1 on latlong (long) using 'org.apache.cassandra.index.sasi.SASIIndex'; o create custom index i2 on latlong (lat) using 'org.apache.cassandra.index.sasi.SASIIndex’; • select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow filtering; • “allow filtering” may be inefficient (if many rows have to be retrieved prior to filtering) and isn’t suitable for production. • But SASI docs say o even though “allow filtering” must be used with 2 or more column inequalities, there is actually no filtering taking place © 2020 Instaclustr Pty Limited

Geohashes Rectangular areas Variable length base-32 string Single char regions 5,000km x 5,000km Each extra letter gives 32 sub-areas 8 chars is 40mx20m En/de-code lat/long to/from geohash But: Edges cases, non-linear near poles © 2020 Instaclustr Pty Limited

Geohashes and Cassandra In theory Geohashes work well for database indexes CREATE TABLE geohash1to8 ( geohash1 text, time timestamp, geohash2 text, geohash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (hash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE INDEX i8 ON geohash1to8 (geohash8); CREATE INDEX i7 ON geohash1to8 (geohash7); CREATE INDEX i6 ON geohash1to8 (geohash6); CREATE INDEX i5 ON geohash1to8 (geohash5); CREATE INDEX i4 ON geohash1to8 (geohash4); CREATE INDEX i3 ON geohash1to8 (geohash3); CREATE INDEX i2 ON geohash1to8 (geohash2); Option 1: Multiple Indexed Geohash Columns © 2020 Instaclustr Pty Limited

• Query from smallest to largest areas select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50; select * from geohash1to8 where geohash1=’e’ limit 50; Tradeoffs? Multiple secondary columns/indexes, multiple queries, accuracy and number of queries depends on spatial distribution and density • Stop when 50 rows found © 2020 Instaclustr Pty Limited

Geohashes and Cassandra Denormalization is “Normal” in Cassandra Create 8 tables, one for each geohash length CREATE TABLE geohash1 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); … CREATE TABLE geohash8 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); Option 2: Denormalized Multiple Tables © 2020 Instaclustr Pty Limited

• Select from smallest to largest areas using corresponding table select * from geohash8 where geohash=’everywhe’ limit 50; select * from geohash7 where geohash=’everywh’ limit 50; select * from geohash6 where geohash=’everyw’ limit 50; select * from geohash5 where geohash=’every’ limit 50; select * from geohash4 where geohash=’ever’ limit 50; select * from geohash3 where geohash=’eve’ limit 50; select * from geohash2 where geohash=’ev’ limit 50; select * from geohash1 where geohash=’e’ limit 50; Tradeoffs? Multiple tables and writes, multiple queries © 2020 Instaclustr Pty Limited

Geohashes and Cassandra Similar to Option 1 but using clustering columns CREATE TABLE geohash1to8_clustering ( geohash1 text, time timestamp, geohash2 text, gephash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (geohash1, geohash2, geohash3, geohash4, geohash5, geohash6, geohash7, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC, geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC, geohash8 DESC, time DESC); Option 3: Clustering Column(s) © 2020 Instaclustr Pty Limited

How Do Clustering Columns Work? • Clustering columns are good for modelling and efficient querying of hierarchical/nested data • Query must include higher level columns with equality operator, ranges are only allowed on last column in query, lower level columns don’t have to be included. For example: o select * from geohash1to8_clustering where geohash1=’e’ and geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50; • But why have multiple clustering columns when one is actually enough… Good for Hierarchical Data © 2020 Instaclustr Pty Limited

Better: Single Geohash Clustering Column Geohash8 and time are clustering keys CREATE TABLE geohash_clustering ( geohash1 text, time timestamp, geohash8 text, lat double, long double, PRIMARY KEY (geohash1, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash8 DESC, time DESC); © 2020 Instaclustr Pty Limited

Inequality Range Query • With decreasing length geohashes • Stop when result has 50 rows select * from geohash_clustering where geohash1=’e’ and geohash8=’everywhe’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’every0’ and geohash8 <=’everyz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ever0’ and geohash8 <=’everz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’eve0’ and geohash8 <=’evez’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ev0’ and geohash8 <=’evz’ limit 50; select * from geohash_clustering where geohash1=’e’ limit 50; © 2020 Instaclustr Pty Limited

Issues? ■ Cardinality for partition key ● should be > 100,000 ● >= 4 character geohash ■ Unbounded partitions are bad ● May need composite partition key in production ● e.g. extra time bucket (hour, day, etc) ■ Space vs time ● could have different sized buckets for different sized spaces ● E.g. bigger areas with more frequent events may need shorter time buckets to limit size ● This may depend on the space-time scales of underlying systems/processes ● E.g. Spatial and temporal scales of oceanographic processes (left)

Cassandra Lucene Index Plugin • The Cassandra Lucene Index is a plugin for Apache Cassandra: o That extends its index functionality to provide near real-time search, including full-text search capabilities and free multivariable, geospatial and bitemporal search o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. • Instaclustr now supports the plugin o Optional add-on to managed Cassandra service o And code support https://github.com/instaclustr/cassandra-lucene-index • How does this help for Geospatial queries? o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search, geo shape search, multiple distance units, geo transformations, and complex geo shapes. © 2020 Instaclustr Pty Limited

Cassandra Table and Lucene Indexes • Geopoint Example • Under the hood indexing is done using a tree structure with geohashes (configurable precision) CREATE TABLE latlong_lucene ( geohash1 text, value double, time timestamp, latitude double, longitude double, Primary key (geohash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE CUSTOM INDEX latlong_index ON latlong_lucene () USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { geohash1: {type: "string"}, value: {type: "double"}, time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"}, place: {type: "geo_point", latitude: "latitude", longitude: "longitude"} }' };

Search Options • Sort • Sophisticated but complex semantics (see the docs) SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ sort: [ {field: "place", type: "geo_distance", latitude: " + <lat> + ", longitude: " + <long> + "}, {field: "time", reverse: true} ] }') and geohash1=<geohash> limit 50; © 2020 Instaclustr Pty Limited

Search Options • Building Box Filter • Need to compute box corners SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_bbox", field: "place", min_latitude: " + <minLat> + ", max_latitude: " + <maxLat> + ", min_longitude: " + <minLon> + ", max_longitude: " + <maxLon> + " }}’) limit 50; © 2020 Instaclustr Pty Limited

Search Options • Geo Distance Filter SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_distance", field: "place", latitude: " + <lat> + ", longitude: " + <long> + ", max_distance: " <distance> + "km" } }') and geohash1=' + <hash1> + ' limit 50; © 2020 Instaclustr Pty Limited

Search Options: Prefix Filter Prefix search is useful for searching larger areas over a single geohash column as you can search for a substring SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: [ {type: "prefix", field: "geohash1", value: <geohash>} ] }') limit 50 Similar to inequality over clustering column © 2020 Instaclustr Pty Limited

Lucene Results Options = 2-25% Best is prefix filter (25%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash © 2020 Instaclustr Pty Limited

Overall Geohash options are faster (25%, 34%) 34% -> 6.5 Billion Anomaly checks/day 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash G e o h a s h G e o h a s h © 2020 Instaclustr Pty Limited

Overall Geohash options are faster (25%, 34%) Lucene bounded box/geo distance most accurate but only 5% of baseline performance 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash L u c e n e L u c e n e © 2020 Instaclustr Pty Limited

Application? 3D Drone Proximity Detection

Proximity rules > 50m from people and property >150m from congested areas > 1000m from airports > 5000m from exclusion zones Just happen to correspond to different length 3D geohashes,

3D Geohashes Works with all the geohash index options Therefore reasonably fast to compute 3D proximity More accurate slower options can be improved with bigger Cassandra clusters © 2020 Instaclustr Pty Limited 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h 3 D G e o h a s h

• Demo 3D Geohash java code o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762 o produces valid 3D geohashes for altitudes from 13km below sea level to geostationary satellite orbit o Can be used for any of the geohash options More Information? © 2020 Instaclustr Pty Limited

www.instaclustr.com info@instaclustr.com @instaclustr THANK YOU!

Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

More Related Content

What's hot

Similar to Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)

More from Paul Brebner

Recently uploaded

Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020)