Analyzing Log Data With Apache Spark

Analyzing log data with Apache Spark William Benton Red Hat Emerging Technology

Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour

Challenges of log data 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour

Challenges of log data postgres httpd syslog INFO INFO WARN CRIT DEBUG INFO GET GET GET POST WARN WARN INFO INFO INFO GET (404) INFO (ca. 2000)

Challenges of log data postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN (ca. 2016)

Challenges of log datapostgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN How many services are generating logs in your datacenter today?

Collecting log data collecting Ingesting live log data via rsyslog, logstash, ﬂuentd normalizing Reconciling log record metadata across sources warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet ﬁles on Gluster volume local to Spark cluster

Collecting log data warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet ﬁles on Gluster volume local to Spark cluster

Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.

logs .select("level").distinct .map { case Row(s: String) => s } .collect Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … debug, notice, emerg, err, warning, crit, info, severe, alert

Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert

Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert This class must be declared outside the REPL!

From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010

From log records to vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000

Similarity and distance (q - p) • (q - p)

Similarity and distance pi - qi i=1 n

Similarity and distance p • q p q

WARN INFO INFOINFO WARN DEBUGINFOINFOINFO WARN WARNINFO INFO INFO WAR INFO INFO Other interesting features host01 host02 host03

INFO INFOINFO DEBUGINFOINFO WARNNFO INFO INFO WARN INFO INFOINFO INFO INFO INFOINFOINFO INFOINFO INFO INFO INFO INFO WARN DEBUG Other interesting features host01 host02 host03

INFO INFOINFO INFO INFO INFOINFO INFO INFO INFO WARN DEBUG WARN INFO INFO EBUG WARN INFO INFO INFO INFO INFO INFO INFO INFO WARN WARN INFO WARN INFO Other interesting features host01 host02 host03

Other interesting features : Great food, great service, a must-visit! : Our whole table got gastroenteritis. : This place is so wonderful that it has ruined all other tacos for me and my family.

Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK.

Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.

Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes.

Other interesting features INFO: Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes. See https://links.freevariable.com/nlp-logs/ for more!

VISUALIZING STRUCTURE and FINDING OUTLIERS

Multidimensional data [4,7] [2,3,5]

Multidimensional data [4,7] [2,3,5] [7,1,6,5,12,  8,9,2,2,4, 7,11,6,1,5]

A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray

Tree-based approaches yes no yes no if orange if !orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no

Outliers in log data 0.95 0.97

Outliers in log data 0.95 0.97 0.92

Outliers in log data 0.95 0.97 0.92 0.37 An outlier is any record whose best match was at least 4σ below the mean. 0.94 0.89 0.91 0.93 0.96

Out of 310 million log records, we identiﬁed 0.0012% as outliers.

Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.

On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt

On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt at each step, we update each unit by adding its value from the previous step…

On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt to the example that we considered…

On-line SOM training while t < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt scaled by a learning factor and the distance from this unit to its best match

On-line SOM training sensitive to learning rate not parallel sensitive to example order

Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)

Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) update the state of every cell in the neighborhood of the best matching unit, weighting by distance

Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) keep track of the distance weights we’ve seen for a weighted average

Batch SOM training for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) since we can easily merge multiple states, we can train in parallel across many examples

over all partitions Batch SOM training

driver (using aggregate) workers

driver (using aggregate) workers What if you have a 3 mb model and 2,048 partitions?

driver (using treeAggregate) workers

Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here }

Sharing models class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here } case class FrozenModel(entries: Array[Double], /* ... */ ) { }

Sharing models case class FrozenModel(entries: Array[Double], /* ... */ ) { } class Model(private var entries: breeze.linalg.DenseVector[Double], /* ... lots of (possibly) mutable state ... */ ) implements java.io.Serializable { // lots of implementation details here def freeze: FrozenModel = // ... } object Model { def thaw(im: FrozenModel): Model = // ... }

Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) }

Sharing models import org.json4s.jackson.Serialization import org.json4s.jackson.Serialization.{read=>jread, write=>jwrite} implicit val formats = Serialization.formats(NoTypeHints) def toJson(m: Model): String = { jwrite(som.freeze) } def fromJson(json: String): Try[Model] = { Try({ Model.thaw(jread[FrozenModel](json)) }) } Also consider how you’ll share feature encoders and other parts of your learning pipeline!

Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. If your data are write-once, consider exporting ES indices to Parquet ﬁles and analyzing those instead.

Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and beneﬁt from query planning without (entirely) forsaking type safety.

Memory and partitioning Large JVM heaps can lead to appalling GC pauses and executor timeouts. Use multiple JVMs or off-heap storage (in Spark 2.0!) Tree aggregation can save you both memory and execution time by partially aggregating at worker nodes.

Interoperability Avoid brittle or language-speciﬁc model serializers when sharing models with non-Spark environments. JSON is imperfect but ubiquitous. However, json4s will serialize case classes for free! See also SPARK-13944, merged recently into 2.0.

Feature engineering Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models. Design your feature engineering pipeline so you can translate feature vectors back to factor values.

@willb • willb@redhat.com  https://chapeau.freevariable.com THANKS!

Analyzing Log Data With Apache Spark

In this document

More Related Content

What's hot

Similar to Analyzing Log Data With Apache Spark

More from Spark Summit

Recently uploaded

Analyzing Log Data With Apache Spark