The document discusses analyzing log data using Apache Spark. It covers challenges with log data like schema mediation and feature engineering to transform log records into vectors. It also discusses visualizing the structured data using dimensionality reduction techniques like PCA and self-organizing maps to find outliers. The document provides examples of analyzing log data to identify the most frequent error levels and applications generating logs.
Challenges of logdata SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
5.
Challenges of logdata 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%' GROUP BY hostname, hour
6.
Challenges of logdata postgres httpd syslog INFO INFO WARN CRIT DEBUG INFO GET GET GET POST WARN WARN INFO INFO INFO GET (404) INFO (ca. 2000)
7.
Challenges of logdata postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN (ca. 2016)
8.
Challenges of logdatapostgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN postgres httpd syslog INFO WARN GET (404) CRIT INFO GET GET GET POST INFO INFO INFO WARN CouchDB httpd Django INFO CRITINFO GET POST INFO INFO INFO WARN haproxy k8s INFO INFO WARN CRITDEBUG WARN WARN INFO INFOINFO INFO Cassandra nginx Rails INFO CRIT INFO GET POST PUT POST INFO INFO INFOWARN INFO redis INFO CRIT INFOINFO PUT (500)httpd syslog GET PUT INFO INFO INFOWARN How many services are generating logs in your datacenter today?
Collecting log data collecting Ingestinglive log data via rsyslog, logstash, fluentd normalizing Reconciling log record metadata across sources warehousing Storing normalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
11.
Collecting log data warehousing Storingnormalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
12.
Collecting log data warehousing Storingnormalized records in ES indices analysis cache warehoused data as Parquet files on Gluster volume local to Spark cluster
logs .select("level").distinct .map { caseRow(s: String) => s } .collect Exploring structured data logs .groupBy($"level", $"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … debug, notice, emerg, err, warning, crit, info, severe, alert
18.
Exploring structured data logs .groupBy($"level",$"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert
19.
Exploring structured data logs .groupBy($"level",$"rsyslog.app_name") .agg(count("level").as("total")) .orderBy($"total".desc) .show info kubelet 17933574 info kube-proxy 10961117 err journal 6867921 info systemd 5184475 … logs .select("level").distinct .as[String].collect debug, notice, emerg, err, warning, crit, info, severe, alert This class must be declared outside the REPL!
From log recordsto vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010
22.
From log recordsto vectors What does it mean for two sets of categorical features to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000
WARN INFO INFOINFO WARNDEBUGINFOINFOINFO WARN WARNINFO INFO INFO WAR INFO INFO Other interesting features host01 host02 host03
30.
INFO INFOINFO DEBUGINFOINFO WARNNFO INFOINFO WARN INFO INFOINFO INFO INFO INFOINFOINFO INFOINFO INFO INFO INFO INFO WARN DEBUG Other interesting features host01 host02 host03
31.
INFO INFOINFO INFO INFO INFOINFO INFOINFO INFO WARN DEBUG WARN INFO INFO EBUG WARN INFO INFO INFO INFO INFO INFO INFO INFO WARN WARN INFO WARN INFO Other interesting features host01 host02 host03
32.
Other interesting features :Great food, great service, a must-visit! : Our whole table got gastroenteritis. : This place is so wonderful that it has ruined all other tacos for me and my family.
Other interesting features INFO:Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers.
35.
Other interesting features INFO:Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes.
36.
Other interesting features INFO:Everything is great! Just checking in to let you know I’m OK. CRIT: No requests in last hour; suspending running app containers. INFO: Phoenix datacenter is on fire; may not rise from ashes. See https://links.freevariable.com/nlp-logs/ for more!
Tree-based approaches yes no yes no if orange if!orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
51.
Tree-based approaches yes no yes no if orange if!orange if red if !red if !gray if !gray yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no yes no no yes yes no yes no
Outliers in logdata 0.95 0.97 0.92 0.37 An outlier is any record whose best match was at least 4σ below the mean. 0.94 0.89 0.91 0.93 0.96
65.
Out of 310million log records, we identified 0.0012% as outliers.
68.
Thirty most extremeoutliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.
On-line SOM training whilet < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt
75.
On-line SOM training whilet < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt at each step, we update each unit by adding its value from the previous step…
76.
On-line SOM training whilet < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt to the example that we considered…
77.
On-line SOM training whilet < iterations: for ex in examples: t = t + 1 if t == iterations: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + ex * alpha(t) * wt scaled by a learning factor and the distance from this unit to its best match
Batch SOM training fort in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
81.
Batch SOM training fort in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) update the state of every cell in the neighborhood of the best matching unit, weighting by distance
82.
Batch SOM training fort in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) keep track of the distance weights we’ve seen for a weighted average
83.
Batch SOM training fort in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) since we can easily merge multiple states, we can train in parallel across many examples
Spark and ElasticSearch Datalocality is an issue and caching is even more important than when running from local storage. If your data are write-once, consider exporting ES indices to Parquet files and analyzing those instead.
108.
Structured queries inSpark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.
109.
Memory and partitioning LargeJVM heaps can lead to appalling GC pauses and executor timeouts. Use multiple JVMs or off-heap storage (in Spark 2.0!) Tree aggregation can save you both memory and execution time by partially aggregating at worker nodes.
110.
Interoperability Avoid brittle orlanguage-specific model serializers when sharing models with non-Spark environments. JSON is imperfect but ubiquitous. However, json4s will serialize case classes for free! See also SPARK-13944, merged recently into 2.0.
111.
Feature engineering Favor featureengineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models. Design your feature engineering pipeline so you can translate feature vectors back to factor values.