Mongo Analytics – Learn aggregation by example Exploratory Analytics and Visualization using Flight Data www.jsonstudio.com
Analyzing Flight Data • JSON data imported from CSV downloaded from www.transtats.bts.gov • Sample document for a flight: { "_id": { "$oid": "534205f61c479f6149a92709" }, "YEAR": 2013, "QUARTER": 1, "MONTH": 1, "DAY_OF_MONTH": 18, "DAY_OF_WEEK": 5, "FL_DATE": "2013-01-18”, "UNIQUE_CARRIER": "DL”, "AIRLINE_ID": 19790, "CARRIER": "DL", "TAIL_NUM": "N325US”, "FL_NUM": 1497, "ORIGIN_AIRPORT_ID": 14100, "ORIGIN_AIRPORT_SEQ_ID": 1410002, "ORIGIN_CITY_MARKET_ID": 34100, "ORIGIN": "PHL", "ORIGIN_CITY_NAME": "Philadelphia, PA", "ORIGIN_STATE_ABR": "PA”, "ORIGIN_STATE_FIPS": 42, "DEST_AIRPORT_ID": 13487, "DEST_AIRPORT_SEQ_ID": 1348702, "DEST_CITY_MARKET_ID": 31650, "DEST": "MSP", "DEST_CITY_NAME": "Minneapolis, MN", "DEST_STATE_ABR": "MN", "DEST_STATE_FIPS": 27, "DEST_STATE_NM": "Minnesota", "DEST_WAC": 63, "CRS_DEP_TIME": 805, "DEP_TIME": 758, "DEP_DELAY": -7, "DEP_DELAY_NEW": 0, "DEP_DEL15": 0, "DEP_DELAY_GROUP": -1, "DEP_TIME_BLK": "0800-0859", "TAXI_OUT": 24, "WHEELS_OFF": 822, "WHEELS_ON": 958, "TAXI_IN": 4, "CRS_ARR_TIME": 1015, "ARR_TIME": 1002, "ARR_DELAY": -13, "ARR_DELAY_NEW": 0, "ARR_DEL15": 0, "ARR_DELAY_GROUP": -1, "ARR_TIME_BLK": "1000-1059", "CANCELLED": 0, "CANCELLATION_CODE": "", "DIVERTED": 0, "CRS_ELAPSED_TIME": 190, "ACTUAL_ELAPSED_TIME": 184, "AIR_TIME": 156, "FLIGHTS": 1, "DISTANCE": 980, "DISTANCE_GROUP": 4, "CARRIER_DELAY": "", "WEATHER_DELAY": "", "NAS_DELAY": "", "SECURITY_DELAY": "", "LATE_AIRCRAFT_DELAY": "", "FIRST_DEP_TIME": "", "TOTAL_ADD_GTIME": "", "LONGEST_ADD_GTIME": "", "": "" } • We will build aggregation pipelines and visualize data using JSON Studio (www.jsonstudio.com) • You will fall madly in love with the Aggregation Framework and it’s capabilities
MongoDB aggregation steps/stages • Grouping • Matching/filtering • Projection • Sorting • Unwind • Limit, skip • Added in 2.6 – Out – Redact
Who are the largest carriers?
Some Carrier Stats { "$group": { "_id": { "CARRIER": "$CARRIER" }, "_avg_DEP_DELAY": { "$avg": "$DEP_DELAY" }, "_avg_ARR_DELAY": { "$avg": "$ARR_DELAY" }, "_avg_DISTANCE_GROUP": { "$avg": "$DISTANCE_GROUP" }, "_avg_TAXI_IN": { "$avg": "$TAXI_IN" }, "_avg_TAXI_OUT": { "$avg": "$TAXI_OUT" } } } { "_id": { "CARRIER": "9E" }, "_avg_DEP_DELAY": 8.45451754385965, "_avg_ARR_DELAY": 3.3237368838726744, "_avg_DISTANCE_GROUP": 2.2188688815622624, "_avg_TAXI_IN": 7.082464246424642, "_avg_TAXI_OUT": 20.558167120639663 }
Which airports have the most cancellations?
Which carriers are most at fault for cancellations?
Arrival delays by distance
Delays by distance by carrier
Delays by distance by carrier – long haul only
Words of caution (courtesy of David Weisman)
Words of caution (courtesy of David Weisman)
What to do? “Touch” the data – e.g. Histograms
Words of caution (courtesy of David Weisman)
Words of caution (courtesy of David Weisman)
Order Does Matter http://docs.mongodb.org/manual/core/aggregation-pipeline-optimization/
An example for $unwind Count how many airports each carrier lands in { "_id": { "$oid": "5383623b7bfb8767e2e9ca1f" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": [ "AA", "UA", "DL", "BA" ] } … [ { "_id": { "$oid": "5383623b7bfb8767e2e9ca1f" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": "AA" }, { "_id": { "$oid": "542217ffc026b858b47a6640" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": "UA" } … ] [ { "_id": { "carriers": "BA" }, "count": 10 }, { "_id": { "carriers": "DL" }, "count": 10 } … ] airports2 $unwind $group
Hub airports – try1
Hub airports – try2
Hub airports – try 3 { $group: { _id: { ORIGIN: "$ORIGIN", CARRIER: "$CARRIER" }, count: { $sum: 1 } } }, { $project: { airport: "$_id.ORIGIN", carrier: "$_id.CARRIER", "count": 1 } }, { $match: { "count": { $gte: "$$hub_threshold" } } }, { $group: { _id: { airport: "$airport" }, airlines: { $sum: 1 }, flights: { $sum: "$count" }, avg_airline: { $avg: "$count" }, max_airline: { $max: "$count" } } }, { $project: { "airlines": 1, "flights": 1, "avg_airline": 1, "max_airline": 1, "avg_no_max": { $divide: [ { $subtract: [ "$flights", "$max_airline" ] }, "$airlines" ] } } }, { $sort: { "flights": -1 } }
Hub airports
From-to Insensitive { $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN", DEST: "$DEST" }, count: { $sum: 1 } } }, { $match: { "count": { $gt: "$$count_threshold" } } }, { $project: { _id_UNIQUE_CARRIER: "$_id.UNIQUE_CARRIER", "count": 1, rroute: { $cond: [ { $lt: [ { $cmp: [ "$_id.ORIGIN", "$_id.DEST" ] }, 0 ] }, { $concat: [ "$_id.ORIGIN", "$_id.DEST" ] }, { $concat: [ "$_id.DEST", "$_id.ORIGIN" ] } ] } } }, { $group: { _id: { _id_UNIQUE_CARRIER: "$_id_UNIQUE_CARRIER", rroute: "$rroute" }, _sum_count: { $sum: "$count" } } }
Hub visualization (using routes – from/to, $$count=1, origin treemap)
Using “R” for Advanced Analytics • Using a MongoDB driver for “R” • Using the JSON Studio Gateway (including using aggregation output) install.packages("jSonarR") library(’jSonarR') con2 <- sonarR::new.SonarConnection('https://localhost:8443', 'localhost', 'flights', port=47017, username="ron", pwd=”<pwd>”) nyc_by_day <- sonarR::sonarAgg(con2, 'delays_by_day', 'NYCFlights', colClasses=c(X_avg_AirTime='numeric', X_avg_ArrDelay='numeric',X_avg_DepDelay='numeric')) lm.out = lm(nyc_by_day$X_sum_ArrDelay ~ nyc_by_day$X_sum_AirTime) MongoDB
Recommendation engine example: jsonstudio.com
NYC Flights – Quiz Questions • Of the three airports, who has the most flights? – Nyc1 • Who has the most cancellations and highest cancellation ratio? – Nyc2 • Taxi in/out times? – Nyc3 • What about delays? – Nyc4 • How do delays differ by month? – Nyc5 + nyc5 – (summer vs. winter / bubble size vs. y-axis) • What about weather delays only? Which months are worse? Are the three airports equivalent? – Nyc7 + nyc7 • Where can I fly to if I work for Boeing and am very loyal (and on which aicraft)? – Nyc8 + map
www.jsonstudio.com (download – presentation and eval copy) Discount code: MUGTX* (* Good for 1 month after event) ron@jsonar.com
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data

  • 1.
    Mongo Analytics – Learn aggregation by example Exploratory Analytics and Visualization using Flight Data www.jsonstudio.com
  • 2.
    Analyzing Flight Data • JSON data imported from CSV downloaded from www.transtats.bts.gov • Sample document for a flight: { "_id": { "$oid": "534205f61c479f6149a92709" }, "YEAR": 2013, "QUARTER": 1, "MONTH": 1, "DAY_OF_MONTH": 18, "DAY_OF_WEEK": 5, "FL_DATE": "2013-01-18”, "UNIQUE_CARRIER": "DL”, "AIRLINE_ID": 19790, "CARRIER": "DL", "TAIL_NUM": "N325US”, "FL_NUM": 1497, "ORIGIN_AIRPORT_ID": 14100, "ORIGIN_AIRPORT_SEQ_ID": 1410002, "ORIGIN_CITY_MARKET_ID": 34100, "ORIGIN": "PHL", "ORIGIN_CITY_NAME": "Philadelphia, PA", "ORIGIN_STATE_ABR": "PA”, "ORIGIN_STATE_FIPS": 42, "DEST_AIRPORT_ID": 13487, "DEST_AIRPORT_SEQ_ID": 1348702, "DEST_CITY_MARKET_ID": 31650, "DEST": "MSP", "DEST_CITY_NAME": "Minneapolis, MN", "DEST_STATE_ABR": "MN", "DEST_STATE_FIPS": 27, "DEST_STATE_NM": "Minnesota", "DEST_WAC": 63, "CRS_DEP_TIME": 805, "DEP_TIME": 758, "DEP_DELAY": -7, "DEP_DELAY_NEW": 0, "DEP_DEL15": 0, "DEP_DELAY_GROUP": -1, "DEP_TIME_BLK": "0800-0859", "TAXI_OUT": 24, "WHEELS_OFF": 822, "WHEELS_ON": 958, "TAXI_IN": 4, "CRS_ARR_TIME": 1015, "ARR_TIME": 1002, "ARR_DELAY": -13, "ARR_DELAY_NEW": 0, "ARR_DEL15": 0, "ARR_DELAY_GROUP": -1, "ARR_TIME_BLK": "1000-1059", "CANCELLED": 0, "CANCELLATION_CODE": "", "DIVERTED": 0, "CRS_ELAPSED_TIME": 190, "ACTUAL_ELAPSED_TIME": 184, "AIR_TIME": 156, "FLIGHTS": 1, "DISTANCE": 980, "DISTANCE_GROUP": 4, "CARRIER_DELAY": "", "WEATHER_DELAY": "", "NAS_DELAY": "", "SECURITY_DELAY": "", "LATE_AIRCRAFT_DELAY": "", "FIRST_DEP_TIME": "", "TOTAL_ADD_GTIME": "", "LONGEST_ADD_GTIME": "", "": "" } • We will build aggregation pipelines and visualize data using JSON Studio (www.jsonstudio.com) • You will fall madly in love with the Aggregation Framework and it’s capabilities
  • 3.
    MongoDB aggregation steps/stages • Grouping • Matching/filtering • Projection • Sorting • Unwind • Limit, skip • Added in 2.6 – Out – Redact
  • 4.
    Who are thelargest carriers?
  • 5.
    Some Carrier Stats{ "$group": { "_id": { "CARRIER": "$CARRIER" }, "_avg_DEP_DELAY": { "$avg": "$DEP_DELAY" }, "_avg_ARR_DELAY": { "$avg": "$ARR_DELAY" }, "_avg_DISTANCE_GROUP": { "$avg": "$DISTANCE_GROUP" }, "_avg_TAXI_IN": { "$avg": "$TAXI_IN" }, "_avg_TAXI_OUT": { "$avg": "$TAXI_OUT" } } } { "_id": { "CARRIER": "9E" }, "_avg_DEP_DELAY": 8.45451754385965, "_avg_ARR_DELAY": 3.3237368838726744, "_avg_DISTANCE_GROUP": 2.2188688815622624, "_avg_TAXI_IN": 7.082464246424642, "_avg_TAXI_OUT": 20.558167120639663 }
  • 6.
    Which airports havethe most cancellations?
  • 7.
    Which carriers aremost at fault for cancellations?
  • 8.
  • 9.
  • 10.
    Delays by distanceby carrier – long haul only
  • 11.
    Words of caution(courtesy of David Weisman)
  • 12.
    Words of caution(courtesy of David Weisman)
  • 13.
    What to do? “Touch” the data – e.g. Histograms
  • 14.
    Words of caution(courtesy of David Weisman)
  • 15.
    Words of caution(courtesy of David Weisman)
  • 16.
    Order Does Matter http://docs.mongodb.org/manual/core/aggregation-pipeline-optimization/
  • 17.
    An example for$unwind Count how many airports each carrier lands in { "_id": { "$oid": "5383623b7bfb8767e2e9ca1f" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": [ "AA", "UA", "DL", "BA" ] } … [ { "_id": { "$oid": "5383623b7bfb8767e2e9ca1f" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": "AA" }, { "_id": { "$oid": "542217ffc026b858b47a6640" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": "UA" } … ] [ { "_id": { "carriers": "BA" }, "count": 10 }, { "_id": { "carriers": "DL" }, "count": 10 } … ] airports2 $unwind $group
  • 18.
  • 19.
  • 20.
    Hub airports –try 3 { $group: { _id: { ORIGIN: "$ORIGIN", CARRIER: "$CARRIER" }, count: { $sum: 1 } } }, { $project: { airport: "$_id.ORIGIN", carrier: "$_id.CARRIER", "count": 1 } }, { $match: { "count": { $gte: "$$hub_threshold" } } }, { $group: { _id: { airport: "$airport" }, airlines: { $sum: 1 }, flights: { $sum: "$count" }, avg_airline: { $avg: "$count" }, max_airline: { $max: "$count" } } }, { $project: { "airlines": 1, "flights": 1, "avg_airline": 1, "max_airline": 1, "avg_no_max": { $divide: [ { $subtract: [ "$flights", "$max_airline" ] }, "$airlines" ] } } }, { $sort: { "flights": -1 } }
  • 21.
  • 22.
    From-to Insensitive {$group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN", DEST: "$DEST" }, count: { $sum: 1 } } }, { $match: { "count": { $gt: "$$count_threshold" } } }, { $project: { _id_UNIQUE_CARRIER: "$_id.UNIQUE_CARRIER", "count": 1, rroute: { $cond: [ { $lt: [ { $cmp: [ "$_id.ORIGIN", "$_id.DEST" ] }, 0 ] }, { $concat: [ "$_id.ORIGIN", "$_id.DEST" ] }, { $concat: [ "$_id.DEST", "$_id.ORIGIN" ] } ] } } }, { $group: { _id: { _id_UNIQUE_CARRIER: "$_id_UNIQUE_CARRIER", rroute: "$rroute" }, _sum_count: { $sum: "$count" } } }
  • 23.
    Hub visualization (usingroutes – from/to, $$count=1, origin treemap)
  • 24.
    Using “R” forAdvanced Analytics • Using a MongoDB driver for “R” • Using the JSON Studio Gateway (including using aggregation output) install.packages("jSonarR") library(’jSonarR') con2 <- sonarR::new.SonarConnection('https://localhost:8443', 'localhost', 'flights', port=47017, username="ron", pwd=”<pwd>”) nyc_by_day <- sonarR::sonarAgg(con2, 'delays_by_day', 'NYCFlights', colClasses=c(X_avg_AirTime='numeric', X_avg_ArrDelay='numeric',X_avg_DepDelay='numeric')) lm.out = lm(nyc_by_day$X_sum_ArrDelay ~ nyc_by_day$X_sum_AirTime) MongoDB
  • 25.
  • 26.
    NYC Flights –Quiz Questions • Of the three airports, who has the most flights? – Nyc1 • Who has the most cancellations and highest cancellation ratio? – Nyc2 • Taxi in/out times? – Nyc3 • What about delays? – Nyc4 • How do delays differ by month? – Nyc5 + nyc5 – (summer vs. winter / bubble size vs. y-axis) • What about weather delays only? Which months are worse? Are the three airports equivalent? – Nyc7 + nyc7 • Where can I fly to if I work for Boeing and am very loyal (and on which aicraft)? – Nyc8 + map
  • 27.
    www.jsonstudio.com (download –presentation and eval copy) Discount code: MUGTX* (* Good for 1 month after event) ron@jsonar.com