0% found this document useful (0 votes)

7K views26 pages

The Age of Big Data: Kayvan Tirdad

The document discusses the rise of big data and its characteristics. It notes that the volume of data being generated is exploding, doubling every few years. Big data is characterized by its volume, velocity, and variety. Examples of big data are given from science, business, entertainment, and medicine. The importance of big data is discussed in terms of job growth and its usage by companies and organizations. Challenges in analyzing big data are presented.

Uploaded by

Hari Sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7K views26 pages

The Age of Big Data: Kayvan Tirdad

Uploaded by

Hari Sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

The Age of Big Data

Kayvan Tirdad
Tirdad@Yorku.ca

Contents
1 2 3 Introduction: Explosion in Quantity of Data

Big Data Characteristics

Cost Problem (example) Importance of Big Data Usage Example in Big Data

4
5 5

Contents
1 6 2 7 3 8 Some Challenges in Big Data

Other Aspects of Big Data

Implementation of Big Data Zeta-Byte Horizon Book Review

4 9
10 5

Introduction: Explosion in Quantity of Data 1946

Eniac X 6000000 =

2012
LHC 1 (40 TB/S) 640TB per Flight

Air Bus A380 - 1 billion line of code - each engine generate 10 TB every 30 min

Twitter Generate approximately 12 TB of data per day

New York Stock Exchange 1TB of data everyday storage capacity has doubled roughly every three years since the 1980s

Introduction: Explosion in Quantity of Data

Our Data-driven World

Science
Data bases from astronomy, genomics, environmental data, transportation data,

Humanities and Social Sciences

Scanned books, historical documents, social interactions data, new technology like GPS

Business & Commerce

Corporate sales, stock market transactions, census, airline traffic,

Entertainment
Internet images, Hollywood movies, MP3 files,

Medicine
MRI & CT scans, patient records,

Introduction: Explosion in Quantity of Data

Our Data-driven World - Fish and Oceans of Data

What we do with these amount of data?

Ignore

Big Data Characteristics How big is the Big Data?

- What is big today maybe not big tomorrow - Any data that can challenge our current technology in some manner can consider as Big Data - Volume - Communication - Speed of Generating - Meaningful Analysis

Big Data Vectors (3Vs)

"Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization Gartner 2012

Big Data Characteristics

Big Data Vectors (3Vs)

- high-volume amount of data
- high-velocity Speed rate in collecting or acquiring or generating or processing of data - high-variety different data type such as audio, video, image data (mostly unstructured data)

Cost Problem (example)

Cost of processing 1 Petabyte of data with 1000 node ?

1 PB = 1015 B = 1 million gigabytes = 1 thousand terabytes - 9 hours for each node to process 500GB at rate of 15MB/S - 15*60*60*9 = 486000MB ~ 500 GB - 1000 * 9 * 0.34$ = 3060$ for single run

- 1 PB = 1000000 / 500 = 2000 * 9 = 18000 h /24 = 750 Day - The cost for 1000 cloud node each processing 1PB 2000 * 3060$ = 6,120,000$

Importance of Big Data

- Government In 2012, the Obama administration announced the Big Data Research and Development Initiative 84 different big data programs spread across six departments - Private Sector - Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data - Facebook handles 40 billion photos from its user base. - Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide - Science - Large Synoptic Survey Telescope will generate 140 Terabyte of data every 5 days. - Large Hardon Colider 13 Petabyte data produced in 2010 - Medical computation like decoding human Genome - Social science revolution - New way of science (Microscope example)

Importance of Big Data

Job
- The U.S. could face a shortage by 2018 of 140,000 to 190,000 people with
"deep analytical talent" and of 1.5 million people capable of analyzing data in ways that enable business decisions. (McKinsey & Co) - Big Data industry is worth more than $100 billion growing at almost 10% a year (roughly twice as fast as the software business)

Technology Player in this field

Oracle
Exadata

Microsoft
HDInsight Server

IBM
Netezza

Usage Example in Big Data

- Moneyball: The Art of Winning an Unfair Game
Oakland Athletics baseball team and its general manager Billy Beane - Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could compete successfully against richer competitors in MLB - Oakland approximately $41 million in salary, New York Yankees, $125 million in payroll that same season. Oakland is forced to find players undervalued by the market,

- Moneyball had a huge impact in other teams in MLB And there is a moneyball movie!!!!!

Usage Example of Big Data

US 2012 Election

- predictive modeling - mybarackobama.com - drive traffic to other campaign sites Facebook page (33 million "likes") YouTube channel (240,000 subscribers and 246 million page views). - a contest to dine with Sarah Jessica Parker - Every single night, the team ran 66,000 computer simulations, Reddit!!! - Amazon web services

- data mining for individualized ad targeting

- Orca big-data app

- YouTube channel( 23,700 subscribers and 26 million page views) - Ace of Spades HQ

Usage Example in Big Data

Data Analysis prediction for US 2012 Election
Drew Linzer, June 2012 332 for Obama, 206 for Romney media continue reporting the race as very tight

Nate Silvers, Five thirty Eight blog Predict Obama had a 86% chance of winning Predicted all 50 state correctly Sam Wang, the Princeton Election Consortium The probability of Obama's re-election at more than 98%

Some Challenges in Big Data

Big Data Integration is Multidisciplinary Less than 10% of Big Data world are genuinely relational Meaningful data integration in the real, messy, schema-less and complex Big Data world of database and semantic web using multidisciplinary and multi-technology methode The Billion Triple Challenge Web of data contain 31 billion RDf triples, that 446million of them are RDF links, 13 Billion government data, 6 Billion geographic data, 4.6 Billion Publication and Media data, 3 Billion life science data BTC 2011, Sindice 2011

The Linked Open Data Ripper Mapping, Ranking, Visualization, Key Matching, Snappiness
Demonstrate the Value of Semantics: let data integration drive DBMS technology Large volumes of heterogeneous data, like link data and RDF

Other Aspects of Big Data

Six Provocations for Big Data
1- Automating Research Changes the Definition of Knowledge

2- Claim to Objectively and Accuracy are Misleading

3- Bigger Data are not always Better data 4- Not all Data are equivalent 5- Just because it is accessible doesnt make it ethical 6- Limited access to big data creatrs new digital divides

Other Aspects of Big Data

Five Big Question about big Data:
1- What happens in a world of radical transparency, with data widely available? 2- If you could test all your decisions, how would that change the way you compete?

3- How would your business change if you used big data for widespread, real time customization?
4- How can big data augment or even replace Management? 5-Could you create a new business model based on data?

Implementation of Big Data

Platforms for Large-scale Data Analysis

Parallel DBMS technologies
Proposed in late eighties Matured over the last two decades Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises

Map Reduce
pioneered by Google popularized by Yahoo! (Hadoop)

Implementation of Big Data

MapReduce
Overview:

Parallel DBMS technologies

Popularly used for more than two decades
Research Projects: Gamma, Grace, Commercial: Multi-billion dollar industry but access to only a privileged few Relational Data Model Indexing Familiar SQL interface Advanced query optimization Well understood and studied

Data-parallel programming model An associated parallel and distributed implementation for commodity clusters Pioneered by Google Processes 20 PB of data per day Popularized by open-source Hadoop Used by Yahoo!, Facebook, Amazon, and the list is growing

Implementation of Big Data

MapReduce
Raw Input: <key, value>

MAP

<K1, V1>

<K2,V2>

<K3,V3>

REDUCE

Implementation of Big Data

MapReduce Advantages

Automatic Parallelization:
Depending on the size of RAW INPUT DATA instantiate multiple MAP tasks Similarly, depending upon the number of intermediate <key, value> partitions instantiate multiple REDUCE tasks

Run-time:

Completely transparent to the

programmer/analyst/user

Data partitioning Task scheduling Handling machine failures Managing inter-machine communication

Implementation of Big Data Map Reduce vs Parallel DBMS

Parallel DBMS Schema Support Indexing Declarative (SQL) MapReduce Not out of the box Not out of the box Imperative (C/C++, Java, ) Extensions through Pig and Hive

Programming Model Optimizations (Compression, Query Optimization) Flexibility Fault Tolerance

Not out of the box Coarse grained techniques

Not out of the box

Zeta-Byte Horizon
As of 2009, the entire World Wide Web was estimated to contain close to 500 exabytes. This is a half zettabyte the total amount of global data is expected to grow to 2.7 zettabytes during 2012. This is 48% up from 2011

x50 2012 2020

Wrap Up

Book Review

The Fourth Paradigm Data-Intensive Scientific Discovery

Toney Hey, Stwart Tansley and Kristin Tolle Microsotf Press 2009

References
1.

B. Brown, M. Chuiu and J. Manyika, Are you ready for the era of Big Data? McKinsey Quarterly, Oct 2011, McKinsey Global Institute 2. C. Bizer, P. Bonez, M. L. Bordie and O. Erling, The Meaningful Use of Big Data: Four Perspective Four Challenges SIGMOD Vol. 40, No. 4, December 2011 3. D. Boyd and K. Crawford, Six Provation for Big Data A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, September 2011, Oxford Internet Institute 4. D. Agrawal, S. Das and A. E. Abbadi, Big Data and Cloud Computing: Current State and Future Opportunities ETDB 2011, Uppsala, Sweden 5. D. Agrawal, S. Das and A. E. Abbadi, Big Data and Cloud Computing: New Wine or Just New Bottles? VLDB 2010, Vol. 3, No. 2 6. F. J. Alexander, A. Hoisie and A. Szalay, Big Data IEEE Computing in Science and Engineering journal 2011 7. O. Trelles, P Prins, M. Snir and R. C. Jansen, Big Data, but are we ready? Nature Reviews, Feb 2011 8. K. Bakhshi, Considerations for Big data: Architecture and approach Aerospace Conference, 2012 IEEE 8. S. Lohr, The Age of Big Data Thr New York times Publication, February 2012 10. M. Nielsen, Aguide to the day of big data, Nature, vol. 462, December 2009

Kayvan Tirdad

Business Intelligence Basics
No ratings yet
Business Intelligence Basics
7 pages
Augmented Analytics for BI Experts
No ratings yet
Augmented Analytics for BI Experts
8 pages
Business Intelligence for Managers
No ratings yet
Business Intelligence for Managers
36 pages
Data Analytics: Key Concepts & Terms
No ratings yet
Data Analytics: Key Concepts & Terms
22 pages
Digital Measurement: Analy&cs Workshop On How To Turn Data Into Ac&onable Insights
No ratings yet
Digital Measurement: Analy&cs Workshop On How To Turn Data Into Ac&onable Insights
84 pages
Genetic Algorithms in Java Basics-2 PDF
No ratings yet
Genetic Algorithms in Java Basics-2 PDF
2 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
RhinoPython CheetSheet
100% (1)
RhinoPython CheetSheet
1 page
Turban Ch1 Ch6
100% (4)
Turban Ch1 Ch6
167 pages
Introduction To Business Forecasting and Predictive Analytics
No ratings yet
Introduction To Business Forecasting and Predictive Analytics
25 pages
Introduction To Machine Learning: Methods, Applications, Etc
No ratings yet
Introduction To Machine Learning: Methods, Applications, Etc
15 pages
Schneider 5. Selling To Consumers Online
No ratings yet
Schneider 5. Selling To Consumers Online
46 pages
Understanding Strategic Business Units
No ratings yet
Understanding Strategic Business Units
35 pages
Business Intelligence PDF
No ratings yet
Business Intelligence PDF
12 pages
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
17 pages
Data Science Life Cycle Sheet
No ratings yet
Data Science Life Cycle Sheet
191 pages
Web Analytics Quiz and Key Concepts
No ratings yet
Web Analytics Quiz and Key Concepts
11 pages
Big Data's Human Component
No ratings yet
Big Data's Human Component
4 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Defining Big Data: Insights from Experts
No ratings yet
Defining Big Data: Insights from Experts
10 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
Data Science For Business
No ratings yet
Data Science For Business
18 pages
Learn Data Modelling by Example PT 1 Beginner Level
No ratings yet
Learn Data Modelling by Example PT 1 Beginner Level
99 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Slides l4 Ts
No ratings yet
Slides l4 Ts
162 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
5.web Data Mining
No ratings yet
5.web Data Mining
41 pages
Data Warehouse/Data Mart: Components Concepts Characteristics
0% (1)
Data Warehouse/Data Mart: Components Concepts Characteristics
24 pages
Real Time Object Detection Using Deep Learning Andmachine Learning Project
No ratings yet
Real Time Object Detection Using Deep Learning Andmachine Learning Project
56 pages
Developing A Marketing Analytics Process
No ratings yet
Developing A Marketing Analytics Process
10 pages
Hypothesis Testing Spinning The Wheel
No ratings yet
Hypothesis Testing Spinning The Wheel
1 page
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Big Data Marketers
No ratings yet
Big Data Marketers
15 pages
Unstructured Data Is Information
No ratings yet
Unstructured Data Is Information
3 pages
Emc Data Science Study WP PDF
No ratings yet
Emc Data Science Study WP PDF
6 pages
Persona Creation for Marketers
No ratings yet
Persona Creation for Marketers
3 pages
Data Science Bootcamp Curriculum 2
No ratings yet
Data Science Bootcamp Curriculum 2
7 pages
Unit 4
No ratings yet
Unit 4
5 pages
Evaluations of Big Data Processing PDF
No ratings yet
Evaluations of Big Data Processing PDF
10 pages
Python Basic
No ratings yet
Python Basic
34 pages
Semester: 3 Course Name: Marketing Analytics Course Code: 18JBS315 Number of Credits: 3 Number of Hours: 30
No ratings yet
Semester: 3 Course Name: Marketing Analytics Course Code: 18JBS315 Number of Credits: 3 Number of Hours: 30
4 pages
19 Storytelling PDF
No ratings yet
19 Storytelling PDF
64 pages
BC0041 Fundamentals of Database Management Paper 1
No ratings yet
BC0041 Fundamentals of Database Management Paper 1
11 pages
Basic Charts and Multidimensional Visualization
No ratings yet
Basic Charts and Multidimensional Visualization
33 pages
Visual Analytics
No ratings yet
Visual Analytics
36 pages
Big Data Not Right Data Yes
No ratings yet
Big Data Not Right Data Yes
8 pages
Decision Support Systems Guide
No ratings yet
Decision Support Systems Guide
9 pages
Business Capability Mapping For Real Estate
No ratings yet
Business Capability Mapping For Real Estate
24 pages
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
Bussiness Intelligence
No ratings yet
Bussiness Intelligence
6 pages
Multi-Criteria Decision Making
No ratings yet
Multi-Criteria Decision Making
5 pages
CH 05 PPTaccessible
No ratings yet
CH 05 PPTaccessible
60 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
22 pages
Turban Dss9e Ch08
No ratings yet
Turban Dss9e Ch08
39 pages
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
No ratings yet
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
19 pages
Big Data
No ratings yet
Big Data
30 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Info System Big-Data-by-Dex
No ratings yet
Info System Big-Data-by-Dex
37 pages