Big Data Analytics and Hadoop Dr. M.V. Padmavati Bhilai Institute of Technology, Durg Data scientist is the most promising job of 2019
Big Data
What is Big data? Key Challenges • Capture & Store • Search • Sharing & Transfer • Analysis Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications Domains with Large Datasets: • Meteorology • Complex physics simulations • Biological and environmental research • Internet Search
Dimensions to Big Data • Initially, there are three dimensions to big data known as Volume, Variety and Velocity. • These are also called characteristics of big data or 3V’s of Big data. • 4th V (Veracity) is added afterwards.
Volume(Scale)-Data Volume • There will be 44x increase from 2009 to 2020, From 0.8 zettabytes to 35zb, Data volume is increasing exponentially. • 1TB=1024GB • 1 PetaByte (5th power of 1000, 1015) =1024TB • 1 ExaByte (6th power of 1000, 1018) =1024 PB • 1 ZettaByte=1024 EB • 1 YottaByte=1024 ZB • Big Data is a collection of huge volumes of Data.
Velocity (Speed) • Data is being generated fast and need to be processed fast. • Requires Online Data Analytics • Late decisions means missing opportunities • Examples •E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you •Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction
Variety (Complexity) • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Data can be either Static or streaming data
Veracity (Uncertainty) • Veracity refers to the trustworthiness of the data • This refers to the inconsistency
4 Vs of Big Data
Types of Data: Data is categorized as 1. Structured Data 2. Semi-Structured Data 3. Un-Structured Data Generally Big Data consists unstructured Data 1. Structured Data: • Uploads neatly into a relational database Types of Data
2. Unstructured Data • Today more than 80% of the data generated is unstructured. • Examples: •Satellite images, Social media data, Mobile data, Photographs and video: This includes security, surveillance, and traffic video •Website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram. Types of Data -Unstructured Data
• Semi-structured has some organizational properties that make it easier to analyze. • Examples of semi- structured data formats: • CSV (Comma separated values) • XML (Extended Markup language) • JSON (JavaScript Object Notation) Types of Data – Semi structured Data
Big data Analytics
What is Big data Analytics? • “It is the art of finding patterns and insights in large sets of data that allow you to make better decisions or learn things you couldn’t otherwise learn.” • It makes use of statistics, AI, data mining, machine learning, pattern recognition, natural language processing etc
Reasons Benefits of Big data Analytics Timely Gain instant insights from diverse data sources Better analytics Improvement of business performance through real-time analytics Vast data Big data technologies manage huge amounts of data Insights Can provide better insights with the help of unstructured and semi-structured data Decision making Helps mitigate risk and make smart decision by proper risk analysis Why Big data Analytics?
Types of Analytics
BIG DATA ANALYTICAL TOOLS  Apache Hadoop  Apache Spark  Apache Storm  Presto (Facebook)  Hydra  Google Bigquery  Statwing  Pentaho  Flink  Openrefine  Kaggle  Windows Azure
Applications of Big Analytics for Humanity
Big data in Healthcare Customer relationship management Electronic Health Record
Big data in Healthcare • Big data reduces costs of treatment since there is less chances of having to perform unnecessary diagnosis. • It helps in predicting outbreaks of epidemics and also helps in deciding what preventive measures • It helps avoid preventable diseases by detecting diseases in early stages which helps in preventing it • Patients can be provided with the evidence based medicine which is identified and prescribed after doing the past medical results research.
Big data in Insurance • Analyzing and predicting customer behavior through data derived from social media, GPS-enabled devices and CCTV footage. • When it comes to claims management, predictive analytics from big data has been used to offer faster service and Fraud detection. • Through massive data from digital channels and social media, real-time monitoring of claims throughout the claims cycle has been used to provide insights. • SBI life makes use of big data analytics.
Big data in Education • The University of Tasmania, An Australian university with over 26000 students has deployed a Learning and Management System that tracks among other things, when a student logs onto the system, how much time is spent on different pages in the system, as well as the overall progress of a student over time. • it is also used to measure teacher’s effectiveness to ensure a good experience for both students and teachers. • Click patterns are also being used to detect boredom. • Adaptive learning: Customized learning. Enterprises produce digital courses that use big-data-fuelled prognostic analytics to locate what a learner is learning and what components of a lecture plan most effectively ensembles them at those situations.
Big data in Media and Entertainment • Media and entertainment industry is facing new business models, for the way they – create, market and distribute their content. This is happening because of current consumer’s search and the requirement of accessing content anywhere, any time, on any device. • Big Data provides actionable points of information about millions of individuals. Now, publishing environments are tailoring advertisements and content to appeal consumers. These insights are gathered through various data-mining activities. Big Data applications benefits media and entertainment industry by: • Predicting what the audience wants • Scheduling optimization • Increasing acquisition and retention • Ad targeting • Content monetization and new product development
• Crime Prediction and Prevention Police departments can leverage advanced, real-time analytics to provide actionable intelligence that can be used to understand criminal behaviour, identify crime/incident patterns, and uncover location-based threats. • Weather Forecasting The NOAA(National Oceanic and Atmospheric Administration) gathers data every minute of every day from land, sea, and space-based sensors. Daily NOAA uses Big Data to analyze and extract value from over 20 terabytes of data. • Tax Compliance Big Data Applications can be used by tax organizations to analyze both unstructured and structured data from a variety of sources in order to identify suspicious behavior and multiple identities. This would help in tax fraud identification. • Big Data Contributions to Transportation: Route planning to reduce the users wait times, Congestion management by predicting traffic conditions: Using big data, real time estimation of congestion and traffic patterns is now possible. For examples, people using Google Maps to locate the least traffic prone routes. Safety level of traffic: Using the real time processing of big data and predictive analysis to identify the traffic accidents prone areas can help reduce accidents and increase the safety level of traffic Big data in Various Other Fields
Why should I Learn Big Data Analytics?
Role of Mathematicians in Big data • Data science is the marriage of statistics and computer science, we need • Probability • Statistics • Distributed Optimization • Algebra • Calculus
How Physicists can use Big data • Astrophysics • Quantum Computing • Electrical grid analytics • Simulation of complex systems • Internet of things
How Bio People can use Big data • The human genome contains roughly 3 billion DNA base pairs and about 20,000 genes. • The genetic information acquired globally about patients and diseases will enable the health-care providers to offer individual-specific, tailor made medicines. • Smart agriculture using IOTs • The DNA-sequence data contain insights for the development of (a) superior, disease-resistant and high yielding crop varieties that are resistant to the climate change, and (b) drugs for cancer cure, HIV, or new strains of influenza
For Commerce People • Supply chain analytics • Retail Analytics • Manufacturing analytics • Bank Analytics • HR Analytics • Sales analytics • Recommender systems
Apache Hadoop HADOOP
APACHE HADOOP • Hadoop is an open source framework developed by Doug Cutting in 2006 and is managed by the Apache Software Foundation • The project was named as Hadoop after the yellow toy elephant of the Doug Cutting’s son. • The framework is written in Java that allows storage and processing of large volumes of data on a cluster of commodity hardware. • The Apache Hadoop project actively supports multiple projects intended to extend Hadoop’s capabilities and make it easier to use.
Traditional Systems Vs Big data Systems Traditional Systems • Schema-On-Write • Traditional systems use shared storage • Cost of Proprietary Hardware • Brings Data to the Programs Hadoop Data Systems • Schema-On-Read • Uses the Hadoop Distributed File System (HDFS) • Local storage, uses commodity hardware • Brings Programs to the Data
HADOOP ECOSYSTEM
HADOOP COMPONENTS
 HDFS (Hadoop Distributed File System) • It is the storage layer of Hadoop. It works as the Master-Slave pattern. • In HDFS NameNode acts as a master which stores the metadata of DataNode. • Data node acts as a slave which stores the actual data in local disc and parallely performs the actual task on data. HADOOP COMPONENTS
 MapReduce • It is the data processing layer of Hadoop. • It processes huge amount of data in parallel by dividing the job (submitted job) into a set of independent tasks. • It contains four tasks: Map-shuffle-sort-reduce HADOOP COMPONENTS
 Hbase and Hive • Hive and HBase are both data stores for storing unstructured data. • RDBMS professionals love apache hive as they can simply map HDFS files to Hive tables and query the data • HBase is a NoSQL database used for real-time data streaming whereas Hive is not ideally a database but a mapreduce based SQL engine that runs on top of Hadoop. • HBase is a database and Hive is a SQL engine for batch processing of big data. • Other NoSQL databases are MongoDB, Cassandra etc HADOOP COMPONENTS
 Pig • It is a top-level scripting language. • It enables writing complex data processing operators in Hadoop using Pig Latin programming.  Sqoop • It is a data collection tool design to transport huge volumes of data between Hadoop and RDBMS.  Mahout • A library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm. HADOOP COMPONENTS
 Flume • It is a reliable system for collecting large amounts of log data from many different sources in real-time.  Oozie • It is a workflow scheduler system that is used to schedule Apache Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work.  Zookeeper • ZooKeeper is a high-performance coordination service for distributed applications. It provides a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. HADOOP COMPONENTS
FEATURES OF HADOOP  No expensive hardware are required  Supports a large cluster of 100 to 1000 nodes  More computing power and storage system  Parallel Processing of Data  Distributed Data  Data Replication  Automatic Failover management  Data Locality Optimization  Supports Heterogeneous Cluster  Scalability
Bigdata and Hadoop with applications

Bigdata and Hadoop with applications

  • 1.
    Big Data Analyticsand Hadoop Dr. M.V. Padmavati Bhilai Institute of Technology, Durg Data scientist is the most promising job of 2019
  • 2.
  • 4.
    What is Bigdata? Key Challenges • Capture & Store • Search • Sharing & Transfer • Analysis Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications Domains with Large Datasets: • Meteorology • Complex physics simulations • Biological and environmental research • Internet Search
  • 5.
    Dimensions to BigData • Initially, there are three dimensions to big data known as Volume, Variety and Velocity. • These are also called characteristics of big data or 3V’s of Big data. • 4th V (Veracity) is added afterwards.
  • 6.
    Volume(Scale)-Data Volume • Therewill be 44x increase from 2009 to 2020, From 0.8 zettabytes to 35zb, Data volume is increasing exponentially. • 1TB=1024GB • 1 PetaByte (5th power of 1000, 1015) =1024TB • 1 ExaByte (6th power of 1000, 1018) =1024 PB • 1 ZettaByte=1024 EB • 1 YottaByte=1024 ZB • Big Data is a collection of huge volumes of Data.
  • 7.
    Velocity (Speed) • Datais being generated fast and need to be processed fast. • Requires Online Data Analytics • Late decisions means missing opportunities • Examples •E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you •Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction
  • 8.
    Variety (Complexity) • Text,numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Data can be either Static or streaming data
  • 9.
    Veracity (Uncertainty) • Veracityrefers to the trustworthiness of the data • This refers to the inconsistency
  • 10.
    4 Vs ofBig Data
  • 11.
    Types of Data:Data is categorized as 1. Structured Data 2. Semi-Structured Data 3. Un-Structured Data Generally Big Data consists unstructured Data 1. Structured Data: • Uploads neatly into a relational database Types of Data
  • 12.
    2. Unstructured Data •Today more than 80% of the data generated is unstructured. • Examples: •Satellite images, Social media data, Mobile data, Photographs and video: This includes security, surveillance, and traffic video •Website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram. Types of Data -Unstructured Data
  • 13.
    • Semi-structured has someorganizational properties that make it easier to analyze. • Examples of semi- structured data formats: • CSV (Comma separated values) • XML (Extended Markup language) • JSON (JavaScript Object Notation) Types of Data – Semi structured Data
  • 14.
  • 15.
    What is Bigdata Analytics? • “It is the art of finding patterns and insights in large sets of data that allow you to make better decisions or learn things you couldn’t otherwise learn.” • It makes use of statistics, AI, data mining, machine learning, pattern recognition, natural language processing etc
  • 16.
    Reasons Benefits ofBig data Analytics Timely Gain instant insights from diverse data sources Better analytics Improvement of business performance through real-time analytics Vast data Big data technologies manage huge amounts of data Insights Can provide better insights with the help of unstructured and semi-structured data Decision making Helps mitigate risk and make smart decision by proper risk analysis Why Big data Analytics?
  • 17.
  • 18.
    BIG DATA ANALYTICALTOOLS  Apache Hadoop  Apache Spark  Apache Storm  Presto (Facebook)  Hydra  Google Bigquery  Statwing  Pentaho  Flink  Openrefine  Kaggle  Windows Azure
  • 19.
    Applications of BigAnalytics for Humanity
  • 20.
    Big data inHealthcare Customer relationship management Electronic Health Record
  • 21.
    Big data inHealthcare • Big data reduces costs of treatment since there is less chances of having to perform unnecessary diagnosis. • It helps in predicting outbreaks of epidemics and also helps in deciding what preventive measures • It helps avoid preventable diseases by detecting diseases in early stages which helps in preventing it • Patients can be provided with the evidence based medicine which is identified and prescribed after doing the past medical results research.
  • 22.
    Big data inInsurance • Analyzing and predicting customer behavior through data derived from social media, GPS-enabled devices and CCTV footage. • When it comes to claims management, predictive analytics from big data has been used to offer faster service and Fraud detection. • Through massive data from digital channels and social media, real-time monitoring of claims throughout the claims cycle has been used to provide insights. • SBI life makes use of big data analytics.
  • 23.
    Big data inEducation • The University of Tasmania, An Australian university with over 26000 students has deployed a Learning and Management System that tracks among other things, when a student logs onto the system, how much time is spent on different pages in the system, as well as the overall progress of a student over time. • it is also used to measure teacher’s effectiveness to ensure a good experience for both students and teachers. • Click patterns are also being used to detect boredom. • Adaptive learning: Customized learning. Enterprises produce digital courses that use big-data-fuelled prognostic analytics to locate what a learner is learning and what components of a lecture plan most effectively ensembles them at those situations.
  • 24.
    Big data inMedia and Entertainment • Media and entertainment industry is facing new business models, for the way they – create, market and distribute their content. This is happening because of current consumer’s search and the requirement of accessing content anywhere, any time, on any device. • Big Data provides actionable points of information about millions of individuals. Now, publishing environments are tailoring advertisements and content to appeal consumers. These insights are gathered through various data-mining activities. Big Data applications benefits media and entertainment industry by: • Predicting what the audience wants • Scheduling optimization • Increasing acquisition and retention • Ad targeting • Content monetization and new product development
  • 25.
    • Crime Predictionand Prevention Police departments can leverage advanced, real-time analytics to provide actionable intelligence that can be used to understand criminal behaviour, identify crime/incident patterns, and uncover location-based threats. • Weather Forecasting The NOAA(National Oceanic and Atmospheric Administration) gathers data every minute of every day from land, sea, and space-based sensors. Daily NOAA uses Big Data to analyze and extract value from over 20 terabytes of data. • Tax Compliance Big Data Applications can be used by tax organizations to analyze both unstructured and structured data from a variety of sources in order to identify suspicious behavior and multiple identities. This would help in tax fraud identification. • Big Data Contributions to Transportation: Route planning to reduce the users wait times, Congestion management by predicting traffic conditions: Using big data, real time estimation of congestion and traffic patterns is now possible. For examples, people using Google Maps to locate the least traffic prone routes. Safety level of traffic: Using the real time processing of big data and predictive analysis to identify the traffic accidents prone areas can help reduce accidents and increase the safety level of traffic Big data in Various Other Fields
  • 26.
    Why should ILearn Big Data Analytics?
  • 27.
    Role of Mathematiciansin Big data • Data science is the marriage of statistics and computer science, we need • Probability • Statistics • Distributed Optimization • Algebra • Calculus
  • 28.
    How Physicists canuse Big data • Astrophysics • Quantum Computing • Electrical grid analytics • Simulation of complex systems • Internet of things
  • 29.
    How Bio Peoplecan use Big data • The human genome contains roughly 3 billion DNA base pairs and about 20,000 genes. • The genetic information acquired globally about patients and diseases will enable the health-care providers to offer individual-specific, tailor made medicines. • Smart agriculture using IOTs • The DNA-sequence data contain insights for the development of (a) superior, disease-resistant and high yielding crop varieties that are resistant to the climate change, and (b) drugs for cancer cure, HIV, or new strains of influenza
  • 30.
    For Commerce People •Supply chain analytics • Retail Analytics • Manufacturing analytics • Bank Analytics • HR Analytics • Sales analytics • Recommender systems
  • 31.
  • 32.
    APACHE HADOOP • Hadoopis an open source framework developed by Doug Cutting in 2006 and is managed by the Apache Software Foundation • The project was named as Hadoop after the yellow toy elephant of the Doug Cutting’s son. • The framework is written in Java that allows storage and processing of large volumes of data on a cluster of commodity hardware. • The Apache Hadoop project actively supports multiple projects intended to extend Hadoop’s capabilities and make it easier to use.
  • 33.
    Traditional Systems VsBig data Systems Traditional Systems • Schema-On-Write • Traditional systems use shared storage • Cost of Proprietary Hardware • Brings Data to the Programs Hadoop Data Systems • Schema-On-Read • Uses the Hadoop Distributed File System (HDFS) • Local storage, uses commodity hardware • Brings Programs to the Data
  • 34.
  • 35.
  • 36.
     HDFS (HadoopDistributed File System) • It is the storage layer of Hadoop. It works as the Master-Slave pattern. • In HDFS NameNode acts as a master which stores the metadata of DataNode. • Data node acts as a slave which stores the actual data in local disc and parallely performs the actual task on data. HADOOP COMPONENTS
  • 37.
     MapReduce • Itis the data processing layer of Hadoop. • It processes huge amount of data in parallel by dividing the job (submitted job) into a set of independent tasks. • It contains four tasks: Map-shuffle-sort-reduce HADOOP COMPONENTS
  • 38.
     Hbase andHive • Hive and HBase are both data stores for storing unstructured data. • RDBMS professionals love apache hive as they can simply map HDFS files to Hive tables and query the data • HBase is a NoSQL database used for real-time data streaming whereas Hive is not ideally a database but a mapreduce based SQL engine that runs on top of Hadoop. • HBase is a database and Hive is a SQL engine for batch processing of big data. • Other NoSQL databases are MongoDB, Cassandra etc HADOOP COMPONENTS
  • 39.
     Pig • Itis a top-level scripting language. • It enables writing complex data processing operators in Hadoop using Pig Latin programming.  Sqoop • It is a data collection tool design to transport huge volumes of data between Hadoop and RDBMS.  Mahout • A library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm. HADOOP COMPONENTS
  • 40.
     Flume • Itis a reliable system for collecting large amounts of log data from many different sources in real-time.  Oozie • It is a workflow scheduler system that is used to schedule Apache Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work.  Zookeeper • ZooKeeper is a high-performance coordination service for distributed applications. It provides a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. HADOOP COMPONENTS
  • 42.
    FEATURES OF HADOOP No expensive hardware are required  Supports a large cluster of 100 to 1000 nodes  More computing power and storage system  Parallel Processing of Data  Distributed Data  Data Replication  Automatic Failover management  Data Locality Optimization  Supports Heterogeneous Cluster  Scalability