Understanding your Data Data Analytics Lifecycle and Machine Learning Dr. Abzetdin ADAMOV Director, Center for Data Analytics Research School of IT & Engineering ADA University aadamov@ada.edu.az
Content • Why now? • Data Analytics Lifecycle • Data Acquisition • Data Repository • Data Preprocessing • Data Analytics and Machine Learning • Data Visualization • Data Governance
BIG DATA AT ADA
4th Big Data Day Baku 2018
Computing Facilities at CeDSRT Characteristics of computing cluster: Processing Cores: 102 Memory: 1,568 TB Storage: 136 TB
AAdamov, CeDAR, ADA University
Student Research - SDP Topics 1. Development of Lexical and Morphological Analysis System 2. Effective Installation of Multi-node Cluster based on Hadoop 3.0 3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with Dementia 4. Statistical Analysis and Data Visualization of DTS Data 5. Development of N-gram Model 6. Development of Semantic Similarity System 7. Development of Sentiment Analysis System 8. Personalized Offers and Customer Retention Platform in Banking 9. Data Retrieval, Storage and Manipulation of DTS Data 10. Network Security and IDS using Machine Learning 11. Development Spell Correction System
WHY NOW?
Where Data Comes From Data is produced by: • People • Social Media, Public Web, Smartphones, … • Organizations (Employer) • OLTP, OLAP, BI, … • Machines • IoT, Satellites, Vehicles, Science, … AAdamov, CeDAR, ADA University
Modern Data Sources à Internet of Anything (IoAT) • Wind Turbines, Oil Rigs, Cars • Weather Stations, Smart Grids • RFID Tags, Beacons, Wearables à User Generated Content (Web & Mobile) • Twitter, Facebook, Snapchat, YouTube • Clickstream, Ads, User Engagement • Payments: Paypal, Venmo
Addressing Data – Digital Universe 0 5 10 15 20 25 30 35 1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018 DataGrowthinZettaBytes Digital Universe Growth over time
Addressing Data – Hard Disk Capacity 0 10000 20000 30000 40000 50000 60000 70000 1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017 CapacityinGigaBytes Hard Drive Capacity Growth over time
Addressing Data – Storage Cost 1200000 100000 10000 800 10 1 0.1 0.003 0.0020 200000 400000 600000 800000 1000000 1200000 1400000 1980 1985 1990 1995 2000 2005 2010 2015 2020 Price$perGBytes Data Storage Cost per Gigabyte AAdamov, CeDAR, ADA University
Computation Power CPU and GPU 0 500 1000 1500 2000 2500 3000 3500 4000 4500 2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018 GFLOPS Computation Power CPU and GPU GPU CPU
Data Growth vs. Processing Power AAdamov, CeDAR, ADA University
Addressing Data – Transfer Rate 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1991 1998 2003 2005 2007 2009 2010 2012 2014 2016 TransferSpeedMB/sec Hard Drive Data Transfer Rate
DATA ANALYTICS LIFECYCLE
Data Analytics Life-Cycle Data Acquisition Data Repository Data Processing Data Analytics / ML Data Visualization - Hadoop HDFS - Microsoft Azure - Amazon EC2 - Warehouse - Statistical Analysis - Machine Learning - R Programming - Python - RapidMiner - Weka - …. - Web Crawling - Data Mining - Information Retrieval - …. - ETL - Parsing - Indexing - Searching - Ranking - NLP - …. Big Data Management involves Data Science and Data Engineering areas for implementing Data Mining Techniques
DATA ACQUISITION
Data Acquisition Techniques 1. Operational Systems 2. Data Warehouses and Data Marts 3. Online Analytical Processing (OLAP) / BI 4. Web Crawling 5. Data Brokers (Commercial Data) 6. Open Data Sources 7. Experimental Data Collection 8. Online Surveys
Data Acquisition Considerations • Business Needs • Data Standards (ISO, ITIS, FGDC, ISDM) • Accuracy Requirements • Currency of Data • Time Constraints • Format (CSV, XLS, XML, JSON, …) • Cost
DATA REPOSITORY
Traditional Approach in Data Management 1TB Hard Drive 3 TB file 1TB of Data 1TB of Data 1TB of Data STORAGE PROCESSING DATA Processor Raw DataProcessed Data
The “Big Data” Problem à A single machine cannot process or even store all the data! Problem Solution à Distribute data over large clusters Difficulty à How to split work across machines? à Moving data over network is expensive à Must consider data & network locality à How to deal with failures? à How to deal with slow nodes?
Addressing Data • Standard Hard Drive data transmission speed 60 – 100 MB/sec • Solid State Hard Drive (SSD) - 250 – 500 MB/sec • Hard Drive capacity growing RAPIDLY (4 – 60 TB) • Online data growth (double every 18 month) • Processing Speed (relatively same growth) • Hard Drive transmission speed is relatively FLAT Moving Data IN and OUT of disk is the Bottleneck
BIG DATA and TRADITIONAL SYSTEMS AAdamov, CeDAR, ADA University
Hadoop Input/Output Model AAdamov, CeDAR, ADA University A 128mb Hadoop WRITE/READS blocks into HDFS sequentially CLIENT ABCD B 128mb C 128mb D 128mb File NEWS.txt (512 Mb) divided to 4 blocks Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO performance significantly. SOLUTION is Data Striping technique…
Distributed Architecture of HDFS Rack 1 DN1 DN2 DN3 DN4 Switch Rack 2 DN11 DN12 DN13 DN14 Switch Rack 3 DN21 DN22 DN23 DN24 Switch Rack 4 DN31 DN32 DN33 DN34 Switch CC AA DD BB A Where to write file ADA.txt (blocks A, B, C, D) in HDFS? CLIENT NAMENODE A B C D A – DN32, 11, 14 B – DN01, 22, 23 C – DN12, 02, 04 D – DN34, 12, 14 BCD AAdamov, CeDAR, ADA University
Big Data and Virtialization Traditional Architecture Distributed Architecture Operating System HARDWARE App App App HARDWARE App App App HYPERVISOR OS OS OS HARDWARE HARDWARE HARDWARE OS OS OS HADOOP HDFS + YARN App App App App App App Virtualized Architecture
BIG DOES NOT MEAN SLOW AAdamov, CeDAR, ADA University
DATA PREPROCESSING Good decisions require Good Data
Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names • No quality data, no quality Analytics results! • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data
Multi-Dimensional Measure of Data Quality • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility
Major Tasks in Data Preprocessing • Data Cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data Integration • Integration of multiple databases, data cubes, or files • Data Transformation • Normalization and aggregation • Data Reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data Discretization • Part of data reduction but with particular importance, especially for numerical data
Data Cleaning and Transformation • Data Cleaning Tasks: • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Data Transformation Tasks: • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Normalization: scaled to fall within a small, specified range • Generalization: concept hierarchy climbing
Data Reduction Strategies • Why data reduction? • Warehouse may store terabytes of data • Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction • Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies • Data Compression • Sampling • Data cube aggregation • Dimensionality reduction • Numerosity reduction • Discretization and concept hierarchy generation
DATA ANALYTICS You can't manage what you can't measure
Skills Requirements for Data Analytics Statistics Business Domain Computer Science Data Analytics
Meaning of Statistics • The word statistics is used in either two senses: • Commonly used to refer to data. • Principles and methods of handling numerical data. • Statistics is defined as a branch of mathematics that deals with the collection, analysis and interpretation of numerical information • Statistics changes numbers into information • deciding how to collect data efficiently • using data to give information • using data to answer questions • using data to make decisions. • Statistics is the science of learning from data
Kinds of Data • Quantitative – Data that is numerical, counted, or compared • Demographic data • Answers to closed-ended survey items • Attendance data • Scores on standardized instruments • Qualitative – Narratives, logs, experience • Interviews • Open-ended survey items • Categories
Statistical Measures • Measure of central tendency • Mean • Median • Mode • Measure of variation • Range • Variance and standard deviation • Interquartile range • Proportion, Percentage • Ratio, Rate
Statistical Analytics in R • mean(), max(), min() • median() • var(), sd() • cor() • quantile() • summary() • hist() • plot()
Mean, Median, Quantile 5,3,6,8,9,2,11,8,3,8,10,9 mean(vec) 6.83 median(vec) 8 5,3,6,8,9,2,11,8,3,8,10,1000 mean(vec) 89.41 median(vec) 8 quantile(vec) 0% 25% 50% 75% 100% 2.00 4.50 8.00 9.25 1000.00 quantile(var, probs = c(0, .75, 1)) 0% 75% 100% 2.00 9.25 1000.00 quantile(vec, probs=seq(0, 1, .1))
Correlation mpg cyl disp wt mpg 1.0000000 -0.8521620 -0.8475514 -0.8676594 cyl -0.8521620 1.0000000 0.9020329 0.7824958 disp -0.8475514 0.9020329 1.0000000 0.8879799 wt -0.8676594 0.7824958 0.8879799 1.0000000 cor(mtcars[, c("mpg", "cyl", "disp", "wt")]) mpg cyl disp wt Mazda RX4 21.0 6 160.0 2.620 Mazda RX4 Wag 21.0 6 160.0 2.875 Datsun 710 22.8 4 108.0 2.320 Hornet 4 Drive 21.4 6 258.0 3.215 Hornet Sportabout 18.7 8 360.0 3.440 Valiant 18.1 6 225.0 3.460 Duster 360 14.3 8 360.0 3.570 Merc 240D 24.4 4 146.7 3.190 Merc 230 22.8 4 140.8 3.150
Summary Function summary(mtcars[, c("mpg", "cyl", "disp", "wt")]) mpg cyl disp wt Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581 Median :19.20 Median :6.000 Median :196.3 Median :3.325 Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610 Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
DATA VISUALIZATION
Anscombe Quartet Anscombe Kvarteti I II III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Visualization
MACHINE LEARNING
What id Machine Learning? • “Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge.” • “Learning denotes changes in a system that ... enable a system to do the same task … more efficiently the next time.” - Herbert Simon • Automating automation • Getting computers to program themselves • Writing software is the bottleneck • Let the data do the work instead! • Machine learning is primarily concerned with the accuracy and effectiveness of the computer system.
Why Machine Learning? • No human experts • industrial/manufacturing control • mass spectrometer analysis, drug design, astronomic discovery • Black-box human expertise • face/handwriting/speech recognition • driving a car, flying a plane • Rapidly changing phenomena • credit scoring, financial modeling • diagnosis, fraud detection • Need for customization/personalization • personalized news reader • movie/book recommendation
Machine Learning Algorithms Categories • Supervised Learning algorithm • Logistic Regression, • Neural Networks, • Support Vector Machines (SVMs), and Naive Bayes classifiers) • Unsupervised Learning algorithm • K-means, Random Forests, Hierarchical clustering) • Semi-supervised Learning algorithm • Reinforcement learning algorithm (self-driving cars)
ML vs Traditional Programming Traditional Programming Machine Learning Computer Data Program Output Computer Data Output Program
Machine Learning in Python
DATA GOVERNANCE
Data Governance Metrics • Digital Culture • Naming Standard • Professional Terms and Abbreviations • Data Model, Documentation and Relationship • Data Quality Rules and Metrics • Hierarchy of Data Artifacts / Entities • Classify your Data: • Master Data, Transactional Data, Reference Data
But do you have the capacity to refine it? DATA is the NEW OIL! AAdamov, CeDAR, ADA University
Information is the oil of the 21st century, and Analytics is the Combustion Engine
Q & A ? Dr. Abzetdin Adamov Email me at: aadamov@ada.edu.az Follow me at: @ Link to me at: www.linkedin.com/in/adamov Visit my blog at: aadamov.wordpress.com

Understanding your Data - Data Analytics Lifecycle and Machine Learning

  • 1.
    Understanding your Data DataAnalytics Lifecycle and Machine Learning Dr. Abzetdin ADAMOV Director, Center for Data Analytics Research School of IT & Engineering ADA University aadamov@ada.edu.az
  • 2.
    Content • Why now? •Data Analytics Lifecycle • Data Acquisition • Data Repository • Data Preprocessing • Data Analytics and Machine Learning • Data Visualization • Data Governance
  • 3.
  • 4.
    4th Big DataDay Baku 2018
  • 6.
    Computing Facilities atCeDSRT Characteristics of computing cluster: Processing Cores: 102 Memory: 1,568 TB Storage: 136 TB
  • 7.
  • 8.
    Student Research -SDP Topics 1. Development of Lexical and Morphological Analysis System 2. Effective Installation of Multi-node Cluster based on Hadoop 3.0 3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with Dementia 4. Statistical Analysis and Data Visualization of DTS Data 5. Development of N-gram Model 6. Development of Semantic Similarity System 7. Development of Sentiment Analysis System 8. Personalized Offers and Customer Retention Platform in Banking 9. Data Retrieval, Storage and Manipulation of DTS Data 10. Network Security and IDS using Machine Learning 11. Development Spell Correction System
  • 9.
  • 10.
    Where Data ComesFrom Data is produced by: • People • Social Media, Public Web, Smartphones, … • Organizations (Employer) • OLTP, OLAP, BI, … • Machines • IoT, Satellites, Vehicles, Science, … AAdamov, CeDAR, ADA University
  • 11.
    Modern Data Sources ÃInternet of Anything (IoAT) • Wind Turbines, Oil Rigs, Cars • Weather Stations, Smart Grids • RFID Tags, Beacons, Wearables à User Generated Content (Web & Mobile) • Twitter, Facebook, Snapchat, YouTube • Clickstream, Ads, User Engagement • Payments: Paypal, Venmo
  • 12.
    Addressing Data –Digital Universe 0 5 10 15 20 25 30 35 1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018 DataGrowthinZettaBytes Digital Universe Growth over time
  • 13.
    Addressing Data –Hard Disk Capacity 0 10000 20000 30000 40000 50000 60000 70000 1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017 CapacityinGigaBytes Hard Drive Capacity Growth over time
  • 14.
    Addressing Data –Storage Cost 1200000 100000 10000 800 10 1 0.1 0.003 0.0020 200000 400000 600000 800000 1000000 1200000 1400000 1980 1985 1990 1995 2000 2005 2010 2015 2020 Price$perGBytes Data Storage Cost per Gigabyte AAdamov, CeDAR, ADA University
  • 15.
    Computation Power CPUand GPU 0 500 1000 1500 2000 2500 3000 3500 4000 4500 2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018 GFLOPS Computation Power CPU and GPU GPU CPU
  • 16.
    Data Growth vs.Processing Power AAdamov, CeDAR, ADA University
  • 17.
    Addressing Data –Transfer Rate 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1991 1998 2003 2005 2007 2009 2010 2012 2014 2016 TransferSpeedMB/sec Hard Drive Data Transfer Rate
  • 18.
  • 19.
    Data Analytics Life-Cycle Data Acquisition Data Repository Data Processing Data Analytics/ ML Data Visualization - Hadoop HDFS - Microsoft Azure - Amazon EC2 - Warehouse - Statistical Analysis - Machine Learning - R Programming - Python - RapidMiner - Weka - …. - Web Crawling - Data Mining - Information Retrieval - …. - ETL - Parsing - Indexing - Searching - Ranking - NLP - …. Big Data Management involves Data Science and Data Engineering areas for implementing Data Mining Techniques
  • 20.
  • 21.
    Data Acquisition Techniques 1.Operational Systems 2. Data Warehouses and Data Marts 3. Online Analytical Processing (OLAP) / BI 4. Web Crawling 5. Data Brokers (Commercial Data) 6. Open Data Sources 7. Experimental Data Collection 8. Online Surveys
  • 22.
    Data Acquisition Considerations •Business Needs • Data Standards (ISO, ITIS, FGDC, ISDM) • Accuracy Requirements • Currency of Data • Time Constraints • Format (CSV, XLS, XML, JSON, …) • Cost
  • 23.
  • 24.
    Traditional Approach inData Management 1TB Hard Drive 3 TB file 1TB of Data 1TB of Data 1TB of Data STORAGE PROCESSING DATA Processor Raw DataProcessed Data
  • 25.
    The “Big Data”Problem à A single machine cannot process or even store all the data! Problem Solution à Distribute data over large clusters Difficulty à How to split work across machines? à Moving data over network is expensive à Must consider data & network locality à How to deal with failures? à How to deal with slow nodes?
  • 26.
    Addressing Data • StandardHard Drive data transmission speed 60 – 100 MB/sec • Solid State Hard Drive (SSD) - 250 – 500 MB/sec • Hard Drive capacity growing RAPIDLY (4 – 60 TB) • Online data growth (double every 18 month) • Processing Speed (relatively same growth) • Hard Drive transmission speed is relatively FLAT Moving Data IN and OUT of disk is the Bottleneck
  • 27.
    BIG DATA andTRADITIONAL SYSTEMS AAdamov, CeDAR, ADA University
  • 28.
    Hadoop Input/Output Model AAdamov,CeDAR, ADA University A 128mb Hadoop WRITE/READS blocks into HDFS sequentially CLIENT ABCD B 128mb C 128mb D 128mb File NEWS.txt (512 Mb) divided to 4 blocks Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO performance significantly. SOLUTION is Data Striping technique…
  • 29.
    Distributed Architecture ofHDFS Rack 1 DN1 DN2 DN3 DN4 Switch Rack 2 DN11 DN12 DN13 DN14 Switch Rack 3 DN21 DN22 DN23 DN24 Switch Rack 4 DN31 DN32 DN33 DN34 Switch CC AA DD BB A Where to write file ADA.txt (blocks A, B, C, D) in HDFS? CLIENT NAMENODE A B C D A – DN32, 11, 14 B – DN01, 22, 23 C – DN12, 02, 04 D – DN34, 12, 14 BCD AAdamov, CeDAR, ADA University
  • 30.
    Big Data andVirtialization Traditional Architecture Distributed Architecture Operating System HARDWARE App App App HARDWARE App App App HYPERVISOR OS OS OS HARDWARE HARDWARE HARDWARE OS OS OS HADOOP HDFS + YARN App App App App App App Virtualized Architecture
  • 31.
    BIG DOES NOTMEAN SLOW AAdamov, CeDAR, ADA University
  • 32.
  • 33.
    Why Data Preprocessing? •Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names • No quality data, no quality Analytics results! • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data
  • 34.
    Multi-Dimensional Measure ofData Quality • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility
  • 35.
    Major Tasks inData Preprocessing • Data Cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data Integration • Integration of multiple databases, data cubes, or files • Data Transformation • Normalization and aggregation • Data Reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data Discretization • Part of data reduction but with particular importance, especially for numerical data
  • 36.
    Data Cleaning andTransformation • Data Cleaning Tasks: • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Data Transformation Tasks: • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Normalization: scaled to fall within a small, specified range • Generalization: concept hierarchy climbing
  • 37.
    Data Reduction Strategies •Why data reduction? • Warehouse may store terabytes of data • Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction • Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies • Data Compression • Sampling • Data cube aggregation • Dimensionality reduction • Numerosity reduction • Discretization and concept hierarchy generation
  • 38.
    DATA ANALYTICS You can'tmanage what you can't measure
  • 39.
    Skills Requirements forData Analytics Statistics Business Domain Computer Science Data Analytics
  • 40.
    Meaning of Statistics •The word statistics is used in either two senses: • Commonly used to refer to data. • Principles and methods of handling numerical data. • Statistics is defined as a branch of mathematics that deals with the collection, analysis and interpretation of numerical information • Statistics changes numbers into information • deciding how to collect data efficiently • using data to give information • using data to answer questions • using data to make decisions. • Statistics is the science of learning from data
  • 41.
    Kinds of Data •Quantitative – Data that is numerical, counted, or compared • Demographic data • Answers to closed-ended survey items • Attendance data • Scores on standardized instruments • Qualitative – Narratives, logs, experience • Interviews • Open-ended survey items • Categories
  • 42.
    Statistical Measures • Measureof central tendency • Mean • Median • Mode • Measure of variation • Range • Variance and standard deviation • Interquartile range • Proportion, Percentage • Ratio, Rate
  • 43.
    Statistical Analytics inR • mean(), max(), min() • median() • var(), sd() • cor() • quantile() • summary() • hist() • plot()
  • 44.
    Mean, Median, Quantile 5,3,6,8,9,2,11,8,3,8,10,9 mean(vec)6.83 median(vec) 8 5,3,6,8,9,2,11,8,3,8,10,1000 mean(vec) 89.41 median(vec) 8 quantile(vec) 0% 25% 50% 75% 100% 2.00 4.50 8.00 9.25 1000.00 quantile(var, probs = c(0, .75, 1)) 0% 75% 100% 2.00 9.25 1000.00 quantile(vec, probs=seq(0, 1, .1))
  • 45.
    Correlation mpg cyl dispwt mpg 1.0000000 -0.8521620 -0.8475514 -0.8676594 cyl -0.8521620 1.0000000 0.9020329 0.7824958 disp -0.8475514 0.9020329 1.0000000 0.8879799 wt -0.8676594 0.7824958 0.8879799 1.0000000 cor(mtcars[, c("mpg", "cyl", "disp", "wt")]) mpg cyl disp wt Mazda RX4 21.0 6 160.0 2.620 Mazda RX4 Wag 21.0 6 160.0 2.875 Datsun 710 22.8 4 108.0 2.320 Hornet 4 Drive 21.4 6 258.0 3.215 Hornet Sportabout 18.7 8 360.0 3.440 Valiant 18.1 6 225.0 3.460 Duster 360 14.3 8 360.0 3.570 Merc 240D 24.4 4 146.7 3.190 Merc 230 22.8 4 140.8 3.150
  • 46.
    Summary Function summary(mtcars[, c("mpg","cyl", "disp", "wt")]) mpg cyl disp wt Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581 Median :19.20 Median :6.000 Median :196.3 Median :3.325 Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610 Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
  • 47.
  • 48.
    Anscombe Quartet Anscombe Kvarteti III III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
  • 49.
  • 50.
  • 51.
    What id MachineLearning? • “Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge.” • “Learning denotes changes in a system that ... enable a system to do the same task … more efficiently the next time.” - Herbert Simon • Automating automation • Getting computers to program themselves • Writing software is the bottleneck • Let the data do the work instead! • Machine learning is primarily concerned with the accuracy and effectiveness of the computer system.
  • 52.
    Why Machine Learning? •No human experts • industrial/manufacturing control • mass spectrometer analysis, drug design, astronomic discovery • Black-box human expertise • face/handwriting/speech recognition • driving a car, flying a plane • Rapidly changing phenomena • credit scoring, financial modeling • diagnosis, fraud detection • Need for customization/personalization • personalized news reader • movie/book recommendation
  • 53.
    Machine Learning AlgorithmsCategories • Supervised Learning algorithm • Logistic Regression, • Neural Networks, • Support Vector Machines (SVMs), and Naive Bayes classifiers) • Unsupervised Learning algorithm • K-means, Random Forests, Hierarchical clustering) • Semi-supervised Learning algorithm • Reinforcement learning algorithm (self-driving cars)
  • 54.
    ML vs TraditionalProgramming Traditional Programming Machine Learning Computer Data Program Output Computer Data Output Program
  • 55.
  • 56.
  • 57.
    Data Governance Metrics •Digital Culture • Naming Standard • Professional Terms and Abbreviations • Data Model, Documentation and Relationship • Data Quality Rules and Metrics • Hierarchy of Data Artifacts / Entities • Classify your Data: • Master Data, Transactional Data, Reference Data
  • 58.
    But do youhave the capacity to refine it? DATA is the NEW OIL! AAdamov, CeDAR, ADA University
  • 59.
    Information is theoil of the 21st century, and Analytics is the Combustion Engine
  • 60.
    Q & A? Dr. Abzetdin Adamov Email me at: aadamov@ada.edu.az Follow me at: @ Link to me at: www.linkedin.com/in/adamov Visit my blog at: aadamov.wordpress.com