CHAPTER 8 Data-Intensive Computing
What is data-intensive computing? • Data-intensive computing is concerned with production, manipulation, and analysis of large-scale data in the range of hundreds of megabytes (MB) to petabytes (PB)
Characterizing data-intensive computations • Data-intensive applications not only deal with huge volumes of data but, very often, also exhibit compute-intensive properties • Datasets are commonly persisted in several formats and distributed across different locations.
Challenges ahead • Scalable algorithms that can search and process massive datasets • New metadata management technologies that can scale to handle complex, heterogeneous, and distributed data sources • Advances in high-performance computing platforms aimed at providing a better support for accessing in-memory multiterabyte data structures • High-performance, highly reliable, petascale distributed file systems
Challenges ahead • Data signature-generation techniques for data reduction and rapid processing • New approaches to software mobility for delivering algorithms that are able to move the computation to where the data are located • Specialized hybrid interconnection architectures that provide better support for filtering multigigabyte datastreams coming from high- speed networks and scientific instruments • Flexible and high-performance software integration techniques
Historical perspective storage, networking technologies, algorithms, and infrastructure software all together • The early age: high-speed wide-area networking • Data grids • Data clouds and “Big Data” • Databases and data-intensive computing
high-speed wide-area networking • 1989, the first experiments in high-speed networking as a support for remote visualization of scientific data led the way • Two years later, the potential of using high- speed wide area networks for enabling high- speed, TCP/IP-based distributed applications was demonstrated at Supercomputing 1991 • Another important milestone was set with the Clipper project,
Data grids • huge computational power and storage facilities • A data grid provides services that help users discover, transfer, and manipulate large datasets stored in distributed repositories • Data grids offer two main functionalities: high- performance and reliable file transfer for moving large amounts of data
Characteristics and introduce new challenges • Massive datasets • Shared data collections • Unified namespace • Access restrictions
Data clouds and “Big Data” • Scientific computing • searching, online advertising, and social media • It is critical for such companies to efficiently analyze these huge datasets because they constitute a precious source of information about their customers • Log analysis is an example
Data clouds and “Big Data” Cloud technologies support data-intensive computing in several ways: • By providing a large amount of compute instances on demand • By providing a storage system • By providing frameworks and programming APIs
Databases and data-intensive computing • Distributed Database
Technologies for data-intensive computing Data-intensive computing concerns the development of applications that are mainly focused on processing large quantities of data.
Storage systems • Growing of popularity of Big Data • Growing importance of data analytics in the business chain • Presence of data in several forms, not only structured • New approaches and technologies for computing
Storage systems • High-performance distributed file systems and storage clouds – Lustre – IBM General Parallel File System (GPFS) – Google File System (GFS) – Sector – Amazon Simple Storage Service (S3)
• NoSQL systems – Apache CouchDB and MongoDB – Amazon Dynamo – Google Bigtable – Hadoop HBase
High-performance distributed file systems and storage clouds • Lustre • The Lustre file system is a massively parallel distributed file system that covers the needs of a small workgroup of clusters to a large-scale computing cluster. • The file system is used by several of the Top 500 supercomputing systems, • Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands of clients with an I/O throughput of hundreds of gigabytes per second (GB/s)
High-performance distributed file systems and storage clouds • IBM General Parallel File System (GPFS). • high-performance distributed file system developed by IBM • support for the RS/6000 supercomputer and Linux computing clusters • GPFS is built on the concept of shared disks • GPFS distributes the metadata of the entire file system and provides transparent access
High-performance distributed file systems and storage clouds • Google File System (GFS) • Distributed applications in Google’s computing cloud • The system has been designed to be a fault tolerant, highly available, distributed file system built on commodity hardware and standard Linux operating systems. • large files • workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
High-performance distributed file systems and storage clouds • Sector • storage cloud that supports the execution of data-intensive applications • deployed on commodity hardware across a wide-area network. • Compared to other file systems, Sector does not partition a file into blocks but replicates the entire files on multiple nodes • The system’s architecture is composed of four nodes: a security server, one or more master nodes, slave nodes, and client machines
High-performance distributed file systems and storage clouds • Amazon Simple Storage Service (S3) • Amazon S3 is the online storage service provided by Amazon. • support high availability, reliability, scalability, infinite storage, • The system offers a flat storage space organized into buckets • Each bucket can store multiple objects, each identified by a unique key. Objects are identified by unique URLs and exposed through HTTP,
NoSQL systems • Document stores (Apache Jackrabbit, Apache CouchDB, SimpleDB, Terrastore). • Graphs (AllegroGraph, Neo4j, FlockDB, Cerebrum). • Multivalue databases (OpenQM, Rocket U2, OpenInsight). • Object databases (ObjectStore, JADE, ZODB). • Tabular stores (Google BigTable, Hadoop HBase, Hypertable). • Tuple stores (Apache River).
NoSQL systems • Apache CouchDB and MongoDB – document stores – schema-less – RESTful interface and represent data in JSON format. – allow querying and indexing data by using the MapReduce programming model – JavaScript as a base language for data querying and manipulation rather than SQL
NoSQL systems • Amazon Dynamo – The main goal of Dynamo is to provide an incrementally scalable and highly available storage system. – serving 10 million requests per day – objects are stored and retrieved with a unique identifier (key)
NoSQL systems • Google Bigtable – scale up to petabytes of data across thousands of server – Bigtable provides storage support for several Google applications – Bigtable’s key design goals are wide applicability, scalability, high performance, and high availability. – Bigtable organizes the data storage in tables of which the rows are distributed over the distributed file system supporting the middleware
NoSQL systems • Apache Cassandra – managing large amounts of structured data spread across many commodity servers – Cassandra was initially developed by Facebook – Currently, it provides storage support for several very large Web applications such as Facebook, Digg, and Twitter – second-generation distributed database – column family
NoSQL systems • Hadoop HBase. – distributed database – Hadoop distributed programming platform. – HBase is designed by taking inspiration from Google Bigtable – main goal is to offer real-time read/write operations for tables with billions of rows and millions of columns by leveraging clusters of commodity hardware
Programming platforms • large quantity of information • runtime systems able to efficiently manage huge volumes of data. • database management systems based on the relational model - unsuccessful • unstructured or semistructured • large size or a huge number of medium-sized files rather than rows in a database • Distributed workflows
The MapReduce programming model • map and reduce • Google introduced for processing large quantities of data. • Data transfer and management are completely handled by the distributed storage infrastructure
Examples of MapReduce • Distributed grep – recognition of patterns within text streams • Count of URL-access frequency – key-value <pair , URL,1>, <URL, total-count> • Reverse Web-link graph – <target, source>, <target, list (source) > • Term vector per host. – Word Counting
Exapmles of MapReduce • Inverted index – <word, document-id>, < word, list(document-id)> • Distributed sort • Statistical algorithms such as Support Vector Machines (SVM), Linear Regression (LR), Naive Bayes (NB), and Neural Network (NN)
• two major stages can be represented in the terms of Map Reduce computation. – Analysis • operates directly on the data input file • embarrassingly parallel – Aggregation • operates on the intermediate results • aimed at aggregating, summing, and/or elaborating • previous stage to present the data in their final form
Variations and extensions of MapReduce • MapReduce constitutes a simplified model for processing large quantities of data • model can be applied to several different problem scenarios • They aim at extending the MapReduce application space and providing developers with an easier interface for designing distributed algorithms.
frameworks • Hadoop • Pig • Hive • Map-Reduce-Merge • Twister
Hadoop • Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. • Initially developed and supported by Yahoo • 40,000 machines and more than 300,000 cores • http://hadoop.apache.org/
Pig • platform that allows the analysis of large datasets • high-level language for expressing data analysis programs • Developers can express their data analysis programs in a textual language called Pig Latin, • https://pig.apache.org/
Hive. • Provides a data warehouse infrastructure on top of Hadoop MapReduce. • It provides tools for easy data summarization • classical data warehouse, • https://hive.apache.org/
Map-Reduce-Merge • Map-Reduce-Merge is an extension of the MapReduce model • merging data already partitioned and sorted by map and reduce modules
Twister • extension of the MapReduce model that allows the creation of iterative executions of MapReduce jobs. • 1. Configure Map • 2. Configure Reduce • 3. While Condition Holds True Do – a. Run MapReduce – b. Apply Combine Operation to Result – c. Update Condition • 4. Close
Alternatives to MapReduce • Sphere. • All-Pairs. • DryadLINQ
Sphere. • Sector Distributed File System (SDFS). • Sphere implements the stream processing model (Single Program, Multiple Data) • user-defined functions (UDFs) • it is built on top of Sector’s API for data access • UDFs are expressed in terms of programs that read and write streams. • Sphere client sends a request for processing to the master node, which returns the list of available slaves, and the client will choose the slaves on which to execute Sphere processes
All-Pairs • Biometrics • (1) model the system; • (2) distribute the data; • (3) dispatch batch jobs; and • (4) clean up the system
DryadLINQ. • Microsoft Research project that investigates programming models for writing parallel and distributed programs • small cluster to a large datacenter • Automatically parallelizing the execution of applications without requiring the developer to know about distributed and parallel programming.
Aneka MapReduce programming • Developing MapReduce applications on top of Aneka • Mapper and Reducer - Aneka MapReduce APIs
• Three classes are of Importent for application development: – Mapper < K,V > – Reducer <K,V > – MapReduceApplication <M,R > – The submission and execution of a MapReduce job is performed through the class MapReduceApplication <M,R >
• Map Function APIs. • IMapInput<K,V> provides access to the input key-value pair on which the map operation is performed
Reduce Function APIs • Reduce (IReduceInputEnumerator < V > input) • reduce operation is applied to a collection of values that are mapped to the same key
• MapReduceApplication <M,R> • InvokeAndWait method: ApplicationBase<M,R> • WordCounterMapper and WordCounterReducer classes
• The parameters that can be controlled – Partitions – Attempts – SynchReduce – IsInputReady – FetchResults – LogFile
• WordCounter Job. -Program
Runtime support • Task Scheduling – MapReduceScheduler class. • Task Execution. – MapReduceExecutor
Task Scheduling
Task Execution.
Distributed file system support • Supports - Other programming models • MapReduce model does not leverage the default Storage Service for storage and data transfer • uses a distributed file system implementation • management are significantly different with respect to the other models • Distributed file system implementations guarantee high availability and better efficiency
• Aneka provides the capability of interfacing with different storage implementations • Retrieving the location of files and file chunks • Accessing a file by means of a stream • classes SeqReader and SeqWriter
Example application • MapReduce is a very useful model for processing large quantities of data • Semistructured • logs or Web pages • logs produced by the Aneka container
Parsing Aneka logs • Aneka components produce a lot of information that is stored in the form of log files • DD MMM YY hh:mm:ss level - message
Mapper design and implementation
Reducer design and implementation
Driver program
Result

VTU 6th Sem Elective CSE - Module 4 cloud computing

  • 1.
  • 9.
    What is data-intensivecomputing? • Data-intensive computing is concerned with production, manipulation, and analysis of large-scale data in the range of hundreds of megabytes (MB) to petabytes (PB)
  • 10.
    Characterizing data-intensive computations • Data-intensiveapplications not only deal with huge volumes of data but, very often, also exhibit compute-intensive properties • Datasets are commonly persisted in several formats and distributed across different locations.
  • 12.
    Challenges ahead • Scalablealgorithms that can search and process massive datasets • New metadata management technologies that can scale to handle complex, heterogeneous, and distributed data sources • Advances in high-performance computing platforms aimed at providing a better support for accessing in-memory multiterabyte data structures • High-performance, highly reliable, petascale distributed file systems
  • 13.
    Challenges ahead • Datasignature-generation techniques for data reduction and rapid processing • New approaches to software mobility for delivering algorithms that are able to move the computation to where the data are located • Specialized hybrid interconnection architectures that provide better support for filtering multigigabyte datastreams coming from high- speed networks and scientific instruments • Flexible and high-performance software integration techniques
  • 14.
    Historical perspective storage, networkingtechnologies, algorithms, and infrastructure software all together • The early age: high-speed wide-area networking • Data grids • Data clouds and “Big Data” • Databases and data-intensive computing
  • 15.
    high-speed wide-area networking •1989, the first experiments in high-speed networking as a support for remote visualization of scientific data led the way • Two years later, the potential of using high- speed wide area networks for enabling high- speed, TCP/IP-based distributed applications was demonstrated at Supercomputing 1991 • Another important milestone was set with the Clipper project,
  • 16.
    Data grids • hugecomputational power and storage facilities • A data grid provides services that help users discover, transfer, and manipulate large datasets stored in distributed repositories • Data grids offer two main functionalities: high- performance and reliable file transfer for moving large amounts of data
  • 18.
    Characteristics and introducenew challenges • Massive datasets • Shared data collections • Unified namespace • Access restrictions
  • 19.
    Data clouds and“Big Data” • Scientific computing • searching, online advertising, and social media • It is critical for such companies to efficiently analyze these huge datasets because they constitute a precious source of information about their customers • Log analysis is an example
  • 20.
    Data clouds and“Big Data” Cloud technologies support data-intensive computing in several ways: • By providing a large amount of compute instances on demand • By providing a storage system • By providing frameworks and programming APIs
  • 21.
  • 22.
    Technologies for data-intensive computing Data-intensivecomputing concerns the development of applications that are mainly focused on processing large quantities of data.
  • 23.
    Storage systems • Growingof popularity of Big Data • Growing importance of data analytics in the business chain • Presence of data in several forms, not only structured • New approaches and technologies for computing
  • 24.
    Storage systems • High-performancedistributed file systems and storage clouds – Lustre – IBM General Parallel File System (GPFS) – Google File System (GFS) – Sector – Amazon Simple Storage Service (S3)
  • 25.
    • NoSQL systems –Apache CouchDB and MongoDB – Amazon Dynamo – Google Bigtable – Hadoop HBase
  • 26.
    High-performance distributed file systemsand storage clouds • Lustre • The Lustre file system is a massively parallel distributed file system that covers the needs of a small workgroup of clusters to a large-scale computing cluster. • The file system is used by several of the Top 500 supercomputing systems, • Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands of clients with an I/O throughput of hundreds of gigabytes per second (GB/s)
  • 27.
    High-performance distributed file systemsand storage clouds • IBM General Parallel File System (GPFS). • high-performance distributed file system developed by IBM • support for the RS/6000 supercomputer and Linux computing clusters • GPFS is built on the concept of shared disks • GPFS distributes the metadata of the entire file system and provides transparent access
  • 28.
    High-performance distributed file systemsand storage clouds • Google File System (GFS) • Distributed applications in Google’s computing cloud • The system has been designed to be a fault tolerant, highly available, distributed file system built on commodity hardware and standard Linux operating systems. • large files • workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
  • 29.
    High-performance distributed file systemsand storage clouds • Sector • storage cloud that supports the execution of data-intensive applications • deployed on commodity hardware across a wide-area network. • Compared to other file systems, Sector does not partition a file into blocks but replicates the entire files on multiple nodes • The system’s architecture is composed of four nodes: a security server, one or more master nodes, slave nodes, and client machines
  • 30.
    High-performance distributed file systemsand storage clouds • Amazon Simple Storage Service (S3) • Amazon S3 is the online storage service provided by Amazon. • support high availability, reliability, scalability, infinite storage, • The system offers a flat storage space organized into buckets • Each bucket can store multiple objects, each identified by a unique key. Objects are identified by unique URLs and exposed through HTTP,
  • 32.
    NoSQL systems • Documentstores (Apache Jackrabbit, Apache CouchDB, SimpleDB, Terrastore). • Graphs (AllegroGraph, Neo4j, FlockDB, Cerebrum). • Multivalue databases (OpenQM, Rocket U2, OpenInsight). • Object databases (ObjectStore, JADE, ZODB). • Tabular stores (Google BigTable, Hadoop HBase, Hypertable). • Tuple stores (Apache River).
  • 33.
    NoSQL systems • ApacheCouchDB and MongoDB – document stores – schema-less – RESTful interface and represent data in JSON format. – allow querying and indexing data by using the MapReduce programming model – JavaScript as a base language for data querying and manipulation rather than SQL
  • 34.
    NoSQL systems • AmazonDynamo – The main goal of Dynamo is to provide an incrementally scalable and highly available storage system. – serving 10 million requests per day – objects are stored and retrieved with a unique identifier (key)
  • 36.
    NoSQL systems • GoogleBigtable – scale up to petabytes of data across thousands of server – Bigtable provides storage support for several Google applications – Bigtable’s key design goals are wide applicability, scalability, high performance, and high availability. – Bigtable organizes the data storage in tables of which the rows are distributed over the distributed file system supporting the middleware
  • 38.
    NoSQL systems • ApacheCassandra – managing large amounts of structured data spread across many commodity servers – Cassandra was initially developed by Facebook – Currently, it provides storage support for several very large Web applications such as Facebook, Digg, and Twitter – second-generation distributed database – column family
  • 39.
    NoSQL systems • HadoopHBase. – distributed database – Hadoop distributed programming platform. – HBase is designed by taking inspiration from Google Bigtable – main goal is to offer real-time read/write operations for tables with billions of rows and millions of columns by leveraging clusters of commodity hardware
  • 40.
    Programming platforms • largequantity of information • runtime systems able to efficiently manage huge volumes of data. • database management systems based on the relational model - unsuccessful • unstructured or semistructured • large size or a huge number of medium-sized files rather than rows in a database • Distributed workflows
  • 41.
    The MapReduce programmingmodel • map and reduce • Google introduced for processing large quantities of data. • Data transfer and management are completely handled by the distributed storage infrastructure
  • 50.
    Examples of MapReduce •Distributed grep – recognition of patterns within text streams • Count of URL-access frequency – key-value <pair , URL,1>, <URL, total-count> • Reverse Web-link graph – <target, source>, <target, list (source) > • Term vector per host. – Word Counting
  • 51.
    Exapmles of MapReduce •Inverted index – <word, document-id>, < word, list(document-id)> • Distributed sort • Statistical algorithms such as Support Vector Machines (SVM), Linear Regression (LR), Naive Bayes (NB), and Neural Network (NN)
  • 52.
    • two majorstages can be represented in the terms of Map Reduce computation. – Analysis • operates directly on the data input file • embarrassingly parallel – Aggregation • operates on the intermediate results • aimed at aggregating, summing, and/or elaborating • previous stage to present the data in their final form
  • 54.
    Variations and extensionsof MapReduce • MapReduce constitutes a simplified model for processing large quantities of data • model can be applied to several different problem scenarios • They aim at extending the MapReduce application space and providing developers with an easier interface for designing distributed algorithms.
  • 55.
    frameworks • Hadoop • Pig •Hive • Map-Reduce-Merge • Twister
  • 56.
    Hadoop • Apache Hadoopis an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. • Initially developed and supported by Yahoo • 40,000 machines and more than 300,000 cores • http://hadoop.apache.org/
  • 57.
    Pig • platform thatallows the analysis of large datasets • high-level language for expressing data analysis programs • Developers can express their data analysis programs in a textual language called Pig Latin, • https://pig.apache.org/
  • 58.
    Hive. • Provides adata warehouse infrastructure on top of Hadoop MapReduce. • It provides tools for easy data summarization • classical data warehouse, • https://hive.apache.org/
  • 59.
    Map-Reduce-Merge • Map-Reduce-Merge isan extension of the MapReduce model • merging data already partitioned and sorted by map and reduce modules
  • 60.
    Twister • extension ofthe MapReduce model that allows the creation of iterative executions of MapReduce jobs. • 1. Configure Map • 2. Configure Reduce • 3. While Condition Holds True Do – a. Run MapReduce – b. Apply Combine Operation to Result – c. Update Condition • 4. Close
  • 61.
    Alternatives to MapReduce •Sphere. • All-Pairs. • DryadLINQ
  • 62.
    Sphere. • Sector DistributedFile System (SDFS). • Sphere implements the stream processing model (Single Program, Multiple Data) • user-defined functions (UDFs) • it is built on top of Sector’s API for data access • UDFs are expressed in terms of programs that read and write streams. • Sphere client sends a request for processing to the master node, which returns the list of available slaves, and the client will choose the slaves on which to execute Sphere processes
  • 63.
    All-Pairs • Biometrics • (1)model the system; • (2) distribute the data; • (3) dispatch batch jobs; and • (4) clean up the system
  • 64.
    DryadLINQ. • Microsoft Researchproject that investigates programming models for writing parallel and distributed programs • small cluster to a large datacenter • Automatically parallelizing the execution of applications without requiring the developer to know about distributed and parallel programming.
  • 65.
    Aneka MapReduce programming •Developing MapReduce applications on top of Aneka • Mapper and Reducer - Aneka MapReduce APIs
  • 67.
    • Three classesare of Importent for application development: – Mapper < K,V > – Reducer <K,V > – MapReduceApplication <M,R > – The submission and execution of a MapReduce job is performed through the class MapReduceApplication <M,R >
  • 68.
    • Map FunctionAPIs. • IMapInput<K,V> provides access to the input key-value pair on which the map operation is performed
  • 70.
    Reduce Function APIs •Reduce (IReduceInputEnumerator < V > input) • reduce operation is applied to a collection of values that are mapped to the same key
  • 72.
    • MapReduceApplication <M,R> •InvokeAndWait method: ApplicationBase<M,R> • WordCounterMapper and WordCounterReducer classes
  • 73.
    • The parametersthat can be controlled – Partitions – Attempts – SynchReduce – IsInputReady – FetchResults – LogFile
  • 74.
  • 75.
    Runtime support • TaskScheduling – MapReduceScheduler class. • Task Execution. – MapReduceExecutor
  • 76.
  • 77.
  • 78.
    Distributed file systemsupport • Supports - Other programming models • MapReduce model does not leverage the default Storage Service for storage and data transfer • uses a distributed file system implementation • management are significantly different with respect to the other models • Distributed file system implementations guarantee high availability and better efficiency
  • 79.
    • Aneka providesthe capability of interfacing with different storage implementations • Retrieving the location of files and file chunks • Accessing a file by means of a stream • classes SeqReader and SeqWriter
  • 81.
    Example application • MapReduceis a very useful model for processing large quantities of data • Semistructured • logs or Web pages • logs produced by the Aneka container
  • 82.
    Parsing Aneka logs •Aneka components produce a lot of information that is stored in the form of log files • DD MMM YY hh:mm:ss level - message
  • 85.
    Mapper design andimplementation
  • 86.
    Reducer design andimplementation
  • 87.
  • 88.