1 NoSQL Not Only SQL
2
3 Source :- http://wearesocial.net/blog/2015/01/digital-social-mobile-worldwide-2015/
4 3 V o VELOCITY o VOLUME o VARIETY
5
6 6 RELATIONAL DATABASE MANAGEMENT SYSTEM Relational Model - data represented in terms of tuples (rows). Key Concepts o Table - collection of data elements organized in terms of rows and columns o Field - column in a table designed to maintain specific information about every record in the table o Record - horizontal entity represents set of related data o Column - vertical entity containing values of particular type
7 7 RELATIONAL DATABASE MANAGEMENT SYSTEM INTEGRITY RULES o Entity Integrity o Domain Integrity o Referential integrity o User-Defined Integrity
8 8 RELATIONAL DATABASE MANAGEMENT SYSTEM Pros Cons Support simple data structure Poor representation of real world Limit redundancy Difficult to represent hierarchies Better integrity Difficult represent complex data types Offer logical database independence Support one off queries using SQL Better backup & recovery procedure
9 9 RDBMS VS NOSQL RDBMS NoSQL Scale up Scale out Handle Structured Data Semi-Structured data / Unstructured data Atomic transaction Eventual consistency impedance mismatch Object model Strict schema Schema-less
10 10 DISTRIBUTED SYSTEMS Distributed database system consists of loosely- coupled sites that share no physical components. Homogeneous DDBMS All sites have identical software & aware of each other. work corporately in processing user requests Heterogeneous DDBMS Different sites may use different schema and software. provide limited facilities for cooperation in transaction processing
11 11 DISTRIBUTED SYSTEMS Sharding Split the data among multiple machines while ensuring that data is always accessed from the correct place. Replication Multiple instances of the Database which each mirror all the data of each other. 75GB 25GB 25GB 25GB 75GB 75GB 75GB 75GB
12 12 WHY NOSQL The global NoSQL market is forecast to reach $3.4 Billion in 2020, representing a compound annual growth rate (CAGR) of 21% for the period 2015 – 2020. http://www.technologies.org/?p=102 http://www.marketresearchmedia.com/?p=568
13 13 BIG USERS
14 14 BIG DATA
15 15 THE INTERNET OF THINGS
16 16 CLOUD COMPUTING
17 17 FLEXIBLE DATA MODEL
18 18 SCALABILITY AND PERFORMANCE
19 19 WHAT IS ACID? o Atomicity A transaction is all or nothing o Consistency Only valid data is written to the database o Isolation Pretend all transactions are happening serially and the data is correct o Durability What you write is what you get
20 20 CAP THEOREM A PC Availability : Each client can always read and write Partition Tolerance : The system works well despite physical network partitions. Consistency : All clients always have the same view of the data. You can have at most two of these properties for any shared Data Systems.
21 21 AN ALTERNATIVE TO ACID IS BASE o Basic Availability System seems to work all the time o Soft-State It doesn't have to be consistent all the time o Eventual Consistency Becomes consistent at some later time
22 22 NOSQL DATABASE CATEGORIES NoSQL Database Categories Key Value Store Document Store Wide Column Store Graph Databases
23 23 KEY VALUE STORE - OVERVIEW o Most basic type of NoSQL Database and basis for other three o Schema-free o Store data as Key-Value pair o Key-Value stores can be used as collections, dictionaries, associative arrays etc. Example DBs: Redis, Project Voldemort, Amazon DyanmoDB Key: Value Row_Id:100 First_Name: Saman Last_Name: Silva Address: 123, Galle Rd, Beruwala Last_Order: 2001
24 24 WIDE COLUMN STORE - OVERVIEW o Stored data in a columnar format o Semi-Schematic o Allow key-value pairs to be stored o Each key(Super Column) is associate with multiple attributes o Stores data in column specific file Example DBs: Apache Hbase, Cassendra, Big Table, Hadoop Super_Column:Value Sub_Coulmn->Key:Value Sub_Coulmn->Key:Value Super_Column:Name First_Name:Saman Last_Name:Silva Super_Column:Address No:125 Road:Galle Rd City:Beruwala
25 25 DOCUMENT STORE - OVERVIEW o Everything is stored in a Document o Schema-free o Data is stored inside documents as JSON or BSON formats o Document is a Key-Value collection Example DBs: MongoDb, CouchDB Database: Customers Database: Orders Document_Id:100 First_Name:Saman Last_Name:Silva Address: Order: Number: 125 Road: Galle Rd City: Beruwala Most_Recent: 2001 Document_Id:2001 Price: Rs 450 Item1: 1001 Item2: 1002 Document_Id:2002 Price: Rs 750 Item1: 1003 Item2: 1001
26 26 GRAPH DATABASE - OVERVIEW o Collection of nodes & edges o Node represent an entity & an edge represent a connection between two nodes o Stores data in a Graph o Within nodes data stored as Key : Value pairs o Mostly use in Social network applications such as Facebook, Twitter and etc. o Example DBs: Neo4j, Titan Nodes & EDGES With Key : Value Name: Shelan Name: Hansa WorkPlace: Virtusa NODE WORKS_IN WORKS_IN IS_FRIEND_OF EDGE
27 27 KEY VALUE STORE o Most Basic NoSQL Database Type o Storing data as a dictionary or hash o Dictionaries contain collection of objects or records o Different than RDBMS
28 28 KEY VALUE STORE Database Customer Order Row_Id:100 First_Name: Saman Last_Name: Silva Address: 123, Galle Rd, Beruwala Last_Order: 2001 Row_Id:101 First_Name: Nuwan Last_Name: Perera Address: 1/2, Galle Rd, Kalutara Last_Order: 2002 Row_Id: 2001 Price: Rs 450 Item1: 1001 Item2: 1003 Item3: 1005 Row_Id:2002 Price: Rs 750 Item1: 1001 Item2: 1002 Item3: 1003
29 29 WHEN TO USE KEY VALUE STORE o Caching: Quickly storing and retrieving o Queuing: Some K/V stores support lists, sets, queues and more o Distributing information and tasks o Keeping live information
30 30 ADVANTAGES OF KEY VALUE STORE o Support horizontal scaling o Highly Performance o Lack of Schema/Schema-less Data store o Different than RDBMS o Flexibility and more closely follow modern concepts like OOP o Provide basic K/V concept for other major 3 NoSQL DB types
31 31 REDIS – KEY STORE VALUE DATABASE o Open Source, Advanced Key-Value store o 3 main specialties o Holds its database entirely in memory o Has a relatively rich set of data types o Can replicate data to any number of slaves o 2 types of Persistence o RDB Persistence o AOF Persistence o 5 Data Types http://www.redis.io http://redis.io/download
32 32 REDIS FEATURES o Exceptionally Fast o Support Rich data types o Operations are Atomic o MultiUtility Tool
33 33 REDIS DATA TYPES “This is a String Value” name customer:1 address Hasangi Hasangi Hansa HijasRajith 0 Hansa 1 Hasangi 2 Hijas 4 Shelan 3 Rajith Hasangi Hansa HijasRajith Shelan Beruwala customer:2 name address Rajith Homagama Hashes Lists Sets Sorted Sets String
34 34 REDIS - STRING “This is a String Value” >SET stringvalue “This is a String Value” >OK >GET stringvalue >“This is a String Value”
35 35 REDIS - LISTS >LPUSH customer Hansa >(integer)1 >LPUSH customer Hasangi >(integer)2 >RPUSH customer Rajith >(integer)3 >LPUSH customer Hasangi >(integer)4 >RPUSH customer Hijas >(integer)5 >LRANGE customer 0 4 1) “Hasangi” 2) “Hasangi” 3) “Hansa” 4) “Rajith” 5) “Hijas” Hasangi Hasangi Hansa HijasRajith
36 36 REDIS - SETS >SADD customer Hansa >(integer)1 >SADD customer Hasangi >(integer)1 >SADD customer Rajith >(integer)1 >SADD customer Hasangi >(integer)0 >SADD customer Hijas >(integer)1 >SMEMBERS customer 1) “Hijas” 2) “Rajith” 3) “Hasangi” 4) “Hansa” HasangiHansa HijasRajith
37 37 REDIS – SORTED SETS >ZADD customer 1 Hasangi >(integer)1 >ZADD customer 3 Rajith >(integer)1 >ZADD customer 4 Shelan >(integer)1 >ZADD customer 2 Hijas >(integer)1 >ZADD customer 0 Hansa >(integer)1 >ZRANGE customer 0 4 1) “Hansa” 2) “Hasangi” 3) “Hijas” 4) “Rajith” 5) “Shelan” 0 Hansa 1 Hasangi 2 Hijas 4 Shelan3 Rajith
38 38 REDIS - HASHES >HMSET customer:1 name “Shelan” address “Beruwala” >OK >HMSET customer:2 name “Rajith” address “Homagama” >OK >HGETALL customer:1 1) “name” 2) “Shelan” 3) “address” 4) “Beruwala” name customer:1 address Shelan Beruwala customer:2 name address Rajith Homagama >HGETALL customer:2 1) “name” 2) “Rajith” 3) “address” 4) “Homagama”
39 39 REDIS – PUBS/SUBS o Publish and Subscribe to message Channels o Publisher/s can Subscribe to a channel/s Publisher Subscriber SubscriberSubscriber “RedisChat” ChannelHi, I’m RedisChat Publisher Publisher I’m Another RedisChat Publisher
40 40 REDIS – TRANSACTIONS o Execute group of command in a single step o Has 2 properties o All commands in a transaction are sequentially executed as a single isolated operation o Redis transaction is also atomic >MULTI >INCRBY accountA -50 >QUEUED >INCRBY accountB +50 >QUEUED >EXEC >(integer)50 >(integer)150 >SET accountA 100 >OK >SET accountB 100 >OK >GET accountA >”100” >GET accountB >”100” >GET accountA >”50” >GET accountB >”150”
41 41 REDIS – DISK PERSISTENCE o Point-in-time snapshot of all dataset o Compact, ideal for regular backup/archive o Multiple save-points available o Faster restarts compared to AOF o Very good for disaster recovery o Writes every command like a tape o Gets re-written when it gets too big o Can be easily parsed & edited o AOF files bigger than RDB files o Slower than RDB RDB Persistence AOF Persistence
42 42 REDIS – REPLICATION o Use asynchronous replication o A master can have multiple slaves o Slaves accept connection from other slaves o Non-blocking on both master and slave side o Redis Sentinel Redis Master Redis Slave Sentinel Redis Master Redis Slave Redis Slave Redis SlaveRedis Slave • Automatic Failover • Monitoring • Notification • Configuration Provider High AvailabilityScalability
43 43 WIDE-COLUMN STORE DATABASES o Stores data as sections of columns of data rather than rows of data o Ability to hold very large numbers of dynamic columns o Benefit of storing data in columns, is fast search/ access and data aggregation o Advantages for data warehouses, customer relationship management (CRM) systems. o A wide variety of companies and organizations use Hadoop for both research and production.
44 44 HADOOP o Its not a software. Its a framework of tools. o Objective is to running applications on big data. o Open source set of tools distributed under Apache license. o A distributed file system (HDFS) o An environment to run Map-Reduce tasks – typically Batch mode o NOSQL Database – HBase o Real Time Query Engine (Impala)
45 45 HADOOP’S APPROACH Big Data is broken into pieces Computation Computation Computation Computation Combined Result
46 46 HADOOP ARCHITECTURE Map Reduce File System (HDFS) Projects (Set of Hadoop Tools) Ambari Cassandra HBase Mahout Spark ZooKeeper
47 47 HADOOP DISTRIBUTED MODEL Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s
48 48 HADOOP DISTRIBUTED MODEL Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s
49 49 HADOOP DATA ACCESS Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s Application
50 50 HADOOP DATA FAULT TOLERANCE Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s Application Data Node Data Node Data Node Task Tracker
51 51 HOW HADOOP SOLVES BIG DATA CHALLENGES OF PROGRAMMERS File location Manage failures Break computations into pieces Scaling Focus on scale free programs
52 52 SCALABILITY ProcessingSpeed No of Computers … … Master Slave Cost
53 53 HBASE An open-source, distributed, versioned, non-relational database modeled after Google's Big Table. Features o Linear and modular scalability. o Strictly consistent reads and writes. o Automatic and configurable sharding of tables o Automatic failover support between Region Servers. o Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. o Easy to use Java API for client access. o Block cache and Bloom Filters for real-time queries.
54 54 What is a Graph ? F o ll o w s Shelan Hansa Hijaz Follows F o ll o w s Hasangi Follows @Hansa #nosql GRAPH DATABASES
55 55 What is a Graph Database? Database that uses graph structures to represent & store data. Key-Features o Excellent in dealing with relationships o High Performance o Flexible o Query language support Rajith Name:Rajith City:Kottawa Married:false Works for Since:2014/11/24 Virtusa Name:Virtusa City:Colombo GRAPH DATABASES
56 56 GRAPH DATABASES Graph databases vs Relational databases Relational Graph Tables Nodes Schema with nullables No schema Relationships with foreign keys Relation is first class citizen Related data fetch with joins Related data fetched with pattern
57 57 NEO4J ACID Graph DB JAVA Enterprise Features Billions of Entities Rest API
58 58 NEO4J What is Cypher? Graph Query Language Declarative Pattern matching Clauses
59 59 NEO4J Cypher Basic Syntax (a) - [ r ] - > (b) a b r nodesrelation
60 60 NEO4J - CYPHER Node with properties ( a { name : “rajith”, born : 1989 } ) Relationships with properties ( a ) - [:WORKED_IN { roles:[“ASE”] } ] - > ( b ) Labels ( a : Person { name: “rajith”} )
61 61 NEO4J - CYPHER Quering with Cypher MATCH ( a ) - - > ( b ) RETURN a, b; MATCH ( a ) – [ r ] – > ( b ) RETURN a.name, type ( r ); Using Clauses MATCH ( a : Person) WHERE a.name = “rajith” RETURN a;
62 62 DOCUMENT STORE o A collection of documents o Data in this model is stored inside documents. o A document is a key value collection where the key allows access to its value. o Documents are not typically forced to have a schema and therefore are flexible and easy to change. o Documents are stored into collections in order to group different kinds of data. o Documents can contain many different key-value pairs, or key- array pairs, or even nested documents. o Usually use JSON (BSON) like interchange model then application logic can be write easily.
63 63 WHAT IS MONGODB ? o Scalable High-Performance Open-source, Document- orientated database written in C++. o Built for Speed o Rich Document based queries for Easy readability o Full Index Support for High Performance o Replication and Failover for High Availability o Auto Sharding for Easy Scalability. o Map / Reduce for Aggregation.
64 64 KEYWORDS COMPARISON RDBMS MongoDB Database Database Table, View Collection Row Document (JSON, BSON) Column Field Index Index Join Embedded Document Foreign Key Reference Partition Shard > db.user.findOne({age:39}) { "_id" : ObjectId("5114e0bd42…"), "first" : "John", "last" : "Doe", "age" : 39, "interests" : [ "Reading", "Mountain Biking ] "favorites": { "color": "Blue", "sport": "Soccer"} }
65 65 MONGODB ADVANCED FEATURES o Replication o Indexing o Aggregation o Sharding o Capped Collections
66 66 REPLICATION o Replication is the process of synchronizing data across multiple servers o Replication provides redundancy and increases data availability Primary DB Secondary DB Arbiter DB Minimum Replica set in MongoDB REPLICA SET
67 67 AUTOMATIC FAILOVER
68 68 INDEXING o Indexes support the efficient execution of queries in MongoDB o MongoDB can use the index to limit the number of documents it must inspect o Indexes use a B-tree data structure. o Using “ensureIndex” method can create index. >db.COLLECtION_NAME.ensureIndex({KEY:1}) o Key is the name of field on which want to create index. o 1 is for ascending order. o -1 is for descending order.
69 69 WITH OUT INDEXING Client says Server have to read every document to find the result. Document Storage
70 70 WITH INDEXING
71 71 INDEX TYPES o Default _id Index o Single Field Index o Compound Index o Multikey Index o Geo Index o Text Index o Hashed Index
72 72 AGGREGATIONS Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations. Aggregation concepts o Aggregation Pipelines o Map-Reduce o Single Purpose Aggregation Operation
73 73 AGGREGATION PIPELINES The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB
74 74 MAP-REDUCE MongoDB also provides map-reduce operations to perform aggregation
75 75 SINGLE PURPOSE AGGREGATION OPERATION MongoDB provides special purpose database commands. All of operations aggregate documents from a single collection. Common aggregation operations are: o returning a count of matching documents o returning the distinct values for a field o grouping data based on the values of a field
76 76 SHARDING Sharding is a method for storing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.
77 77 CAPPED COLLECTIONS o It is fixed-size circular collections that follow the insertion order to support high performance for create, read and delete operations. o Capped collections restrict updates to the documents if the update results in increased document size. o Capped collections are best for storing log information, cache data or any other high volume data.
78 78 NOSQL DATABASE CATEGORIES NoSQL Database Categories Key Value Store Document Store Wide Column Store Graph Databases
79 79 NOSQL DATABASES SUMMARY Name HBase MongoDB Neo4j Redis Database model Wide column store Document store Graph DBMS Key-value store Initial release 2008 2009 2007 2009 License Open Source Open Source Open Source Open Source DBaaS no no no no Implementation language Java C++ Java C Server operating systems • Linux • Unix • Windows • Linux • OS X • Solaris • Windows • Linux • OS X • Windows • BSD • Linux • OS X • Windows Data scheme schema-free schema-free schema-free schema-free Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
80 80 NOSQL DATABASES SUMMARY Name HBase MongoDB Neo4j Redis 2nd indexes no yes yes no SQL no no no no APIs and other access methods Java API RESTful HTTP Thrift proprietary protocol using JSON Cypher query language Java API RESTful HTTP proprietary protocol Supported programming languages C C# C++ Groovy Java PHP Python Scala Actionscript, C, C#, C++, Clojure, ColdFusion, D, Dart, Delphi, Erlang, Go, Groovy, Haskell, Java, JavaScript, Lisp, Lua, MatLab, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Scala, Smalltalk .Net Clojure Go Groovy Java JavaScript Perl PHP Python Ruby Scala C, C#, C++, Clojure, Dart Erlang, Go, Haskell, Java JavaScript, Lisp, Lua Objective-C, Perl, PHP, Python, Ruby, Scala, Smalltalk, Tcl Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
81 81 NOSQL DATABASES SUMMARY Name HBase MongoDB Neo4j Redis Triggers yes no yes no Partitioning methods Sharding Sharding none Sharding Replication methods selectable replication factor Master-slave replication Master-slave replication Master-slave replication MapReduce yes yes no no Consistency concepts • Immediate • Consistency • Eventual • Consistency • Immediate • Consistency • Eventual • Consistency configurable in High Availability • Cluster setup Immediate Consistency • Eventual • Consistency Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
82 82 NOSQL DATABASES SUMMARY Name HBase MongoDB Neo4j Redis Foreign keys no no yes no Transaction concepts no no ACID optimistic locking Concurrency yes yes yes yes Durability yes yes yes yes In-memory capabilities yes User concepts Access Control Lists (ACL) Access rights for users and roles no very simple password-based access control Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
83 83 THANK YOU

An Intro to NoSQL Databases

  • 1.
  • 2.
  • 3.
  • 4.
    4 3 V o VELOCITY oVOLUME o VARIETY
  • 5.
  • 6.
    6 6 RELATIONAL DATABASEMANAGEMENT SYSTEM Relational Model - data represented in terms of tuples (rows). Key Concepts o Table - collection of data elements organized in terms of rows and columns o Field - column in a table designed to maintain specific information about every record in the table o Record - horizontal entity represents set of related data o Column - vertical entity containing values of particular type
  • 7.
    7 7 RELATIONAL DATABASEMANAGEMENT SYSTEM INTEGRITY RULES o Entity Integrity o Domain Integrity o Referential integrity o User-Defined Integrity
  • 8.
    8 8 RELATIONAL DATABASEMANAGEMENT SYSTEM Pros Cons Support simple data structure Poor representation of real world Limit redundancy Difficult to represent hierarchies Better integrity Difficult represent complex data types Offer logical database independence Support one off queries using SQL Better backup & recovery procedure
  • 9.
    9 9 RDBMS VSNOSQL RDBMS NoSQL Scale up Scale out Handle Structured Data Semi-Structured data / Unstructured data Atomic transaction Eventual consistency impedance mismatch Object model Strict schema Schema-less
  • 10.
    10 10 DISTRIBUTED SYSTEMS Distributeddatabase system consists of loosely- coupled sites that share no physical components. Homogeneous DDBMS All sites have identical software & aware of each other. work corporately in processing user requests Heterogeneous DDBMS Different sites may use different schema and software. provide limited facilities for cooperation in transaction processing
  • 11.
    11 11 DISTRIBUTED SYSTEMS Sharding Splitthe data among multiple machines while ensuring that data is always accessed from the correct place. Replication Multiple instances of the Database which each mirror all the data of each other. 75GB 25GB 25GB 25GB 75GB 75GB 75GB 75GB
  • 12.
    12 12 WHY NOSQL Theglobal NoSQL market is forecast to reach $3.4 Billion in 2020, representing a compound annual growth rate (CAGR) of 21% for the period 2015 – 2020. http://www.technologies.org/?p=102 http://www.marketresearchmedia.com/?p=568
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    19 19 WHAT ISACID? o Atomicity A transaction is all or nothing o Consistency Only valid data is written to the database o Isolation Pretend all transactions are happening serially and the data is correct o Durability What you write is what you get
  • 20.
    20 20 CAP THEOREM A PC Availability: Each client can always read and write Partition Tolerance : The system works well despite physical network partitions. Consistency : All clients always have the same view of the data. You can have at most two of these properties for any shared Data Systems.
  • 21.
    21 21 AN ALTERNATIVETO ACID IS BASE o Basic Availability System seems to work all the time o Soft-State It doesn't have to be consistent all the time o Eventual Consistency Becomes consistent at some later time
  • 22.
    22 22 NOSQL DATABASECATEGORIES NoSQL Database Categories Key Value Store Document Store Wide Column Store Graph Databases
  • 23.
    23 23 KEY VALUESTORE - OVERVIEW o Most basic type of NoSQL Database and basis for other three o Schema-free o Store data as Key-Value pair o Key-Value stores can be used as collections, dictionaries, associative arrays etc. Example DBs: Redis, Project Voldemort, Amazon DyanmoDB Key: Value Row_Id:100 First_Name: Saman Last_Name: Silva Address: 123, Galle Rd, Beruwala Last_Order: 2001
  • 24.
    24 24 WIDE COLUMNSTORE - OVERVIEW o Stored data in a columnar format o Semi-Schematic o Allow key-value pairs to be stored o Each key(Super Column) is associate with multiple attributes o Stores data in column specific file Example DBs: Apache Hbase, Cassendra, Big Table, Hadoop Super_Column:Value Sub_Coulmn->Key:Value Sub_Coulmn->Key:Value Super_Column:Name First_Name:Saman Last_Name:Silva Super_Column:Address No:125 Road:Galle Rd City:Beruwala
  • 25.
    25 25 DOCUMENT STORE- OVERVIEW o Everything is stored in a Document o Schema-free o Data is stored inside documents as JSON or BSON formats o Document is a Key-Value collection Example DBs: MongoDb, CouchDB Database: Customers Database: Orders Document_Id:100 First_Name:Saman Last_Name:Silva Address: Order: Number: 125 Road: Galle Rd City: Beruwala Most_Recent: 2001 Document_Id:2001 Price: Rs 450 Item1: 1001 Item2: 1002 Document_Id:2002 Price: Rs 750 Item1: 1003 Item2: 1001
  • 26.
    26 26 GRAPH DATABASE- OVERVIEW o Collection of nodes & edges o Node represent an entity & an edge represent a connection between two nodes o Stores data in a Graph o Within nodes data stored as Key : Value pairs o Mostly use in Social network applications such as Facebook, Twitter and etc. o Example DBs: Neo4j, Titan Nodes & EDGES With Key : Value Name: Shelan Name: Hansa WorkPlace: Virtusa NODE WORKS_IN WORKS_IN IS_FRIEND_OF EDGE
  • 27.
    27 27 KEY VALUESTORE o Most Basic NoSQL Database Type o Storing data as a dictionary or hash o Dictionaries contain collection of objects or records o Different than RDBMS
  • 28.
    28 28 KEY VALUESTORE Database Customer Order Row_Id:100 First_Name: Saman Last_Name: Silva Address: 123, Galle Rd, Beruwala Last_Order: 2001 Row_Id:101 First_Name: Nuwan Last_Name: Perera Address: 1/2, Galle Rd, Kalutara Last_Order: 2002 Row_Id: 2001 Price: Rs 450 Item1: 1001 Item2: 1003 Item3: 1005 Row_Id:2002 Price: Rs 750 Item1: 1001 Item2: 1002 Item3: 1003
  • 29.
    29 29 WHEN TOUSE KEY VALUE STORE o Caching: Quickly storing and retrieving o Queuing: Some K/V stores support lists, sets, queues and more o Distributing information and tasks o Keeping live information
  • 30.
    30 30 ADVANTAGES OFKEY VALUE STORE o Support horizontal scaling o Highly Performance o Lack of Schema/Schema-less Data store o Different than RDBMS o Flexibility and more closely follow modern concepts like OOP o Provide basic K/V concept for other major 3 NoSQL DB types
  • 31.
    31 31 REDIS –KEY STORE VALUE DATABASE o Open Source, Advanced Key-Value store o 3 main specialties o Holds its database entirely in memory o Has a relatively rich set of data types o Can replicate data to any number of slaves o 2 types of Persistence o RDB Persistence o AOF Persistence o 5 Data Types http://www.redis.io http://redis.io/download
  • 32.
    32 32 REDIS FEATURES oExceptionally Fast o Support Rich data types o Operations are Atomic o MultiUtility Tool
  • 33.
    33 33 REDIS DATATYPES “This is a String Value” name customer:1 address Hasangi Hasangi Hansa HijasRajith 0 Hansa 1 Hasangi 2 Hijas 4 Shelan 3 Rajith Hasangi Hansa HijasRajith Shelan Beruwala customer:2 name address Rajith Homagama Hashes Lists Sets Sorted Sets String
  • 34.
    34 34 REDIS -STRING “This is a String Value” >SET stringvalue “This is a String Value” >OK >GET stringvalue >“This is a String Value”
  • 35.
    35 35 REDIS -LISTS >LPUSH customer Hansa >(integer)1 >LPUSH customer Hasangi >(integer)2 >RPUSH customer Rajith >(integer)3 >LPUSH customer Hasangi >(integer)4 >RPUSH customer Hijas >(integer)5 >LRANGE customer 0 4 1) “Hasangi” 2) “Hasangi” 3) “Hansa” 4) “Rajith” 5) “Hijas” Hasangi Hasangi Hansa HijasRajith
  • 36.
    36 36 REDIS -SETS >SADD customer Hansa >(integer)1 >SADD customer Hasangi >(integer)1 >SADD customer Rajith >(integer)1 >SADD customer Hasangi >(integer)0 >SADD customer Hijas >(integer)1 >SMEMBERS customer 1) “Hijas” 2) “Rajith” 3) “Hasangi” 4) “Hansa” HasangiHansa HijasRajith
  • 37.
    37 37 REDIS –SORTED SETS >ZADD customer 1 Hasangi >(integer)1 >ZADD customer 3 Rajith >(integer)1 >ZADD customer 4 Shelan >(integer)1 >ZADD customer 2 Hijas >(integer)1 >ZADD customer 0 Hansa >(integer)1 >ZRANGE customer 0 4 1) “Hansa” 2) “Hasangi” 3) “Hijas” 4) “Rajith” 5) “Shelan” 0 Hansa 1 Hasangi 2 Hijas 4 Shelan3 Rajith
  • 38.
    38 38 REDIS -HASHES >HMSET customer:1 name “Shelan” address “Beruwala” >OK >HMSET customer:2 name “Rajith” address “Homagama” >OK >HGETALL customer:1 1) “name” 2) “Shelan” 3) “address” 4) “Beruwala” name customer:1 address Shelan Beruwala customer:2 name address Rajith Homagama >HGETALL customer:2 1) “name” 2) “Rajith” 3) “address” 4) “Homagama”
  • 39.
    39 39 REDIS –PUBS/SUBS o Publish and Subscribe to message Channels o Publisher/s can Subscribe to a channel/s Publisher Subscriber SubscriberSubscriber “RedisChat” ChannelHi, I’m RedisChat Publisher Publisher I’m Another RedisChat Publisher
  • 40.
    40 40 REDIS –TRANSACTIONS o Execute group of command in a single step o Has 2 properties o All commands in a transaction are sequentially executed as a single isolated operation o Redis transaction is also atomic >MULTI >INCRBY accountA -50 >QUEUED >INCRBY accountB +50 >QUEUED >EXEC >(integer)50 >(integer)150 >SET accountA 100 >OK >SET accountB 100 >OK >GET accountA >”100” >GET accountB >”100” >GET accountA >”50” >GET accountB >”150”
  • 41.
    41 41 REDIS –DISK PERSISTENCE o Point-in-time snapshot of all dataset o Compact, ideal for regular backup/archive o Multiple save-points available o Faster restarts compared to AOF o Very good for disaster recovery o Writes every command like a tape o Gets re-written when it gets too big o Can be easily parsed & edited o AOF files bigger than RDB files o Slower than RDB RDB Persistence AOF Persistence
  • 42.
    42 42 REDIS –REPLICATION o Use asynchronous replication o A master can have multiple slaves o Slaves accept connection from other slaves o Non-blocking on both master and slave side o Redis Sentinel Redis Master Redis Slave Sentinel Redis Master Redis Slave Redis Slave Redis SlaveRedis Slave • Automatic Failover • Monitoring • Notification • Configuration Provider High AvailabilityScalability
  • 43.
    43 43 WIDE-COLUMN STOREDATABASES o Stores data as sections of columns of data rather than rows of data o Ability to hold very large numbers of dynamic columns o Benefit of storing data in columns, is fast search/ access and data aggregation o Advantages for data warehouses, customer relationship management (CRM) systems. o A wide variety of companies and organizations use Hadoop for both research and production.
  • 44.
    44 44 HADOOP o Itsnot a software. Its a framework of tools. o Objective is to running applications on big data. o Open source set of tools distributed under Apache license. o A distributed file system (HDFS) o An environment to run Map-Reduce tasks – typically Batch mode o NOSQL Database – HBase o Real Time Query Engine (Impala)
  • 45.
    45 45 HADOOP’S APPROACH BigData is broken into pieces Computation Computation Computation Computation Combined Result
  • 46.
    46 46 HADOOP ARCHITECTURE MapReduce File System (HDFS) Projects (Set of Hadoop Tools) Ambari Cassandra HBase Mahout Spark ZooKeeper
  • 47.
    47 47 HADOOP DISTRIBUTEDMODEL Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s
  • 48.
    48 48 HADOOP DISTRIBUTEDMODEL Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s
  • 49.
    49 49 HADOOP DATAACCESS Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s Application
  • 50.
    50 50 HADOOP DATAFAULT TOLERANCE Commodity Hardware Task Tracker Data Node Task Tracker Data Node Task Tracker Task Tracker Data Node Slave Computers Task Tracker Data Node Job Tracker Name Node Master Computer/s Application Data Node Data Node Data Node Task Tracker
  • 51.
    51 51 HOW HADOOPSOLVES BIG DATA CHALLENGES OF PROGRAMMERS File location Manage failures Break computations into pieces Scaling Focus on scale free programs
  • 52.
    52 52 SCALABILITY ProcessingSpeed No ofComputers … … Master Slave Cost
  • 53.
    53 53 HBASE An open-source,distributed, versioned, non-relational database modeled after Google's Big Table. Features o Linear and modular scalability. o Strictly consistent reads and writes. o Automatic and configurable sharding of tables o Automatic failover support between Region Servers. o Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. o Easy to use Java API for client access. o Block cache and Bloom Filters for real-time queries.
  • 54.
    54 54 What isa Graph ? F o ll o w s Shelan Hansa Hijaz Follows F o ll o w s Hasangi Follows @Hansa #nosql GRAPH DATABASES
  • 55.
    55 55 What isa Graph Database? Database that uses graph structures to represent & store data. Key-Features o Excellent in dealing with relationships o High Performance o Flexible o Query language support Rajith Name:Rajith City:Kottawa Married:false Works for Since:2014/11/24 Virtusa Name:Virtusa City:Colombo GRAPH DATABASES
  • 56.
    56 56 GRAPH DATABASES Graphdatabases vs Relational databases Relational Graph Tables Nodes Schema with nullables No schema Relationships with foreign keys Relation is first class citizen Related data fetch with joins Related data fetched with pattern
  • 57.
  • 58.
    58 58 NEO4J What isCypher? Graph Query Language Declarative Pattern matching Clauses
  • 59.
    59 59 NEO4J Cypher BasicSyntax (a) - [ r ] - > (b) a b r nodesrelation
  • 60.
    60 60 NEO4J -CYPHER Node with properties ( a { name : “rajith”, born : 1989 } ) Relationships with properties ( a ) - [:WORKED_IN { roles:[“ASE”] } ] - > ( b ) Labels ( a : Person { name: “rajith”} )
  • 61.
    61 61 NEO4J -CYPHER Quering with Cypher MATCH ( a ) - - > ( b ) RETURN a, b; MATCH ( a ) – [ r ] – > ( b ) RETURN a.name, type ( r ); Using Clauses MATCH ( a : Person) WHERE a.name = “rajith” RETURN a;
  • 62.
    62 62 DOCUMENT STORE oA collection of documents o Data in this model is stored inside documents. o A document is a key value collection where the key allows access to its value. o Documents are not typically forced to have a schema and therefore are flexible and easy to change. o Documents are stored into collections in order to group different kinds of data. o Documents can contain many different key-value pairs, or key- array pairs, or even nested documents. o Usually use JSON (BSON) like interchange model then application logic can be write easily.
  • 63.
    63 63 WHAT ISMONGODB ? o Scalable High-Performance Open-source, Document- orientated database written in C++. o Built for Speed o Rich Document based queries for Easy readability o Full Index Support for High Performance o Replication and Failover for High Availability o Auto Sharding for Easy Scalability. o Map / Reduce for Aggregation.
  • 64.
    64 64 KEYWORDS COMPARISON RDBMSMongoDB Database Database Table, View Collection Row Document (JSON, BSON) Column Field Index Index Join Embedded Document Foreign Key Reference Partition Shard > db.user.findOne({age:39}) { "_id" : ObjectId("5114e0bd42…"), "first" : "John", "last" : "Doe", "age" : 39, "interests" : [ "Reading", "Mountain Biking ] "favorites": { "color": "Blue", "sport": "Soccer"} }
  • 65.
    65 65 MONGODB ADVANCEDFEATURES o Replication o Indexing o Aggregation o Sharding o Capped Collections
  • 66.
    66 66 REPLICATION o Replicationis the process of synchronizing data across multiple servers o Replication provides redundancy and increases data availability Primary DB Secondary DB Arbiter DB Minimum Replica set in MongoDB REPLICA SET
  • 67.
  • 68.
    68 68 INDEXING o Indexessupport the efficient execution of queries in MongoDB o MongoDB can use the index to limit the number of documents it must inspect o Indexes use a B-tree data structure. o Using “ensureIndex” method can create index. >db.COLLECtION_NAME.ensureIndex({KEY:1}) o Key is the name of field on which want to create index. o 1 is for ascending order. o -1 is for descending order.
  • 69.
    69 69 WITH OUTINDEXING Client says Server have to read every document to find the result. Document Storage
  • 70.
  • 71.
    71 71 INDEX TYPES oDefault _id Index o Single Field Index o Compound Index o Multikey Index o Geo Index o Text Index o Hashed Index
  • 72.
    72 72 AGGREGATIONS Aggregations areoperations that process data records and return computed results. MongoDB provides a rich set of aggregation operations. Aggregation concepts o Aggregation Pipelines o Map-Reduce o Single Purpose Aggregation Operation
  • 73.
    73 73 AGGREGATION PIPELINES Thepipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB
  • 74.
    74 74 MAP-REDUCE MongoDB alsoprovides map-reduce operations to perform aggregation
  • 75.
    75 75 SINGLE PURPOSEAGGREGATION OPERATION MongoDB provides special purpose database commands. All of operations aggregate documents from a single collection. Common aggregation operations are: o returning a count of matching documents o returning the distinct values for a field o grouping data based on the values of a field
  • 76.
    76 76 SHARDING Sharding isa method for storing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.
  • 77.
    77 77 CAPPED COLLECTIONS oIt is fixed-size circular collections that follow the insertion order to support high performance for create, read and delete operations. o Capped collections restrict updates to the documents if the update results in increased document size. o Capped collections are best for storing log information, cache data or any other high volume data.
  • 78.
    78 78 NOSQL DATABASECATEGORIES NoSQL Database Categories Key Value Store Document Store Wide Column Store Graph Databases
  • 79.
    79 79 NOSQL DATABASESSUMMARY Name HBase MongoDB Neo4j Redis Database model Wide column store Document store Graph DBMS Key-value store Initial release 2008 2009 2007 2009 License Open Source Open Source Open Source Open Source DBaaS no no no no Implementation language Java C++ Java C Server operating systems • Linux • Unix • Windows • Linux • OS X • Solaris • Windows • Linux • OS X • Windows • BSD • Linux • OS X • Windows Data scheme schema-free schema-free schema-free schema-free Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
  • 80.
    80 80 NOSQL DATABASESSUMMARY Name HBase MongoDB Neo4j Redis 2nd indexes no yes yes no SQL no no no no APIs and other access methods Java API RESTful HTTP Thrift proprietary protocol using JSON Cypher query language Java API RESTful HTTP proprietary protocol Supported programming languages C C# C++ Groovy Java PHP Python Scala Actionscript, C, C#, C++, Clojure, ColdFusion, D, Dart, Delphi, Erlang, Go, Groovy, Haskell, Java, JavaScript, Lisp, Lua, MatLab, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Scala, Smalltalk .Net Clojure Go Groovy Java JavaScript Perl PHP Python Ruby Scala C, C#, C++, Clojure, Dart Erlang, Go, Haskell, Java JavaScript, Lisp, Lua Objective-C, Perl, PHP, Python, Ruby, Scala, Smalltalk, Tcl Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
  • 81.
    81 81 NOSQL DATABASESSUMMARY Name HBase MongoDB Neo4j Redis Triggers yes no yes no Partitioning methods Sharding Sharding none Sharding Replication methods selectable replication factor Master-slave replication Master-slave replication Master-slave replication MapReduce yes yes no no Consistency concepts • Immediate • Consistency • Eventual • Consistency • Immediate • Consistency • Eventual • Consistency configurable in High Availability • Cluster setup Immediate Consistency • Eventual • Consistency Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
  • 82.
    82 82 NOSQL DATABASESSUMMARY Name HBase MongoDB Neo4j Redis Foreign keys no no yes no Transaction concepts no no ACID optimistic locking Concurrency yes yes yes yes Durability yes yes yes yes In-memory capabilities yes User concepts Access Control Lists (ACL) Access rights for users and roles no very simple password-based access control Source :- http://db-engines.com/en/system/HBase%3BMongoDB%3BNeo4j%3BRedis
  • 83.

Editor's Notes

  • #7 A Relational Database management System(RDBMS) is a database management system based on relational model introduced by E.F Codd.  Many popular databases currently in use are based on the relational database model. The data in RDBMS is stored in database objects called tables. The table is a collection of related data entries and it consists of columns and rows. table is the most common and simplest form of data storage in a relational database. A field is a column in a table that is designed to maintain specific information about every record in the table. A record, also called a row of data, is each individual entry that exists in a table. record is a horizontal entity in a table that represents set of related data. A column is a vertical entity in a table that contains all information associated with a specific field in a table. a column is a set of value of a particular type
  • #8 Entity Integrity: There are no duplicate rows in a table. the rows in a relational table should all be distinct. Domain Integrity: Enforces valid entries for a given column by restricting the type, the format, or the range of values. column values must not be repeating groups or arrays Referential integrity: Rows cannot be deleted, which are used by other records. User-Defined Integrity: Enforces some specific business rules that do not fall into entity, domain or referential integrity. the concept of a null value- A blank is considered equal to another blank, a zero is equal to another zero, but two null values are not considered equal.
  • #13 NOSQL market is expected to grow 21 percent annually and reach 3.4 billion US dollars in 2020. Why this growth is expected? Because it’s being proved that developing NOSQL applications in Facebook, Twitter, Biotechnology, Defense, Image processing and many more, has gained more success. NOSQL is moving in to become a major player in database market place.
  • #14 NOSQL supports Big Users. Early days, 10000 concurrent users was an extreme case. But now apps should support millions of different users a day, and must support global users 24 hours a day, 365 days a year. Supporting large numbers of concurrent users is important, but because app usage requirements are hard to predict, it’s just as important to dynamically support rapidly growing numbers of concurrent users. With relational technologies, many application developers find it difficult, or even impossible, to get the dynamic scalability and level of scale they need while also maintaining the performance user’s demand. Only NOSQL can help to achieve this target.
  • #15 NOSQL also supports Big Data. You can see according to the graph, the usage of structured and semi-structured data usage has increased with time. Explosive growth in internet usage, in addition to the use of mobile and social apps, and machine-to-machine communications, has introduced new data types. However, capturing and using big data requires a very different type of database. Unfortunately, the rigidly defined schema-based approach used by relational databases makes it impossible to quickly incorporate new types of data and is a poor fit for unstructured and semi-structured data. NOSQL provides a much more flexible data model that better maps to an applications data organization.
  • #16 Today 20 billion devices are connected to Internet. For example: smart phones, tablets, home appliances, devices in cars, hospitals, warehouses and more. These devices receive data on environment, location movement, temperature, and etc. Innovative enterprises are relying on NoSQL technology to scale concurrent data access to millions of connected devices and systems, store billions of data points, and meet the performance requirements can be achieved by NOSQL.
  • #17 Today, most new applications run in a public, private, or hybrid cloud, support large numbers of users, and use a three-tier internet architecture. In the cloud, a load balancer directs the incoming traffic to a scale-out tier of web/application servers that process the logic of the application. NoSQL databases are built from the ground up to be distributed, scale-out technologies and are therefore a better fit with the highly distributed nature of the three-tier internet architecture.
  • #18 Relational and NOSQL data models are very different. The relational model takes data and separates it into many interrelated tables that contain rows and columns. You can store a JSON document in NOSQL which might take all the data stored in 20 tables of a relational database. Another major difference is that relational technologies have rigid schemas. NOSQL has no strict schema like relational database. The format of the data being inserted can be changed at any time, without application disruption.
  • #19 There are two options to deal with increased concurrent users and volume of data. They are, scale up the database or scale down. Relational database has limitations in scaling up. To support more concurrent users and store more data, relational databases require a bigger and more expensive server with more CPUs, memory, and disk storage. At some point, the capacity of even the biggest server can be outstripped and the relational database cannot scale further. Scale-out Database Tier with NoSQL provide an easier, linear, and cost effective approach to database scaling. As the number of concurrent users grows, simply add additional low-cost, commodity servers to your cluster. There’s no need to modify the application, since the application always sees a single (distributed) database.
  • #20 A transaction is a logical unit that is independently executed for data retrieval or update. ACID is a set of properties that apply specifically to database transactions. A database truncations are processed reliably, referred to as ACID. Let's examine the ACID requirement for a database transaction system in more detail.  Atomicity means either the task or tasks within a transaction are performed or none are performed (all or none rule). Consistency means the transaction meets all rules defined by the system at all times. The transaction does not violate those rules and the database must remain in a consistent state at the beginning and end of a transaction. There are no half-completed transactions. Isolation: No transaction has access to any other transaction that is in an intermediate or unfinished state. Each transaction is independent. Finally, durability means the transaction is complete and it will persist. The completed transaction will survive system failure, power loss and other types of system breakdowns.
  • #21 CAP Theorem, also known as Brewer’s Theorem, CAP theorem says, that there are three essential system requirements necessary for the successful design, implementation and deployment of applications in distributed computing systems. They are Consistency, Availability and Partition Tolerance. Consistency:  means that each client always has the same view of the data. This is the same idea of consistency in ACID. High Availability:  means that all clients can always read and write. Partition-tolerance:  means the system will continue to work unless there is a total network failure. A few nodes can fail and the system keeps going. Attaining all three is not however possible. If you can't have all of the ACID guarantees it turns out you can have two of these three characteristics.
  • #22 The BASE acronym was defined by Eric Brewer, who is also known for formulating the CAP theorem. The types of large systems based on CAP aren't ACID they are BASE. Everyone who builds big applications builds them on CAP and BASE: Google, Yahoo, Facebook, Amazon, eBay, etc.  Let's review BASE standards:  Basically Available: This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be ‘failure’ to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account. Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to ‘eventual consistency,’ thus the state of the system is always ‘soft.’ Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one. The BASE model isn't appropriate for every situation, but it is certainly a flexible alternative to the ACID model for databases that don't require strict adherence to a relational model.
  • #23 Key Value Store Global Collection of Key:Value Pair eg : Name is the Key, Value is the “Saman” Schema Free : every record can have different keys Most common, basis for other 3 nosql database categories… Examples, Redis, Amazon Simple DB, Project Voldermart, Riazk. Windows Azure Document Store Similar to key/value, but major different is value is document Flexible Schema, Schema Free – any number of fields can be added Values (Documents) stored in JSON or BSON Wide Column Store Each key, key -> super column is associate with multiple attributes Semi schematic, not schema free, we need to specify groups of column(knowns as column families) Data stores in column specific file Graph databases Is a collection of nodes and edges and each node represent a entity & each edge represent a connection or relationship between two nodes This stores data in a graph Key Value Store Column Oriented Store Document Store Graph Database Multimodal Databases Object Databases Unresolved and Uncategorized
  • #24 Basic type of nosql database category and basic one for other major three database categories Schema-free: allow developers to store schema less data (every record can have different keys) database stores data as key value pair, each key is unique and the value can be string, JSON, BLOB (basic large object) Key-Value stores can be used as collections, dictionaries, associative arrays etc. For example, think we have sales database and it have customer and order tables and each tables have unique rows. Here we have got one row here 100 and it have key value pairs first name, last name, address and last order will point to a another table. But there is no explicit relation between customer and orders
  • #25 Stored data in a columnar format those column are treated individually Wide columns have tables, but tables doesn’t belongs to a database. There is no such thing as database. Tables have rows, and rows have super columns and columns within them. So super columns are define when the tables are defined. In this example Name and Address
  • #26 Everything is stored in a Document, we can say collection of documents Schema Free : Documents are not typically forced to have a schema and therefore are flexible and easy to change. Instead of contain rows, they contain documents. But conceptually document is a similar to row. But still have the key value pairs inside the documents. The little difference is value of key can actually it self be a document. Value of that key point to an another document in a another database. As an example Customer document id 100 has a address key and value of that key it self a document And orders key has a value it self as a document it is point to an Orders database document id 2001s
  • #28 Key / Value Store KV can be considered the most basic and backbone implementation of NoSQL. This is designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash. Stores data as hash table. each Key is unique, key may be strings, hashes, lists, sets, sorted sets Value can be string, JSON, BLOB (basic large object) etc. These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g.Name) and provide a matching value (e.g. ”Saman”) which can later be retrieved the same way by supplying the key. Dictionaries contain a collection of objects, or records, which in turn have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database. KV stores work in a very different fashion than the better known relational databases (RDB). RDBs pre-define the data structure in the database as a series of tables containing fields with well defined data types. Exposing the data types to the database program allows it to apply a number of optimizations. In contrast, key-value systems treat the data as a single opaque collection which may have different fields for every record. This offers considerable flexibility and more closely follows modern concepts like object-oriented programming. Some popular key / value based data stores are: Redis: In-memory K/V store with optional persistence. Riak: Highly distributed, replicated K/V store. Memcached / MemcacheDB: Distributed memory based K/V store.
  • #30 Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely preferment, efficient and usually easily scalable. When To Use Caching: Quickly storing data for - sometimes frequent - future use. Queue-ing: Some K/V stores (e.g. Redis) supports lists, sets, queues and more. Distributing information / tasks: They can be used to implement Pub/Sub. Keeping live information: Applications which need to keep a state cane use K/V stores easily.
  • #31 One of the biggest benefit for most NoSQL solutions, including Key Value Stores, would be horizontal scaling. We all know that horizontal scaling and SQL Server, while it’s possible, does not play well. Typically if you need more from SQL Server you scale vertically, which can be costly. Key / Value data stores are highly performant, easy to work with and they usually scale well. Another benefit for Key Value stores is a lack of schema, this allows for changing the data structure as needed, thus being a bit more flexible. Whereas with SQL Server altering a table could result in stored procedures, functions, views, etc… needing updates, which take time and a DBA resource. Because optional values are not represented by placeholders as in most RDBs, key-value stores often use far less memory to store the same database, which can lead to large performance gains in certain workloads. key-value systems treat the data as a single opaque collection which may have different fields for every record. This offers considerable flexibility and more closely follows modern concepts like object-oriented programming. The key value stores are typically written in some type of programming language, commonly Java. This gives the application developer the freedom to store data how they see fit, in a schema-less data store. A subclass of the key-value store is the document-oriented database, which offers additional tools that use the metadata in the data to provide a richer key-value database that more closely matches the use patterns of RDBM systems. Some graph databases are also key-value stores internally, adding the concept of the relationships (pointers) between records as a first class data type. Key Value stores support “Eventual Consistency”, if a feature in your application doesn’t need to fully support ACID, then may not be a significant draw back.
  • #32 Redis is an open source, advanced key-value store and a serious solution for building high-performance, scalable web applications. Redis has three main peculiarities that set it apart from much of its competition: Redis holds its database entirely in memory, using the disk only for persistence. Redis has a relatively rich set of data types when compared to many key-value data stores. Redis can replicate data to any number of slaves – Redis Replication Redis Persist in 2 ways RDB Persistence AOF(Append Only File) Persistence Now Redis is quite a bit different than other noSQL databases. Besides just being different than relational databases, like SQL server. You may be familiar with document databases like Ravendb or Mongodb. And while they are certainly good choices for noSQL databases, they operate quite a bitdifferently than Redis does. With document databases, like Ravendb or Mongodb. The focus is on creating documents which are persisted to disk and can be indexed. Just like relational tables are indexed in SQL server or Oracle. Redis on the other hand stores its data using keys, and the data it stores can be in the form of different data structures, not just a document. The data is also stored in memory with persistence as a secondary consideration. And there is no indexing of any kind. You can, of course, implement your own indexes by creating them as additional data. But Redis does not do any of that for you. This can be a bit of a shock to you, some developers that are use to being able to query a database. After all, isn't that what databases are for? Databases like SQL server and Oracle allow you to query the database using SQL. Databases like RavenDB and MongoDB, allow you to query the data using indexes you create ahead of time or on the fly. But Redis only lets you get data by specifying a key. At first, this may seem like a ludicrous tradeoff to make. Why would you want to give up the ability to query your data? And it's true, in some case, using Redis will not make any sense at all, but you'll probably find that where Redis is appropriate. Although you have to do a little bit of extra work in designing your data, and working out how to access that data. It will be extremely fast with very little overhead, and so that's really the advantage, and the consideration that you need totake into account when deciding whether or not to use Redis.
  • #33 Exceptionally Fast : Redis is very fast and can perform about 110000 SETs per second, about 81000 GETs per second. Supports Rich data types : Redis natively supports most of the datatypes that most developers already know like list, set, sorted set, hashes. This makes it very easy to solve a variety of problems because we know which problem can be handled better by which data type. Operations are atomic : All the Redis operations are atomic, which ensures that if two clients concurrently access Redis server will get the updated value. MultiUtility Tool : Redis is a multi utility tool and can be used in a number of use cases like caching, messaging-queues (Redis natively supports Publish/ Subscribe ), any short lived data in your application like web application sessions, web page hit counts, etc.
  • #34 Redis supports 5 types of data types, Bitmaps and HyperLogLogs Redis also supports Bitmaps and HyperLogLogs which are actually data types based on the String base type, but having their own semantics.
  • #35 Strings – Redis String is a Sequence of bytes. Binary safe, meaning they have a known length not determined by any special terminating characters. Can store anything up to 512 megabytes in one string.
  • #36 Lists - Redis Lists are simply lists of strings, sorted by insertion order. You can add elements to a Redis List on the head or on the tail. The max length of a list is 2-32 - 1 elements (more than 4 billion of elements per list). Internally maintained as a linked list. Ideal for Queues, Stacks, TopN, Recent News, Time Line
  • #37 Sets - Redis Sets are an unordered collection of Strings. In redis you can add, remove, and test for existence of members in O(1) time complexity. In the above example Hasangi is added twice but due to unique property of set it is added only once. The max number of members in a set is 232 - 1 (4294967295, more than 4 billion of members per set). Sample usage tracking unique Ips, Tagging.
  • #38 Sorted Sets - Redis Sorted Sets are, similarly to Redis Sets, non repeating collections of Strings. The difference is that every member define with a score, that is used to take set ordered, from the smallest to the greatest score. Members are unique, but scores may be repeated. Sample Usage: Leaders Boards, Most Page Views, Sort for a given age, friends, comments, likes range
  • #39 Sorted Sets - Redis Sorted Sets are, similarly to Redis Sets, non repeating collections of Strings. The difference is that every member define with a score, that is used to take set ordered, from the smallest to the greatest score. Members are unique, but scores may be repeated. Sample Usage: Leaders Boards, Most Page Views, Sort for a given age, friends, comments, likes range
  • #40 Redis pub/sub implements the messaging system where senders/client (called publishers) sends the messages while receivers (subscribers) receive them. The link by which messages are transferred is called channel. In Redis a client(Publisher) can subscribe any number of channels. Subscriber also get messages from (published) multiple clients(Publishers) who are publishing message to a particular Channel
  • #41 Redis transactions allow the execution of a group of commands in a single step. Transactions has two properties in it, which are described below: All commands in a transaction are sequentially executed as a single isolated operation. It is not possible that a request issued by another client is served in the middle of the execution of a Redis transaction. Redis transaction is also atomic. Atomic means either all of the commands or none are processed.
  • #42 Redis Persistence Redis provides a different range of persistence options: The RDB persistence performs point-in-time snapshots of your dataset at specified intervals. the AOF persistence logs every write operation received by the server, that will be played again at server startup, reconstructing the original dataset. Commands are logged using the same format as the Redis protocol itself, in an append-only fashion. Redis is able to rewrite the log on background when it gets too big. If you wish, you can disable persistence at all, if you want your data to just exist as long as the server is running. It is possible to combine both AOF and RDB in the same instance. Notice that, in this case, when Redis restarts the AOF file will be used to reconstruct the original dataset since it is guaranteed to be the most complete. The most important thing to understand is the different trade-offs between the RDB and AOF persistence. Let's start with RDB: RDB advantages RDB is a very compact single-file point-in-time representation of your Redis data. RDB files are perfect for backups. For instance you may want to archive your RDB files every hour for the latest 24 hours, and to save an RDB snapshot every day for 30 days. This allows you to easily restore different versions of the data set in case of disasters. RDB is very good for disaster recovery, being a single compact file can be transferred to far data centers, or on Amazon S3 (possibly encrypted). RDB maximizes Redis performances since the only work the Redis parent process needs to do in order to persist is forking a child that will do all the rest. The parent instance will never perform disk I/O or alike. RDB allows faster restarts with big datasets compared to AOF. RDB disadvantages RDB is NOT good if you need to minimize the chance of data loss in case Redis stops working (for example after a power outage). You can configure different save points where an RDB is produced (for instance after at least five minutes and 100 writes against the data set, but you can have multiple save points). However you'll usually create an RDB snapshot every five minutes or more, so in case of Redis stopping working without a correct shutdown for any reason you should be prepared to lose the latest minutes of data. RDB needs to fork() often in order to persist on disk using a child process. Fork() can be time consuming if the dataset is big, and may result in Redis to stop serving clients for some millisecond or even for one second if the dataset is very big and the CPU performance not great. AOF also needs to fork() but you can tune how often you want to rewrite your logs without any trade-off on durability. AOF advantages Using AOF Redis is much more durable: you can have different fsync policies: no fsync at all, fsync every second, fsync at every query. With the default policy of fsync every second write performances are still great (fsync is performed using a background thread and the main thread will try hard to perform writes when no fsync is in progress.) but you can only lose one second worth of writes. The AOF log is an append only log, so there are no seeks, nor corruption problems if there is a power outage. Even if the log ends with an half-written command for some reason (disk full or other reasons) the redis-check-aof tool is able to fix it easily. Redis is able to automatically rewrite the AOF in background when it gets too big. The rewrite is completely safe as while Redis continues appending to the old file, a completely new one is produced with the minimal set of operations needed to create the current data set, and once this second file is ready Redis switches the two and starts appending to the new one. AOF contains a log of all the operations one after the other in an easy to understand and parse format. You can even easily export an AOF file. For instance even if you flushed everything for an error using a FLUSHALL command, if no rewrite of the log was performed in the meantime you can still save your data set just stopping the server, removing the latest command, and restarting Redis again. AOF disadvantages AOF files are usually bigger than the equivalent RDB files for the same dataset. AOF can be slower than RDB depending on the exact fsync policy. In general with fsync set to every secondperformances are still very high, and with fsync disabled it should be exactly as fast as RDB even under high load. Still RDB is able to provide more guarantees about the maximum latency even in the case of an huge write load. In the past we experienced rare bugs in specific commands (for instance there was one involving blocking commands like BRPOPLPUSH) causing the AOF produced to not reproduce exactly the same dataset on reloading. This bugs are rare and we have tests in the test suite creating random complex datasets automatically and reloading them to check everything is ok, but this kind of bugs are almost impossible with RDB persistence. To make this point more clear: the Redis AOF works incrementally updating an existing state, like MySQL or MongoDB does, while the RDB snapshotting creates everything from scratch again and again, that is conceptually more robust. However - 1) It should be noted that every time the AOF is rewritten by Redis it is recreated from scratch starting from the actual data contained in the data set, making resistance to bugs stronger compared to an always appending AOF file (or one rewritten reading the old AOF instead of reading the data in memory). 2) We never had a single report from users about an AOF corruption that was detected in the real world.
  • #43 Redis replication is a very simple to use and configure master-slave replication that allows slave Redis servers to be exact copies of master servers. The following are some very important facts about Redis replication: Redis uses asynchronous replication. Starting with Redis 2.8, however, slaves will periodically acknowledge the amount of data processed from the replication stream. A master can have multiple slaves. Slaves are able to accept connections from other slaves. Aside from connecting a number of slaves to the same master, slaves can also be connected to other slaves in a graph-like structure. Redis replication is non-blocking on the master side. This means that the master will continue to handle queries when one or more slaves perform the initial synchronization. Replication is also non-blocking on the slave side. While the slave is performing the initial synchronization, it can handle queries using the old version of the dataset, assuming you configured Redis to do so in redis.conf. Otherwise, you can configure Redis slaves to return an error to clients if the replication stream is down. However, after the initial sync, the old dataset must be deleted and the new one must be loaded. The slave will block incoming connections during this brief window. Replication can be used both for scalability, in order to have multiple slaves for read-only queries (for example, heavy SORT operations can be offloaded to slaves), or simply for data redundancy. http://blog.concretesolutions.com.br/2013/03/redis-parte-2/ http://redis.io/topics/sentinel http://redis.io/topics/replication Redis Sentinel provides high availability for Redis. In practical terms this means that using Sentinel you can create a Redis deployment that resists without human intervention to certian kind of failures. Redis Sentinel also provides other collateral tasks such as monitoring, notifications and acts as a configuration provider for clients. This is the full list of Sentinel capabilities at a macroscopical level (i.e. the big picture): Monitoring. Sentinel constantly checks if your master and slave instances are working as expected. Notification. Sentinel can notify the system administrator, another computer programs, via an API, that something is wrong with one of the monitored Redis instances. Automatic failover. If a master is not working as expected, Sentinel can start a failover process where a slave is promoted to master, the other additional slaves are reconfigured to use the new master, and the applications using the Redis server informed about the new address to use when connecting. Configuration provider. Sentinel acts as a source of authority for clients service discovery: clients connect to Sentinels in order to ask for the address of the current Redis master responsible for a given service. If a failover occurs, Sentinels will report the new address.
  • #44 The important difference here is that columns are created for each row rather than being predefined by the table structure.
  • #45 Map-Reduce - An algorithm for efficiently processing large amounts of data in parallel
  • #47 Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Avro™: A data serialization system. Cassandra™: A scalable multi-master database with no single points of failure. Chukwa™: A data collection system for managing large distributed systems. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout™: A Scalable machine learning and data mining library. Pig™: A high-level data-flow language and execution framework for parallel computation. Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. ZooKeeper™: A high-performance coordination service for distributed applications.
  • #48 Labels
  • #49 Labels
  • #50 Labels
  • #51 Labels
  • #52 Labels
  • #53 Labels
  • #54 Map-Reduce - An algorithm for efficiently processing large amounts of data in parallel
  • #55 Labels
  • #59 Cypher is a query language specially designed for neo4j graph database.it is still in active development. Cypher is declarative, that means you specify what you need to retrieve , not how neo should retrieve it. Cypher use patters to match data in the database. Cypher works with clauses e.g. where , orderby
  • #63 Document Store is type of NOSQL database. Its store collection of documents. Data model store inside the document. Documents are not typically forced to have a schema and therefore are flexible and easy to change. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Usually use JSON (BSON) like interchange model then application logic can be write easily.
  • #64 MongoDB is open source document database. It is written in C++. Data is stored in an open format such as XML, JSON, Binary JSON (BSON), etc. then easy readability of data. Allows server side operations on data, and easy to create tools to manipulate data. Fully index support then give high performance. In mongoDB have automatically fail recovery and replication then high availability. MongoDB has horizontal scaling like sharding then have easy scalability. It is provide aggregation framework then easy to handle lager amount of data.
  • #65 Database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server typically has multiple databases. Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A document is a set of key-value pairs. This is example for a document.
  • #66 MongoDB has lot of its own features. Those are some advance features in mongoDB. Now we review one by one those features.
  • #67 Replication is the process of synchronizing data across multiple servers. It’s provide redundancy and its increase the data availability because it keeps multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. We can dedicate one to disaster recovery, reporting, or backup //////////// When we have single server DB then it is danger. when DB crashed, all data will be lost But if we have a backup then can restore it however this is a traditional approach for fail safety. In this situation ,Mongo DB support concept call replica set to achieves replication. Replica set is a group of mongod instances that host the same data set. Generally replica set contain minimum 3 nodes. One is primary node and one or more secondary nodes and arbiter node. All data replicates from primary to secondary node. 1)Primary node  only can have one primary instance in replica set. that receives all write operations. That means at any client write data to the database then have to connected to the primary. 2)Secondary node  those are read only databases. Can have many secondary database. That means can have more scalability because can preform many more read against the replicas rather than attacking single server. 3) Arbiter node An arbiter does not have a copy of data set and cannot become a primary. Replica sets may have arbiters to add a vote in elections of for primary. It can be a smaller machine does not need lot of hard.
  • #68 At some point primary db going to fail then one of the secondary will take over and become the primary. this is great because mongod support automatically recovery from a crash on primary. If one of secondary will break it not big deal because still have primary and depends on the application can have many secondary also. NO Data loss and NO lot of functionality. When primary Server will fail then one of the secondary will take over but there can be multiple secondary then which one become primary. So what mongo does it is hold an election. In election will look simple majority more than the 50% in order to become primary server. Those data will store in arbiter db server and its responsible for election.
  • #69 Mongodb has lager volume of data. Index support to speed up query and when using index then can limit the number of document to scan. Indexes are special data structures that store a small portion of the collection’s data set in an easy to traverse form. Indexes use B-tree structure. Using “ensureIndex” method can create index on field . Here key is the name of filed on which you want to create index and 1 is for ascending order. To create index in descending order then use as -1. As well “ensureIndex” method can pass multiple fields, to create index on multiple fields. The index stores the value of a specific field or set of fields, ordered by the value of the field. The ordering of the index entries supports efficient equality matches and range-based query operations. In addition, MongoDB can return sorted results by using the ordering in the index.
  • #70 MongoDB with out Indexing Lets look at this example, You have a collection named “foo” and you want to find all document where is the value field x is 10.then What the server does in order to find the document. Server has to scan each and every document and check if the value field x is equal to 10. then have to scan every document and compare those. This is very wasteful operation. Without indexes, MongoDB must perform a collection scan. then solution is the use index.
  • #71 Fundamentally, indexes in MongoDB are similar to indexes in other database systems. MongoDB defines indexes at the collection level and supports indexes on any field or sub-field of the documents in a MongoDB collection.
  • #72 MongoDB provides a number of different index types to support specific types of data and queries. 1) Default _id index  All MongoDB collections have an index on the _id field that exists by default. If applications do not specify a value for _id the driver or the mongod will create an _id field with an ObjectId value. The _id index is unique and prevents clients from inserting two documents with the same value for the_id field. 2) Single Field index In addition to the MongoDB-defined _id index, MongoDB supports the creation of user-defined ascending/descending indexes on a single field of a document. 3) Compound Index  MongoDB also supports user-defined indexes on multiple fields. The order of fields listed in a compound index has significance. index sorts first by first field and then, within each document  , sorts by other field. 4) Multikey Index If ,index a field that holds an array value, MongoDB creates multikey index on that field . These multikey indexes allow queries to select documents that contain arrays by matching on element or elements of the arrays. MongoDB automatically determines whether to create a multikey index if the indexed field contains an array value; you do not need to explicitly specify the multikey type. 5) Geo index  To support efficient queries of geospatial coordinate data, 6) Text indexes It is supports searching for string content in a collection 7) Hashed indexes To support hash based sharding, MongoDB provides a hashed index type, which indexes the hash of the value of a field. These indexes have a more random distribution of values along their range, but only support equality matches and cannot support range-based queries.
  • #73 Aggregations are operations that process data records and return computed results. Aggregation operations group values from multiple documents together, and can preform a variety of operations on the groped data to return a single result. MongoDB provides a rich set of aggregation operations. Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents. There are 3 concepts in aggregation. 1)Aggregation pipelines 2)map-reduce 3)Single purpose aggregation operation
  • #74 The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result. In first stage it take document as input and process it then producing result output documents as the input for next stage and so on. Possible stages in aggregation framework are, $project ,$match , $group , $sort , $skip, $limit , $unwind. In this example have two stages such as $match and $group, first happen $match stage then here filter status field value equal “A” then output document will be the input document to the next $group stage. then group according to the cust_id and get sum of amount as total.
  • #75 MongoDB also provides map-reduce operations to perform aggregation. In general, map-reduce operations have two phases such as map and reduce Optionally, map-reduce can have a finalize stage to make final modifications to the result. Map-reduce uses custom JavaScript functions to perform the map and reduce operations, as well as the optional finalize operation. There are some syntax: Map - JavaScript function that maps a value with a key and emits a key – values pair. Reduce - JavaScript function that reduce or groups all the documents having the same key. Out – specifies the location of the map-reduce query result query- specifies the optional selection criteria for selecting documents Sort – specifies the optional sort criteria Limit – specifies the optional maximum number of documents to be returned. In this example, have orders collection. Then get query with status field value equal “A” after that map the documents, key as cust_id and value as amount. Next reduce stage here return the sum of amount array and the result will store in order_totals.
  • #76 MongoDB provides special purpose database commands. These common aggregation operations are: returning a count of matching documents, returning the distinct values for a field, and grouping data based on the values of a field. All of these operations aggregate documents from a single collection
  • #77 Sharding is a method for storing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations As the size of the data increases, a single machine may not be sufficient to store the data and can not acceptable all read write request Sharding solves the problem with horizontal scaling. With shading can add more machines to support data growth and the demands of read and write operations. Shards: Shards are used to store data. They provide high availability and data consistency. In production environment each shard is a separate replica set. Config Servers: Config servers store the cluster's metadata. This data contains a mapping of the cluster's data set to the shards Query Routers(MongoS): Query Routers are basically mongos instances, interface with client applications and direct operations to the appropriate shard. The query router processes and targets operations to shards and then returns results to the clients. 
  • #78 It is fixed-size circular collections. It get high performance for create, read and delete operations By circular, it means that when the fixed size allocated to the collection is exhausted, it will start deleting the oldest document in the collection without providing any explicit commands. Capped collections restrict updates to the documents if the update results in increased document size Capped collections are best for storing log information,