H I M A N C H A L I T E C H L E A D D B E @ I N M O B I PostgreSQL as NoSQL
What’s on the plate today ?  What is the requirement for Schemaless database ?  Data type and volume that has to be handled.  How can we use PostgreSQL to store Schemaless data ?  Performance of PostgreSQL Schemaless option.
What is Schemaless data ?  It doesn’t mean unstructured, but it’s structured over each document  Each Document is a pair of key-values in a hierarchical structure of arrays  The application needs to be aware about this structure and handle it if it’s not getting as expected  Modern Schemaless DBs don’t use SQL and hence are termed NoSQL DB. (confusing term)
Requirement for Schemaless data  Introduction of new reporting data for merchant where dimension is not fixed  Currently there are 15 known fields, where data varies from 5-15 for different merchants  In future , chances of new dimension addition  Data size to start with is 1 million per hour
Issue with current schema DB  Fixed schema definition  Too many null values  Adding new column will need schema changes  May be updates in old rows will be required So, should we move to NoSQL ??
If PostgreSQL can solve it !!  3 different ways provided by PostgreSQL for Document types  XML  hstore  JSON
hstore : Smart contrib module  A hierarchical storage type for PostgreSQL  Key Value store with ACID compliance  Maps String Keys to String Values or other hstore values  Rich in its own functions like..  h -> “a”  h?->”a”  h@>”a->2”  Indexing available : GiST and GIN  Indexes whole hierarchy , not just a key  Expression indexes are also supported
hstore continues…. CREATE TABLE my_store ( id character varying(1024) NOT NULL, doc hstore, CONSTRAINT my_store_pkey PRIMARY KEY (id) ); CREATE INDEX my_store_doc_idx_gist ON my_store USING gist(doc) ; SELECT doc -> ‘text’ as merchant, doc -> ‘created_at’ as created_at FROM my_store WHERE doc @> ‘created_at=>23/12/2013’; SELECT doc -> ‘text’ as merchant, doc -> ‘created_at’ as created_at FROM my_store WHERE doc @> ‘is_active=>:t’ AND doc ? ‘has_address’ ORDER BY doc -> ‘created_at’ DESC; SELECT doc -> ‘text’ as merchant, doc -> ‘created_at’ as created_at FROM my_store WHERE doc @> ‘is_active=>:t’ AND doc ?| ARRAY[‘has_address’, ‘has_payoption’] ;
JSON  Storage type of famous NoSQL DBs like MongoDB , CouchDB  PostgreSQL introduced JSON type in 9.2; Stored as text but with validation for JSON (--To be fixed in 9.4)  Function like row_to_json , array_to_json  But JSON Production and Processing absent  9.3 Came with many new features  to_json(any)  Json_agg(record)  Many Extraction Functions and operators
JSON continues…  JSONB being introduced in 9.4  JSONB Indexing  Tables can be unlogged for further performance increase at the cost of reliability
hstore and json  hstore_to_json(hstore)  hstore_to_json_loose(hstore)
Some numbers..  Data used : id number, other fields text  Record 2 Million , average 64bytes each  Read CSV File, parse into appropriate format , Insert into DB  Write Speed in Records/Sec  hstore : 4600 r/s  hstore(GiST) : 3000 r/s  hstore (GIN) : 1700 r/s  JSON : 4600 r/s  MongoDB : 4000 r/s • DB Size (in MB table/indexes)  hstore : 300/0  hstore(GiST) : 300/71  hstore(GIN) : 300/700  JSON : 300/60  MongoDB : 1800/200
Write Speed 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Records/Second Records/Second
Data Size 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Index(MB) Table(MB)
 Select query on primary key  Fetch Time in Milliseconds  hstore : 320  hstore(GiST) : 190  hstore(GIN) : 180  JSON : 20  MongoDB : 40  Select query on Name (filter for 100 names)  Fetch Time in Milliseconds  hstore : 350  hstore(GiST) : 150  Hstore(GIN) : 140  JSON : 10000  MongoDB : 450
Select Query Fetch on PK 0 50 100 150 200 250 300 350 MilliSeconds
Select Query Fetch on Text 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 MilliSeconds MilliSeconds
Some conclusions..  PostgreSQL can be used as schemaless DB  PostgreSQL’s relational data storage is very efficient.  Build indexes using expressions on common used fields  In hstore GiST index is much more efficient than GIN in our case  GiST/GIN accelerates every field not just the Primary Key  GIN indexes are best for static data because lookups are faster.  For dynamic data, GiST indexes are faster to update.  Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000.
So what you got in PostgreSQL  Use the features of relational DB in your Schemaless world !  Use Constraints  Use Transactions  Use Indexing  Do Joins on keys  And with all these get NoSQL(Schemaless data) requirement fulfilled as well.  Save the new DB migration cost and time  ---  As these are PostgreSQL specific features, migration to other RDBMS not possible.
Questions??? No offence to NoSQL DBs  Thank You !! himamahi09@gmail.com

PostgreSQL as NoSQL

  • 1.
    H I MA N C H A L I T E C H L E A D D B E @ I N M O B I PostgreSQL as NoSQL
  • 2.
    What’s on theplate today ?  What is the requirement for Schemaless database ?  Data type and volume that has to be handled.  How can we use PostgreSQL to store Schemaless data ?  Performance of PostgreSQL Schemaless option.
  • 3.
    What is Schemalessdata ?  It doesn’t mean unstructured, but it’s structured over each document  Each Document is a pair of key-values in a hierarchical structure of arrays  The application needs to be aware about this structure and handle it if it’s not getting as expected  Modern Schemaless DBs don’t use SQL and hence are termed NoSQL DB. (confusing term)
  • 4.
    Requirement for Schemalessdata  Introduction of new reporting data for merchant where dimension is not fixed  Currently there are 15 known fields, where data varies from 5-15 for different merchants  In future , chances of new dimension addition  Data size to start with is 1 million per hour
  • 5.
    Issue with currentschema DB  Fixed schema definition  Too many null values  Adding new column will need schema changes  May be updates in old rows will be required So, should we move to NoSQL ??
  • 6.
    If PostgreSQL cansolve it !!  3 different ways provided by PostgreSQL for Document types  XML  hstore  JSON
  • 7.
    hstore : Smartcontrib module  A hierarchical storage type for PostgreSQL  Key Value store with ACID compliance  Maps String Keys to String Values or other hstore values  Rich in its own functions like..  h -> “a”  h?->”a”  h@>”a->2”  Indexing available : GiST and GIN  Indexes whole hierarchy , not just a key  Expression indexes are also supported
  • 8.
    hstore continues…. CREATE TABLEmy_store ( id character varying(1024) NOT NULL, doc hstore, CONSTRAINT my_store_pkey PRIMARY KEY (id) ); CREATE INDEX my_store_doc_idx_gist ON my_store USING gist(doc) ; SELECT doc -> ‘text’ as merchant, doc -> ‘created_at’ as created_at FROM my_store WHERE doc @> ‘created_at=>23/12/2013’; SELECT doc -> ‘text’ as merchant, doc -> ‘created_at’ as created_at FROM my_store WHERE doc @> ‘is_active=>:t’ AND doc ? ‘has_address’ ORDER BY doc -> ‘created_at’ DESC; SELECT doc -> ‘text’ as merchant, doc -> ‘created_at’ as created_at FROM my_store WHERE doc @> ‘is_active=>:t’ AND doc ?| ARRAY[‘has_address’, ‘has_payoption’] ;
  • 9.
    JSON  Storage typeof famous NoSQL DBs like MongoDB , CouchDB  PostgreSQL introduced JSON type in 9.2; Stored as text but with validation for JSON (--To be fixed in 9.4)  Function like row_to_json , array_to_json  But JSON Production and Processing absent  9.3 Came with many new features  to_json(any)  Json_agg(record)  Many Extraction Functions and operators
  • 10.
    JSON continues…  JSONBbeing introduced in 9.4  JSONB Indexing  Tables can be unlogged for further performance increase at the cost of reliability
  • 11.
    hstore and json hstore_to_json(hstore)  hstore_to_json_loose(hstore)
  • 12.
    Some numbers..  Dataused : id number, other fields text  Record 2 Million , average 64bytes each  Read CSV File, parse into appropriate format , Insert into DB  Write Speed in Records/Sec  hstore : 4600 r/s  hstore(GiST) : 3000 r/s  hstore (GIN) : 1700 r/s  JSON : 4600 r/s  MongoDB : 4000 r/s • DB Size (in MB table/indexes)  hstore : 300/0  hstore(GiST) : 300/71  hstore(GIN) : 300/700  JSON : 300/60  MongoDB : 1800/200
  • 13.
  • 14.
  • 15.
     Select queryon primary key  Fetch Time in Milliseconds  hstore : 320  hstore(GiST) : 190  hstore(GIN) : 180  JSON : 20  MongoDB : 40  Select query on Name (filter for 100 names)  Fetch Time in Milliseconds  hstore : 350  hstore(GiST) : 150  Hstore(GIN) : 140  JSON : 10000  MongoDB : 450
  • 16.
    Select Query Fetchon PK 0 50 100 150 200 250 300 350 MilliSeconds
  • 17.
    Select Query Fetchon Text 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 MilliSeconds MilliSeconds
  • 18.
    Some conclusions..  PostgreSQLcan be used as schemaless DB  PostgreSQL’s relational data storage is very efficient.  Build indexes using expressions on common used fields  In hstore GiST index is much more efficient than GIN in our case  GiST/GIN accelerates every field not just the Primary Key  GIN indexes are best for static data because lookups are faster.  For dynamic data, GiST indexes are faster to update.  Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000.
  • 19.
    So what yougot in PostgreSQL  Use the features of relational DB in your Schemaless world !  Use Constraints  Use Transactions  Use Indexing  Do Joins on keys  And with all these get NoSQL(Schemaless data) requirement fulfilled as well.  Save the new DB migration cost and time  ---  As these are PostgreSQL specific features, migration to other RDBMS not possible.
  • 20.
    Questions??? No offence toNoSQL DBs  Thank You !! himamahi09@gmail.com