CosmosDB for DBAs & Developers

Cosmos DB for DBAs & DEVs Niko Neugebauer – Consultant @ OH22

Speaker Niko speaks regularly at events such as PASS Summit, SQLRally, SQLBits, and SQLSaturday events around the world. Niko Neugebauer Professional Focus Community Lead the first international SQLSaturday PASS User Group Leader TUGA Non-Profit Association Leader /in/webcaravela/ @NikoNeugebauer Data Platform (especially from Microsoft) Columnstore Blogger (110+) at http://www.nikoport.com/columnstore Creator of CISL – Columnstore Indexes Script Library (https://github.com/NikoNeugebauer/CSIL)

Niko Neugebauer Consultant, OH22 IS Professional Focus Data Platform (especially from Microsoft) Columnstore Blogger (110+) at http://www.nikoport.com/columnstore Creator of CISL – Columnstore Indexes Script Library (https://github.com/NikoNeugebauer/CSIL) Lead the first international SQLSaturday PASS User Group Leader TUGA Non-Profit Association Leader Speaker Niko speaks regularly at events such as PASS Summit, SQLRally, SQLBits, and SQLSaturday events around the world.• /in/webcaravela/ • @NikoNeugebauer

CAP Theorem – old wisdom: pick just 2! • Consistency • Availability • Partition tolerance

Agenda • What is CosmosDB ? • Why CosmosDB ? • How CosmosDB ? • Use CosmosDB • CosmosDB for Developers • CosmosDB for DBAs

What is CosmosDB • Azure Cosmos DB is Microsoft's globally distributed, multi-model database. • With the click of a button, Azure Cosmos DB enables you to elastically and independently scale throughput and storage across any number of Azure's geographic regions. • It offers throughput, latency, availability, and consistency guarantees with comprehensive service level agreements (SLAs), something no other database service can offer.

Data Models in CosmosDB • Database engine operates on atom-record-sequence based type system. All data models translated to A-R-S • API and wire protocols supported via extensible modules Currently supported data models: • Documents, Graphs, Key-Value, Column-Value

API (30-11-2017) • DocumentDB API • SQL-like API • MongoDB API • Table API • Graph API (TinkerPop, Gremlin/Groove) • Cassandra API • Spark • Geospatial support • more will be coming!

A word on Table API vs Azure Table Storage comparison Table Storage Cosmos Table API Latency Fast Single-digit millisecond latency Throughput Variable, scalalbe up to 20.000 operations/second Highly scalable with dedicated reserved throughput per table, up to 10 million operations/sec Global Distribution Single Region Turnkey global distribution Indexing Only Primary Index on PartitionKey and RowKey Automatic and complete indexing on all properties, no index management (LOL). Query Query execution uses index for primary key, and scans otherwise. Queries can take advantage of automatic indexing on properties for fast query times. Consistency Strong in Primary Region, Eventual in Secondary Reg. 5 well-defined consistency levels

Partitioning • Implemented on the Tenant-level (Collection, Graph, Table) • A resource partition is a resource-governed primitive, which is limited to a subset of keys. • Capable of doing Splits, Merges, etc from the Partitions

Partitioning Best Practices - Select a PartitionKey for the best data distribution - Use location-aware partition key for the best access locality - Select a PartitionKey which can be a transaction scope - Don’t use Timestamps for write-heavy workloads. Use time ranges (hour, month, week, day, year) for even data distribution.

Why creating CosmosDB? • Traditional relational databases were designed in 70s-80s • Data is Growing (Petabytes, Exabytes, etc) • Think about Internet-Scale and distributed systems • Provide API Choices Think about: • Availability • Performance • Costs

CosmosDB: the focus on the performance Reads (1KB) Indexed Writes (1KB) 50th < 2ms < 6ms 99th < 10ms < 15ms ▪ Globally distributed with reads and writes served from/to local region ▪ Write-optimised, latch-free engine designed for SSD ▪ Synchronous/Asynchronous automatic indexing

Azure Cosmos DB • Azure Cosmos DB is fully schema agnostic. • Uses JSON to describe the supported data models • Automatic indexing of all ingested content • Resource Governed, write-optimised engine • Online Index operations

Core pieces of CosmosDB Architecture • Global distribution • Resource Governance • Schema-agnostic service

Consisteny Levels (and there are 5 of them): • You pick a stronger consistency level like strong/bounded staleness because for your account, because a critical path in your e- commerce/LOB application needs the guarantee • But for some less-critical operations (like a reporting dashboard query), you would choose a weaker-consistency level because it consumes only half the throughput. • The current offering for the Consistency levels is: Strong / Bounded Staleness / Session / Consistent Prefix / Eventual

Consisteny Levels in 1 Picture:

Default Consisteny Levels: • Strong - Linear. Reads are guaranteed to return the most recent version of an item. • Bounded Staleness - Consistent Prefix. Reads lag behind writes by k prefixes or t interval • Session - Consistent Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads in your geographical location. • Consistent Prefix - Updates returned are some prefix of all the updates, with no gaps. If you applied sequential transactions, the previous ones are available on request. • Eventual - Out of order reads

Indexing & Consisteny Levels: Indexing Mode Reads Queries Consistent Select from strong, bounded staleness, session, consistent prefix, or eventual Select from strong, bounded staleness, session, or eventual Lazy Select from strong, bounded staleness, session, consistent prefix, or eventual Eventual None Select from strong, bounded staleness, session, consistent prefix, or eventual Eventual

Throughoutput • RU – Requests Unit • % Memory / % CPU / % IOPS just like for Azure SQLDB • READ / INSERT / UPSERT / DELETE / QUERY - operations • QUERY = Scans + Index Lookups + Query Complexity + Instruction Cost • Everything is calculated by Azure ML 

Throughoutput • RU – Requests Per Unit • 400 RU/sec – 10.000 RU/sec (Collections) • 2.500 RU/sec – Unlimited? RU/sec (Partitioned Collections) • Min Increase / Decrease is 100 RU/sec

Scaling Cosmos DB Up & Out • Scale Up – Increase the number of RUs • Scale Out – Increase the number of partitions for your collections/graphs/tables

Stored Procs, User-Defined Functions, Triggers, etc • Is a Server-Side JavaScript Programming • Procedural Logic • Atomic Transactions • Batching • Pre-Compilation • Encapsulation

Triggers (validation and Node.JS registration)

Stored Procedures using Javascript API DO NOT!

Azure Functions Are supported 

Real Life Problems • Data Quality (Data Types Casting, Missing Connections) • Complex Questions (joins)

CosmosDB • Introduction (Availability (Ring 0), Consistency, 5 9s, PaaS, Scaling) • Blah • Stored Procedures • UDFs • Triggers

At the Data Centre • Solid State Drives storage (SSD) • Fusion IO 160GB Drives • Fast Private Network Connections

Azure CosmosDB Data Migration Tool • Allows you to migrate your data into the CosmosDB • Supports a range of the sources • Does not support GraphDB ... yet

CosmosDB Query Playground • https://www.documentdb.com/sql/demo

Try CosmosDB for free (need an Azure account): • https://azure.microsoft.com/en-us/try/cosmosdb/ 46

CosmosDB in Azure Storage Explorer

Azure Cosmos DB Emulator Software requirements: • Windows Server 2012 R2, Windows Server 2016, or Windows 10 Minimum Hardware requirements: • 2 GB RAM • 10 GB available hard disk space

CosmosDB: DBAs DBA as in DCT = Data Care Taker

Indexing Policy Modes • Consistent – follows the same consistency level as specified for the point- reads (i.e. strong, bounded-staleness, session or eventual). The index is updated synchronously as part of the document update. The workload target is “write quickly, query immediately”. • Lazy - To allow maximum document ingestion throughput, an Azure Cosmos DB collection can be configured with lazy consistency; meaning queries are eventually consistent. The index is updated asynchronously when an Azure Cosmos DB collection is quite. • None - A collection marked with index mode of “None” has no index associated with it. This is commonly used if Azure Cosmos DB is utilized as a key-value storage and documents are accessed only by their ID property.

Indexing Policy Modes Consistency Indexing Mode: Consistent Indexing Mode: Lazy Strong Strong Eventual Bounded Staleness Bounded Staleness Eventual Session Session Eventual Eventual Eventual Eventual

Indexing Policy Modes with EnableScanInQuery Consistency Indexing Mode: Consistent Indexing Mode: Lazy Indexing Mode: None Strong Strong Eventual Strong Bounded Staleness Bounded Staleness Eventual Bounded Staleness Session Session Eventual Session Eventual Eventual Eventual Eventual

Indexing Paths Path Description / Default path for the collection. Recursive /name/? Hash or Range Indexes for predicates and sorts /name/* Index path for all paths under the specified label. (multiple levels down) /name/[]/prop/? Index path required to serve iteration and JOIN queries against arrays of objects like [{prop: "a"}, {prop: "b"}]:

Indexes Types, Kinds & Precisions DataTypes: • String • Number • Point • Polygon • LineString

Indexes Types, Kinds & Precisions Index Types: • Hash – Hash Indexes, think Hekaton (Hash Indexes). Supports equality and JOIN queries, for the most queries default value of 3 bytes is sufficient. DataType can be String or Number. • Range – Range Indexes, think Hekaton (BW-Tree). Supports equality & range queries (<,>,<=,>=,!=) and ORDER BY queries. DataType can be String or Number. • Spatial – Spatial Queries for Points, Polygons & LineString. Supports efficient spatial (within & distance queries) queries.

Indexes Precision Lets you tradeoff between index storage overhead and query performance. For numbers, Microsoft recommends using the defulat precision -1 (“maximum”). Notice that numbers are 8 bytes in JSON. Picking smaller numbers for precision (1-7) means collisions and hence more RU’s consumption. For String ranges, which can be of arbitrary lengths, the index precision can impact the performance of range search queries and impact storage. The precision can be specified between 1 to 100. Important: if you need sorting on the results (ORDER BY), you must specify the precision of 100.

Indexes Inclusion / Exclusion includedPaths: [ { “path”: “/mainContent/*”, “indexes”:[ { “kind”: “Hash”, “dataType”: “String”, “precision”: 20 } ] } ] excludedPaths: [ { “path”: “/nonIndexedContent/*” } ]

Indexing Policy Changes – What for ? • When importing bulk data using lazy indexing models for faster writes, switching then to consistent indexing for regular operation. • When reducing the throughput for writes as well as the storage space used by hand selecting the properties to be indexed and changing them over time, or by varying the index precision of individual properties. • When using new indexing features on your current DocumentDB collections like Order By and string range queries which require the newly introduced string range index kind.

Indexing Policy Changes - how ?

Backup for DBAs: • Every 4 hours (approx.) a backup is taken (to Azure BLOB Storage) • At least 2 backups are stored at all times • If you lost your data, you need to contact Azure Support within 8 hours • Backup retention: 30 days for deleted partitions/databases • If you want to maintain your own snapshots, you can use the export to JSON option in the Azure Cosmos DB Data Migration tool to schedule additional backups.

Backup for DBAs – read carefully: • As soon as corruption is detected, the user should delete the corrupted container (collection/graph/table) so that backups are protected from being overwritten with corrupted data. Source: https://docs.microsoft.com/en-us/azure/cosmosdb/online-backup-and-restore

Backup for DBAs – the alternative: • Extract JSON files of your databases/collections/graphs with the help of the Azure Migration Tool

Global Distribution aka Geo-Replication aka Reional Failover

Manual Failover Scenarios: • Follow the clock model: If your applications have predictable traffic patterns based on the time of the day, you can periodically change the write status to the most active geographic region based on time of the day. • Service update: Certain globally distributed application deployment may involve rerouting traffic to different region via traffic manager during their planned service update. Such application deployment now can use manual failover to keep the write status to the region where there is going to be active traffic during the service update window. • Business Continuity and Disaster Recovery (BCDR) and High Availability and Disaster Recovery (HADR) drills: Most enterprise applications include business continuity tests as part of their development and release process. BCDR and HADR testing is often an important step in compliance certifications and guaranteeing service availability in the case of regional outages. You can test the BCDR readiness of your applications that use Cosmos DB for storage by triggering a manual failover of your Cosmos DB account and/or adding and removing a region dynamically.

Global Distribution aka Geo-Replication aka Reional Failover • Configuration • First, deploy your application in multiple regions • To ensure low latency access from every region your application is deployed, configure the corresponding preferred regions list for each region via one of the supported SDKs.

GraphDB • Based on Apache TinkerPop (open source) • Supporting Gremlin & Groove (How much?) languages

GraphDB - possibilities • Querying across graph collections - not supported right now • Duplicate Edges detection • Duplicate Vertex detection • Betweness Centrality • Eigenvector (PageRank) • Recommendation (as Products in SSAS) • ...

GraphDB Gremlin querying • g.V().count(); // Documents • g.V().hasLabel(‘person’).has(‘age’,gt(40)); // People aged over 40 • g.V().hasLabel('person').values('firstName'); // List People’s first names Under the hood, the query • g.V().hasLabel('Azure') transforms into • {"query":"SELECT N_2 FROM Node N_2 WHERE (IS_DEFINED(N_2._isEdge) = false AND (N_2.label = 'Azure'))"}

GraphDB Migrations • Neo4J: https://github.com/bsherwin/neo2cosmos • Migration Tool (soon)

Data Migration Tool: • https://www.microsoft.com/en-us/download/details.aspx?id=46436

Limitations: • Returning big amounts of data • No support for Group BY (SQL Api)

PowerBI • Via Spark - https://github.com/Azure/azure-cosmosdb- spark/wiki/Configuring-Power-BI-Direct-Query-to-Azure- Cosmos-DB-via-Apache-Spark-(HDI)

Geospatial • Working with geospatial and GeoJSON location data in Azure Cosmos DB: https://docs.microsoft.com/en-us/azure/cosmosdb/geospatial • Azure Cosmos DB: Expanded geospatial support, including automatic indexing of Polygon and LineString objects: https://azure.microsoft.com/en-us/updates/documentdb- expanded-geospatial-support-including-automatic- indexing-of-polygons-and-lines/

CosmosDB Links • https://www.microsoft.com/en-us/download/details.aspx?id=46436 • https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels • Azure CosmosDB Emulator: https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator • Indexing Policies: https://docs.microsoft.com/en-us/azure/cosmos-db/indexing-policies • Use the Azure Cosmos DB Emulator for local development and testing: https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator • Tunable data consistency levels in Azure Cosmos DB: https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels

CosmosDB Links • Gremlin Console: http://tinkerpop.apache.org/docs/current/tutorials/the-gremlin- console/ • Tunable data consistency levels in Azure Cosmos DB:

Database Console Commands Rodrigo Crespi, SQL Server specialist A seguir….

CosmosDB for DBAs & Developers

More Related Content

What's hot

Similar to CosmosDB for DBAs & Developers

Recently uploaded

CosmosDB for DBAs & Developers