Scaling the Web: Databases & NoSQL Richard Schneeman Wed Nov 10 @schneems works for @Gowalla 2011
whoami • @Schneems • BSME with Honors from Georgia Tech • 5 + years experience Ruby & Rails • Work for @Gowalla • Rails 3.1 contributor : ) • 3 + years technical teaching
Traffic
Compounding Traffic ex. Wikipedia
Compounding Traffic ex. Wikipedia
Gowalla
Gowalla • 50 best websites NYTimes 2010 • Founded 2009 @ SXSW • 1 million+ Users • Undisclosed Visitors • Loves/highlights/comments/stories/guides • Facebook/Foursquare/Twitter integration • iphone/android/web apps • public API
Gowalla Backend • Ruby on Rails • Uses the Ruby Language • Rails is the Framework
The Web is Data • Username => String • Birthday => Int/ Int/ Int • Blog Post => Text • Image => Binary-file/blob Data needs to be stored to be useful
Database
Gowalla Database • PostgreSQL • Relational (RDBMS) • Open Source • Competitor to MySQL • ACID compliant • Running on a Dedicated Managed Server
Need for Speed • Throughput: • The number of operations per minute that can be performed • Pure Speed: • How long an individual operation takes.
Potential Problems • Hardware • Slow Network • Slow hard-drive • Insufficient CPU • Insufficient Ram • Software • too many Reads • too many Writes
Scaling Up versus Out • Scale Up: • More CPU, Bigger HD, More Ram etc. • Scale Out: • More machines • More machines • More machines • ...
Scale Up • Bigger faster machine • More Ram • More CPU • Bigger ethernet bus • ... • Moores Law • Diminishing returns
Scale Out • Forget Moores law... • Add more nodes • Master/ Slave Database • Sharding
Master/Slave Write Master DB Copy Slave DB Slave DB Slave DB Slave DB Read
Master & Slave +/- • Pro • Increased read speed • Takes read load off of master • Allows us to Join across all tables • Con • Doesn’t buy increased write throughput • Single Point of Failure in Master Node
Sharding Write Users in Users in Users in Users in USA Europe Asia Africa Read
Sharding +/- • Pro • Increased Write & Read throughput • No Single Point of failure • Individual features can fail • Con • Cannot Join queries between shards
What is a Database? • Relational Database Managment System (RDBMS) • Stores Data Using Schema • A.C.I.D. compliant • Atomic • Consistent • Isolated • Durable
RDBMS • Relational • Matches data on common characteristics in data • Enables “Join” & “Union” queries • Makes data modular
Relational +/- • Pros • Data is modular • Highly flexible data layout • Cons • Getting desired data can be tricky • Over modularization leads to many join queries • Trade off performance for search-ability
Schema Storage • Blueprint for data storage • Break data into tables/columns/rows • Give data types to your data • Integer • String • Text • Boolean • ...
Schema +/- • Pros • Regularize our data • Helps keep data consistent • Converts to programming “types” easily • Cons • Must seperatly manage schema • Adding columns & indexes to existing large tables can be painful & slow
ACID • Properties that guarante a reliably transaction are processed database • Atomic • Consistent • Isolated • Durable
ACID • Atomic • Any database Transaction is all or nothing. • If one part of the transaction fails it all fails “An Incomplete Transaction Cannot Exist”
ACID • Consistent • Any transaction will take the another from one consistent state to database “Only Consistent data is allowed to be written”
ACID • Isolated • No transaction should be able to interfere with another transaction “the same field cannot be updated by two sources at the exact same time” } a = 0 a += 1 a = ?? a += 2
ACID • Durable • Onceway that a transaction Is committed it will stay “Save it once, read it forever”
What is a Database? • RDBMS • Relational • Flexible • Has a schema • Most likely ACID compliant • Typically fast under low load or when optimized
What is SQL? • Structured Query Language • The language databases speak • Based on relational algebra • Insert • Query • Update • Delete “SELECT Company, Country FROM Customers WHERE Country = 'USA' ”
Why people <3 SQL • Relational algebra is powerful • SQL is proven • well understood • well documented
Why people </3 SQL • Relational algebra Is hard • Different databases support different SQL syntax • Yet another programming language to learn
SQL != Database • SQL is used to talk to a RDBMS (database) • SQL is not a RDBMS
What is NoSQL? Not A Relational Database
RDBMS
Types of NoSQL • Distributed Systems • Document Store • Graph Database • Key-Value Store • Eventually Consistent Systems Mix And Match ↑
Key Value Stores • Non Relational • Typically No Schema • Map one Key (a string) to a Value (some object) Example: Redis
Key Value Example redis = Redis.new redis.set(“foo”, “bar”) redis.get(“foo”) >> “bar”
Key Value Example redis = Redis.new Key Value redis.set(“foo”, “bar”) Key redis.get(“foo”) Value >> “bar”
Key Value • Like a databse that can only ever use primary Key (id) YES select * from users where id = ‘3’; NO select * from users where name = ‘schneems’;
NoSQL @ Gowalla • Redis (key-value store) • Store “Likes” & Analytics • Memcache (key-value store) • Cache Database results • Cassandra • (eventually consistent, with-schema, key value store) • Store “feeds” or “timelines” • Solr (search index)
Memcache • Key-Value Store • Open Source • Distributed • In memory (ram) only • fast, but volatile • Not ACID • Memory object caching system
Memcache Example memcache = Memcache.new memcache.set(“foo”, “bar”) memcache.get(“foo”) >> “bar”
Memcache • Can store whole objects memcache = Memcache.new user = User.where(:username => “schneems”) memcache.set(“user:3”, user) user_from_cache = memcache.get(“user:3”) user_from_cache == user >> true user_from_cache.username >> “Schneems”
Memcache @ Gowalla • Cache Common Queries • Decreases Load on DB (postgres) • Enables higher throughput from DB • Faster response than DB • Users see quicker page load time
What to Cache? • Objects that change infrequently • users • spots (places) • etc. • Expensive(ish) sql queries • Friend ids for users • User ids for people visiting spots • etc.
Memcache Distributed A C B
Memcache Distributed Easily add more nodes A D B C
Memcache <3’s DB • We use them Together • If memcache doesn’t have a value • Fetch from the database • Set the key from database • Hard • Cache Invalidation : (
Redis • Key Value Store • Open Source • Not Distributed (yet) • Extremely Quick • “Data structure server”
Redis Example, again redis = Redis.new redis.set(“foo”, “bar”) redis.get(“foo”) >> “bar”
Redis - Has Data Types • Strings • Hashes • Lists • Sets • Sorted Sets
Redis Example, sets redis = Redis.new redis.sadd(“foo”, “bar”) redis.members(“foo”) >> [“bar”] redis.sadd(“foo”, “fly”) redis.members(“foo”) >> [“bar”, “fly”]
Redis => Likeable • Very Fast response • ~ 50 queries per page view • ~ 1 ms per query • http://github.com/Gowalla/likeable
Cassandra • Open Source • Distributed • Key Value Store • Eventually Consistent • Sortof not ACID • Uses A Schema • ColumnFamilies
Cassandra Distributed Eventual Consistency A D Copied To Extra Nodes ... Eventually Data In B C
Cassandra { @ Gowalla Activity Feeds
Cassandra @ Gowalla • Chronologic • http://github.com/Gowalla/chronologic
Should I use NoSQL?
Which One?
Pick the right tool
Tradeoffs • Every Data store has them • Know your data store • Strengths • Weaknesses
NoSQL vs. RDBMS • No Magic Bullet • Use Both!!! • Model data in a datastore you understand • Switch to when/if you need to • Understand Your Options
Questions? Richard Schneeman @schneems works for @Gowalla

Scaling the Web: Databases & NoSQL

  • 1.
    Scaling the Web: Databases& NoSQL Richard Schneeman Wed Nov 10 @schneems works for @Gowalla 2011
  • 2.
    whoami • @Schneems • BSMEwith Honors from Georgia Tech • 5 + years experience Ruby & Rails • Work for @Gowalla • Rails 3.1 contributor : ) • 3 + years technical teaching
  • 3.
  • 4.
    Compounding Traffic ex. Wikipedia
  • 5.
    Compounding Traffic ex. Wikipedia
  • 6.
  • 7.
    Gowalla • 50 bestwebsites NYTimes 2010 • Founded 2009 @ SXSW • 1 million+ Users • Undisclosed Visitors • Loves/highlights/comments/stories/guides • Facebook/Foursquare/Twitter integration • iphone/android/web apps • public API
  • 9.
    Gowalla Backend • Rubyon Rails • Uses the Ruby Language • Rails is the Framework
  • 10.
    The Web isData • Username => String • Birthday => Int/ Int/ Int • Blog Post => Text • Image => Binary-file/blob Data needs to be stored to be useful
  • 11.
  • 12.
    Gowalla Database • PostgreSQL • Relational (RDBMS) • Open Source • Competitor to MySQL • ACID compliant • Running on a Dedicated Managed Server
  • 13.
    Need for Speed •Throughput: • The number of operations per minute that can be performed • Pure Speed: • How long an individual operation takes.
  • 14.
    Potential Problems • Hardware • Slow Network • Slow hard-drive • Insufficient CPU • Insufficient Ram • Software • too many Reads • too many Writes
  • 15.
    Scaling Up versusOut • Scale Up: • More CPU, Bigger HD, More Ram etc. • Scale Out: • More machines • More machines • More machines • ...
  • 16.
    Scale Up • Biggerfaster machine • More Ram • More CPU • Bigger ethernet bus • ... • Moores Law • Diminishing returns
  • 17.
    Scale Out • ForgetMoores law... • Add more nodes • Master/ Slave Database • Sharding
  • 18.
    Master/Slave Write Master DB Copy Slave DB Slave DB Slave DB Slave DB Read
  • 19.
    Master & Slave+/- • Pro • Increased read speed • Takes read load off of master • Allows us to Join across all tables • Con • Doesn’t buy increased write throughput • Single Point of Failure in Master Node
  • 20.
    Sharding Write Users in Users in Users in Users in USA Europe Asia Africa Read
  • 21.
    Sharding +/- • Pro • Increased Write & Read throughput • No Single Point of failure • Individual features can fail • Con • Cannot Join queries between shards
  • 22.
    What is aDatabase? • Relational Database Managment System (RDBMS) • Stores Data Using Schema • A.C.I.D. compliant • Atomic • Consistent • Isolated • Durable
  • 23.
    RDBMS • Relational • Matches data on common characteristics in data • Enables “Join” & “Union” queries • Makes data modular
  • 24.
    Relational +/- • Pros • Data is modular • Highly flexible data layout • Cons • Getting desired data can be tricky • Over modularization leads to many join queries • Trade off performance for search-ability
  • 25.
    Schema Storage • Blueprintfor data storage • Break data into tables/columns/rows • Give data types to your data • Integer • String • Text • Boolean • ...
  • 26.
    Schema +/- • Pros • Regularize our data • Helps keep data consistent • Converts to programming “types” easily • Cons • Must seperatly manage schema • Adding columns & indexes to existing large tables can be painful & slow
  • 27.
    ACID • Properties thatguarante a reliably transaction are processed database • Atomic • Consistent • Isolated • Durable
  • 28.
    ACID • Atomic • Anydatabase Transaction is all or nothing. • If one part of the transaction fails it all fails “An Incomplete Transaction Cannot Exist”
  • 29.
    ACID • Consistent • Anytransaction will take the another from one consistent state to database “Only Consistent data is allowed to be written”
  • 30.
    ACID • Isolated • Notransaction should be able to interfere with another transaction “the same field cannot be updated by two sources at the exact same time” } a = 0 a += 1 a = ?? a += 2
  • 31.
    ACID • Durable • Onceway that a transaction Is committed it will stay “Save it once, read it forever”
  • 32.
    What is aDatabase? • RDBMS • Relational • Flexible • Has a schema • Most likely ACID compliant • Typically fast under low load or when optimized
  • 33.
    What is SQL? • Structured Query Language • The language databases speak • Based on relational algebra • Insert • Query • Update • Delete “SELECT Company, Country FROM Customers WHERE Country = 'USA' ”
  • 34.
    Why people <3SQL • Relational algebra is powerful • SQL is proven • well understood • well documented
  • 35.
    Why people </3SQL • Relational algebra Is hard • Different databases support different SQL syntax • Yet another programming language to learn
  • 36.
    SQL != Database •SQL is used to talk to a RDBMS (database) • SQL is not a RDBMS
  • 37.
    What is NoSQL? Not A Relational Database
  • 38.
  • 39.
    Types of NoSQL •Distributed Systems • Document Store • Graph Database • Key-Value Store • Eventually Consistent Systems Mix And Match ↑
  • 40.
    Key Value Stores •Non Relational • Typically No Schema • Map one Key (a string) to a Value (some object) Example: Redis
  • 41.
    Key Value Example redis= Redis.new redis.set(“foo”, “bar”) redis.get(“foo”) >> “bar”
  • 42.
    Key Value Example redis= Redis.new Key Value redis.set(“foo”, “bar”) Key redis.get(“foo”) Value >> “bar”
  • 43.
    Key Value • Like a databse that can only ever use primary Key (id) YES select * from users where id = ‘3’; NO select * from users where name = ‘schneems’;
  • 44.
    NoSQL @ Gowalla •Redis (key-value store) • Store “Likes” & Analytics • Memcache (key-value store) • Cache Database results • Cassandra • (eventually consistent, with-schema, key value store) • Store “feeds” or “timelines” • Solr (search index)
  • 45.
    Memcache • Key-Value Store •Open Source • Distributed • In memory (ram) only • fast, but volatile • Not ACID • Memory object caching system
  • 46.
    Memcache Example memcache =Memcache.new memcache.set(“foo”, “bar”) memcache.get(“foo”) >> “bar”
  • 47.
    Memcache •Can store whole objects memcache = Memcache.new user = User.where(:username => “schneems”) memcache.set(“user:3”, user) user_from_cache = memcache.get(“user:3”) user_from_cache == user >> true user_from_cache.username >> “Schneems”
  • 48.
    Memcache @ Gowalla •Cache Common Queries • Decreases Load on DB (postgres) • Enables higher throughput from DB • Faster response than DB • Users see quicker page load time
  • 49.
    What to Cache? •Objects that change infrequently • users • spots (places) • etc. • Expensive(ish) sql queries • Friend ids for users • User ids for people visiting spots • etc.
  • 50.
  • 51.
    Memcache Distributed Easily add more nodes A D B C
  • 52.
    Memcache <3’s DB •We use them Together • If memcache doesn’t have a value • Fetch from the database • Set the key from database • Hard • Cache Invalidation : (
  • 53.
    Redis • Key ValueStore • Open Source • Not Distributed (yet) • Extremely Quick • “Data structure server”
  • 54.
    Redis Example, again redis= Redis.new redis.set(“foo”, “bar”) redis.get(“foo”) >> “bar”
  • 55.
    Redis - HasData Types • Strings • Hashes • Lists • Sets • Sorted Sets
  • 56.
    Redis Example, sets redis= Redis.new redis.sadd(“foo”, “bar”) redis.members(“foo”) >> [“bar”] redis.sadd(“foo”, “fly”) redis.members(“foo”) >> [“bar”, “fly”]
  • 57.
    Redis => Likeable •Very Fast response • ~ 50 queries per page view • ~ 1 ms per query • http://github.com/Gowalla/likeable
  • 58.
    Cassandra • Open Source •Distributed • Key Value Store • Eventually Consistent • Sortof not ACID • Uses A Schema • ColumnFamilies
  • 59.
    Cassandra Distributed Eventual Consistency A D Copied To Extra Nodes ... Eventually Data In B C
  • 60.
    Cassandra { @ Gowalla Activity Feeds
  • 61.
    Cassandra @ Gowalla •Chronologic • http://github.com/Gowalla/chronologic
  • 62.
  • 63.
  • 64.
  • 65.
    Tradeoffs • Every Datastore has them • Know your data store • Strengths • Weaknesses
  • 66.
    NoSQL vs. RDBMS •No Magic Bullet • Use Both!!! • Model data in a datastore you understand • Switch to when/if you need to • Understand Your Options
  • 67.