Rapid, Scalable Web Development with MongoDB, Ming, and Python

Rapid, Scalable Web Development with MongoDB, Ming, and Python Rick Copeland @rick446 [email_address]

NoSQL at SourceForge Rewriting Consume Introducing Ming Allura – Open-Sourcing Open Source Zarkov – MongoDB-based (near) real-time analytics

SF.net “BlackOps”: FossFor.us User Editable! Web 2.0! (ish) Not Ugly!

FossFor.us used CouchDB (NoSQL) “ Just adding new fields was trivial, and was happening all the time” – Mark Ramm Scaling up to the level of SF.net needs research CouchDB MongoDB Tokyo Cabinet/Tyrant Cassandra... and others Moving to NoSQL

What we were looking for Performance – how does a single node perform? Scalability – needs to support simple replication Ability to handle complex data and queries Ease of development

Rewriting “Consume” Most traffic on SF.net hits 3 types of pages: Project Summary File Browser Download Pages are read-mostly, with infrequent updates from the “Develop” side of sf.net Original goal is 1 MongoDB document per project Later split release data because some projects have lots of releases Periodic updates via RSS and AMQP from “Develop”

Deployment Architecture Load Balancer / Proxy Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Gobble Server Develop Apache mod_wsgi / TG 2.0 MongoDB Slave

Deployment Architecture (revised) Load Balancer / Proxy Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 Gobble Server Develop Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Scalability is good Single-node performance is good, too

Ming – an “Object-Document Mapper?” Your data has a schema Your database can define and enforce it It can live in your application (as with MongoDB) Nice to have the schema defined in one place in the code Sometimes you need a “migration” Changing the structure/meaning of fields Adding indexes, particularly unique indexes Sometimes lazy, sometimes eager “ Unit of work:” Queuing up all your updates can be handy Python dicts are nice; objects are nicer

Ming Concepts Inspired by SQLAlchemy Group of collection objects with schemas defined Group of classes to which you map your collections Use collection-level operations for performance Use class-level operations for abstraction Convenience methods for loading/saving objects and ensuring indexes are created Migrations Unit of Work – great for web applications MIM – “Mongo in Memory” nice for unit tests

Ming Example from ming import schema, Field from ming.orm import (mapper, Mapper, RelationProperty, ForeignIdProperty) WikiDoc = collection(‘ wiki_page' , session, Field( '_id' , schema . ObjectId()), Field( 'title' , str , index = True ), Field( 'text' , str )) CommentDoc = collection(‘ comment' , session, Field( '_id' , schema . ObjectId()), Field( 'page_id' , schema . ObjectId(), index = True ), Field( 'text' , str )) class WikiPage ( object ): pass class Comment ( object ): pass ormsession . mapper(WikiPage, WikiDoc, properties = dict ( comments = RelationProperty( 'WikiComment' ))) ormsession . mapper(Comment, CommentDoc, properties = dict ( page_id = ForeignIdProperty( 'WikiPage' ), page = RelationProperty( 'WikiPage' ))) Mapper . compile_all()

Python / MongoDB Taking Over…. Allura Load Balancer / Proxy Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 Gobble Server Develop Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Scalability is good Single-node performance is good, too

Allura Architecture Web-facing App Server Task Daemon SMTP Server FUSE Filesystem (repository hosting)

Allura Threaded Discussions MessageDoc = collection( 'message' , project_doc_session, Field( '_id' , str , if_missing = h . gen_message_id), Field( 'slug' , str , if_missing = h . nonce), Field( 'full_slug' , str ), Field( 'parent_id' , str ),…) _id – use an email Message-ID compatible key slug – threaded path of random 4-digit hex numbers prefixed by parent (e.g. dead/beef/f00d  dead/beef  dead) full_slug – slug interspersed with ISO-formatted message datetime (20110627…dead/20110627…beef….) Easy queries for hierarchical data Find all descendants of a message – slug prefix search “dead/.*” Sort messages by thread, then by date – full_slug sort

MonQ: Async Queueing in MongoDB states = ( 'ready' , 'busy' , 'error' , 'complete' ) result_types = ( 'keep' , 'forget' ) MonQTaskDoc = collection( 'monq_task' , main_doc_session, Field( '_id' , schema . ObjectId()), Field( 'state' , schema . OneOf( * states)), Field( 'result_type' , Schema . OneOf( * result_types)), Field( 'time_queue' , datetime), Field( 'time_start' , datetime), Field( 'time_stop' , datetime), # dotted path to function Field( 'task_name' , str ), Field( 'process' , str ), # worker process name: “locks” the task Field( 'context' , dict ( project_id = schema . ObjectId(), app_config_id = schema . ObjectId(), user_id = schema . ObjectId())), Field( 'args' , list ), Field( 'kwargs' , { None : None }), Field( 'result' , None , if_missing = None ))

Repository Cache Objects On commit to a repo (Hg, SVN, or Git) Build commit graph in MongoDB for new commits Build auxiliary structures tree structure, including all trees in a commit & last commit to modify linear commit runs (useful for generating history) commit difference summary (must be computed in Hg and Git) Note references to other artifacts and commits Repo browser uses cached structure to serve pages DiffInfo Tree Trees CommitRun LastCommit Commit

Repository Cache Lessons Learned Using MongoDB to represent graph structures (commit graph, commit trees) requires careful query planning. Pointer-chasing is no fun! Sometimes Ming validation and ORM overhead can be prohibitively expensive – time to drop down a layer. Benchmarking and profiling are your friends, as are queries like {‘_id’: {‘$in’:[…]}} for returning multiple objects

And now, for something completely different… Business: we need more visibility into what users are doing Low overhead Near real-time Unified view of lots of systems Python PHP Perl

Introducing Zarkov Asynchronous TCP server for event logging with gevent Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client) Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}

What We Liked Performance, performance, performance – Easily handle 90% of SF.net traffic from 1 DB server, 4 web servers Dynamic schema allows fast schema evolution in development, making many migrations unnecessary Replication is easy , making scalability and backups easy Query Language You mean I can have performance without map-reduce? GridFS

Pitfalls Too-large documents Store less per document Return only a few fields Ignoring indexing Watch your server log; bad queries show up there Too much denormalization Try to use an index if all you need is a backref Stale data is a tricky problem Using many databases when one will do Using too many queries

Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License Zarkov http://sf.net/p/zarkov/ Apache License

Future Work mongos New Allura Tools Migrating legacy SF.net projects to Allura Continue to optimize stats & analytics (Zarkov and others) Better APIs to access your project data

Rick Copeland @rick446 [email_address]

Rapid, Scalable Web Development with MongoDB, Ming, and Python

More Related Content

What's hot

Viewers also liked

Similar to Rapid, Scalable Web Development with MongoDB, Ming, and Python

More from Rick Copeland

Recently uploaded

Rapid, Scalable Web Development with MongoDB, Ming, and Python

Editor's Notes