Building A directed graph with mongodbMongoSF 5/24/2011By Tony Tam @fehguy
Who is wordnikWord + Meaning Discovery EngineClustered Application built with:Scala/Java/JettyOnly way in is via REST19M API calls/day @ 7ms/query averagePhysical servers72GB RAM, 8 core4.3TB DASWe’re MongoDB users for ~1.5 yrsUsed in master/slave14B documents in MongoDB
Why a graph for wordsTechnique to model network relationshipsProperties are dynamicLinks are “arbitrary”Runtime performanceAnswers in < 5ms/requestRouting functions based on goals“find most likely word for X”“find more common form of Y”
Why a graph for wordsMisspellings, abbreviations, texting, Twitter
More about graphsDifferent types of GraphsDecisions have huge impact on design + implementationNodes (vertices)String and numeric propertiesEdges (links)Finite set of labeled edge types (~30)Multiple target nodes per edgeEach potentially different weightDirected, non-symmetrical
Why build on Mongodb?Word Graph is core to WordnikMany ways to build a graphDedicated graph DBsRelational DBsMongoDB Document StorageUber-flexibleSuccessfully routes in < 5msLong runway for scale-outLimit storage infrastructure componentsEasy to implement
Wordnik graph data modelNodes_id field holds name, object typeIndex at no extra costArbitrary number of propertiesOnly two datatypes for us, String, DoubleNode type info in node ID (_id)na_corpusCount => Doublesa_source => String
Wordnik graph data modelEdgesDestination(s)WeightLink PropertiesStored in Mongo ArraysArray size is app limitedUse $push, $pop
Access to mongoMongo Access via DAO layerLimit queries to ones that work“well”ALL queries use indexFind Node “cat” of type “word”:db.node.findOne({_id:"cat|word"})Find Edge types for above:db.edge.find({_id:/^cat\|word\|/},{_id:1})Serialization/deserialization Done “the old-fashioned way”BasicDBObject, BasicDBList faster than mappers for our use case
Query efficiencyMax execution time is f (ahops)
Routing, traversals, functionsTypically find path from A to BRoutes have costsLow cost or high probabilityOur use case is atypicalLinkedIn vs. MapsNot from A to BMore like “from A with 3 hops”This matters!
Performance + Scaling
Performance + scalingQuery by index onlyUse regex syntax in restricted fashionStarts with onlyNo look behindCase sensitiveBoring? Fast?Sharding is a no-brainerWhat about ObjectId()?
Performance + scalingHorizontal? Vertical? Both? And when?Separate collections by edge type/object typeIncreases storage needsCollections all have padding, 30 collections => ~30x paddingShardingUse slick, built-in Mongo shardingRoll your own based on your dataWhat does Wordnik do?Neither! (yet)30M Nodes, 50M EdgesOne collection for nodesOne collection for edges
Performance + scalingSelecting a shard keyDone in application logic based on OUR dataDepends on what you need
End resultSolves Wordnik Graph infrastructure needsStore Word nodes with UGC, corpus, structured, analytical dataBatch fetch Edges @ > 50k/secondFind Edge + endpoints in 80mS Powers our…Word SelectionCanonicalizationMisspelling“Did you mean” logicClassification + Matching Engine
ExamplesMisspellingsAbbreviationsLemmatization
ExamplesTerm normalizationFind similar wordsMeaning normalizationFind “more common” form
examplesApplied Word GraphRecall:“Computers are stupid”English is complexClustering + classification algorithms:Stink without consistent data“The” => “the” (duh)“geese” => “goose” (ok)Stink when they’re slowGraph + Clustering/ClassificationJust add data
MongoDB makes a Great graph back-endSee more about Wordnik APIs:http://developer.wordnik.comFurther ReadingMigrating from MySQL to MongoDBhttp://www.slideshare.net/fehguy/migrating-from-mysql-to-mongodb-at-wordnikMaintaining your MongoDB Installationhttp://www.slideshare.net/fehguy/mongo-sv-tony-tamSource CodeMapping Benchmarkhttps://github.com/fehguy/mongodb-benchmark-toolsWordnik OSS Tools https://github.com/wordnik/wordnik-oss
MongoDB makes a Great graph back-endQuestions?

Building a Directed Graph with MongoDB

  • 1.
    Building A directedgraph with mongodbMongoSF 5/24/2011By Tony Tam @fehguy
  • 2.
    Who is wordnikWord+ Meaning Discovery EngineClustered Application built with:Scala/Java/JettyOnly way in is via REST19M API calls/day @ 7ms/query averagePhysical servers72GB RAM, 8 core4.3TB DASWe’re MongoDB users for ~1.5 yrsUsed in master/slave14B documents in MongoDB
  • 3.
    Why a graphfor wordsTechnique to model network relationshipsProperties are dynamicLinks are “arbitrary”Runtime performanceAnswers in < 5ms/requestRouting functions based on goals“find most likely word for X”“find more common form of Y”
  • 4.
    Why a graphfor wordsMisspellings, abbreviations, texting, Twitter
  • 5.
    More about graphsDifferenttypes of GraphsDecisions have huge impact on design + implementationNodes (vertices)String and numeric propertiesEdges (links)Finite set of labeled edge types (~30)Multiple target nodes per edgeEach potentially different weightDirected, non-symmetrical
  • 6.
    Why build onMongodb?Word Graph is core to WordnikMany ways to build a graphDedicated graph DBsRelational DBsMongoDB Document StorageUber-flexibleSuccessfully routes in < 5msLong runway for scale-outLimit storage infrastructure componentsEasy to implement
  • 7.
    Wordnik graph datamodelNodes_id field holds name, object typeIndex at no extra costArbitrary number of propertiesOnly two datatypes for us, String, DoubleNode type info in node ID (_id)na_corpusCount => Doublesa_source => String
  • 8.
    Wordnik graph datamodelEdgesDestination(s)WeightLink PropertiesStored in Mongo ArraysArray size is app limitedUse $push, $pop
  • 9.
    Access to mongoMongoAccess via DAO layerLimit queries to ones that work“well”ALL queries use indexFind Node “cat” of type “word”:db.node.findOne({_id:"cat|word"})Find Edge types for above:db.edge.find({_id:/^cat\|word\|/},{_id:1})Serialization/deserialization Done “the old-fashioned way”BasicDBObject, BasicDBList faster than mappers for our use case
  • 10.
  • 11.
    Routing, traversals, functionsTypicallyfind path from A to BRoutes have costsLow cost or high probabilityOur use case is atypicalLinkedIn vs. MapsNot from A to BMore like “from A with 3 hops”This matters!
  • 12.
  • 13.
    Performance + scalingQueryby index onlyUse regex syntax in restricted fashionStarts with onlyNo look behindCase sensitiveBoring? Fast?Sharding is a no-brainerWhat about ObjectId()?
  • 14.
    Performance + scalingHorizontal?Vertical? Both? And when?Separate collections by edge type/object typeIncreases storage needsCollections all have padding, 30 collections => ~30x paddingShardingUse slick, built-in Mongo shardingRoll your own based on your dataWhat does Wordnik do?Neither! (yet)30M Nodes, 50M EdgesOne collection for nodesOne collection for edges
  • 15.
    Performance + scalingSelectinga shard keyDone in application logic based on OUR dataDepends on what you need
  • 16.
    End resultSolves WordnikGraph infrastructure needsStore Word nodes with UGC, corpus, structured, analytical dataBatch fetch Edges @ > 50k/secondFind Edge + endpoints in 80mS Powers our…Word SelectionCanonicalizationMisspelling“Did you mean” logicClassification + Matching Engine
  • 17.
  • 18.
    ExamplesTerm normalizationFind similarwordsMeaning normalizationFind “more common” form
  • 19.
    examplesApplied Word GraphRecall:“Computersare stupid”English is complexClustering + classification algorithms:Stink without consistent data“The” => “the” (duh)“geese” => “goose” (ok)Stink when they’re slowGraph + Clustering/ClassificationJust add data
  • 20.
    MongoDB makes aGreat graph back-endSee more about Wordnik APIs:http://developer.wordnik.comFurther ReadingMigrating from MySQL to MongoDBhttp://www.slideshare.net/fehguy/migrating-from-mysql-to-mongodb-at-wordnikMaintaining your MongoDB Installationhttp://www.slideshare.net/fehguy/mongo-sv-tony-tamSource CodeMapping Benchmarkhttps://github.com/fehguy/mongodb-benchmark-toolsWordnik OSS Tools https://github.com/wordnik/wordnik-oss
  • 21.
    MongoDB makes aGreat graph back-endQuestions?