Building a Large Scale SEO/SEM Application with Apache Solr Rahul Jain Freelance Big-data/Search Consultant @rahuldausa dynamicrahul2020@gmail.com
About Me… • Freelance Big-data/Search Consultant based out of Hyderabad, India • Provide Consulting services and solutions for Solr, Elasticsearch and other Big data solutions (Apache Hadoop and Spark) • Organizer of two Meetup groups in Hyderabad • Hyderabad Apache Solr/Lucene • Big Data Hyderabad
What I am going to talk Share our experience in working on Search in this application … • What all issues we have faced and Lessons learned • How we do Database Import, Batch Indexing… • Techniques to Scale and improve Search latency • The System Architecture • Some tips for tuning Solr • Q/A
What does the Application do § Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM (Search Engine Marketing) Professionals § End user search for a keyword or a domain, and get all insights about that. § Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords. § Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc. Web crawling Data Processing & Aggrega4on Ad Networks Apis Databases Data Collec4on *All trademarks and logos belong to their respec1ve owners.
Technology Stack
High level Architecture Load Balancer (HAProxy) Managed Cache Apache Solr Cache Cluster (Redis) Apache Solr Internet Database (MySQL) App Server (Tomcat) Apache Solr Search Head Web Server Farm Php App (Nginx) Cluster Manager (Custom using Zookeeper) Search Head : • Is a Solr Server which does not contain any data. • Make a Distributed Search request and aggregate the Search Results • Also works as a Load Balancer for search queries. Apache Solr Search Head (Solr) 1 2 3 4 8 5 6 7 Ids lookup Cache Fetch cluster Mapping for which month’ cluster
Search - Key challenges § After processing we have ~40 billion records every month in MySQL database including § 80+ Million Keywords § 110+ Million Domains § 1billion+ URLs § Multiple keywords for a Single URL and vice-versa § Multiple tables with varying size from 50million to 12billion § Search is a core functionality, so all records (selected fields) must be Indexed in Solr § Page load time (including all 24 widgets, Max) < 1.5 sec (Critical) § But… we need to load this data only once every month for all countries, so we can do Batch Indexing and as this data never changes, we can apply caching.
Making Data Import and Batch Indexing Faster
Data Import from MySQL to Solr • Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume • We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into Solr. Data Importer (Custom) Solr Solr Solr Table ID (Primary/ Unique Key with Index) Columns 1 Record1 2 Record2 ………… 5000 Record 5000 *6000 Record 6000 -­‐-­‐-­‐-­‐-­‐-­‐-­‐ n… Record n… Database Batch 1-­‐2000 Batch 2001-­‐4000 Importer batches these database Batches into a Bigger Batch (10k documents) and Flushes to selected Solr servers Asynchronously in a round robin fashion Rather than using “limit” func4on of Database, it queries by Range of IDs (Primary Key). Importer Fetches 10 batches at a 4me from MySQL database, each having 2k Records. Each call is Stateless. Downside: • We “select * from table t where ID=1 to ID<=2000″ “select * from table t where ID=2001 to ID<=4000″ must need to have a primary key and that can be slow while crea4ng it in database. • This approach required more number of calls, if the IDs are not sequen4al. ……… *Non-­‐sequen4al Id
Batch Indexing
Indexing All tables into a Single Big Index • All tables in same Index, distributed on multiple Solr cores and Solr servers (Java processes) • Commit on every 120million records or in every 15 minutes whichever is earlier • Disabled Soft-commit and updates (overwrite=false), as each call to addDocument calls updateDocument under the hood • But still.. Indexing was slow (due to being sequential for all tables) and we need to stop it after 2 days. • Search was also awfully slow (order of Minutes) From cache, aber warm-­‐up Bunch of shards ~100
Creating a Distributed Index for each table How many shards ? • Each table have varying number of records from 50million to 12billion • If we choose 100million per shard (core), it means for 12billion, we need to query 120 shards, awfully slow. • Other side If we choose 500million/shard, a table with 500million records will have only 1 shard, high latency, high memory usage (Heap) and no distributed search*. • Hybrid Approach : Determine number of shards based on max number of records in table. • Did a benchmarking to find the best sweet spot for max documents (records) per shard with most optimal Search latency • Commit at the end for each table. Records/Max Shards Table Max Number of Records in table Max number of Shards (cores) Allowed <100 million 1 100-­‐300million 2 <500 million 4 < 1 billion 6 1-­‐5 billion 8 >5 billion 16 * Distributed Search improves latency but may not be faster always as search latency is limited by 4me taken by last shard in responding.
It worked fine but one day suddenly…. java.lang.OutOfMemoryError: Java heap Space • All Solr servers were crashed. • We restarted but they keep crashing randomly after every other day • Took a Heap dump and realized that it is due to Field Cache • Found a Solution : Doc values and never had this issue again till date.
Doc Values (Life Saver) • Disk based Field Data a.ka. Doc values • Document to value mapping built at index time • Store Field values on Disk (or Memory) in a column stride fashion • Better compression than Field Cache • Replacement of Field Cache (not completely) • Quite suitable for Custom Scoring, Sorting and Faceting References: • Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ • https://cwiki.apache.org/confluence/display/solr/DocValues • http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html
Scaling and Making Search Faster…
Partitioning • 3 Level Partitioning, by Month, Country and Table name • Each Month has its own Cluster and a Cluster Manager. • Latency and Throughput are tradeoff, you can’t have both at the same time. Node n Node n Web server Farm Load Balancer App Server Search Head (US) Search Head (UK) Search Head (AU) Master Cluster Manager Internet Cluster 2 for another month e.g Feb Fetch Cluster Mapping and make a request to Search Head with respec4ve Solr cores for that Country, Month and Table ApApp pS eSrevrevre r Cluster 1 for a Month e.g. Jan Solr Solr Solr Cluster 1 Cluster 2 Solr Solr Solr Node 1 Solr Solr Solr Solr Solr Node 1 Solr Cluster 1 Cluster Manager Cluster Manager A P *A : Ac4ve P : Passive Cluster Manager Cluster Manager A P Real 4me Sync 1 user 24 UI widgets, 24 Ajax requests 41 search requests Search Head (US) Search Head (UK) Search Head (AU)
Index Optimization Strategy • Running optimization on ~200+ Solr cores is very-very time consuming • Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing. • Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO • Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk • Other side Small number of segments improve performance drastically, so need to have a balance. Node 1 Solr Solr Staging Cluster Manager *As per our observa4on for our data, Op4miza4on process takes ~42-­‐46 seconds for 1GB We need to do it for 4.6TB (including all boxes), the total Solr Index size for a Single Month Solr Op4mizer Produc4on Cluster Manager Fetches Cluster Mapping (list of all cores) Once optimization and cache warmup is done, pushes the Cluster Mapping to Production Cluster manager, making all Indices live Optimizing a Solr core into a very small number of segments takes a huge time. so we do it iteratively. Algo: Choose Max 3 cores on a Machine to optimize in parallel. Start with least size of Index Index Size Number of docs Determine Max Segments Allowed Reduce Segments to *.90 in each Run Current Segments Aber op4miza4on Node 2 Solr Solr Solr
Finally after optimization and cache warm-up… A shard look like this. Max Segments aber op4miza4on
External Caching • In Distributed search, for a repeated query request, all Solr severs needs to be hit, even though result is served from Solr’s cache. It increase search latency with lowering throughput. • Solution: cache most frequently accessed query results in app layer (LRU based eviction) • We use Redis for Caching • All complex aggregation queries’ results, once fetched from multiple Solr servers are served from cache on subsequent requests. Why Redis… • Advanced In-Memory key-value store • Insane fast • Response time in order of 5-10ms • Provides Cache behavior (set, get) with advance data structures like hashes, lists, sets, sorted sets, bitmaps etc. • http://redis.io/
Hardware • We use Bare Metal, Dedicated servers for Solr due to below reasons 1. Performance gain (with virtual servers, performance dropped by ~18-20%) 2. Better value of computing power/$ spent • 2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10) • 2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10) • Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM. SSD vs SAS 1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes) 2. SAS 15k: 182k docs/sec (dropped by ~45%) 3. SAS 15k is quite cheaper than SSD for bigger hard disks. 4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.
Conclusion : Key takeaways General: • Understand the characteristics of the data and partition it well. Cache: § Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster. § Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache. GC : § Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB. § Keep an eye on Garbage collection (GC) logs specially on Full GC. Tuning Params: § Don’t use Soft Commit if you don’t need it. Specially in Batch Loading § Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s various configurations. § Use hash in Redis to minimize the memory usage. Read the whole experience for more detail: http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records-month-with-solr/
Thank you! Twitter: @rahuldausa dynamicrahul2020@gmail.com http://www.linkedin.com/in/rahuldausa

Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

  • 2.
    Building a LargeScale SEO/SEM Application with Apache Solr Rahul Jain Freelance Big-data/Search Consultant @rahuldausa dynamicrahul2020@gmail.com
  • 3.
    About Me… •Freelance Big-data/Search Consultant based out of Hyderabad, India • Provide Consulting services and solutions for Solr, Elasticsearch and other Big data solutions (Apache Hadoop and Spark) • Organizer of two Meetup groups in Hyderabad • Hyderabad Apache Solr/Lucene • Big Data Hyderabad
  • 4.
    What I amgoing to talk Share our experience in working on Search in this application … • What all issues we have faced and Lessons learned • How we do Database Import, Batch Indexing… • Techniques to Scale and improve Search latency • The System Architecture • Some tips for tuning Solr • Q/A
  • 5.
    What does theApplication do § Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM (Search Engine Marketing) Professionals § End user search for a keyword or a domain, and get all insights about that. § Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords. § Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc. Web crawling Data Processing & Aggrega4on Ad Networks Apis Databases Data Collec4on *All trademarks and logos belong to their respec1ve owners.
  • 6.
  • 7.
    High level Architecture Load Balancer (HAProxy) Managed Cache Apache Solr Cache Cluster (Redis) Apache Solr Internet Database (MySQL) App Server (Tomcat) Apache Solr Search Head Web Server Farm Php App (Nginx) Cluster Manager (Custom using Zookeeper) Search Head : • Is a Solr Server which does not contain any data. • Make a Distributed Search request and aggregate the Search Results • Also works as a Load Balancer for search queries. Apache Solr Search Head (Solr) 1 2 3 4 8 5 6 7 Ids lookup Cache Fetch cluster Mapping for which month’ cluster
  • 8.
    Search - Keychallenges § After processing we have ~40 billion records every month in MySQL database including § 80+ Million Keywords § 110+ Million Domains § 1billion+ URLs § Multiple keywords for a Single URL and vice-versa § Multiple tables with varying size from 50million to 12billion § Search is a core functionality, so all records (selected fields) must be Indexed in Solr § Page load time (including all 24 widgets, Max) < 1.5 sec (Critical) § But… we need to load this data only once every month for all countries, so we can do Batch Indexing and as this data never changes, we can apply caching.
  • 9.
    Making Data Importand Batch Indexing Faster
  • 10.
    Data Import fromMySQL to Solr • Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume • We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into Solr. Data Importer (Custom) Solr Solr Solr Table ID (Primary/ Unique Key with Index) Columns 1 Record1 2 Record2 ………… 5000 Record 5000 *6000 Record 6000 -­‐-­‐-­‐-­‐-­‐-­‐-­‐ n… Record n… Database Batch 1-­‐2000 Batch 2001-­‐4000 Importer batches these database Batches into a Bigger Batch (10k documents) and Flushes to selected Solr servers Asynchronously in a round robin fashion Rather than using “limit” func4on of Database, it queries by Range of IDs (Primary Key). Importer Fetches 10 batches at a 4me from MySQL database, each having 2k Records. Each call is Stateless. Downside: • We “select * from table t where ID=1 to ID<=2000″ “select * from table t where ID=2001 to ID<=4000″ must need to have a primary key and that can be slow while crea4ng it in database. • This approach required more number of calls, if the IDs are not sequen4al. ……… *Non-­‐sequen4al Id
  • 11.
  • 12.
    Indexing All tables into a Single Big Index • All tables in same Index, distributed on multiple Solr cores and Solr servers (Java processes) • Commit on every 120million records or in every 15 minutes whichever is earlier • Disabled Soft-commit and updates (overwrite=false), as each call to addDocument calls updateDocument under the hood • But still.. Indexing was slow (due to being sequential for all tables) and we need to stop it after 2 days. • Search was also awfully slow (order of Minutes) From cache, aber warm-­‐up Bunch of shards ~100
  • 13.
    Creating a DistributedIndex for each table How many shards ? • Each table have varying number of records from 50million to 12billion • If we choose 100million per shard (core), it means for 12billion, we need to query 120 shards, awfully slow. • Other side If we choose 500million/shard, a table with 500million records will have only 1 shard, high latency, high memory usage (Heap) and no distributed search*. • Hybrid Approach : Determine number of shards based on max number of records in table. • Did a benchmarking to find the best sweet spot for max documents (records) per shard with most optimal Search latency • Commit at the end for each table. Records/Max Shards Table Max Number of Records in table Max number of Shards (cores) Allowed <100 million 1 100-­‐300million 2 <500 million 4 < 1 billion 6 1-­‐5 billion 8 >5 billion 16 * Distributed Search improves latency but may not be faster always as search latency is limited by 4me taken by last shard in responding.
  • 14.
    It worked finebut one day suddenly…. java.lang.OutOfMemoryError: Java heap Space • All Solr servers were crashed. • We restarted but they keep crashing randomly after every other day • Took a Heap dump and realized that it is due to Field Cache • Found a Solution : Doc values and never had this issue again till date.
  • 15.
    Doc Values (LifeSaver) • Disk based Field Data a.ka. Doc values • Document to value mapping built at index time • Store Field values on Disk (or Memory) in a column stride fashion • Better compression than Field Cache • Replacement of Field Cache (not completely) • Quite suitable for Custom Scoring, Sorting and Faceting References: • Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ • https://cwiki.apache.org/confluence/display/solr/DocValues • http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html
  • 16.
    Scaling and MakingSearch Faster…
  • 17.
    Partitioning • 3Level Partitioning, by Month, Country and Table name • Each Month has its own Cluster and a Cluster Manager. • Latency and Throughput are tradeoff, you can’t have both at the same time. Node n Node n Web server Farm Load Balancer App Server Search Head (US) Search Head (UK) Search Head (AU) Master Cluster Manager Internet Cluster 2 for another month e.g Feb Fetch Cluster Mapping and make a request to Search Head with respec4ve Solr cores for that Country, Month and Table ApApp pS eSrevrevre r Cluster 1 for a Month e.g. Jan Solr Solr Solr Cluster 1 Cluster 2 Solr Solr Solr Node 1 Solr Solr Solr Solr Solr Node 1 Solr Cluster 1 Cluster Manager Cluster Manager A P *A : Ac4ve P : Passive Cluster Manager Cluster Manager A P Real 4me Sync 1 user 24 UI widgets, 24 Ajax requests 41 search requests Search Head (US) Search Head (UK) Search Head (AU)
  • 18.
    Index Optimization Strategy • Running optimization on ~200+ Solr cores is very-very time consuming • Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing. • Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO • Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk • Other side Small number of segments improve performance drastically, so need to have a balance. Node 1 Solr Solr Staging Cluster Manager *As per our observa4on for our data, Op4miza4on process takes ~42-­‐46 seconds for 1GB We need to do it for 4.6TB (including all boxes), the total Solr Index size for a Single Month Solr Op4mizer Produc4on Cluster Manager Fetches Cluster Mapping (list of all cores) Once optimization and cache warmup is done, pushes the Cluster Mapping to Production Cluster manager, making all Indices live Optimizing a Solr core into a very small number of segments takes a huge time. so we do it iteratively. Algo: Choose Max 3 cores on a Machine to optimize in parallel. Start with least size of Index Index Size Number of docs Determine Max Segments Allowed Reduce Segments to *.90 in each Run Current Segments Aber op4miza4on Node 2 Solr Solr Solr
  • 19.
    Finally after optimizationand cache warm-up… A shard look like this. Max Segments aber op4miza4on
  • 20.
    External Caching •In Distributed search, for a repeated query request, all Solr severs needs to be hit, even though result is served from Solr’s cache. It increase search latency with lowering throughput. • Solution: cache most frequently accessed query results in app layer (LRU based eviction) • We use Redis for Caching • All complex aggregation queries’ results, once fetched from multiple Solr servers are served from cache on subsequent requests. Why Redis… • Advanced In-Memory key-value store • Insane fast • Response time in order of 5-10ms • Provides Cache behavior (set, get) with advance data structures like hashes, lists, sets, sorted sets, bitmaps etc. • http://redis.io/
  • 21.
    Hardware • Weuse Bare Metal, Dedicated servers for Solr due to below reasons 1. Performance gain (with virtual servers, performance dropped by ~18-20%) 2. Better value of computing power/$ spent • 2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10) • 2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10) • Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM. SSD vs SAS 1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes) 2. SAS 15k: 182k docs/sec (dropped by ~45%) 3. SAS 15k is quite cheaper than SSD for bigger hard disks. 4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.
  • 22.
    Conclusion : Keytakeaways General: • Understand the characteristics of the data and partition it well. Cache: § Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster. § Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache. GC : § Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB. § Keep an eye on Garbage collection (GC) logs specially on Full GC. Tuning Params: § Don’t use Soft Commit if you don’t need it. Specially in Batch Loading § Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s various configurations. § Use hash in Redis to minimize the memory usage. Read the whole experience for more detail: http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records-month-with-solr/
  • 23.
    Thank you! Twitter:@rahuldausa dynamicrahul2020@gmail.com http://www.linkedin.com/in/rahuldausa