Israel Ekpo | World Disney Parks and Resorts Online Building Intelligent Search Applications with Solr 1.4.1 and PHP 5
About the Presenter  Husband (beautiful wife June)  Father (handsome son Joshua)  Sr. Software Engineer at World Disney Parks and Resorts Online  Resides in Orlando, FL  Open Source Contributor to Apache Solr / Apache Lucene Projects  Author of Apache Solr PECL Extension  Email : iekpo@php.net  Twitter : @israelekpo  Website : http://www.israelekpo.com
Summary  Why Create Search Applications?  What Solr is  What Solr is not  Why choose Apache Solr?  What Features Solr Has to Offer in Current Release 1.4.1  Taking Advantage of these features and more  Using Apache Solr via PHP 5  How do we make Search Applications Intelligent?  Additional Topics (Nutch, Local Solr, Bitwise Filtering, Plugins)  Links to slides, sample codes  Upcoming Features  Where to get help with Apache Solr
Reasons to Create Search Applications  Users have been spoiled by Google and other search engines.  Users are used to navigation using a search box.  If unable to locate information immediately, they assume it's not there.  Less patience when attempting to look for information  Certain customer service tools may not have relevant navigation  Information needs to be located immediately
Reasons to Create Search Applications  Reduce the amount of time it takes to locate information  Make products and services more accessible to customers  Increase time spent by visitors on web application  Save time, Save money  Increase employee efficiency (CRM applications)  Improve user experience on web application  Increase revenue and increase profit
About Solr  Solr is written in Java  Standalone full text Search Server within servlet container  Uses the information retrieval library Lucene at its Core  Joined Apache Incubator in January 2006  First major release in December 2006  Latest stable release is 1.4.1 - June 2010  Next major release 4.0  http://lucene.apache.org/solr/
What Solr is NOT  Solr is not a wrapper around Lucene  Solr is not a RDBMS
What version should I use? I am confused … The current version of Apache Solr is 1.4.1 Why is there both a 1.5 and a 3.x anyway ? Not to mention a 4.x ?  1.5 is pre lucene/solr merge (very unlikely to ever be released)  3.1 is the next lucene/solr point release (3x branch in svn)  4.0 is the next major release (trunk in svn)
Why Choose Apache Solr? • Solr is FREE and Open Source (released under Apache License) • Advanced Full Text Search Capabilities absent in an RDBMS • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML,JSON and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Scalability - Efficient Replication to other Solr Search Servers • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture • Because a Voice in my head told me to • All of the above
Features in Apache Solr 1.4.1  A Real Data Schema -Numeric Types, Dynamic Fields, Unique Keys  Hit Highlighting, Spelling Suggestions, Auto Suggests  Faceted Search and Filtering  Advanced, Configurable Text Analysis  Highly Configurable and User Extensible Caching  External Configuration via XML  An Administration Interface  Monitorable Logging  Fast Incremental Updates and Index Replication  Highly Scalable Distributed search - sharded index across multiple hosts  XML, CSV/delimited-text update formats  Rich Document Indexing - PDF, Word, HTML using Apache Tika  Multiple search indices
Setting Up and Getting Started Setup Instructions will be posted here later today http://www.israelekpo.com/works Download Links http://www.apache.org/dyn/closer.cgi/lucene/solr/ http://tomcat.apache.org/download-60.cgi http://pecl.php.net/package/solr Documentation and Helpful Information http://us.php.net/solr http://lucene.apache.org/solr/tutorial.html http://wiki.apache.org/solr/
Important Directories conf/ This directory is mandatory and must contain your solrconfig.xml and schema.xml. Any other optional configuration files would also be kept here. data/ This directory is the default location where Solr will keep your index, and is used by the replication scripts dealing with snapshots. You can override this location in the solrconfig.xml Solr will create this directory if it does not already exist. lib/ This directory is optional. If it exists, Solr will load any Jars found in this directory and use them to resolve any "plugins” specified in your solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...). bin/ This directory is optional. It is the default location used for keeping the replication scripts.
solrconfig.xml – handlers, plugins Defines data directory for index Overides directory for external libs Defines and configures request handlers Defines and enables response writers Defines parameters for autowarming Used to register additional plugins
synonyms.txt Used for token or keyword substitution during indexing or queries Source(s) => replacement(s) colour => color cheque => check car, boat, truck => vehicle dude, guy => man trousers => pants
protowords.txt Used to protect certain words from stemming sports news fighter
schema.xml – data types, fields
schema.xml – data types, fields
schema.xml – data types, fields
schema.xml – data types, fields
Field Options By Use Case
Adding data to index
Solr Query Syntax Solr Uses and Extends the Lucene Query Syntax (Superset of Lucene Syntax) http://lucene.apache.org/java/2_9_1/queryparsersyntax.html http://wiki.apache.org/solr/SolrQuerySyntax + Required Optional –Prohibited Booleans Free AND Fast +Free +Fast +title:Fast AND –body:dollars Range Queries [* TO 500] {300 TO *} cost:[* TO 299.99}
Removing Data from Index
Administrative Interface
Text Analysis (muy importante)
Text Analysis
Text Analysis
Text Analysis
Taking advantage of features Spellchecker
Taking advantage of features Spellchecker
Taking advantage of features Spell Checker
Taking advantage of features Spell Checker
Taking advantage of features Spell checker (behind the scenes)
Taking advantage of features Reasons to Use Spellchecker • Alerts user of possible mistaeks in search kewyords • Helps user in finding what they are looking for even when they don’t know how to spell it • Helps in suggesting alternate queries that could provide better results
Taking advantage of features Autosuggest aka Auto complete
Taking advantage of features Auto suggest aka auto complete
Taking advantage of features Auto suggest behind the scenes http://docs.jquery.com/UI/Autocomplete
Taking advantage of features Auto suggest behind the scenes
Taking advantage of features Why use auto suggest? • Helps users to finish their thoughts or complete search phrase • Helps reduce number of “no matches found” experiences • Provides “mind-reader” experience to certain users. • May propose alternate search phrases that are more useful
Taking advantage of features Hit Highlighting
Taking advantage of features Hit Highlighting
Taking advantage of features Hit Highlighting (behind the scenes)
Taking advantage of features Hit Highlighting (behind the scenes)
Taking advantage of features Why Use Hit Highlighting? Makes it easier for users to parse returned results Improves overall search experience
Taking advantage of features Facets and Filter Queries Makes it easier for users to parse returned results Improves overall search experience
Taking advantage of features Facets and Filter Queries
Taking advantage of features Facets and Filter Queries
Taking advantage of features Why Facets and Filter Queries? Allows users to narrow down humongous result set Creates visual classification or categorization of result set Gives user an idea of number of hits per category Improves overall search experience
Additional Topics Using Nutch and Solr – crawling and indexing intranet sites http://nutch.apache.org/ Local Solr – filtering based on proximity https://issues.apache.org/jira/browse/SOLR-773 Bitwise Filtering on Integer Fields https://issues.apache.org/jira/browse/SOLR-1913 Using Different Response Writer for PHP http://us.php.net/manual/en/solrclient.setresponsewriter.php https://issues.apache.org/jira/browse/SOLR-1967
Upcoming Features Apache Solr • Local Solr • Results Grouping • Field Collapsing PECL Extension • Ability to Send Custom Requests to Custom URLS other than select, update, terms etc. • Ability to add files (pdf, office documents etc) • Windows version of latest releases. • Ensuring that SolrQuery::getFields(), SolrQuery::getFacets() et al returns an array consistently. • Lowering Libxml version to 2.6.16
Where to get help Solr Wiki http://wiki.apache.org/solr/ Solr Mailing Lists solr-user@lucene.apache.org (send message) solr-user-subscribe@lucene.apache.org (subscribe) PECL Extension Documentation on PHP.net http://www.php.net/solr Additional Resources http://wiki.apache.org/solr/SolrResources Articles on LucidImagination.com – lot of articles from experts in the community.
Questions and Answers
WDPRO is Hiring Walt Disney Parks and Resorts Online is hiring for the following positions http://wdpro.jobs • Software Architects • Software Engineers • Web Developers • Automation Engineers • System Engineers • Release Engineers • Quality Assurance Engineers • Business Analysts • Project Managers • Technology Managers Email your resume -> iekpo@php.net mail(“iekpo@php.net”, “I am interested! Hook me up!”, “resume , contact info”);
Feedback and Links Attendee Evaluation and Comments http://joind.in/2261 Link to Slides http://slidesha.re/bAXNF3 Sample Code (will be posted later) http://www.israelekpo.com/works Email iekpo@php.net

Building Intelligent Search Applications with Apache Solr and PHP5

  • 1.
    Israel Ekpo |World Disney Parks and Resorts Online Building Intelligent Search Applications with Solr 1.4.1 and PHP 5
  • 2.
    About the Presenter  Husband(beautiful wife June)  Father (handsome son Joshua)  Sr. Software Engineer at World Disney Parks and Resorts Online  Resides in Orlando, FL  Open Source Contributor to Apache Solr / Apache Lucene Projects  Author of Apache Solr PECL Extension  Email : iekpo@php.net  Twitter : @israelekpo  Website : http://www.israelekpo.com
  • 3.
    Summary  Why Create SearchApplications?  What Solr is  What Solr is not  Why choose Apache Solr?  What Features Solr Has to Offer in Current Release 1.4.1  Taking Advantage of these features and more  Using Apache Solr via PHP 5  How do we make Search Applications Intelligent?  Additional Topics (Nutch, Local Solr, Bitwise Filtering, Plugins)  Links to slides, sample codes  Upcoming Features  Where to get help with Apache Solr
  • 4.
    Reasons to CreateSearch Applications  Users have been spoiled by Google and other search engines.  Users are used to navigation using a search box.  If unable to locate information immediately, they assume it's not there.  Less patience when attempting to look for information  Certain customer service tools may not have relevant navigation  Information needs to be located immediately
  • 5.
    Reasons to CreateSearch Applications  Reduce the amount of time it takes to locate information  Make products and services more accessible to customers  Increase time spent by visitors on web application  Save time, Save money  Increase employee efficiency (CRM applications)  Improve user experience on web application  Increase revenue and increase profit
  • 6.
    About Solr  Solr iswritten in Java  Standalone full text Search Server within servlet container  Uses the information retrieval library Lucene at its Core  Joined Apache Incubator in January 2006  First major release in December 2006  Latest stable release is 1.4.1 - June 2010  Next major release 4.0  http://lucene.apache.org/solr/
  • 7.
    What Solr isNOT  Solr is not a wrapper around Lucene  Solr is not a RDBMS
  • 8.
    What version shouldI use? I am confused … The current version of Apache Solr is 1.4.1 Why is there both a 1.5 and a 3.x anyway ? Not to mention a 4.x ?  1.5 is pre lucene/solr merge (very unlikely to ever be released)  3.1 is the next lucene/solr point release (3x branch in svn)  4.0 is the next major release (trunk in svn)
  • 9.
    Why Choose ApacheSolr? • Solr is FREE and Open Source (released under Apache License) • Advanced Full Text Search Capabilities absent in an RDBMS • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML,JSON and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Scalability - Efficient Replication to other Solr Search Servers • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture • Because a Voice in my head told me to • All of the above
  • 10.
    Features in ApacheSolr 1.4.1  A Real Data Schema -Numeric Types, Dynamic Fields, Unique Keys  Hit Highlighting, Spelling Suggestions, Auto Suggests  Faceted Search and Filtering  Advanced, Configurable Text Analysis  Highly Configurable and User Extensible Caching  External Configuration via XML  An Administration Interface  Monitorable Logging  Fast Incremental Updates and Index Replication  Highly Scalable Distributed search - sharded index across multiple hosts  XML, CSV/delimited-text update formats  Rich Document Indexing - PDF, Word, HTML using Apache Tika  Multiple search indices
  • 11.
    Setting Up andGetting Started Setup Instructions will be posted here later today http://www.israelekpo.com/works Download Links http://www.apache.org/dyn/closer.cgi/lucene/solr/ http://tomcat.apache.org/download-60.cgi http://pecl.php.net/package/solr Documentation and Helpful Information http://us.php.net/solr http://lucene.apache.org/solr/tutorial.html http://wiki.apache.org/solr/
  • 12.
    Important Directories conf/ This directoryis mandatory and must contain your solrconfig.xml and schema.xml. Any other optional configuration files would also be kept here. data/ This directory is the default location where Solr will keep your index, and is used by the replication scripts dealing with snapshots. You can override this location in the solrconfig.xml Solr will create this directory if it does not already exist. lib/ This directory is optional. If it exists, Solr will load any Jars found in this directory and use them to resolve any "plugins” specified in your solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...). bin/ This directory is optional. It is the default location used for keeping the replication scripts.
  • 13.
    solrconfig.xml – handlers,plugins Defines data directory for index Overides directory for external libs Defines and configures request handlers Defines and enables response writers Defines parameters for autowarming Used to register additional plugins
  • 14.
    synonyms.txt Used for tokenor keyword substitution during indexing or queries Source(s) => replacement(s) colour => color cheque => check car, boat, truck => vehicle dude, guy => man trousers => pants
  • 15.
    protowords.txt Used to protectcertain words from stemming sports news fighter
  • 16.
    schema.xml – datatypes, fields
  • 17.
    schema.xml – datatypes, fields
  • 18.
    schema.xml – datatypes, fields
  • 19.
    schema.xml – datatypes, fields
  • 20.
  • 21.
  • 22.
    Solr Query Syntax SolrUses and Extends the Lucene Query Syntax (Superset of Lucene Syntax) http://lucene.apache.org/java/2_9_1/queryparsersyntax.html http://wiki.apache.org/solr/SolrQuerySyntax + Required Optional –Prohibited Booleans Free AND Fast +Free +Fast +title:Fast AND –body:dollars Range Queries [* TO 500] {300 TO *} cost:[* TO 299.99}
  • 23.
  • 24.
  • 25.
    Text Analysis (muyimportante)
  • 26.
  • 27.
  • 28.
  • 29.
    Taking advantage offeatures Spellchecker
  • 30.
    Taking advantage offeatures Spellchecker
  • 31.
    Taking advantage offeatures Spell Checker
  • 32.
    Taking advantage offeatures Spell Checker
  • 33.
    Taking advantage offeatures Spell checker (behind the scenes)
  • 34.
    Taking advantage offeatures Reasons to Use Spellchecker • Alerts user of possible mistaeks in search kewyords • Helps user in finding what they are looking for even when they don’t know how to spell it • Helps in suggesting alternate queries that could provide better results
  • 35.
    Taking advantage offeatures Autosuggest aka Auto complete
  • 36.
    Taking advantage offeatures Auto suggest aka auto complete
  • 37.
    Taking advantage offeatures Auto suggest behind the scenes http://docs.jquery.com/UI/Autocomplete
  • 38.
    Taking advantage offeatures Auto suggest behind the scenes
  • 39.
    Taking advantage offeatures Why use auto suggest? • Helps users to finish their thoughts or complete search phrase • Helps reduce number of “no matches found” experiences • Provides “mind-reader” experience to certain users. • May propose alternate search phrases that are more useful
  • 40.
    Taking advantage offeatures Hit Highlighting
  • 41.
    Taking advantage offeatures Hit Highlighting
  • 42.
    Taking advantage offeatures Hit Highlighting (behind the scenes)
  • 43.
    Taking advantage offeatures Hit Highlighting (behind the scenes)
  • 44.
    Taking advantage offeatures Why Use Hit Highlighting? Makes it easier for users to parse returned results Improves overall search experience
  • 45.
    Taking advantage offeatures Facets and Filter Queries Makes it easier for users to parse returned results Improves overall search experience
  • 46.
    Taking advantage offeatures Facets and Filter Queries
  • 47.
    Taking advantage offeatures Facets and Filter Queries
  • 48.
    Taking advantage offeatures Why Facets and Filter Queries? Allows users to narrow down humongous result set Creates visual classification or categorization of result set Gives user an idea of number of hits per category Improves overall search experience
  • 49.
    Additional Topics Using Nutchand Solr – crawling and indexing intranet sites http://nutch.apache.org/ Local Solr – filtering based on proximity https://issues.apache.org/jira/browse/SOLR-773 Bitwise Filtering on Integer Fields https://issues.apache.org/jira/browse/SOLR-1913 Using Different Response Writer for PHP http://us.php.net/manual/en/solrclient.setresponsewriter.php https://issues.apache.org/jira/browse/SOLR-1967
  • 50.
    Upcoming Features Apache Solr •Local Solr • Results Grouping • Field Collapsing PECL Extension • Ability to Send Custom Requests to Custom URLS other than select, update, terms etc. • Ability to add files (pdf, office documents etc) • Windows version of latest releases. • Ensuring that SolrQuery::getFields(), SolrQuery::getFacets() et al returns an array consistently. • Lowering Libxml version to 2.6.16
  • 51.
    Where to gethelp Solr Wiki http://wiki.apache.org/solr/ Solr Mailing Lists solr-user@lucene.apache.org (send message) solr-user-subscribe@lucene.apache.org (subscribe) PECL Extension Documentation on PHP.net http://www.php.net/solr Additional Resources http://wiki.apache.org/solr/SolrResources Articles on LucidImagination.com – lot of articles from experts in the community.
  • 52.
  • 53.
    WDPRO is Hiring WaltDisney Parks and Resorts Online is hiring for the following positions http://wdpro.jobs • Software Architects • Software Engineers • Web Developers • Automation Engineers • System Engineers • Release Engineers • Quality Assurance Engineers • Business Analysts • Project Managers • Technology Managers Email your resume -> iekpo@php.net mail(“iekpo@php.net”, “I am interested! Hook me up!”, “resume , contact info”);
  • 54.
    Feedback and Links AttendeeEvaluation and Comments http://joind.in/2261 Link to Slides http://slidesha.re/bAXNF3 Sample Code (will be posted later) http://www.israelekpo.com/works Email iekpo@php.net