Discovering python search engine José Manuel Ortega - Pycones 2017
Agenda ● Introduction to search engines ● ElasticSearch,whoosh,django-hystack ● ElasticSearch example ● Other solutions & tools ● Conclusions
Search engines
Search engines ● Document based ● A document is the unit of searching in a full text search system. ● A document can be a json or python dictionary
Core concepts
Core concepts ● Index: Named collection of documents that have similar characteristics(like a database) ● Type:Logical partition of an index that contains documents with common fields(like a table) ● Document:basic unit of information(like a row) ● Mapping:field properties(datatype,token extraction). Includes information about how fields are stored in the index
Core concepts ● Relevance are the algorithms used to rank the results based on the query ● Corpus is the collection of all documents in the index ● Segments:Sharded data storing the inverted index.Allow searching in the index in a efficient way
Inverted index ● Is the heart of the search engine ● Each inverted index stores position and document IDs
ElasticSearch
● Open source search server based on Apache Lucene ● Written in Java ● Cross-platform ● Communications with the search server is done through HTTP REST API ● curl -X<GET|POST|PUT|DELETE> http://localthost:9200/<index>/<type_document>/id
● You can add a document without creating an index ● ElasticSearch will create the index,mapping type and fields automatically ● ElasticSearch will infer the data types based on the document’s data
Metadata Fields ● Each document has metadata associated with it ● _index:Allows matching documents based on their indexes. ● _type:Type of the document ● _id:Document id(not indexed) ● _uid:_type + _id(indexed) ● _source:contains the json passed in creation time of the index or document(not indexed) ● _version
ElasticSearch vs Relational DB
Creating an Index curl -XPUT ‘localhost:9200/myindex’-d { “settings”:{..} “mappings”:{..} }
Searching a document curl -XGET ‘localhost:9200/myindex/mydocument/_search?q=elasticSearch’ curl -XGET ‘localhost:9200/myindex/mydocument/_search?pretty’ -d{ “query”:{ “match”:{ “_all”:”elasticSearch” } } } Query DSL
Searching a document ● Search can get much more complex ○ Multiple terms ○ Multi-match(math query on specific fields) ○ Bool(true,false) ○ Range ○ RegExp ○ GeoPoint,GeoShapes
ElasticSearch python client ● The official low-level client is elasticsearch-py ○ pip install elasticsearch
ElasticSearch-py API
ElasticSearch-py API
Geo queries ● Elastic search supports two types of geo fields ○ geo_point(lat,lon) ○ geo_shapes(points,lines,polygons) ● Perform geographical searches ○ Finding points of interest and GPS coordinates
Whoosh
● Pure-python full-text indexing and searching library ● Library of classes and functions for indexing text and then searching the index. ● It allows you to develop custom search engines for your content. ● Mainly focused on index and search definition using schemas ● Python 2.5 and Python 3
Schema
Create index and insert document
Searching single field
Searching multiple field
Django-haystack
● Multiple backends (you have a Solr & a Whoosh index, or a master Solr & a slave Solr, etc.) ● An Elasticsearch backend ● Big query improvements ● Geospatial search (Solr & Elasticsearch only) ● The addition of Signal Processors for better control ● Input types for improved control over queries ● Rich Content Extraction in Solr
● Create the index ○ Run ./manage.py rebuild_index to create the new search index. ● Update the index ○ ./manage.py update_index will add new entries to the index. ○ ./manage.py rebuild_index will recreate the index from scratch.
● Pros: ○ Easy to setup ○ Looks like Django ORM but for searches ○ Search engine independent ○ Support 4 engines (Elastic, Solr, Xapian, Whoosh) ● Cons: ○ Poor SearchQuerySet API ○ Difficult to manage stop words ○ Loose performance, because extra layer ○ Django Model based
Other solutions
Other solutions ● https://xapian.org ● https://docs.djangoproject.com/en/1.11/ref/contrib/pos tgres/search/ ● https://www.postgresql.org/docs/9.6/static/textsearch. html
pysolr
Other tools
References ● http://elasticsearch-py.readthedocs.io/en/master/ ● https://whoosh.readthedocs.io/en/latest ● http://django-haystack.readthedocs.io/en/master/ ● http://solr-vs-elasticsearch.com/ ● https://wiki.apache.org/solr/SolPython ● https://github.com/django-haystack/pysolr
Thanks! jmortega.github.io @jmortegac

Discovering python search engine

  • 1.
    Discovering python searchengine José Manuel Ortega - Pycones 2017
  • 2.
    Agenda ● Introduction tosearch engines ● ElasticSearch,whoosh,django-hystack ● ElasticSearch example ● Other solutions & tools ● Conclusions
  • 3.
  • 4.
    Search engines ● Documentbased ● A document is the unit of searching in a full text search system. ● A document can be a json or python dictionary
  • 5.
  • 6.
    Core concepts ● Index:Named collection of documents that have similar characteristics(like a database) ● Type:Logical partition of an index that contains documents with common fields(like a table) ● Document:basic unit of information(like a row) ● Mapping:field properties(datatype,token extraction). Includes information about how fields are stored in the index
  • 7.
    Core concepts ● Relevanceare the algorithms used to rank the results based on the query ● Corpus is the collection of all documents in the index ● Segments:Sharded data storing the inverted index.Allow searching in the index in a efficient way
  • 8.
    Inverted index ● Isthe heart of the search engine ● Each inverted index stores position and document IDs
  • 11.
  • 14.
    ● Open sourcesearch server based on Apache Lucene ● Written in Java ● Cross-platform ● Communications with the search server is done through HTTP REST API ● curl -X<GET|POST|PUT|DELETE> http://localthost:9200/<index>/<type_document>/id
  • 15.
    ● You canadd a document without creating an index ● ElasticSearch will create the index,mapping type and fields automatically ● ElasticSearch will infer the data types based on the document’s data
  • 17.
    Metadata Fields ● Eachdocument has metadata associated with it ● _index:Allows matching documents based on their indexes. ● _type:Type of the document ● _id:Document id(not indexed) ● _uid:_type + _id(indexed) ● _source:contains the json passed in creation time of the index or document(not indexed) ● _version
  • 18.
  • 20.
    Creating an Index curl-XPUT ‘localhost:9200/myindex’-d { “settings”:{..} “mappings”:{..} }
  • 24.
    Searching a document curl-XGET ‘localhost:9200/myindex/mydocument/_search?q=elasticSearch’ curl -XGET ‘localhost:9200/myindex/mydocument/_search?pretty’ -d{ “query”:{ “match”:{ “_all”:”elasticSearch” } } } Query DSL
  • 26.
    Searching a document ●Search can get much more complex ○ Multiple terms ○ Multi-match(math query on specific fields) ○ Bool(true,false) ○ Range ○ RegExp ○ GeoPoint,GeoShapes
  • 27.
    ElasticSearch python client ●The official low-level client is elasticsearch-py ○ pip install elasticsearch
  • 28.
  • 29.
  • 36.
    Geo queries ● Elasticsearch supports two types of geo fields ○ geo_point(lat,lon) ○ geo_shapes(points,lines,polygons) ● Perform geographical searches ○ Finding points of interest and GPS coordinates
  • 46.
  • 47.
    ● Pure-python full-textindexing and searching library ● Library of classes and functions for indexing text and then searching the index. ● It allows you to develop custom search engines for your content. ● Mainly focused on index and search definition using schemas ● Python 2.5 and Python 3
  • 48.
  • 49.
    Create index andinsert document
  • 50.
  • 51.
  • 52.
  • 54.
    ● Multiple backends(you have a Solr & a Whoosh index, or a master Solr & a slave Solr, etc.) ● An Elasticsearch backend ● Big query improvements ● Geospatial search (Solr & Elasticsearch only) ● The addition of Signal Processors for better control ● Input types for improved control over queries ● Rich Content Extraction in Solr
  • 60.
    ● Create theindex ○ Run ./manage.py rebuild_index to create the new search index. ● Update the index ○ ./manage.py update_index will add new entries to the index. ○ ./manage.py rebuild_index will recreate the index from scratch.
  • 61.
    ● Pros: ○ Easyto setup ○ Looks like Django ORM but for searches ○ Search engine independent ○ Support 4 engines (Elastic, Solr, Xapian, Whoosh) ● Cons: ○ Poor SearchQuerySet API ○ Difficult to manage stop words ○ Loose performance, because extra layer ○ Django Model based
  • 62.
  • 63.
    Other solutions ● https://xapian.org ●https://docs.djangoproject.com/en/1.11/ref/contrib/pos tgres/search/ ● https://www.postgresql.org/docs/9.6/static/textsearch. html
  • 64.
  • 66.
  • 72.
    References ● http://elasticsearch-py.readthedocs.io/en/master/ ● https://whoosh.readthedocs.io/en/latest ●http://django-haystack.readthedocs.io/en/master/ ● http://solr-vs-elasticsearch.com/ ● https://wiki.apache.org/solr/SolPython ● https://github.com/django-haystack/pysolr
  • 75.