Java Search Engine Framework
soluzioni •  Regular expression (can be slow and memory hungry) •  Lucene (full-text search engine library) •  Solr (standalone full-text search server ) •  SolrJ (java client per solr)
Regular expression •  (cos’è) una sequenza di simboli (quindi una stringa) che identifica un insieme di stringhe •  (che fa) definisce una funzione che prende in ingresso una stringa, e restituisce in uscita un valore del tipo sì/no, a seconda che la stringa segua o meno un certo pattern.
Regular expression (esempio) 1.  Pattern p = Pattern.compile("eur*usd"); 2.  Matcher m = p.matcher( 3.  “In quel ramo del lago di eUr&uSd”).toLowerCase() 4.  ); 5.  If(m.find()) { //trovato! Ma dove nella stringa? 6.  }
Lucene •  Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. •  Apache Software Foundation •  Stable release 4.3.0 / May 6, 2013 •  Development status Active
Lucene (esempio) •  Analyzer analyzer = null; •  Directory index = null; •  IndexWriterConfig config = null; •  IndexWriter w = null; •  //analyzer = new StandardAnalyzer(Version.LUCENE_43); •  analyzer = new KeywordAnalyzer(); •  index = new RAMDirectory(); •  config = new IndexWriterConfig(Version.LUCENE_43, analyzer); •  w = new IndexWriter(index, config);
Lucene (esempio 2) 1.  private void addDoc(long time, String value, String flag) throws Exception { 2.  Document doc = new Document(); 3.  doc.add(new StringField("time", String.valueOf(time), Field.Store.YES)); 4.  doc.add(new StringField("value", value, Field.Store.YES)); 5.  doc.add(new StringField("flag", flag, Field.Store.YES)); 6.  w.addDocument(doc); 7.  } à w.commit(); //da eseguire alla fine del batch
Lucene (esempio 3) 1.  IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index)); 2.  MultiFieldQueryParser queryParser = new MultiFieldQueryParser( 3.  Version.LUCENE_43, 4.  new String[] {"time", "value", "flag"}, 5.  analyzer); 6.  QueryParser queryParser = new QueryParser( 7.  Version.LUCENE_43, 8.  "value", 9.  analyzer); 10.  TopDocs hits = searcher.search(queryParser.parse("VALUE:(+eurusd)"), 50); 11.  System.out.println(hits.totalHits); 12.  for(ScoreDoc scoreDoc : hits.scoreDocs) { 13.  Document doc = searcher.doc(scoreDoc.doc); 14.  System.out.println(doc.toString()); 15.  }
Solr •  Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. •  Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required. •  Apache Software Foundation •  Stable release 4.3.0 / May 6, 2013 •  Development status Active
SolrJ •  SolrJ is a java client to access Solr. •  It offers a java interface to add, update, and query the solr index. •  Last version: 1.4.X
SolrJ (esempio) 1.  SolrServer server = new HttpSolrServer("http://localhost:8983/solr/"); 2.  server.deleteByQuery( "*:*" );// CAUTION: deletes everything! 3.  SolrInputDocument doc1 = new SolrInputDocument(); 4.  doc1.addField( "id", 23425); 5.  doc1.addField( "name", "doc1"); 6.  doc1.addField( "price", 100980 ); 7.  SolrInputDocument doc2 = new SolrInputDocument(); 8.  doc2.addField( "id", 63432); 9.  doc2.addField( "name", "doc2"); 10. doc2.addField( "price", 205345 ); 11. Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); 12. docs.add(doc1); 13. docs.add(doc2); 14.  server.add(docs); 15.  server.commit(); 16.  SolrQuery query = new SolrQuery(); 17.  query.setQuery("+name:*c1 +price:100980"); 18.  QueryResponse rsp = server.query(query);
SolrJ (esempio) 1.  SolrDocumentList docsr = rsp.getResults(); 2.  for(SolrDocument document : docsr){ 3.  Object formName = document.getFieldValue("id"); 4.  System.out.println(formName); 5.  } 6.  List<Product> products = rsp.getBeans(Product.class); 7.  for(Product product : products){ 8.  Object empName = product.getId(); 9.  System.out.println(empName); 10.  }
SolrJ (Product class) 1.  public class Product { 2.  private String id; 3.  public String getId() { 4.  return id; 5.  } 6.  @Field("id") 7.  public void setId(String id) { 8.  this.id = id; 9.  } …the same for price and name attributes. 10. }
SolrJ (file indexing) 1.  public static void indexPdfWithSolrJ(String fileName, String solrId) throws Exception { 2.  String urlString = "http://localhost:8983/solr"; 3.  SolrServer solr = new HttpSolrServer(urlString); 4.  ContentStreamUpdateRequest up = new longnameclass("/update/extract"); 5.  up.addFile(new File(fileName),"application/pdf"); 6.  up.setParam("literal.id",solrId); 7.  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); 8.  solr.request(up); 9.  QueryResponse rsp = solr.query(new SolrQuery("*:*")); 10.  System.out.println(rsp); 11.  }
references •  Lucene & Solr http://lucene.apache.org/solr/ •  SolrJ http://wiki.apache.org/solr/Solrj •  Tika http://tika.apache.org/

Java Search Engine Framework

  • 1.
  • 2.
    soluzioni •  Regular expression(can be slow and memory hungry) •  Lucene (full-text search engine library) •  Solr (standalone full-text search server ) •  SolrJ (java client per solr)
  • 3.
    Regular expression •  (cos’è)una sequenza di simboli (quindi una stringa) che identifica un insieme di stringhe •  (che fa) definisce una funzione che prende in ingresso una stringa, e restituisce in uscita un valore del tipo sì/no, a seconda che la stringa segua o meno un certo pattern.
  • 4.
    Regular expression (esempio) 1. Pattern p = Pattern.compile("eur*usd"); 2.  Matcher m = p.matcher( 3.  “In quel ramo del lago di eUr&uSd”).toLowerCase() 4.  ); 5.  If(m.find()) { //trovato! Ma dove nella stringa? 6.  }
  • 5.
    Lucene •  Lucene isa high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. •  Apache Software Foundation •  Stable release 4.3.0 / May 6, 2013 •  Development status Active
  • 6.
    Lucene (esempio) •  Analyzeranalyzer = null; •  Directory index = null; •  IndexWriterConfig config = null; •  IndexWriter w = null; •  //analyzer = new StandardAnalyzer(Version.LUCENE_43); •  analyzer = new KeywordAnalyzer(); •  index = new RAMDirectory(); •  config = new IndexWriterConfig(Version.LUCENE_43, analyzer); •  w = new IndexWriter(index, config);
  • 7.
    Lucene (esempio 2) 1. private void addDoc(long time, String value, String flag) throws Exception { 2.  Document doc = new Document(); 3.  doc.add(new StringField("time", String.valueOf(time), Field.Store.YES)); 4.  doc.add(new StringField("value", value, Field.Store.YES)); 5.  doc.add(new StringField("flag", flag, Field.Store.YES)); 6.  w.addDocument(doc); 7.  } à w.commit(); //da eseguire alla fine del batch
  • 8.
    Lucene (esempio 3) 1. IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index)); 2.  MultiFieldQueryParser queryParser = new MultiFieldQueryParser( 3.  Version.LUCENE_43, 4.  new String[] {"time", "value", "flag"}, 5.  analyzer); 6.  QueryParser queryParser = new QueryParser( 7.  Version.LUCENE_43, 8.  "value", 9.  analyzer); 10.  TopDocs hits = searcher.search(queryParser.parse("VALUE:(+eurusd)"), 50); 11.  System.out.println(hits.totalHits); 12.  for(ScoreDoc scoreDoc : hits.scoreDocs) { 13.  Document doc = searcher.doc(scoreDoc.doc); 14.  System.out.println(doc.toString()); 15.  }
  • 9.
    Solr •  Solr iswritten in Java and runs as a standalone full-text search server within a servlet container such as Jetty. •  Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required. •  Apache Software Foundation •  Stable release 4.3.0 / May 6, 2013 •  Development status Active
  • 10.
    SolrJ •  SolrJ isa java client to access Solr. •  It offers a java interface to add, update, and query the solr index. •  Last version: 1.4.X
  • 11.
    SolrJ (esempio) 1.  SolrServerserver = new HttpSolrServer("http://localhost:8983/solr/"); 2.  server.deleteByQuery( "*:*" );// CAUTION: deletes everything! 3.  SolrInputDocument doc1 = new SolrInputDocument(); 4.  doc1.addField( "id", 23425); 5.  doc1.addField( "name", "doc1"); 6.  doc1.addField( "price", 100980 ); 7.  SolrInputDocument doc2 = new SolrInputDocument(); 8.  doc2.addField( "id", 63432); 9.  doc2.addField( "name", "doc2"); 10. doc2.addField( "price", 205345 ); 11. Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); 12. docs.add(doc1); 13. docs.add(doc2); 14.  server.add(docs); 15.  server.commit(); 16.  SolrQuery query = new SolrQuery(); 17.  query.setQuery("+name:*c1 +price:100980"); 18.  QueryResponse rsp = server.query(query);
  • 12.
    SolrJ (esempio) 1.  SolrDocumentListdocsr = rsp.getResults(); 2.  for(SolrDocument document : docsr){ 3.  Object formName = document.getFieldValue("id"); 4.  System.out.println(formName); 5.  } 6.  List<Product> products = rsp.getBeans(Product.class); 7.  for(Product product : products){ 8.  Object empName = product.getId(); 9.  System.out.println(empName); 10.  }
  • 13.
    SolrJ (Product class) 1. public class Product { 2.  private String id; 3.  public String getId() { 4.  return id; 5.  } 6.  @Field("id") 7.  public void setId(String id) { 8.  this.id = id; 9.  } …the same for price and name attributes. 10. }
  • 14.
    SolrJ (file indexing) 1. public static void indexPdfWithSolrJ(String fileName, String solrId) throws Exception { 2.  String urlString = "http://localhost:8983/solr"; 3.  SolrServer solr = new HttpSolrServer(urlString); 4.  ContentStreamUpdateRequest up = new longnameclass("/update/extract"); 5.  up.addFile(new File(fileName),"application/pdf"); 6.  up.setParam("literal.id",solrId); 7.  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); 8.  solr.request(up); 9.  QueryResponse rsp = solr.query(new SolrQuery("*:*")); 10.  System.out.println(rsp); 11.  }
  • 15.
    references •  Lucene &Solr http://lucene.apache.org/solr/ •  SolrJ http://wiki.apache.org/solr/Solrj •  Tika http://tika.apache.org/