Archive

Archive for January, 2009

Hibernate Search Part 2 – Document Search

January 19th, 2009

This is a continuation of my previous post Hibernate Search Part 1 – Database Search

I mentioned another cool feature that can be implemented using Hibernate search is the abilty to search inside other file types such as Microsoft Word documents. Using Apache POI, However the internetz where light on information on how to do this. I did find one tutorial On JavaWorld.com

However despite this tutorial being well written I found it overly complicated for a “Introduction” type post. This could be due to the fact it contains a lot of Spring and I know next to nothing about Spring. I did get it working by stripping out the key parts and including them into my own simple example

I have a Documents class in which I store a filename, a file type a Blob containing the contains of a MS Word document.

package com.jsf.entities;

import com.jsf.util.WordDocHandlerBridge;
import java.io.Serializable;
import javax.persistence.*;
import org.hibernate.search.annotations.*;
import static org.jboss.seam.ScopeType.SESSION;
import org.jboss.seam.annotations.*;

@Entity
@Name("documentsdetails")
@Table(name = "DOCUMENTS")
@Scope(SESSION)
@SequenceGenerator(name = "DOCUMENT_SEQUENCE_GENERATOR", sequenceName = "DOCUMENTS_S")
@Indexed(index = "indexes/documents")
public class Documents implements Serializable {
public Documents() {
}

@Id
@DocumentId
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "DOCUMENT_SEQUENCE_GENERATOR")
public int getId() {
return id;
}

public void setId(int id) {
this.id = id;
}

@Field(index = Index.TOKENIZED)
public String getFileName() {
return fileName;
}

public void setFileName(String fileName) {
this.fileName = fileName;
}

public String getFileType() {
return fileType;
}

public void setFileType(String fileType) {
this.fileType = fileType;
}

@Field(name = "fileData", index = Index.TOKENIZED, store = Store.YES)
@FieldBridge(impl = WordDocHandlerBridge.class)
@Lob
@Basic(fetch = FetchType.EAGER)
public byte[] getFileData() {
return fileData;
}

public void setFileData(byte[] fileData) {
this.fileData = fileData;
}

private int id;
private String fileName;
private String fileType;
private byte[] fileData;
}

Documents.java My documents entity bean. Only real thing difference here compared to the entity bean from my previous post is the inclusion of @FieldBridge(impl = WordDocHandlerBridge.class) on getFileData, this tells Search how to handle the byte[] containing the contents of the Word Document. This is a copy of the one from JavaWorld.com

package com.jsf.util;

import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.hibernate.search.bridge.StringBridge;

public class WordDocHandlerBridge implements StringBridge {

public String objectToString(Object arg0) {
ByteArrayInputStream bais = new ByteArrayInputStream((byte[]) arg0);
StringBuilder _result = new StringBuilder();
try {
POIFSFileSystem poiStream = new POIFSFileSystem(bais);
HWPFDocument doc = new org.apache.poi.hwpf.HWPFDocument(poiStream);
Range range = doc.getRange();
int np = range.numParagraphs();
for (int i = 0; i < np; i++) {
_result.append(range.getParagraph(i).text());
_result.append(" ");
}
} catch (IOException ex) {
ex.printStackTrace();
}
return _result.toString();
}
}

This takes the byte[] then converts it to a String which will be used for searching. This is a copy of the class given in the JavaWorld.com example

Finally I have a Session Bean which does the creates the Document elements, does the Indexing and does the search.

package com.jsf.controllers.documents;

import com.jsf.entities.Documents;
import java.io.*;
import java.util.List;
import javax.ejb.*;
import javax.persistence.*;
import org.apache.log4j.Logger;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.hibernate.search.jpa.*;
import static org.jboss.seam.ScopeType.CONVERSATION;
import org.jboss.seam.annotations.*;

@Stateful
@Scope(CONVERSATION)
@Name("documentBean")
@Local(value = DocumentController.class)
public class DocumentBean implements DocumentController {
static Logger log = Logger.getLogger(DocumentBean.class.getName());
@PersistenceContext
private EntityManager em;

@In
private FullTextEntityManager entityManager;

public void uploadDocument() {
Documents doc = new Documents();
doc.setFileName("pew.doc");
doc.setFileType("Doc");
File aFile = new File("C:/example1.doc");
try {
FileInputStream inFile = null;
//Create Input Stream from File
inFile = new FileInputStream(aFile);
//Convert to byte Array
byte[] theBytes = new
byte[inFile.available()];
inFile.read(theBytes);
doc.setFileData(theBytes);
//Persit Doc
em.merge(doc);
//Index Documents
FullTextEntityManager fullTextEntityManager = Search.createFullTextEntityManager(entityManager);
List docs = entityManager.createQuery("select c from Documents as c").getResultList();
for (Documents doc1 : docs) {
fullTextEntityManager.index(doc1);
}
} catch (Exception e11) {
e11.printStackTrace();
}
doSearch();
}

public void doSearch() {

FullTextEntityManager fullTextEntityManager =
org.hibernate.search.jpa.Search.createFullTextEntityManager(entityManager);
//Do Search on FileData or FileName
String[] productFields = {"fileData", "FileName"};
MultiFieldQueryParser parser = new MultiFieldQueryParser(productFields, new StandardAnalyzer());
parser.setAllowLeadingWildcard(true);
org.apache.lucene.search.Query luceneQuery;
try {
//Search for the String "Pew" in all the indexed documents
luceneQuery = parser.parse("Pew");
List resultsList = fullTextEntityManager.createFullTextQuery(luceneQuery, Documents.class)
.getResultList();
// Return list of Documents which contain Pew in the Name or Data
log.debug("Results Found " + resultsList.size());
} catch (Exception e) {
e.printStackTrace();
}
}

@Remove
@Destroy
public void destroy() {
}
}

My file C:/example1.doc is a simple MS Word file which contains the phrase “Pew Pew Pew”

Running the uploadDocument method will create and Index a new Document object, running the search will then return a list of Document objects which contain the word “Pew” in the body.

@In
private FullTextEntityManager entityManager;

is a new component which needs to be added in order to index a set of documents. In order to get this minor changes need to be made to persistence.xml and components.xml

In practice Apache POI Word Document handling seems a bit fragile according to there site it only supports Word files made in versions of office up to and including 2003, Office 2007 is not supported. Also It doesn’t like Open Office word files it gives a strange error about properties not being set. I am not sure of the current status of the project itself as they say they are currently lacking a developer for the Word handling components

Attached is a zip file containing Documents.java WordDocHandlerBridge.java and DocumentBean.java also included are 2 sample word files and a copy of my persistence.xml and components.xml
Example Code

Pew

Pew

Hibernate, Seam, Tutorial , ,

Hibernate Search Part 1 – Database Search

January 19th, 2009

Hibernate Search is a project which allows you to execute full text query’s against a database using the Apache Lucene API

“Hibernate Search brings the power of full text search engines to the persistence domain model and Hibernate experience, through transparent configuration (Hibernate Annotations) and a common API.”

As part of my ongoing JBoss Seam research I was asked to investigate how we could run “proper” query’s against a number of fields in a database. With the additional possibility of ranking results in order of relavence. Hibernate search does this almost out of the box.

I was going to write another “Hello World” tutorial showing how to do this. Until I found an excellent tutorial by Mike Desjardins.

Mikes tutorial is as simple and straight forward as 1 I would have created with the exception that mine was going to be about books and not cheese.

OK his is more original but I made a logo I was going to use for mine. Well ok I used MS Paint on someone elses great image – (Sorry who ever you are I couldnt find a link to credit you)

Honest Jordans Discount Books Banner

Honest Jordans Discount Books Banner

Mike doesn’t implement Seam in his example but its obvious how it could be applied just stick the contents of the servlet in a Session bean and your done.

Like I said it is also possible to implement a ranking system based on the importance of one field over another this can be done by Implementing the @Boost annotation, this takes a floating point value as a parameter this Boost multiplying the ranking of a index by the specified amount. For example.. and I’m going borrow Mikes code here.


@Entity @Table(name="ORIGIN")
public class Origin {

@Id @Column(name="origin_id")
private Integer id;

@Field(index=Index.TOKENIZED)
@Boost 5f
@Basic@Basic @Column(name="name")
private String name;

@Version @Column(name="version")
private Integer version;

// Accessors omitted
}

@Entity @Table(name="MILK")
public class Milk {
@Id @Column(name="milk_id")
private Integer id;

@Field(index=Index.TOKENIZED)
@Boost 2f
@Basic @Column(name="name")
private String name;

@Version @Column(name="version")
private Integer version;

// Accessors omitted
}

As you can see I have applied the @Boost annotation to these entities the name of the place of Origin is given a boost value of 5 and the name of the milk has been given a boost value of 2.

Now assume the user entered “Scottish” into the search field and this returned 2 values a Origin with the name “Scottish” and a Milk with the name “Scottish” as the Origin has the boost of 5 it will be returned first allowing the results to be displayed by ranking.
This is how it is possible to search for basic database objects using Hibernate search.

In my next post I am going to talk about how it is possible to search a Microsoft Word document, Using Hibernate Search and Apache POI.

A Note with setting up Hibernate Search I had some problems getting the class path set up the deployment I downloaded seemed to be missing some of the required lib files I however found I did have all the correct files in the lib directory of my Seam 2.1.0 deployment,

Hibernate, Seam, Tutorial , ,

Seam,EJB3,Oracle Code Example .zip

January 5th, 2009

In my post JBoss Seam, JSF,Oracle and EJB3 Hello World  I gave a step by step example on how to set up a basic example using all of the aforementioned technologies,

I also provided code examples for each file you would need however over the Christmas holidays I discovered that the plug I use to display code did not correctly format XML files. I dont know why it just didnt. As I was on holiday I didnt have access to my machine which had my source code on it so it had to go uncorrected till today..

I have taken all the examples I give in the post zipped them up and uploaded them you can download the zip file at http://jdick.co.uk/blog/code/JSFExample.zip this is only the source code there are no supporting jars included in this you have to go find those yourself from the sources I mention in the post JBoss Seam, JSF,Oracle and EJB3 Hello World.

So download, use, learn, enjoy and please feel free to comment if you find I have screwed something up and Ill do my best to fix it

EJB3, JSF, Seam, Source, Tutorial , , ,