This is a continuation of my previous post Hibernate Search Part 1 – Database Search
I mentioned another cool feature that can be implemented using Hibernate search is the abilty to search inside other file types such as Microsoft Word documents. Using Apache POI, However the internetz where light on information on how to do this. I did find one tutorial On JavaWorld.com
However despite this tutorial being well written I found it overly complicated for a “Introduction” type post. This could be due to the fact it contains a lot of Spring and I know next to nothing about Spring. I did get it working by stripping out the key parts and including them into my own simple example
I have a Documents class in which I store a filename, a file type a Blob containing the contains of a MS Word document.
package com.jsf.entities;
import com.jsf.util.WordDocHandlerBridge;
import java.io.Serializable;
import javax.persistence.*;
import org.hibernate.search.annotations.*;
import static org.jboss.seam.ScopeType.SESSION;
import org.jboss.seam.annotations.*;
@Entity
@Name("documentsdetails")
@Table(name = "DOCUMENTS")
@Scope(SESSION)
@SequenceGenerator(name = "DOCUMENT_SEQUENCE_GENERATOR", sequenceName = "DOCUMENTS_S")
@Indexed(index = "indexes/documents")
public class Documents implements Serializable {
public Documents() {
}
@Id
@DocumentId
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "DOCUMENT_SEQUENCE_GENERATOR")
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
@Field(index = Index.TOKENIZED)
public String getFileName() {
return fileName;
}
public void setFileName(String fileName) {
this.fileName = fileName;
}
public String getFileType() {
return fileType;
}
public void setFileType(String fileType) {
this.fileType = fileType;
}
@Field(name = "fileData", index = Index.TOKENIZED, store = Store.YES)
@FieldBridge(impl = WordDocHandlerBridge.class)
@Lob
@Basic(fetch = FetchType.EAGER)
public byte[] getFileData() {
return fileData;
}
public void setFileData(byte[] fileData) {
this.fileData = fileData;
}
private int id;
private String fileName;
private String fileType;
private byte[] fileData;
}
Documents.java My documents entity bean. Only real thing difference here compared to the entity bean from my previous post is the inclusion of @FieldBridge(impl = WordDocHandlerBridge.class) on getFileData, this tells Search how to handle the byte[] containing the contents of the Word Document. This is a copy of the one from JavaWorld.com
package com.jsf.util;
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.hibernate.search.bridge.StringBridge;
public class WordDocHandlerBridge implements StringBridge {
public String objectToString(Object arg0) {
ByteArrayInputStream bais = new ByteArrayInputStream((byte[]) arg0);
StringBuilder _result = new StringBuilder();
try {
POIFSFileSystem poiStream = new POIFSFileSystem(bais);
HWPFDocument doc = new org.apache.poi.hwpf.HWPFDocument(poiStream);
Range range = doc.getRange();
int np = range.numParagraphs();
for (int i = 0; i < np; i++) {
_result.append(range.getParagraph(i).text());
_result.append(" ");
}
} catch (IOException ex) {
ex.printStackTrace();
}
return _result.toString();
}
}
This takes the byte[] then converts it to a String which will be used for searching. This is a copy of the class given in the JavaWorld.com example
Finally I have a Session Bean which does the creates the Document elements, does the Indexing and does the search.
package com.jsf.controllers.documents;
import com.jsf.entities.Documents;
import java.io.*;
import java.util.List;
import javax.ejb.*;
import javax.persistence.*;
import org.apache.log4j.Logger;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.hibernate.search.jpa.*;
import static org.jboss.seam.ScopeType.CONVERSATION;
import org.jboss.seam.annotations.*;
@Stateful
@Scope(CONVERSATION)
@Name("documentBean")
@Local(value = DocumentController.class)
public class DocumentBean implements DocumentController {
static Logger log = Logger.getLogger(DocumentBean.class.getName());
@PersistenceContext
private EntityManager em;
@In
private FullTextEntityManager entityManager;
public void uploadDocument() {
Documents doc = new Documents();
doc.setFileName("pew.doc");
doc.setFileType("Doc");
File aFile = new File("C:/example1.doc");
try {
FileInputStream inFile = null;
//Create Input Stream from File
inFile = new FileInputStream(aFile);
//Convert to byte Array
byte[] theBytes = new
byte[inFile.available()];
inFile.read(theBytes);
doc.setFileData(theBytes);
//Persit Doc
em.merge(doc);
//Index Documents
FullTextEntityManager fullTextEntityManager = Search.createFullTextEntityManager(entityManager);
List docs = entityManager.createQuery("select c from Documents as c").getResultList();
for (Documents doc1 : docs) {
fullTextEntityManager.index(doc1);
}
} catch (Exception e11) {
e11.printStackTrace();
}
doSearch();
}
public void doSearch() {
FullTextEntityManager fullTextEntityManager =
org.hibernate.search.jpa.Search.createFullTextEntityManager(entityManager);
//Do Search on FileData or FileName
String[] productFields = {"fileData", "FileName"};
MultiFieldQueryParser parser = new MultiFieldQueryParser(productFields, new StandardAnalyzer());
parser.setAllowLeadingWildcard(true);
org.apache.lucene.search.Query luceneQuery;
try {
//Search for the String "Pew" in all the indexed documents
luceneQuery = parser.parse("Pew");
List resultsList = fullTextEntityManager.createFullTextQuery(luceneQuery, Documents.class)
.getResultList();
// Return list of Documents which contain Pew in the Name or Data
log.debug("Results Found " + resultsList.size());
} catch (Exception e) {
e.printStackTrace();
}
}
@Remove
@Destroy
public void destroy() {
}
}
My file C:/example1.doc is a simple MS Word file which contains the phrase “Pew Pew Pew”
Running the uploadDocument method will create and Index a new Document object, running the search will then return a list of Document objects which contain the word “Pew” in the body.
@In
private FullTextEntityManager entityManager;
is a new component which needs to be added in order to index a set of documents. In order to get this minor changes need to be made to persistence.xml and components.xml
In practice Apache POI Word Document handling seems a bit fragile according to there site it only supports Word files made in versions of office up to and including 2003, Office 2007 is not supported. Also It doesn’t like Open Office word files it gives a strange error about properties not being set. I am not sure of the current status of the project itself as they say they are currently lacking a developer for the Word handling components
Attached is a zip file containing Documents.java WordDocHandlerBridge.java and DocumentBean.java also included are 2 sample word files and a copy of my persistence.xml and components.xml
Example Code

Hibernate, Seam, Tutorial Hibernate Search, Seam, Tutorial