Making your app searchable using self merge-joins ~ Cloud World

A picture is worth a thousand words, and as developers, you know that a working code snippet can be worth even more. The developers at scisurfer.com have agreed to share a few of their code snippets with you today. The snippets outline how they full text index their content and make it easily searchable for their users.

---

Many applications can benefit from full text search. Using Brett Slatkin's presentation at Google I/O, the implementation is pretty straight-forward. The following article gives you a practical introduction of how to implement full text search on GAE. The code is GAE/J + JDO only, but the concepts can be easily converted into Python or JPA.

Goals

Develop a guestbook example (much like the one shipped with the SDK), but with searchable text

The full text search should be fuzzy, within some reasonable limits

Some things before we start:

Self merge-joins and list properties: You can query an entity efficiently based on list properties via self merge-joins. We will not talk about that in detail, but you should watch Brett Slatkin's excellent talk at Google I/O '09 about the topic. It should answer most of your questions: Google I/O 2009 - Building Scalable, Complex Apps on App Engine.

Full Text Search (FTS): FTS is a really huge topic, and it can be done in a myriad of different ways. Check out wikipedia for a primer: http://en.wikipedia.org/wiki/Full_text_search

The art of stemming: One of the most basic things done to enable some form of inexact search is called "stemming". It's the reduction of words towards their basic form. http://en.wikipedia.org/wiki/Stemming

The project

The whole project is available on Google Code http://code.google.com/p/guestbook-example-appengine-full-text-search.

A live demo is available at http://guestbook-example-fts.appspot.com/

The project - walk-through: indexing

guestbook.jsp: This file is the first file that is loaded. You can enter new entries into the guestbook (GuestBookEntry) or search all guestbook entries.

GuestBookEntry.java: A simple JDO file with persistent fields. However, there is one special field: fts. It is a Set of Strings. The set will be filled with the terms that allow for a full text search. If you inspect the constructor, you will see a call which is responsible for making this GuestBookEntry searchable:
```
SearchJanitor.updateFTSStuffForGuestBookEntry(this);
```

SearchJanitor.java: The updateFTSStuffForGuestBookEntry method gets a GuestBookEntry and chops it into single words using:
```
SearchJanitorUtils.getTokensForIndexingOrQuery(...);
```

SearchJanitorUtils.java: The getTokensForIndexingOrQuery(...) method uses Apache Lucene and Lucene Snowball to extract words from the given string. But it does more: The Lucene Snowball stemmer reduces the words to the basic form which enables fuzzy search. A search for Kids or Kid will return the same results. kid (lowercase) and Kids will also return the same results.

Indexing summary: So far, we have an entity (GuestBookEntry) that will be filled with a set of Strings generated from it's content. So far, so good. But the real search component is missing.

The project - walk-through: searching

search.jsp: This file gets a parameter "search" and presents results for that search. It does that by consulting:
```
List searchResults =
    SearchJanitor.searchGuestBookEntries(searchString, pm);
```

SearchJanitor.java : The searchGuestBookEntries method does all the magic. It again chops the search string into single, stemmed words (using the SearchJanitorUtils) and constructs a query that searches for all these Strings in the fts field of entity GuestBookEntry.

StringBuffer queryBuffer = new StringBuffer();
queryBuffer.append("SELECT FROM " +
    GuestBookEntry.class.getName() + " WHERE ");

Set queryTokens = SearchJanitorUtils
    .getTokensForIndexingOrQuery(queryString,
        MAXIMUM_NUMBER_OF_WORDS_TO_SEARCH);

List parametersForSearch = new ArrayList(queryTokens);

StringBuffer declareParametersBuffer = new StringBuffer();
int parameterCounter = 0;

while (parameterCouter < queryTokens.size()) {
  queryBuffer.append("fts == param" + parameterCounter);
  declareParametersBuffer.append("String param" + parameterCounter);

  if (parameterCounter + 1 < queryTokens.size()) {
    queryBuffer.append(" && ");
    declareParametersBuffer.append(", ");
  }

  parameterCounter++;
}
    
Query query = pm.newQuery(queryBuffer.toString());
query.declareParameters(declareParametersBuffer.toString());
List result = (List) query
    .executeWithArray(parametersForSearch.toArray());

Searching summary: We have a search.jsp that uses the same stemming as in the indexing part to translate a string into a searchable set of strings. This set of strings is then in turn queried against the datastore (in the form of self merge-joins on one field). Mission accomplished: the guestbook application can now do full text search, including alternate "fuzzy" spellings of given words.

Limitations of the approach

1MB limit on entities. You cannot store more than 1MB in one entity. You can work around this limitation by generating more than one entity or by creating relation index entities as described in Brett Slatkin's presentation.

Number of search terms is limited (to roughly 5). You cannot search for a unlimited number of query terms, as you are doing a (potentially costly) self merge-join. But in most cases the results the user will get with a limited set of search terms will be fine.

If you get too many results this approach will not work. You have to make sure you are searching in a subset of the data with less than ~200 results. That's up to you. If you do not have many entities you are querying this error will never show up. If you have thousands of entries you should make sure you are only getting subsets. E.g. by only retrieving the "best" results (for whatever your application's secret sauce is). Alternatively, you can achieve this by only looking at results from a particular day or other small window.

Where to go from here

Use reference entities to avoid serialization costs (as outlined in Brett Slatkin's talk)

Add key-only queries for more efficient searches http://gae-java-persistence.blogspot.com/2009/10/keys-only-queries.html

Add memcache support for faster queries http://code.google.com/appengine/docs/java/memcache/overview.html

Add some secret sauce that enables you to "rank" results

Precompute date and sub-timestamps to allow you to search for them. Don't search for date ranges, which don't work with merge-join, but instead search for exact values, like 2010-04 (all entries in April 2010)

About us

This approach has been successfully applied in our GAE project http://scisurfer.com/ (scientific knowledge management). We added some secret sauce of our own for ranking, but in general, this method works (and scales) really well for us.

Please feel free to contact us about this post, or our project anytime at raphael.andre.bauer@gmail.com.

Thanks!

Nico Güttler, Dominic Jansen, Raphael Bauer (http://corporate.scisurfer.com/)

Posted by Fred Sauer, App Engine Team

Cloud World

Monday, 26 April 2010