Cloud World

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Monday, 26 April 2010

Making your app searchable using self merge-joins

Posted on 11:06 by Unknown
A picture is worth a thousand words, and as developers, you know that a working code snippet can be worth even more. The developers at scisurfer.com have agreed to share a few of their code snippets with you today. The snippets outline how they full text index their content and make it easily searchable for their users.



---



Many applications can benefit from full text search. Using Brett Slatkin's presentation at Google I/O, the implementation is pretty straight-forward. The following article gives you a practical introduction of how to implement full text search on GAE. The code is GAE/J + JDO only, but the concepts can be easily converted into Python or JPA.




Goals



  • Develop a guestbook example (much like the one shipped with the SDK), but with searchable text

  • The full text search should be fuzzy, within some reasonable limits



Some things before we start:



  • Self merge-joins and list properties: You can query an entity efficiently based on list properties via self merge-joins. We will not talk about that in detail, but you should watch Brett Slatkin's excellent talk at Google I/O '09 about the topic. It should answer most of your questions: Google I/O 2009 - Building Scalable, Complex Apps on App Engine.

  • Full Text Search (FTS): FTS is a really huge topic, and it can be done in a myriad of different ways. Check out wikipedia for a primer: http://en.wikipedia.org/wiki/Full_text_search

  • The art of stemming: One of the most basic things done to enable some form of inexact search is called "stemming". It's the reduction of words towards their basic form. http://en.wikipedia.org/wiki/Stemming



The project


The whole project is available on Google Code http://code.google.com/p/guestbook-example-appengine-full-text-search.



A live demo is available at http://guestbook-example-fts.appspot.com/




The project - walk-through: indexing



  • guestbook.jsp: This file is the first file that is loaded. You can enter new entries into the guestbook (GuestBookEntry) or search all guestbook entries.

  • GuestBookEntry.java: A simple JDO file with persistent fields. However, there is one special field: fts. It is a Set of Strings. The set will be filled with the terms that allow for a full text search. If you inspect the constructor, you will see a call which is responsible for making this GuestBookEntry searchable:



    SearchJanitor.updateFTSStuffForGuestBookEntry(this);




  • SearchJanitor.java: The updateFTSStuffForGuestBookEntry method gets a GuestBookEntry and chops it into single words using:



    SearchJanitorUtils.getTokensForIndexingOrQuery(...);




  • SearchJanitorUtils.java: The getTokensForIndexingOrQuery(...) method uses Apache Lucene and Lucene Snowball to extract words from the given string. But it does more: The Lucene Snowball stemmer reduces the words to the basic form which enables fuzzy search. A search for Kids or Kid will return the same results. kid (lowercase) and Kids will also return the same results.




Indexing summary: So far, we have an entity (GuestBookEntry) that will be filled with a set of Strings generated from it's content. So far, so good. But the real search component is missing.


The project - walk-through: searching



  • search.jsp: This file gets a parameter "search" and presents results for that search. It does that by consulting:



    List searchResults =
    SearchJanitor.searchGuestBookEntries(searchString, pm);




  • SearchJanitor.java : The searchGuestBookEntries method does all the magic. It again chops the search string into single, stemmed words (using the SearchJanitorUtils) and constructs a query that searches for all these Strings in the fts field of entity GuestBookEntry.


StringBuffer queryBuffer = new StringBuffer();
queryBuffer.append("SELECT FROM " +
GuestBookEntry.class.getName() + " WHERE ");

Set queryTokens = SearchJanitorUtils
.getTokensForIndexingOrQuery(queryString,
MAXIMUM_NUMBER_OF_WORDS_TO_SEARCH);

List parametersForSearch = new ArrayList(queryTokens);

StringBuffer declareParametersBuffer = new StringBuffer();
int parameterCounter = 0;

while (parameterCouter < queryTokens.size()) {
queryBuffer.append("fts == param" + parameterCounter);
declareParametersBuffer.append("String param" + parameterCounter);

if (parameterCounter + 1 < queryTokens.size()) {
queryBuffer.append(" && ");
declareParametersBuffer.append(", ");
}

parameterCounter++;
}

Query query = pm.newQuery(queryBuffer.toString());
query.declareParameters(declareParametersBuffer.toString());
List result = (List) query
.executeWithArray(parametersForSearch.toArray());





Searching summary: We have a search.jsp that uses the same stemming as in the indexing part to translate a string into a searchable set of strings. This set of strings is then in turn queried against the datastore (in the form of self merge-joins on one field). Mission accomplished: the guestbook application can now do full text search, including alternate "fuzzy" spellings of given words.




Limitations of the approach



  • 1MB limit on entities. You cannot store more than 1MB in one entity. You can work around this limitation by generating more than one entity or by creating relation index entities as described in Brett Slatkin's presentation.

  • Number of search terms is limited (to roughly 5). You cannot search for a unlimited number of query terms, as you are doing a (potentially costly) self merge-join. But in most cases the results the user will get with a limited set of search terms will be fine.

  • If you get too many results this approach will not work. You have to make sure you are searching in a subset of the data with less than ~200 results. That's up to you. If you do not have many entities you are querying this error will never show up. If you have thousands of entries you should make sure you are only getting subsets. E.g. by only retrieving the "best" results (for whatever your application's secret sauce is). Alternatively, you can achieve this by only looking at results from a particular day or other small window.



Where to go from here



  • Use reference entities to avoid serialization costs (as outlined in Brett Slatkin's talk)

  • Add key-only queries for more efficient searches http://gae-java-persistence.blogspot.com/2009/10/keys-only-queries.html

  • Add memcache support for faster queries http://code.google.com/appengine/docs/java/memcache/overview.html

  • Add some secret sauce that enables you to "rank" results

  • Precompute date and sub-timestamps to allow you to search for them. Don't search for date ranges, which don't work with merge-join, but instead search for exact values, like 2010-04 (all entries in April 2010)



About us


This approach has been successfully applied in our GAE project http://scisurfer.com/ (scientific knowledge management). We added some secret sauce of our own for ranking, but in general, this method works (and scales) really well for us.



Please feel free to contact us about this post, or our project anytime at raphael.andre.bauer@gmail.com.



Thanks!

Nico Güttler, Dominic Jansen, Raphael Bauer (http://corporate.scisurfer.com/)





Posted by Fred Sauer, App Engine Team
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Bridging Mobile Backend as a Service to Enterprise Systems with Google App Engine and Kinvey
    The following post was contributed by Ivan Stoyanov , VP of Engineering for Kinvey, a mobile Backend as a Service provider and Google Cloud ...
  • Tutorial: Adding a cloud backend to your application with Android Studio
    Android Studio lets you easily add a cloud backend to your application, right from your IDE. A backend allows you to implement functionality...
  • 2013 Year in review: topping 100,000 requests-per-second
    2013 was a busy year for Google Cloud Platform. Watch this space: each day, a different Googler who works on Cloud Platform will be sharing ...
  • Easy Performance Profiling with Appstats
    Since App Engine debuted 2 years ago, we’ve written extensively about best practices for writing scalable apps on App Engine. We make writ...
  • TweetDeck and Google App Engine: A Match Made in the Cloud
    I'm Reza and work in London, UK for a startup called TweetDeck . Our vision is to develop the best tools to manage and filter real time ...
  • Scaling with the Kindle Fire
    Today’s blog post comes to us from Greg Bayer of Pulse , a popular news reading application for iPhone, iPad and Android devices. Pulse has ...
  • Who's at Google I/O: Mojo Helpdesk
    This post is part of Who's at Google I/O , a series of guest blog posts written by developers who are appearing in the Developer Sandbox...
  • A Day in the Cloud, new articles on scaling, and fresh open source projects for App Engine
    The latest release of Python SDK 1.2.3, which introduced the Task Queue API and integrated support for Django 1.0, may have received a lot ...
  • SendGrid gives App Engine developers a simple way of sending transactional email
    Today’s guest post is from Adam DuVander, Developer Communications Director at SendGrid. SendGrid is a cloud-based email service that deliv...
  • Qubole helps you run Hadoop on Google Compute Engine
    This guest post comes form Praveen Seluka, Software Engineer at Qubole, a leading provider of Hadoop-as-a-service.  Qubole is a leading pr...

Categories

  • 1.1.2
  • agile
  • android
  • Announcements
  • api
  • app engine
  • appengine
  • batch
  • bicycle
  • bigquery
  • canoe
  • casestudy
  • cloud
  • Cloud Datastore
  • cloud endpoints
  • cloud sql
  • cloud storage
  • cloud-storage
  • community
  • Compute Engine
  • conferences
  • customer
  • datastore
  • delete
  • developer days
  • developer-insights
  • devfests
  • django
  • email
  • entity group
  • events
  • getting started
  • google
  • googlenew
  • gps
  • green
  • Guest Blog
  • hadoop
  • html5
  • index
  • io2010
  • IO2013
  • java
  • kaazing
  • location
  • mapreduce
  • norex
  • open source
  • partner
  • payment
  • paypal
  • pipeline
  • put
  • python
  • rental
  • research project
  • solutions
  • support
  • sustainability
  • taskqueue
  • technical
  • toolkit
  • twilio
  • video
  • websockets
  • workflows

Blog Archive

  • ►  2013 (143)
    • ►  December (33)
    • ►  November (15)
    • ►  October (17)
    • ►  September (13)
    • ►  August (4)
    • ►  July (15)
    • ►  June (12)
    • ►  May (15)
    • ►  April (4)
    • ►  March (4)
    • ►  February (9)
    • ►  January (2)
  • ►  2012 (43)
    • ►  December (2)
    • ►  November (2)
    • ►  October (8)
    • ►  September (2)
    • ►  August (3)
    • ►  July (4)
    • ►  June (2)
    • ►  May (3)
    • ►  April (4)
    • ►  March (5)
    • ►  February (3)
    • ►  January (5)
  • ►  2011 (46)
    • ►  December (3)
    • ►  November (4)
    • ►  October (4)
    • ►  September (5)
    • ►  August (3)
    • ►  July (4)
    • ►  June (3)
    • ►  May (8)
    • ►  April (2)
    • ►  March (5)
    • ►  February (3)
    • ►  January (2)
  • ▼  2010 (38)
    • ►  December (2)
    • ►  October (2)
    • ►  September (1)
    • ►  August (5)
    • ►  July (5)
    • ►  June (6)
    • ►  May (3)
    • ▼  April (5)
      • Making your app searchable using self merge-joins
      • Games on App Engine: An interview with Jay Kyburz,...
      • App Engine SDK 1.3.3 Released
      • Happy Birthday
      • TweetDeck and Google App Engine: A Match Made in t...
    • ►  March (5)
    • ►  February (2)
    • ►  January (2)
  • ►  2009 (47)
    • ►  December (4)
    • ►  November (3)
    • ►  October (6)
    • ►  September (5)
    • ►  August (3)
    • ►  July (3)
    • ►  June (4)
    • ►  May (3)
    • ►  April (5)
    • ►  March (3)
    • ►  February (7)
    • ►  January (1)
  • ►  2008 (46)
    • ►  December (4)
    • ►  November (3)
    • ►  October (10)
    • ►  September (5)
    • ►  August (6)
    • ►  July (4)
    • ►  June (2)
    • ►  May (5)
    • ►  April (7)
Powered by Blogger.

About Me

Unknown
View my complete profile