Chomp Technology Blog

Further Fine-Tuning MySQL Full-Text Search

Posted in Uncategorized by Miker on January 14, 2010

When the Chomp iPhone went out to the public for the first time a few days ago one of the first surprises for the team was how frequently users were using the search function. We assumed that folks would be browsing for apps and generally shy away from searching unless they needed to. There were a few common cases that we missed where searching for apps is pretty essential, bad us… But we’ve done a quick scramble and gotten search to a really good place despite still being far from perfect. The search that we have running now is running on top of a MySQL full text index. We’re going to need to swap to a real search index as some point most likely, something like Solr/Lucene or Sphinx (we’re looking for folks familiar with tools like those, so please let us know if you’re interested in joining the team).

In the meantime though it was just me pretty much by my lonesome, really wanting to get MySQL to be able to at least do a good enough job to buy me a few weeks worth of time. There’s an article in the MySQL docs about fine-tuning MySQL for full text searches, which was a great starting point. Turns out there were a few different issues contributing to the plain old vanilla out-of-the-box MySQL full text searches not performing the way we wanted them to.

The first major issue was the way in which MySQL tokenizes the field being used for the index. The really simple settings were things like tuning the stopword list (which actually we’ve removed completely now, we’re taking care of the problem by sanitizing the row we have the index on before we insert into it) and turning the minimum word length down to two characters (so that apps like “Oz” get indexed at all, and so that you can find “Oz Weather” without it getting buried in all the weather apps out there). The more complex bit however was getting punctuation treated as word matter instead of whitespace. There are a bunch of apps where the punctuation is significant, such as “Ping!” or “m:Vampire”. The easiest way to do so (without recompiling MySQL) is to define a new collation and modify the character set in the process. This post, linked from the bottom of the MySQL fine-tuning post, is a great practical example. Docs are hard to find, and the configuration is a bit raw, but once you understand what’s going on it’s pretty trivial to tell the system to use characters like :, $, and % as word characters.  One useful debugging tip there is to dump the index info using the myisam_ftdump utility to get a list of tokens from the index. Really helped when I thought I had the right set of characters in my modified character set but had made a mistake.

The second major issue was sorting by relevance coming out of the match. Because folks tend to use subwords pretty frequently (must be because of the small keyboard, we see lots of stuff like ‘facebo’, which I assume maps to Facebook) we use boolean mode match()es, but we still want to sort by relevance. The common suggestion there is to do a select from the table using a boolean match() and then select the match() of the text against the query to use as a column to sort by. The suggestion is in the second comment down on the MySQL full text search functions page. Not bad, but also far from perfect. I haven’t dug into the details of how that function operates, but it certainly wasn’t doing what we wanted it to. For instance if someone used to search for “facebook” they would get back a whole bunch of other apps ahead of the facebook app itself. Stuff like “Facebook It!” for example, which I certainly consider to match the term “Facebook” less well than the exact text. What we were looking for was a measure of how far the application name was from the input term and sort by that. What we wanted was the Levenshtein Distance, but unfortunately MySQL doesn’t have that. We could suck all the values out and then compute the distances on the web front end – too nasty for large sets though. Fortunately there’s a Levenshtein Distance User Defined Function for MySQL that did the trick quite nicely. Now the whole thing is fully offloaded to the database and we can hold off making additions to the infrastructure till we have some time to really think through which option we want to go with.

Add in a little bit of text munging before we push the text into the index and we have a version that works with prefix strings for both punctuation bearing and punctuation free version. So now if you search for ‘m:vampire’ or ‘mvampire’ or even ‘mvamp’ on chomp you get back the m:Vampire app as the first hit. The nice thing about having that relevance hack defined as a UDF is that we can retweek it as necessary to generally provide a different kind of relevance. For us it was LD, and that gets us far enough. But if you want to use boolean matches on a full text search and your relevance function is just about anything else that can be evaluated as a pairwise comparison between the query and each individual row this is a decent way to get MySQL sorting things the way you see fit. No recompiling MySQL core required.

Advertisement

5 Responses

Subscribe to comments with RSS.

  1. Joe said, on January 25, 2010 at 5:13 pm

    I am seeing more sub-words in from queries done on full pc also. I suspect is has to do with Google search predictive text / suggested searches and itunes auto complete features. You start typing facebo and it shows a list possible completions in either place. I don’t know if it is feasible but adding a feature to Chomp where you have the 150,597 US apps cached to do auto complete may be a solution that could make queries faster by having specific name to key reference to us for lookup and at the same time increase user experience. The down side would be the memory requirements on device.

  2. Slavic said, on May 7, 2010 at 6:38 pm

    Miker, thanks for sharing with us your valuable experience. I wonder though, how did you make it possible to search subwords such as “facebo” using Mysql FULLTEXT search?

  3. Miker said, on May 20, 2010 at 4:22 pm

    @Slavic – Boolean matches allow for wildcards, so we end up with a query that looks something like this:

    SELECT MATCH(‘Content’) AGAINST (‘facebo’) as Relevance FROM table WHERE MATCH
    (‘Content’) AGAINST(‘facebo*’ IN BOOLEAN MODE) HAVING Relevance > 0.2 ORDER BY Relevance DESC;

    The ‘IN BOOLEAN MODE’ version allows for matching against the substring – which pulls the actual rows. And the MATCH(‘Content’) AGAINST (‘facebo’) calculates the relevance so that we can sort rows.

  4. hadi said, on September 6, 2010 at 6:56 pm

    Hi,

    I try your syntax and the output I get is only Relevance. Is that the purpose? to get relevance number not anything about facebo*

    thank you.

  5. Miker said, on September 7, 2010 at 8:18 pm

    The query was just a sample of how to pull relevance so that you can use it in a sort. You’ll have to include another column in the select, but that depends what you’re trying to search across (for it it’s an ID and not the text of the column).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.