Saturday, July 19, 2008

Slow Sphinx Indexing

Tejus and I are building a Ruby on Rails site that needs both structured and unstructured search. Hamed suggested that we use Ultrasphinx, a Rails plugin that provides an interface to the Sphinx search backend.

I got everything downloaded and compiled, and had figured out how to debug the nil:NilClass errors that Ultrasphinx's configuration mini-language was generating, and then when I went to build the index for our database of seven documents... it seemed to hang. I was patient though, and let it run in the background for 10 minutes. This might be acceptable on a huge database, but... it was clear that something was wrong.

Several hours of debugging led me to the root cause: sphinx was assuming that the primary keys of the indexed table were sequential, and was creating a query for every 5000 rows between the min and max id of that table. With an auto_increment primary key, this is a valid assumption, but our data was being loaded by an ActiveRecord fixture which was generating random primary keys, so the range between min and max was nearly a billion, thus the number of queries was in the hundreds of thousands, all but seven of them returning nothing.

The solution of course, is to put explicit id's on your fixtures.

2 comments:

Unknown said...

wow, you just saved me like two weeks of tooth-grinding pain. thanks

vjt said...

You saved me, also, from a lot of head-banging and enabling mysql log and ... ;). Thanks. :-).

A side note, with the thinking-sphinx plugin, you can use:

production:
  sql_range_step: 16777216

in your config/sphinx.yml to have it included in the generated config file.

Nice catch! :)

~Marcello