Sphinx search / thinking_sphinx with german umlauts

April 17, 2012

I figured out how to get Sphinx / thinking_sphinx to handle searches with german umlauts and treat the umlauts as case insensitive.

Compile Sphinx with libestemmer

First, you need to compile Sphinx with libestemmer.

Download Sphinx source code:

$ wget http://sphinxsearch.com/files/sphinx-0.9.9.tar.gz:
$ tar xvzf sphinx-0.9.9.tar.gz
$ cd sphinx-0.9.9

Download libestemmer and unpack it inside the Sphinx directory:

$ wget http://snowball.tartarus.org/dist/libstemmer_c.tgz
$ tar xvzf libstemmer_c.tgz  (in sphinx-0.9.9)

Then build Sphinx:

$ ./configure --prefix=/usr/local/sphinx --disable-debug --disable-dependency-tracking  --with-libstemmer --with-pgsql --with-mysql
$ make
$ sudo make install

Add /usr/local/sphinx/bin to your PATH variable.

Set Up charset_table

Then you configure the charset_table in Sphinx. Here is a snippet from my sphinx.yml. thinking_sphinx uses it to generate the Sphinx configuration files.

development:
  port: 9312
  searchd_log_file: "log/searchd.development.log"
  query_log_file: "log/searchd.query.development.log"
  morphology: "libstemmer_de"
  charset_type: 'utf-8'
  charset_table: "0..9, A..Z->a..z, a..z, U+C4->U+E4, U+D6->U+F6, U+DC->U+FC, U+E4, U+F6, U+FC, U+DF"
...

Add similar entries for your other Rails environments such as test and production.

The clue here is charset_table. It declares which characters can make up a word. My configuration says that, other than digits and ASCII characters, german umlauts and eszet are relevant, and how to convert upper case characters to lower case.