Sphinx search / thinking_sphinx with german umlauts
I figured out how to get Sphinx / thinking_sphinx to handle searches with german umlauts and treat the umlauts as case insensitive.
Compile Sphinx with libestemmer
First, you need to compile Sphinx with libestemmer.
Download Sphinx source code:
$ wget http://sphinxsearch.com/files/sphinx-0.9.9.tar.gz:
$ tar xvzf sphinx-0.9.9.tar.gz
$ cd sphinx-0.9.9
Download libestemmer and unpack it inside the Sphinx directory:
$ wget http://snowball.tartarus.org/dist/libstemmer_c.tgz
$ tar xvzf libstemmer_c.tgz (in sphinx-0.9.9)
Then build Sphinx:
$ ./configure --prefix=/usr/local/sphinx --disable-debug --disable-dependency-tracking --with-libstemmer --with-pgsql --with-mysql
$ make
$ sudo make install
Add /usr/local/sphinx/bin to your PATH variable.
Set Up charset_table
Then you configure the charset_table
in Sphinx. Here is a snippet from my sphinx.yml
. thinking_sphinx uses it to generate the Sphinx configuration files.
development:
port: 9312
searchd_log_file: "log/searchd.development.log"
query_log_file: "log/searchd.query.development.log"
morphology: "libstemmer_de"
charset_type: 'utf-8'
charset_table: "0..9, A..Z->a..z, a..z, U+C4->U+E4, U+D6->U+F6, U+DC->U+FC, U+E4, U+F6, U+FC, U+DF"
...
Add similar entries for your other Rails environments such as test and production.
The clue here is charset_table
. It declares which characters can make up a word. My configuration says that, other than digits and ASCII characters, german umlauts and eszet are relevant, and how to convert upper case characters to lower case.