#85: Colons and special characters in Ultrasphinx queries

When I use UltraSphinx, the Ruby API to the Sphinx full-text search engine, queries with certain special characters such as colons (:), at signs (@), and other reserved characters throw an Ultrasphinx::UsageError exception in my query interface.

For example, a search for "http://foo.com" reports:
Ultrasphinx::UsageError: index delta,main: query error: no field 'http' found in schema
Different characters cause different variants of the "query error: no field" found in schema problem. Just silencing the error or replacing the colons is unsatisfying. I am interested in what is causing this problem, and how can I reliably escape these characters.

Two different problems causing Ultrasphinx query UsageError exception

In Sphinx extended query syntax, there are currently (as of this post) 9 special characters:
( ) | - ! @ ~ \" &
each of them can be escaped by prepending the classic escape character, backslash ('\'). There is an implementation in the PHP API for Sphinx called EscapeString(), which will do this escaping for query strings for you. I can't seem to find it in Ruby's Ultrasphinx, so I had to reimplement it. Trivial enough.

The one special case is the colon (:) character. It's not in the Sphinx query language, so looking for it there won't help. Ultrasphinx (as of the version current at time of this post) hard-codes this character as a delimiter of field operators. That is, given "foo:bar", it will always attempt to reparse it into "@foo bar", which in Sphinx query language means to search for the string "bar" in the indexed database field foo.

In ultrasphinx/lib/ultrasphinx/search/parser.rb:
def token_stream_to_hash(token_stream) token_hash = Hash.new([]) token_stream.map do |operator, content| # Remove some spaces content.gsub!(/^"\s+|\s+"$/, '"') # Convert fields into sphinx style, reformat the stream object if content =~ /(^(http|https):\/\/[a-z0-9]+([-.]{1}[a-z0-9]*)+. [a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix # XXX hack, its somewhat common to search for URLs. be sure to add # " @, /," in the charset_type of the US config to search on all # URLs and email addresses, and add: # prefix_fields = url, domain # to your US config token_hash[nil] += [[operator, content]] elsif content =~ /(.*?):(.*)/ token_hash[$1] += [[operator, $2]] else token_hash[nil] += [[operator, content]] end end token_hash end

The line "elsif content =~ /(.*?):(.*)/" performs this hard-coded split. As you can see, no amount of escaping backslashes on your end is going to fix this. Since usually people don't index colons anyway, it doesn't really matter.

The naive approach to the problem is to take out the elsif clause and use standard Sphinx syntax ("@foo bar") for advanced field-based ("foo:bar") queries. However, there is no telling what other key functionality relies on this behavior, and you're basically lobotomizing the alternate advanced query syntax. The more intelligent way would be to implement escaping, probably via backslash. I leave that as an exercise to the reader (please post your solution if you have one).

Think you've got a better solution? Help 92049143cabb7ba896d7c06e19906303_small yliu out by posting your solution

Sphinx - Free open-source SQL full-text search engine

http://www.sphinxsearch.com/docs/current.html#api-reference - found by 92049143cabb7ba896d7c06e19906303_small yliu on February 19, 2010, 10:01 PM UTC

there's a function in the original PHP API to Sphinx called EscapeString