Solved!
on February 19, 2010, 09:59 PM UTC — from
yliu ( 2,373 views )
When I use UltraSphinx, the Ruby API to the Sphinx full-text search engine, queries with certain special characters such as colons (:), at signs (@), and other reserved characters throw an Ultrasphinx::UsageError exception in my query interface.
For example, a search for "http://foo.com" reports:
Ultrasphinx::UsageError: index delta,main: query error: no field 'http' found in schema
Different characters cause different variants of the "query error: no field" found in schema problem. Just silencing the error or replacing the colons is unsatisfying. I am interested in what is causing this problem, and how can I reliably escape these characters.
- yliu on February 19, 2010, 10:17 PM UTC
In Sphinx extended query syntax, there are currently (as of this post) 9 special characters:
( ) | - ! @ ~ \" &
each of them can be escaped by prepending the classic escape character, backslash ('\'). There is an implementation in the PHP API for Sphinx called EscapeString(), which will do this escaping for query strings for you. I can't seem to find it in Ruby's Ultrasphinx, so I had to reimplement it. Trivial enough.
The one special case is the colon (:) character. It's not in the Sphinx query language, so looking for it there won't help. Ultrasphinx (as of the version current at time of this post) hard-codes this character as a delimiter of field operators. That is, given "foo:bar", it will always attempt to reparse it into "@foo bar", which in Sphinx query language means to search for the string "bar" in the indexed database field foo.
In ultrasphinx/lib/ultrasphinx/search/parser.rb:
def token_stream_to_hash(token_stream)
token_hash = Hash.new([])
token_stream.map do |operator, content|
# Remove some spaces
content.gsub!(/^"\s+|\s+"$/, '"')
# Convert fields into sphinx style, reformat the stream object
if content =~ /(^(http|https):\/\/[a-z0-9]+([-.]{1}[a-z0-9]*)+. [a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix
# XXX hack, its somewhat common to search for URLs. be sure to add
# " @, /," in the charset_type of the US config to search on all
# URLs and email addresses, and add:
# prefix_fields = url, domain
# to your US config
token_hash[nil] += [[operator, content]]
elsif content =~ /(.*?):(.*)/
token_hash[$1] += [[operator, $2]]
else
token_hash[nil] += [[operator, content]]
end
end
token_hash
end
The line "elsif content =~ /(.*?):(.*)/" performs this hard-coded split. As you can see, no amount of escaping backslashes on your end is going to fix this. Since usually people don't index colons anyway, it doesn't really matter.
The naive approach to the problem is to take out the elsif clause and use standard Sphinx syntax ("@foo bar") for advanced field-based ("foo:bar") queries. However, there is no telling what other key functionality relies on this behavior, and you're basically lobotomizing the alternate advanced query syntax. The more intelligent way would be to implement escaping, probably via backslash. I leave that as an exercise to the reader (please post your solution if you have one).