Solution for: #85: Colons and special characters in Ultrasphinx queries
Two different problems causing Ultrasphinx query UsageError exception
- yliu on February 19, 2010, 10:17 PM UTC
In Sphinx extended query syntax, there are currently (as of this post) 9 special characters:
each of them can be escaped by prepending the classic escape character, backslash ('\'). There is an implementation in the PHP API for Sphinx called EscapeString(), which will do this escaping for query strings for you. I can't seem to find it in Ruby's Ultrasphinx, so I had to reimplement it. Trivial enough.
The one special case is the colon (:) character. It's not in the Sphinx query language, so looking for it there won't help. Ultrasphinx (as of the version current at time of this post) hard-codes this character as a delimiter of field operators. That is, given "foo:bar", it will always attempt to reparse it into "@foo bar", which in Sphinx query language means to search for the string "bar" in the indexed database field foo.
In ultrasphinx/lib/ultrasphinx/search/parser.rb:
The line "elsif content =~ /(.*?):(.*)/" performs this hard-coded split. As you can see, no amount of escaping backslashes on your end is going to fix this. Since usually people don't index colons anyway, it doesn't really matter.
The naive approach to the problem is to take out the elsif clause and use standard Sphinx syntax ("@foo bar") for advanced field-based ("foo:bar") queries. However, there is no telling what other key functionality relies on this behavior, and you're basically lobotomizing the alternate advanced query syntax. The more intelligent way would be to implement escaping, probably via backslash. I leave that as an exercise to the reader (please post your solution if you have one).
( ) | - ! @ ~ \" &
each of them can be escaped by prepending the classic escape character, backslash ('\'). There is an implementation in the PHP API for Sphinx called EscapeString(), which will do this escaping for query strings for you. I can't seem to find it in Ruby's Ultrasphinx, so I had to reimplement it. Trivial enough.
The one special case is the colon (:) character. It's not in the Sphinx query language, so looking for it there won't help. Ultrasphinx (as of the version current at time of this post) hard-codes this character as a delimiter of field operators. That is, given "foo:bar", it will always attempt to reparse it into "@foo bar", which in Sphinx query language means to search for the string "bar" in the indexed database field foo.
In ultrasphinx/lib/ultrasphinx/search/parser.rb:
def token_stream_to_hash(token_stream)
token_hash = Hash.new([])
token_stream.map do |operator, content|
# Remove some spaces
content.gsub!(/^"\s+|\s+"$/, '"')
# Convert fields into sphinx style, reformat the stream object
if content =~ /(^(http|https):\/\/[a-z0-9]+([-.]{1}[a-z0-9]*)+. [a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix
# XXX hack, its somewhat common to search for URLs. be sure to add
# " @, /," in the charset_type of the US config to search on all
# URLs and email addresses, and add:
# prefix_fields = url, domain
# to your US config
token_hash[nil] += [[operator, content]]
elsif content =~ /(.*?):(.*)/
token_hash[$1] += [[operator, $2]]
else
token_hash[nil] += [[operator, content]]
end
end
token_hash
end
The line "elsif content =~ /(.*?):(.*)/" performs this hard-coded split. As you can see, no amount of escaping backslashes on your end is going to fix this. Since usually people don't index colons anyway, it doesn't really matter.
The naive approach to the problem is to take out the elsif clause and use standard Sphinx syntax ("@foo bar") for advanced field-based ("foo:bar") queries. However, there is no telling what other key functionality relies on this behavior, and you're basically lobotomizing the alternate advanced query syntax. The more intelligent way would be to implement escaping, probably via backslash. I leave that as an exercise to the reader (please post your solution if you have one).
References used:
Sphinx - Free open-source SQL full-text search engine
( http://www.sphinxsearch.com/docs/current.html#api-reference ) - found by yliu on February 19, 2010, 10:01 PM UTC