Please log into your user account or create a new one

Solution for: #85: Colons and special characters in Ultrasphinx queries

Two different problems causing Ultrasphinx query UsageError exception

2
In Sphinx extended query syntax, there are currently (as of this post) 9 special characters:
( ) | - ! @ ~ \" &
each of them can be escaped by prepending the classic escape character, backslash ('\'). There is an implementation in the PHP API for Sphinx called EscapeString(), which will do this escaping for query strings for you. I can't seem to find it in Ruby's Ultrasphinx, so I had to reimplement it. Trivial enough.

The one special case is the colon (:) character. It's not in the Sphinx query language, so looking for it there won't help. Ultrasphinx (as of the version current at time of this post) hard-codes this character as a delimiter of field operators. That is, given "foo:bar", it will always attempt to reparse it into "@foo bar", which in Sphinx query language means to search for the string "bar" in the indexed database field foo.

In ultrasphinx/lib/ultrasphinx/search/parser.rb:
   1  
   2        def token_stream_to_hash(token_stream)
   3          token_hash = Hash.new([])        
   4          token_stream.map do |operator, content|
   5            # Remove some spaces
   6            content.gsub!(/^"\s+|\s+"$/, '"')
   7            # Convert fields into sphinx style, reformat the stream object
   8            if content =~ /(^(http|https):\/\/[a-z0-9]+([-.]{1}[a-z0-9]*)+. [a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix
   9              # XXX hack, its somewhat common to search for URLs.  be sure to add 
  10              # " @, /," in the charset_type of the US config to search on all 
  11              # URLs and email addresses, and add:
  12              # prefix_fields = url, domain
  13              # to your US config
  14              token_hash[nil] += [[operator, content]]
  15            elsif content =~ /(.*?):(.*)/
  16              token_hash[$1] += [[operator, $2]]
  17            else
  18              token_hash[nil] += [[operator, content]]
  19            end        
  20          end
  21          token_hash
  22        end


The line "elsif content =~ /(.*?):(.*)/" performs this hard-coded split. As you can see, no amount of escaping backslashes on your end is going to fix this. Since usually people don't index colons anyway, it doesn't really matter.

The naive approach to the problem is to take out the elsif clause and use standard Sphinx syntax ("@foo bar") for advanced field-based ("foo:bar") queries. However, there is no telling what other key functionality relies on this behavior, and you're basically lobotomizing the alternate advanced query syntax. The more intelligent way would be to implement escaping, probably via backslash. I leave that as an exercise to the reader (please post your solution if you have one).