![]() SQL database system |
Searching multiword text fieldsText fields such as titles, descriptions, remarks fields, etc. are common in databases. Users generally prefer to search multiword content by entering a simple intuitive search request, like they might use with Google or Ebay. For instance, when a user enters child asthma into a search application, they expect to see a list of results where entries containing both child and asthma are at the top. Top priority goes to entries that contain child asthma literally, followed by entries that contain the words child and asthma regardless of position or order. Entries containing only one or the other are shown further down the list, or not at all. When a user enters "blue ribbon beer" they expect to see only entries that contain that phrase exactly. shsql provides this capability via the CONTAINS comparison operator, along with word indexes or combinedword indexes.
shsql text fields are by default limited to 255 characters in length.
shsql does not support "memo" fields or text fields of infinite length.
A suggested strategy for accomodating larger blocks of text is to store the text in individual files,
then store the filename in the database. This approach is used by
quisp.
CONTAINSWHERE authors CONTAINS 'smith' WHERE authors CONTAINS 'smith jones' WHERE title CONTAINS '"gene expression" tumor' shsql provides the CONTAINS WHERE clause operator for finding a word (or several words) in a database field that has multiword content (e.g. titles, descriptions, lists, etc.). CONTAINS allows the user search request to be inserted directly into the query WHERE clause without any preprocessing. It also ranks result rows by how well they match the user's request. Rows that do not match at all are rejected. For rows that match at least minimally, a scoring metric is generated and placed into a field called _matchscore. _matchscore is 0 (best) to 19 (worst). The scoring takes into account number of search words present and word order and position. This field may be used to order the result rows, or may be used further to the right in the WHERE clause. All matching is case insensitive. Words are delimited on any combination of whitespace or punctuation characters (this applies to both requested words and words in the database field). Words enclosed within double quotes (") are considered a phrase (see the 3rd example above), and are treated as a single search term which must match exactly. Non-quoted words are each an individual search term to which open-ended matching is applied eg. 'casey at the bat' CONTAINS 'case' would be true (with a slightly worse score), but 'casey at the bat' CONTAINS '"case"' would be false, because of the double quotes. Open-ended matching is not used for very short search terms (1 or 2 characters in length); these must match exactly. A single word may be enclosed in double quotes to force it to match exactly. If CONTAINS is used more than once in a WHERE clause expression, the _matchscores are summed, up to a maximum of 99.
CONTAINS examples:
select title, _matchscore from journalcits where title contains "retina" order by _matchscore select authorlist, _matchscore from journalcits where authorlist contains "smith, jones, greene" and _matchscore < 5 Word indexesA word index is a type of index where every unique word in a multiword field has its own index entry, for fast word-based searching on fields that contain titles, descriptions, or lists of values. Words are delimited by any combination of whitespace and punctuation characters.To create a word index use an SQL command like this: create index type=word on auctionitems ( desc ) Very common words (such as and and the in English) can be omitted for better search efficiency by setting up a "common words" file. Common words should be inserted into a file, one word per line. See the sqlexampledb for an example English common words file. Then, set the dbcommonwordsfile attribute in your project config file.
Note: unless CONTAINS is used, queries that attempt to compare a word-indexed field against a multiword
constant will not be successful.
Combinedword indexesA combinedword index is similar to a word index, except that words from several fields are combined into one index. For situations where a table contains several multiword fields that will usually be searched simultaneously for the same search words (as is often the case with search engine applications), a combinedword index can be used instead of several word indexes, for better search efficiency.For example, suppose you have a table holding information on journal articles with fields title, authors, and keywords, and you have a search engine application that allows a user to type in one or several words. All three of the fields need to be searched. A query to search the table would be something like this: select * from journalcits where title contains searchwords OR author contains searchwords OR keywords contains searchwords If each field had its own word index, three index lookups would be required, since there are three OR terms. With a combinedword index (which contains words from all three fields) only one index lookup is needed. A combinedword index is associated by name with one field, called the primary field. Additional represented fields are called secondary fields. In the above example, the combinedword index's primary field is title, and its secondary fields are author and keywords. When issueing a query, the primary fieldname must be specified in the leftmost comparison in the WHERE clause. The indexing mechanism detects the fact that the index is combinedword type, and cancels index lookups for the other OR terms. For this reason, the following restriction applies: When a combinedword lookup is OR'ed with other terms, the other terms must involve secondary fields. Otherwise, retrievals may be incomplete, since index lookups for rightward OR terms will not be done. Searches involving only the primary field, or the primary field and a subset of the secondary fields, will use the combined index and will still give correct results. Searches involving only one or more secondary fields will not interact with the combinedword index, but may have their own separate indexes. When CREATEing a combinedword index, the first fieldname mentioned in the CREATE command is the primary field with; the remaining mentioned fields are the secondary fields. Thus, only one combinedword index can be created by a CREATE INDEX command. For example:
Note: unless CONTAINS is used, queries that attempt to compare a combinedword-indexed field against a multiword
constant will not be successful.
NotesIf a query is eligible for indexing, SELECT DISTINCT is automatically done whenever OR or CONTAINS are used. When a word or combinedword index exists for a field, searches that use = or like will work with a single word or NULL, but not with multiple words. Use CONTAINS to find multiple words. Searches that involve only "common words" will not find anything. Numeric fields / numeric comparisons cannot be used with word or combinedword indexes. |
![]() Copyright Steve Grubb |