Basics

The point of locus is to choose from a database a list of documents you're searching for. This result depends on various inputs.
  1. The words you're looking up. This simple criterion is perfectly adequate, as long as there is relatively few documents satisfying it - that is, if you know some rare words to look up (e. g. the name of the product you want). Such simple queries can be specified as command-line parameters for grazer.

  2. If you don't know any rare words, you must search for a pattern of common ones. That can be as simple as searching for a whole name (query for "Buffalo Bill" gets a lot more descriptive when you specify that both words must be present and the first must immediately precede the second) or as complicated as extracting characteristic words and their patterns from known documents of the kind you're interested in and searching for these (truth be told, the performance of these complicated approaches is not what I would like it to be, but I'm working on it :-) ). Conformance to patterns is quantified by a hardcoded set of metrics and customized by assigning relative weights to these metrics (i. e. when searching for "Buffalo Bill", you would assign positive weights to existence of all words in a document, their order and locality and zero to everything else, like this). Weights should be numbers from the interval <0, 1> whose sum is 1. Set of weights (with one additional parameter) defines soft operator, which maps list of words to list of documents containing them, sorted by their relevance. Default soft operator is unnamed, defined in locus.opt (if it's not, grazer uses defaults as its attributes) and it's the one used on queries from command line. You can define additional, named soft operators in user-defined objects file.

  3. Many documents stored in databases have general structure - for example e-mail messages contain author's name, subject, date etc. If you describe this structure before indexing them, you can restrict your queries to document parts (i. e. search for messages whose subject contains "engineering").

  4. Of course, you may want to search for more than one pattern in one query (for example not only for "Buffalo Bill", but for "William Cody" as well). You can do that by composing soft operators in the query file with relational operators - the standard & and |. Composition of relevance follows rules of fuzzy arithmetics.

Soft operators

Soft operator is applied on sorted list of mutually different words (if they're not mutually different, grazer displays a warning and throws duplicates out). Each word is first converted to lowercase (for default value of early_case_conversion) and stemmed (if stemming is enabled - by default it's not).

grazer gets all the documents (in a specified interval) containing at least one word from the query. Found documents are sorted by their relevance and those with relevance lower than a threshold (value of relevance_threshold parameter of the operator) are dropped. Document's relevance is computed as follows:

Query file

contains an expression composed from terms by operators '&' and '|' and parentheses. Terms of this expression are soft operators, either named or unnamed. Unnamed operator is just a list of words - for example
	Buffalo Bill | William Cody
is legal and equivalent to
	(Buffalo Bill) | (William Cody)
Named operators must have parameters in parentheses - for example
	name(Buffalo Bill)
Named operators can be restricted to some document parts by listing these parts in square brackets after operator's name - for example
	name[from](Linus)
Word "plain" refers to default part - for example
	name[title plain](ISO)
Operator parameters containing characters other than letters must be in double quotes - for example
	name("O'Brien")
Quoted operator parameter containing whitespace characters is broken into multiple ones - for example
	"o la la"
is equivalent to
	o la

Note that only relational operators can be applied recursively - soft operators can't.


Top: locus homepage