glimpse supports a large variety of patterns, including simple strings, strings with classes of characters, sets of strings, wild cards, and regular expressions (see LIMITATIONS).
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.
Grep understands two different versions of regular expression syntax: ``basic'' and ``extended.'' In GNU grep, there is no difference in available functionality using either syntax. In other implementations, basic regular expressions are less powerful. The following description applies to extended regular expressions; differences for basic regular expressions are summarized afterwards.
The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash.
A list of characters enclosed by [ and ] matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit. A range of characters may be specified by giving the first and last characters, separated by a hyphen. Finally, certain named classes of characters are predefined. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z], except the latter form depends upon the POSIX locale and the ASCII character encoding, whereas the former is independent of locale and character set. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket list.) Most metacharacters lose their special meaning inside lists. To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last.
The period . matches any single character. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum]].
The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word.
A regular expression may be followed by one of several repetition operators:
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.
The backreference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression.
In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { metacharacter, and some egrep implementations support \{ instead, so portable scripts should avoid { in egrep patterns and should use [{] to match a literal {.
GNU egrep attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification. For example, the shell command egrep '{1' searches for the two-character string {1 instead of reporting a syntax error in the regular expression. POSIX.2 allows this behavior as an extension, but portable scripts should avoid it.
The index of glimpse is word based. A pattern that contains more than one word cannot be found in the index. The way glimpse overcomes this weakness is by splitting any multi-word pattern into its set of words and looking for all of them in the index. For example, glimpse 'linear programming' will first consult the index to find all files containing both linear and programming, and then apply agrep to find the combined pattern. This is usually an effective solution, but it can be slow for cases where both words are very common, but their combination is not.
As was mentioned in the section on PATTERNS above, some characters serve as meta characters for glimpse and need to be preceded by '\' to search for them. The most common examples are the characters '.' (which stands for a wild card), and '*' (the Kleene closure). So, "glimpse ab.de" will match abcde, but "glimpse ab\.de" will not, and "glimpse ab*de" will not match ab*de, but "glimpse ab\*de" will. The meta character - is translated automatically to a hypen unless it appears between [] (in which case it denotes a range of characters).
The index of glimpse stores all patterns in lower case. When glimpse searches the index it first converts all patterns to lower case, finds the appropriate files, and then searches the actual files using the original patterns. So, for example, glimpse ABCXYZ will first find all files containing abcxyz in any combination of lower and upper cases, and then searches these files directly, so only the right cases will be found. One problem with this approach is discovering misspellings that are caused by wrong cases. For example, glimpse -B abcXYZ will first search the index for the best match to abcxyz (because the pattern is converted to lower case); it will find that there are matches with no errors, and will go to those files to search them directly, this time with the original upper cases. If the closest match is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error. Another problem is speed. If you search for "ATT", it will look at the index for "att". Unless you use -w to match the whole word, glimpse may have to search all files containing, for example, "Seattle" which has "att" in it.
There is no size limit for simple patterns and simple patterns within Boolean expressions. More complicated patterns, such as regular expressions, are currently limited to approximately 30 characters. Lines are limited to 1024 characters. Records are limited to 48K, and may be truncated if they are larger than that. The limit of record length can be changed by modifying the parameter Max_record in agrep.h.