regopt.c File Reference

The CL_Regex object, and the CL Regular Expression Optimiser. More...

#include "globals.h"
#include "regopt.h"

Functions

Variables


Detailed Description

The CL_Regex object, and the CL Regular Expression Optimiser.

This is the CL front-end to POSIX regular expressions with CL semantics (most notably: CL regexes always match the entire string and NOT substrings.)

Note that the optimiser is handled automatically by the CL_Regex object.

All variables / functions containing "regopt" are internal to this module and are not exported in the CL API.

Optimisation is done by means of "grains". The grain array in a CL_Regex object is a list of short strings. Any string which will match the regex must contain at least one of these. Thus, the grains provide a quick way of filtering out strings that definitely WON'T match, and avoiding a time-wasting call to the POSIX regex matching function.

While a regex is being optimised, the grains are stored in non-exported global variables in this module. Subsequently they are transferred to members of the CL_regex object with which they are associated. The use of global variables and a fixed-size buffer for grains is partly due to historical reasons, but it does also serve to reduce memory allocation overhead.


Function Documentation

void cl_delete_regex ( CL_Regex  rx  ) 

Deletes a CL_Regex object.

Parameters:
rx The CL_Regex to delete.

References _CL_Regex::buffer, cl_free, _CL_Regex::grain, _CL_Regex::grains, and _CL_Regex::iso_string.

Referenced by collect_matching_ids(), free_booltree(), and free_environment().

CL_Regex cl_new_regex ( char *  regex,
int  flags,
CorpusCharset  charset 
)

Create a new CL_regex object (ie a regular expression buffer).

The regular expression is preprocessed according to the flags, and anchored to the start and end of the string. (That is, ^ is added to the start, $ to the end.)

Then the resulting regex is compiled (using POSIX compilation) and optimised. Currently the character set parameter is ignored and assumed to be Latin-1.

Parameters:
regex String containing the regular expression
flags IGNORE_CASE, or IGNORE_DIAC, or both, or 0.
charset The character set of the regex. Currently ignored.
Returns:
The new CL_Regex object, or NULL in case of error.

References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::buffer, CDA_EBADREGEX, CDA_OK, cderrno, _CL_Regex::charset, cl_debug, cl_free, cl_malloc(), cl_regex_error, cl_regopt_analyse(), cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, cl_regopt_jumptable, cl_strdup(), cl_string_canonical(), cl_string_latex2iso(), _CL_Regex::flags, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, IGNORE_CASE, IGNORE_DIAC, _CL_Regex::iso_string, _CL_Regex::jumptable, and MAX_LINE_LENGTH.

Referenced by add_key(), collect_matching_ids(), do_flagged_string(), and do_XMLTag().

int cl_regex_match ( CL_Regex  rx,
char *  str 
)

Matches a regular expression against a string.

The regular expression contained in the CL_Regex is compared to the string. No settings or flags are passed to this function; rather, the settings that rx was created with are used.

Parameters:
rx The regular expression to match.
str The string to compare the regex to.
Returns:
Boolean: true if the regex matched, otherwise false.

References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::buffer, cl_string_canonical(), _CL_Regex::flags, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::iso_string, and _CL_Regex::jumptable.

Referenced by eval_bool(), eval_constraint(), main(), and matchfirstpattern().

int cl_regex_optimised ( CL_Regex  rx  ) 

Finds the level of optimisation of a CL_Regex.

This function returns the approximate level of optimisation, computed from the ratio of grain length to number of grains (0 = no grains, ergo not optimised at all).

Parameters:
rx The CL_Regex to check.
Returns:
0 if rx is not optimised; otherwise an integer indicating optimisation level.

References _CL_Regex::grain_len, and _CL_Regex::grains.

Referenced by collect_matching_ids().

int cl_regopt_analyse ( char *  regex  ) 

Analyses a regular expression and tries to find the best set of grains.

Part of the regex optimiser. For a given regular expression, this function will try to extract a set of grains from regular expression {regex_string}. These grains are then used by the CL regex matcher and cl_regex2id() for faster regular expression search.

If successful, this function returns True and stores the grains in the optiomiser's global variables above (from which they should be copied to a CL_Regex object's corresponding members).

Usage: optimised = cl_regopt_analyse(regex_string);

This is a non-exported function.

Parameters:
regex String containing the regex to optimise.
Returns:
Boolean: true = ok, false = couldn't optimise regex.

References buf, cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, local_grain_data, make_jump_table(), read_disjunction(), read_grain(), read_kleene(), read_wildcard(), and update_grain_buffer().

Referenced by cl_new_regex().

int is_safe_char ( unsigned char  c  ) 

Is the given character a 'safe' character which will only match itself in a regex?

Parameters:
c The character
Returns:
True for non-special characters; false for special characters.

Referenced by read_grain(), and read_matchall().

void make_jump_table ( void   ) 

Computes a jump table for Boyer-Moore searches.

Unlike the textbook version, this jumptable includes the last character of each grain (in order to avoid running the string comparing loops every time).

A non-exported function.

References cl_debug, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, and cl_regopt_jumptable.

Referenced by cl_regopt_analyse().

char* read_disjunction ( char *  mark,
int *  align_start,
int *  align_end 
)

Finds grains in a disjunction group - part of the CL Regex Optimiser.

This function find grains in disjunction group within a regular expression; the grains are then stored in the grain_buffer.

The first argument, mark, must point to the '(' at beginning of the disjunction group.

The booleans align_start and align_end are set to true if the grains from *all* alternatives are anchored at the start or end of the disjunction group, respectively.

This is a non-exported function.

Parameters:
mark Pointer to the disjunction group (see also function description).
align_start See function description.
align_end See function description.
Returns:
A pointer to first character after the disjunction group iff the parse succeeded, the original pointer in the mark argument otherwise.

References buf, grain_buffer, grain_buffer_grains, local_grain_data, MAX_GRAINS, read_grain(), and read_wildcard().

Referenced by cl_regopt_analyse().

char* read_grain ( char *  mark  ) 

Reads in a grain from a regex - part of the CL Regex Optimiser.

A grain is a string of safe symbols not followed by ?, *, or {..}. This function finds the longest grain it can starting at the point in the regex indicated by mark; backslash-escaped characters are allowed but the backslashes must be stripped by the caller.

Parameters:
mark Pointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the grain it has read in (or the original "mark" pointer if no grain is found).

References is_safe_char().

Referenced by cl_regopt_analyse(), and read_disjunction().

char* read_kleene ( char *  mark  ) 

Reads in a repetition marker - part of the CL Regex Optimiser.

This function reads in a Kleene star (asterisk), ?, +, or the general repetition modifier {n,n}; it returns a pointer to the first character after the repetition modifier it has found.

Parameters:
mark Pointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the star or other modifier it has read in (or the original "mark" pointer if a repetion modifier was not read).

Referenced by cl_regopt_analyse(), and read_wildcard().

char* read_matchall ( char *  mark  ) 

Reads in a matchall (dot wildcard) or safe character - part of the CL Regex Optimiser.

This function reads in matchall, any safe character, or a reasonably safe-looking character class.

Parameters:
mark Pointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the character (class) it has read in (or the original "mark" pointer if something suitable was not read).

References is_safe_char().

Referenced by read_wildcard().

char* read_wildcard ( char *  mark  ) 

Reads in a wildcard - part of the CL Regex Optimiser.

This function reads in a wildcard segment matching arbitrary substring (but without a '|' symbol); it returns a pointer to the first character after the wildcard segment.

Parameters:
mark Pointer to location in the regex string from which to read.
Returns:
Pointer to the first character after the wildcard segment (or the original "mark" pointer if a wildcard was not read).

References read_kleene(), and read_matchall().

Referenced by cl_regopt_analyse(), and read_disjunction().

void update_grain_buffer ( int  front_aligned,
int  anchored 
)

Updates the public grain buffer -- part of the CL Regex Optimiser.

This function copies the local grains to the public buffer, if they are better than the set of grains currently there.

A non-exported function.

Parameters:
front_aligned Boolean: if true, grain strings are aligned on the left when they are reduced to equal lengths.
anchored Boolean: if true, the grains are anchored at beginning or end of string, depending on front_aligned.

References buf, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, MAX_LINE_LENGTH, and public_grain_data.

Referenced by cl_regopt_analyse().


Variable Documentation

char cl_regex_error[MAX_LINE_LENGTH]

The error message from (POSIX) regex compilation are placed in this buffer if cl_new_regex() fails.

Referenced by cl_new_regex(), and collect_matching_ids().

Boolean: whether grains are anchored at end of string.

Referenced by cl_new_regex(), cl_regopt_analyse(), and update_grain_buffer().

Boolean: whether grains are anchored at beginning of string.

Referenced by cl_new_regex(), cl_regopt_analyse(), and update_grain_buffer().

char* cl_regopt_grain[MAX_GRAINS]

list of 'grains' (any matching string must contain one of these)

Referenced by cl_new_regex(), cl_regopt_analyse(), make_jump_table(), and update_grain_buffer().

all the grains have the same length

Referenced by cl_new_regex(), cl_regopt_analyse(), make_jump_table(), and update_grain_buffer().

A jump table for Boyer-Moore search algorithm; use _unsigned_ char as index;.

See also:
make_jump_table

Referenced by cl_new_regex(), and make_jump_table().

char* grain_buffer[MAX_GRAINS]

Intermediate buffer for grains.

When a regex is parsed, grains for each segment are written to this intermediate buffer; if the new set of grains is better than the current one, it is copied to the global variables.

Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().

The number of grains currently in the intermediate buffer.

See also:
grain_buffer

Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().

char local_grain_data[MAX_LINE_LENGTH]

A buffer for grain strings.

See also:
public_grain_data

Referenced by cl_regopt_analyse(), and read_disjunction().

char public_grain_data[MAX_LINE_LENGTH]

A buffer for grain strings.

See also:
local_grain_data

Referenced by update_grain_buffer().


Generated on Sun Feb 28 18:08:04 2010 for CWB by  doxygen 1.6.1