The CL_Regex object, and the CL Regular Expression Optimiser. More...
#include "globals.h"
#include "regopt.h"
The CL_Regex object, and the CL Regular Expression Optimiser.
This is the CL front-end to POSIX regular expressions with CL semantics (most notably: CL regexes always match the entire string and NOT substrings.)
Note that the optimiser is handled automatically by the CL_Regex object.
All variables / functions containing "regopt" are internal to this module and are not exported in the CL API.
Optimisation is done by means of "grains". The grain array in a CL_Regex object is a list of short strings. Any string which will match the regex must contain at least one of these. Thus, the grains provide a quick way of filtering out strings that definitely WON'T match, and avoiding a time-wasting call to the POSIX regex matching function.
While a regex is being optimised, the grains are stored in non-exported global variables in this module. Subsequently they are transferred to members of the CL_regex object with which they are associated. The use of global variables and a fixed-size buffer for grains is partly due to historical reasons, but it does also serve to reduce memory allocation overhead.
void cl_delete_regex | ( | CL_Regex | rx | ) |
Deletes a CL_Regex object.
rx | The CL_Regex to delete. |
References _CL_Regex::buffer, cl_free, _CL_Regex::grain, _CL_Regex::grains, and _CL_Regex::iso_string.
Referenced by collect_matching_ids(), free_booltree(), and free_environment().
CL_Regex cl_new_regex | ( | char * | regex, | |
int | flags, | |||
CorpusCharset | charset | |||
) |
Create a new CL_regex object (ie a regular expression buffer).
The regular expression is preprocessed according to the flags, and anchored to the start and end of the string. (That is, ^ is added to the start, $ to the end.)
Then the resulting regex is compiled (using POSIX compilation) and optimised. Currently the character set parameter is ignored and assumed to be Latin-1.
regex | String containing the regular expression | |
flags | IGNORE_CASE, or IGNORE_DIAC, or both, or 0. | |
charset | The character set of the regex. Currently ignored. |
References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::buffer, CDA_EBADREGEX, CDA_OK, cderrno, _CL_Regex::charset, cl_debug, cl_free, cl_malloc(), cl_regex_error, cl_regopt_analyse(), cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, cl_regopt_jumptable, cl_strdup(), cl_string_canonical(), cl_string_latex2iso(), _CL_Regex::flags, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, IGNORE_CASE, IGNORE_DIAC, _CL_Regex::iso_string, _CL_Regex::jumptable, and MAX_LINE_LENGTH.
Referenced by add_key(), collect_matching_ids(), do_flagged_string(), and do_XMLTag().
int cl_regex_match | ( | CL_Regex | rx, | |
char * | str | |||
) |
Matches a regular expression against a string.
The regular expression contained in the CL_Regex is compared to the string. No settings or flags are passed to this function; rather, the settings that rx was created with are used.
rx | The regular expression to match. | |
str | The string to compare the regex to. |
References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::buffer, cl_string_canonical(), _CL_Regex::flags, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::iso_string, and _CL_Regex::jumptable.
Referenced by eval_bool(), eval_constraint(), main(), and matchfirstpattern().
int cl_regex_optimised | ( | CL_Regex | rx | ) |
Finds the level of optimisation of a CL_Regex.
This function returns the approximate level of optimisation, computed from the ratio of grain length to number of grains (0 = no grains, ergo not optimised at all).
rx | The CL_Regex to check. |
References _CL_Regex::grain_len, and _CL_Regex::grains.
Referenced by collect_matching_ids().
int cl_regopt_analyse | ( | char * | regex | ) |
Analyses a regular expression and tries to find the best set of grains.
Part of the regex optimiser. For a given regular expression, this function will try to extract a set of grains from regular expression {regex_string}. These grains are then used by the CL regex matcher and cl_regex2id() for faster regular expression search.
If successful, this function returns True and stores the grains in the optiomiser's global variables above (from which they should be copied to a CL_Regex object's corresponding members).
Usage: optimised = cl_regopt_analyse(regex_string);
This is a non-exported function.
regex | String containing the regex to optimise. |
References buf, cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, local_grain_data, make_jump_table(), read_disjunction(), read_grain(), read_kleene(), read_wildcard(), and update_grain_buffer().
Referenced by cl_new_regex().
int is_safe_char | ( | unsigned char | c | ) |
Is the given character a 'safe' character which will only match itself in a regex?
c | The character |
Referenced by read_grain(), and read_matchall().
void make_jump_table | ( | void | ) |
Computes a jump table for Boyer-Moore searches.
Unlike the textbook version, this jumptable includes the last character of each grain (in order to avoid running the string comparing loops every time).
A non-exported function.
References cl_debug, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, and cl_regopt_jumptable.
Referenced by cl_regopt_analyse().
char* read_disjunction | ( | char * | mark, | |
int * | align_start, | |||
int * | align_end | |||
) |
Finds grains in a disjunction group - part of the CL Regex Optimiser.
This function find grains in disjunction group within a regular expression; the grains are then stored in the grain_buffer.
The first argument, mark, must point to the '(' at beginning of the disjunction group.
The booleans align_start and align_end are set to true if the grains from *all* alternatives are anchored at the start or end of the disjunction group, respectively.
This is a non-exported function.
mark | Pointer to the disjunction group (see also function description). | |
align_start | See function description. | |
align_end | See function description. |
References buf, grain_buffer, grain_buffer_grains, local_grain_data, MAX_GRAINS, read_grain(), and read_wildcard().
Referenced by cl_regopt_analyse().
char* read_grain | ( | char * | mark | ) |
Reads in a grain from a regex - part of the CL Regex Optimiser.
A grain is a string of safe symbols not followed by ?, *, or {..}. This function finds the longest grain it can starting at the point in the regex indicated by mark; backslash-escaped characters are allowed but the backslashes must be stripped by the caller.
mark | Pointer to location in the regex string from which to read. |
References is_safe_char().
Referenced by cl_regopt_analyse(), and read_disjunction().
char* read_kleene | ( | char * | mark | ) |
Reads in a repetition marker - part of the CL Regex Optimiser.
This function reads in a Kleene star (asterisk), ?, +, or the general repetition modifier {n,n}; it returns a pointer to the first character after the repetition modifier it has found.
mark | Pointer to location in the regex string from which to read. |
Referenced by cl_regopt_analyse(), and read_wildcard().
char* read_matchall | ( | char * | mark | ) |
Reads in a matchall (dot wildcard) or safe character - part of the CL Regex Optimiser.
This function reads in matchall, any safe character, or a reasonably safe-looking character class.
mark | Pointer to location in the regex string from which to read. |
References is_safe_char().
Referenced by read_wildcard().
char* read_wildcard | ( | char * | mark | ) |
Reads in a wildcard - part of the CL Regex Optimiser.
This function reads in a wildcard segment matching arbitrary substring (but without a '|' symbol); it returns a pointer to the first character after the wildcard segment.
mark | Pointer to location in the regex string from which to read. |
References read_kleene(), and read_matchall().
Referenced by cl_regopt_analyse(), and read_disjunction().
void update_grain_buffer | ( | int | front_aligned, | |
int | anchored | |||
) |
Updates the public grain buffer -- part of the CL Regex Optimiser.
This function copies the local grains to the public buffer, if they are better than the set of grains currently there.
A non-exported function.
front_aligned | Boolean: if true, grain strings are aligned on the left when they are reduced to equal lengths. | |
anchored | Boolean: if true, the grains are anchored at beginning or end of string, depending on front_aligned. |
References buf, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, MAX_LINE_LENGTH, and public_grain_data.
Referenced by cl_regopt_analyse().
char cl_regex_error[MAX_LINE_LENGTH] |
The error message from (POSIX) regex compilation are placed in this buffer if cl_new_regex() fails.
Referenced by cl_new_regex(), and collect_matching_ids().
Boolean: whether grains are anchored at end of string.
Referenced by cl_new_regex(), cl_regopt_analyse(), and update_grain_buffer().
Boolean: whether grains are anchored at beginning of string.
Referenced by cl_new_regex(), cl_regopt_analyse(), and update_grain_buffer().
char* cl_regopt_grain[MAX_GRAINS] |
list of 'grains' (any matching string must contain one of these)
Referenced by cl_new_regex(), cl_regopt_analyse(), make_jump_table(), and update_grain_buffer().
all the grains have the same length
Referenced by cl_new_regex(), cl_regopt_analyse(), make_jump_table(), and update_grain_buffer().
int cl_regopt_grains |
number of grains
Referenced by cl_new_regex(), cl_regopt_analyse(), make_jump_table(), and update_grain_buffer().
int cl_regopt_jumptable[256] |
A jump table for Boyer-Moore search algorithm; use _unsigned_ char as index;.
Referenced by cl_new_regex(), and make_jump_table().
char* grain_buffer[MAX_GRAINS] |
Intermediate buffer for grains.
When a regex is parsed, grains for each segment are written to this intermediate buffer; if the new set of grains is better than the current one, it is copied to the global variables.
Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().
int grain_buffer_grains = 0 |
The number of grains currently in the intermediate buffer.
Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().
char local_grain_data[MAX_LINE_LENGTH] |
A buffer for grain strings.
Referenced by cl_regopt_analyse(), and read_disjunction().
char public_grain_data[MAX_LINE_LENGTH] |