CWB
|
#include <ctype.h>
#include <math.h>
#include <stdarg.h>
#include <sys/types.h>
#include <time.h>
#include <dirent.h>
#include <errno.h>
#include <sys/stat.h>
#include "../cl/globals.h"
#include "../cl/macros.h"
#include "../cl/storage.h"
#include "../cl/lexhash.h"
#include "../cl/endian.h"
#include "../cl/attributes.h"
#include <sys/time.h>
#define DEFAULT_INFILE_EXTENSION ".vrt" |
Normal extension for CWB input text files.
(.gz may be added ot this if the file is compressed.)
Referenced by encode_scan_directory(), and encode_usage().
#define FIELDSEPS "\t\n" |
Default string containing the characters that can function as field separators.
#define MAX_INPUT_LINE_LENGTH 65536 |
Input buffer size.
If we have XML tags with attributes, input lines can become pretty long (but there's basically just a single buffer)
Referenced by encode_get_input_line(), and main().
#define MAXRANGES 1024 |
max number of s-attributes; also max number of p-attributes (-> could change this to implementation as a linked list)
Referenced by encode_parse_options(), and range_declare().
#define POS_CORPUS "%s" SUBDIR_SEP_STRING "%s.corpus" |
CL naming convention for P-attribute Corpus files.
Referenced by wattr_declare().
#define POS_LEX "%s" SUBDIR_SEP_STRING "%s.lexicon" |
CL naming convention for P-attribute Lexicon files.
Referenced by wattr_declare().
#define POS_LEXIDX "%s" SUBDIR_SEP_STRING "%s.lexicon.idx" |
CL naming convention for P-attribute Lexicon-index files.
Referenced by wattr_declare().
#define REP_CHECK_LEXHASH_SIZE 1000 |
nr of buckets of lexhashes used for checking duplicate errors (undeclared element and attribute names in XML tags)
Referenced by main(), and range_declare().
#define STRUC_AVS "%s" SUBDIR_SEP_STRING "%s.avs" |
CL naming convention for S-attribute AVS files.
Referenced by range_declare().
#define STRUC_AVX "%s" SUBDIR_SEP_STRING "%s.avx" |
CL naming convention for S-attribute AVX files.
Referenced by range_declare().
#define STRUC_RNG "%s" SUBDIR_SEP_STRING "%s.rng" |
CL naming convention for S-attribute RNG files.
Referenced by range_declare().
#define UMASK 0644 |
User privileges of new files (octal format)
#define UNDEF_VALUE "__UNDEF__" |
Default string used as value of P-attributes when a value is missing ie if a tab-delimited field is empty.
Range object: represents an S-attribute being encoded, and holds some information about the currently-being-processed instance of that S-attribute.
TODO should probably be called an SAttr
void encode_add_wattr_line | ( | char * | str | ) |
Processes a token data line.
str | A string containing the line to process. |
References cl_free, cl_lexhash_add(), cl_lexhash_id(), cl_make_set(), CL_MAX_LINE_LENGTH, cl_strdup(), cl_xml_entity_decode(), encode_error(), encode_print_input_lineno(), encode_strtok(), field_separators, _cl_lexhash_entry::id, NwriteInt(), WAttr::position, silent, strip_blanks, undef_value, wattr_ptr, and xml_aware.
Referenced by main().
void encode_error | ( | char * | format, |
... | |||
) |
Prints an error message to STDERR, automatically adding a message on the location of the error in the corpus.
Then exits the program.
format | Format-specifying string of the error message. |
... | Additional arguments,Rprintf-style. |
References current_input_file, encode_print_input_lineno(), and input_line.
Referenced by encode_add_wattr_line(), encode_generate_registry_file(), encode_get_input_line(), encode_parse_options(), encode_scan_directory(), main(), range_close(), range_declare(), and wattr_declare().
void encode_generate_registry_file | ( | char * | registry_file | ) |
Writes a registry file for the corpus that has been encoded.
Part of cwb-encode; not a library function.
registry_file | String containing the path of the file to write. |
References cl_free, cl_id_tolower(), cl_id_toupper(), cl_id_validate(), cl_malloc(), cl_path_registry_quote(), cl_strdup(), corpus_character_set, debug, directory, encode_error(), INFOFILE_DEFAULT_NAME, range_print_registry_line(), range_ptr, SUBDIR_SEPARATOR, and wattr_ptr.
Referenced by main().
int encode_get_input_line | ( | char * | buffer, |
int | bufsize | ||
) |
Reads one input line into the specified buffer (either from stdin, or from one or more input files).
The input files are not passed to the function, but are taken form the program global variables.
This function returns False when the last input file has been completely read, and automatically closes files.
If the line that is read is not valid according to the character set specified for the corpus, then an error will be printed and the program shut down.
buffer | Where to load the line to. Assumed to be MAX_INPUT_LINE_LENGTH long. |
bufsize | Not currently used, but should be MAX_INPUT_LINE_LENGTH in case of future use! |
References CL_MAX_LINE_LENGTH, cl_strcpy(), cl_string_canonical(), cl_string_list_get(), cl_string_validate_encoding(), cl_string_zap_controls(), clean_strings, corpus_character_set, current_input_file, current_input_file_name, encode_error(), encoding_charset, input_fd, input_file_is_pipe, input_line, MAX_INPUT_LINE_LENGTH, nr_input_files, and utf8.
Referenced by main().
void encode_parse_options | ( | int | argc, |
char ** | argv | ||
) |
Parses program options and sets global variables.
References cl_charset_name_canonical(), cl_delete_string_list(), cl_string_list_append(), cl_string_list_get(), cl_string_list_size(), clean_strings, corpus_character_set, debug, DEFAULT_ATT_NAME, directory, encode_error(), encode_scan_directory(), encode_usage(), MAXRANGES, progname, range_declare(), range_find(), range_ptr, registry_file, silent, skip_empty_lines, strip_blanks, undef_value, verbose, wattr_declare(), wattr_find(), wattr_ptr, and xml_aware.
Referenced by main().
void encode_print_input_lineno | ( | void | ) |
Prints the input line number (and input filename, if applicable) on STDERR, for error messages and warnings.
References current_input_file_name, input_line, and nr_input_files.
Referenced by encode_add_wattr_line(), encode_error(), main(), range_close(), and range_open().
void encode_print_time | ( | FILE * | stream, |
char * | msg | ||
) |
Prints a message plus the current time to the specified file/stream.
stream | Stream to print to. |
msg | Message to incorporate into the string that is printed. |
Referenced by main().
cl_string_list encode_scan_directory | ( | char * | dir | ) |
Get a list of files in a given directory.
This function only lists files with .vrt or .vrt.gz extensions, and only files identified by POSIX stat() as "regular".
(Note that .vrt is dependent on DEFAULT_INFILE_EXTENSION.)
dir | Path of directory to look in. |
References cl_free, cl_malloc(), cl_new_string_list(), cl_string_list_append(), cl_string_list_qsort(), DEFAULT_INFILE_EXTENSION, encode_error(), and SUBDIR_SEPARATOR.
Referenced by encode_parse_options().
char* encode_strtok | ( | register char * | s, |
register const char * | delim | ||
) |
A replacement for the strtok() function which doesn't skip empty fields.
s | The string to split. |
delim | Delimiters to use in splitting. |
References last.
Referenced by encode_add_wattr_line().
void encode_usage | ( | void | ) |
Prints a usage message and exits the program.
References DEFAULT_INFILE_EXTENSION, progname, undef_value, and VERSION.
Referenced by encode_parse_options().
int main | ( | int | argc, |
char ** | argv | ||
) |
Main function for cwb-encode.
As well as the entry point to the program, this contains the main loop for each line of the corpus to be encoded.
The string of each line is sent to one of a number of different functions, depending on what is found in that string!
argc | Number of command-line arguments. |
argv | Command-line arguments. |
References _Range::automatic, _Range::avs, _Range::avx, buf, cl_charset_from_name(), cl_lexhash_add(), cl_lexhash_freq(), cl_new_lexhash(), cl_new_string_list(), cl_set_debug_level(), cl_string_list_get(), cl_string_list_size(), cl_xml_is_name_char, COMMA_SEP_THOUSANDS_CONVSPEC, corpus_character_set, debug, _Range::element_drop_count, encode_add_wattr_line(), encode_error(), encode_generate_registry_file(), encode_get_input_line(), encode_parse_options(), encode_print_input_lineno(), encode_print_time(), encoding_charset, _Range::fd, input_line, _Range::is_open, line, MAX_INPUT_LINE_LENGTH, _Range::max_recursion, _Range::name, nr_input_files, _Range::null_attribute, progname, range_close(), range_find(), range_open(), range_ptr, _Range::recursion_level, registry_file, REP_CHECK_LEXHASH_SIZE, silent, skip_empty_lines, _Range::store_values, strip_blanks, verbose, wattr_ptr, and xml_aware.
void range_close | ( | Range * | rng, |
int | end_pos | ||
) |
Closes a currently open instance of an S-attribute.
rng | Pointer to the S-attribute to close. |
end_pos | The corpus position at which this instance closes. |
References _Range::annot, _Range::avs, _Range::avx, cl_free, cl_lexhash_add(), cl_lexhash_find(), CL_MAX_LINE_LENGTH, cl_strdup(), cl_string_list_get(), cl_string_list_size(), _cl_lexhash_entry::data, _Range::el_attributes, _Range::el_atts_list, encode_error(), encode_print_input_lineno(), _Range::fd, _Range::has_children, _cl_lexhash_entry::_cl_lexhash_entry_data::integer, _Range::is_open, _Range::lh, _Range::max_recursion, _Range::name, _Range::null_attribute, _Range::num, NwriteInt(), _Range::offset, _cl_lexhash_entry::_cl_lexhash_entry_data::pointer, _Range::recursion_children, _Range::recursion_level, silent, _Range::start_pos, and _Range::store_values.
Referenced by main(), and range_open().
Range* range_declare | ( | char * | name, |
char * | directory, | ||
int | store_values, | ||
int | null_attribute | ||
) |
Creates a Range object to store a specified s-attribute (and, if appropriate, does the same for children-attributes).
The new Range object is placed in a global variable, but a pointer is also returned. So you can ignore the return value or not, as you prefer.
This is the function where the command-line formalism for defining s-attributes is defined.
name | The string from the user specifying the name of this attribute, recursion and any "attributes" of this XML element - e.g. "text:0+id" |
directory | The directory where the CWB data files will go. |
store_values | boolean: indicates whether this s-attribute was specified with -V (true) or -S (false) when the program was invoked. |
null_attribute | boolean: this is a null attribute, i.e. an XML element to be ignored. |
References _Range::annot, _Range::automatic, _Range::avs, _Range::avx, buf, cl_calloc(), cl_free, cl_lexhash_add(), cl_lexhash_id(), CL_MAX_LINE_LENGTH, cl_new_lexhash(), cl_new_string_list(), cl_strcpy(), cl_strdup(), cl_string_list_append(), _cl_lexhash_entry::data, debug, _Range::dir, _Range::el_attributes, _Range::el_atts_list, _Range::el_undeclared_attributes, _Range::element_drop_count, encode_error(), _Range::fd, _Range::feature_set, _Range::has_children, _Range::in_registry, _Range::is_open, _Range::lh, _Range::max_recursion, MAXRANGES, _Range::name, _Range::null_attribute, _Range::num, _Range::offset, _cl_lexhash_entry::_cl_lexhash_entry_data::pointer, range_ptr, _Range::recursion_children, _Range::recursion_level, REP_CHECK_LEXHASH_SIZE, _Range::start_pos, _Range::store_values, STRUC_AVS, STRUC_AVX, and STRUC_RNG.
Referenced by encode_parse_options().
int range_find | ( | char * | name | ) |
Gets the index (in the ranges array) of a specified S-attribute.
name | The S-attribute to search for. |
References range_ptr.
Referenced by encode_parse_options(), and main().
void range_open | ( | Range * | rng, |
int | start_pos, | ||
char * | annot | ||
) |
Opens an instance of the given S-attribute.
If rng has element attribute children, range_open() will mess around with the string annotation (otherwise not).
rng | The S-attribute to open. |
start_pos | The corpus position at which this instance begins. |
annot | The annotation string (the XML element's att-val pairs). |
References _Range::annot, cl_free, cl_lexhash_add(), cl_lexhash_find(), cl_lexhash_freq(), cl_make_set(), cl_strdup(), cl_string_list_get(), cl_string_list_size(), cl_xml_entity_decode(), cl_xml_is_name_char, _cl_lexhash_entry::data, _Range::el_attributes, _Range::el_atts_list, _Range::el_undeclared_attributes, _Range::element_drop_count, encode_print_input_lineno(), _Range::feature_set, _Range::has_children, _cl_lexhash_entry::_cl_lexhash_entry_data::integer, _Range::is_open, line, _Range::max_recursion, _Range::name, _Range::null_attribute, _cl_lexhash_entry::_cl_lexhash_entry_data::pointer, range_close(), _Range::recursion_children, _Range::recursion_level, silent, _Range::start_pos, _Range::store_values, and strip_blanks.
Referenced by main().
void range_print_registry_line | ( | Range * | rng, |
FILE * | fd, | ||
int | print_comment | ||
) |
Prints registry lines for a given s-attribute, and its children, if any, to the specified file handle.
rng | The s-attribute in question. |
fd | Stream for the registry file to write the line to. |
print_comment | Boolean: if true, a comment on the original XML tags is printed. |
References cl_lexhash_find(), cl_string_list_get(), cl_string_list_size(), _cl_lexhash_entry::data, _Range::el_attributes, _Range::el_atts_list, _Range::has_children, _Range::in_registry, _Range::max_recursion, _Range::name, _Range::null_attribute, _cl_lexhash_entry::_cl_lexhash_entry_data::pointer, _Range::recursion_children, and _Range::store_values.
Referenced by encode_generate_registry_file().
int wattr_declare | ( | char * | name, |
char * | directory, | ||
int | nr_buckets | ||
) |
Sets up a new p-attribute, including opening corpus, lex and index file handles.
Note: corpus_fd is a binary file, lex_fd is a text file(*), and lexidx_fd is a binary file.
(*) But lexicon items are delimited by '\0' not by '
'.
name | Identifier string of the p-attribute |
directory | Directory in which CWB data files are to be created. |
nr_buckets | Number of buckets in the lexhash of the new p-attribute (value passed to cl_new_lexhash() ) |
References CL_MAX_LINE_LENGTH, cl_new_lexhash(), cl_strdup(), DEFAULT_ATT_NAME, encode_error(), WAttr::feature_set, WAttr::lh, WAttr::name, POS_CORPUS, POS_LEX, POS_LEXIDX, WAttr::position, SUBDIR_SEPARATOR, and wattr_ptr.
Referenced by encode_parse_options().
int wattr_find | ( | char * | name | ) |
Gets the index (in wattrs) of the P-attribute with the given name.
name | The P-attribute to search for. |
References wattr_ptr.
Referenced by encode_parse_options().
int clean_strings = 0 |
clean up input strings by replacing invalid bytes with '?' (except for UTF8 encoding)
Referenced by encode_get_input_line(), and encode_parse_options().
char* corpus_character_set = "latin1" |
character set label that is inserted into the registry file
Referenced by encode_generate_registry_file(), encode_get_input_line(), encode_parse_options(), and main().
int current_input_file = 0 |
index of input file currently being processed
Referenced by encode_error(), and encode_get_input_line().
char* current_input_file_name = NULL |
filename of current input file, for error messages
Referenced by encode_get_input_line(), and encode_print_input_lineno().
int debug = 0 |
debug mode on or off?
char* directory = NULL |
corpus data directory (no longer defaults to current directory)
Referenced by encode_generate_registry_file(), encode_parse_options(), and sencode_parse_options().
a charset object to be generated from corpus_character_set
Referenced by encode_get_input_line(), and main().
char* field_separators = FIELDSEPS |
string containing the characters that can function as field separators
Referenced by encode_add_wattr_line().
FILE* input_fd = NULL |
file handle for current input file (or pipe) (text mode!)
int input_file_is_pipe = 0 |
so we can properly close input_fd using either fclose() or pclose()
Referenced by encode_get_input_line().
cl_string_list input_files = NULL |
list of input file (-f option(s))
int input_line = 0 |
input line number (reset for each new file) for error messages
Referenced by encode_error(), encode_get_input_line(), encode_print_input_lineno(), load_macro_file(), and main().
int line = 0 |
corpus position currently being encoded (ie cpos of _next_ token)
Referenced by alignshow_goodbye(), alignshow_print_next_region(), alignshow_skip_next_region(), compose_kwic_line(), do_undump(), evaluate_subset(), evaluate_target(), FreeConcordanceLine(), html_print_output(), latex_print_output(), load_macro_file(), main(), preprocess_input_line(), PrintAttributes(), range_open(), sgml_print_output(), and SortExternally().
int nr_input_files = 0 |
number of input files (length of list after option processing)
Referenced by encode_get_input_line(), encode_print_input_lineno(), and main().
char* progname = NULL |
name of the currently running program
int range_ptr = 0 |
Referenced by encode_generate_registry_file(), encode_parse_options(), main(), range_declare(), and range_find().
char* registry_file = NULL |
if set, auto-generate registry file named {registry_file}, listing declared attributes
Referenced by encode_parse_options(), and main().
int silent = 0 |
hide messages
int skip_empty_lines = 0 |
skip empty lines when encoding?
int strip_blanks = 0 |
strip leading and trailing blanks from input and token annotations
cl_lexhash undeclared_sattrs = NULL |
lookup hash for undeclared s-attributes and s-attributes declared with -S that have annotations (which will be ignored), so warnings are issued only once
char* undef_value = UNDEF_VALUE |
string used as value of P-attributes when a value is missing ie if a tab-delimited field is empty
Referenced by encode_add_wattr_line(), encode_parse_options(), and encode_usage().
int verbose = 0 |
show progress (this is _not_ the opposite of silent!)
int wattr_ptr = 0 |
Referenced by encode_add_wattr_line(), encode_generate_registry_file(), encode_parse_options(), main(), wattr_declare(), and wattr_find().
int xml_aware = 0 |
substitute XML entities in p-attributes & ignore <? and <! lines