CWB
|
cwb-s-encode adds an s-attribute to an existing corpus. More...
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <assert.h>
#include "../cl/globals.h"
#include "../cl/endian.h"
#include "../cl/macros.h"
#include "../cl/storage.h"
#include "../cl/lexhash.h"
cwb-s-encode adds an s-attribute to an existing corpus.
Input: a list of regions (on stdin or in the file specified in the first argument to the program name) with lines in the following format:
start TAB end [ TAB annotation ]
start = corpus position of first token in region (integer as text) end = corpus position of last token in region (integer as text) annotation = annotation text (only if s-attribute was specified with -V)
Output: file att.rng (plus att.avs, att.avx for -V attributes) where att is the specified attribute name.
#define RNG_AVS "%s" SUBDIR_SEP_STRING "%s.avs" |
printf format string for path of attribute values of a given structural attribute
Referenced by sencode_open_files().
#define RNG_AVX "%s" SUBDIR_SEP_STRING "%s.avx" |
printf format string for path of attribute value index of a given structural attribute
Referenced by sencode_open_files().
#define RNG_RNG "%s" SUBDIR_SEP_STRING "%s.rng" |
printf format string for path of file storing ranges of given structural attribute
Referenced by sencode_open_files().
#define UMASK 0644 |
The "structure list" data type is used for 'adding' regions (-a).
SL is a really bad name; should be "RegionList".
In this case, all existing regions are read into an ordered, bidirectional list; new regions are inserted into that list (overlaps are automatically resolved in favour of the 'earlier' region; if start point is identical, the longer region is retained). Only once the entire input has been read is the data actually encoded and stored on disk.
int main | ( | int | argc, |
char ** | argv | ||
) |
Main function for cwb-s-encode.
argc | Number of command-line arguments. |
argv | Command-line arguments. |
References add_to_existing, _SL::annot, ATT_STRUC, buf, cl_free, CL_MAX_LINE_LENGTH, cl_max_struc(), cl_new_attribute, cl_struc2cpos(), cl_struc2str(), cl_struc_values(), debug, _SL::end, in_memory, input_line, SencodeRange::last_cpos, SencodeRange::name, progname, sencode_check_set(), sencode_close_files(), sencode_parse_line(), sencode_parse_options(), sencode_write_region(), set_att, set_none, silent, SL_insert(), SL_next(), SL_rewind(), _SL::start, SencodeRange::store_values, and text_fd.
char* sencode_check_set | ( | char * | annot | ) |
Changes an annotation string to standard set attribute syntax.
On first call, the function checks whether annotations are already given in standard '|'-delimited form; otherwise we assume we are using whitespace to split.
The return string may have been newly allocated (i.e. caller must use & free the returned value).
If there are syntax errors, returns NULL.
annot | The annotation string to check. |
References _SL::annot, cl_free, cl_make_set(), set_any, set_att, set_none, set_regular, set_syntax_strict, and set_whitespace.
Referenced by main().
void sencode_close_files | ( | void | ) |
Close the disk files for the s-attribute being encoded.
References SencodeRange::avs, SencodeRange::avx, SencodeRange::fd, and SencodeRange::ready.
Referenced by main().
void sencode_declare_new_satt | ( | char * | name, |
char * | directory, | ||
int | store_values | ||
) |
Initialises the "new_satt" variable for the s-attribute to be encoded, and sets name/directory.
References SencodeRange::avs, SencodeRange::avx, cl_strdup(), SencodeRange::dir, SencodeRange::fd, SencodeRange::last_cpos, SencodeRange::name, SencodeRange::num, SencodeRange::offset, SencodeRange::ready, and SencodeRange::store_values.
Referenced by sencode_parse_options().
void sencode_open_files | ( | void | ) |
Open disk files for the s-attribute being encoded (must have been declared first).
References SencodeRange::avs, SencodeRange::avx, buf, CL_MAX_LINE_LENGTH, SencodeRange::dir, SencodeRange::fd, SencodeRange::name, SencodeRange::ready, RNG_AVS, RNG_AVX, RNG_RNG, and SencodeRange::store_values.
Referenced by sencode_write_region().
int sencode_parse_line | ( | char * | line, |
int * | start, | ||
int * | end, | ||
char ** | annot | ||
) |
Parses an input line into cwb-s-encode.
Usage:
ok = sencode_parse_line(char *line, int *start, int *end, char **annot);
Expects standard TAB-separated format; first two fields must be numbers, optional third field is returned in annot - if not present, annot is set to NULL.
line | The line to be parsed. |
start | Location for the start cpos. |
end | Location for the end cos. |
annot | Location for the annotation string. |
References cl_free, and cl_strdup().
Referenced by main().
void sencode_parse_options | ( | int | argc, |
char ** | argv | ||
) |
Parse options and set global variables.
References add_to_existing, cl_new_corpus(), debug, directory, in_memory, SencodeRange::name, registry, sencode_declare_new_satt(), sencode_usage(), set_any, set_att, set_syntax_strict, silent, strip_blanks_in_values, and text_fd.
Referenced by main().
void sencode_usage | ( | void | ) |
print usage message and exit
References progname, and VERSION.
Referenced by sencode_parse_options().
void sencode_write_region | ( | int | start, |
int | end, | ||
char * | annot | ||
) |
Write data about a region to disk files (as defined in global variable new_satt).
References SencodeRange::avs, SencodeRange::avx, cl_lexhash_add(), cl_lexhash_find(), cl_new_lexhash(), _cl_lexhash_entry::data, _SL::end, SencodeRange::fd, _cl_lexhash_entry::id, _cl_lexhash_entry::_cl_lexhash_entry_data::integer, SencodeRange::last_cpos, SencodeRange::num, NwriteInt(), SencodeRange::offset, SencodeRange::ready, sencode_open_files(), and SencodeRange::store_values.
Referenced by main().
void SL_delete | ( | SL | item | ) |
delete region from list; updates SL_Point if it happened to point at item
References _SL::annot, cl_free, _SL::next, and _SL::prev.
Referenced by SL_insert().
void SL_insert | ( | int | start, |
int | end, | ||
char * | annot | ||
) |
Inserts an item into the global structure list.
It adds a new region to the list: its start point, its end point, its annotation.
Combines SL_seek(), SL_insert_at_point() and ambiguity resolution.
References _SL::end, _SL::next, SL_delete(), SL_insert_after_point(), SL_seek(), and _SL::start.
Referenced by main().
SL SL_insert_after_point | ( | int | start, |
int | end, | ||
char * | annot | ||
) |
insert region [start, end, annot] after SL_Point; no overlap/position checking
References _SL::annot, cl_malloc(), cl_strdup(), _SL::end, _SL::next, _SL::prev, SL_Point, _SL::start, and StructureList.
Referenced by SL_insert().
SL SL_next | ( | void | ) |
void SL_rewind | ( | void | ) |
Rewind the index-pointer to the start of the global structure list.
References StructureList.
Referenced by main().
SL SL_seek | ( | int | cpos | ) |
Find region containing (or preceding) cpos; NULL = start of list; sets SL_Point to returned value.
References _SL::end, _SL::next, _SL::prev, SL_Point, _SL::start, and StructureList.
Referenced by SL_insert().
int add_to_existing = 0 |
add to existing attribute: implies in_memory; existing regions are automatically inserted at startup
Referenced by main(), and sencode_parse_options().
corpus we're working on; at the moment, this is only required for add_to_existing
int debug = 0 |
int in_memory = 0 |
create list of regions in memory (allowing non-linear input), then write to disk
Referenced by main(), and sencode_parse_options().
cl_lexhash LH = NULL |
Lexhash used when writing regions, to avoid multiple copies of annotations (-m mode)
Global (and only) instance of the cwb-s-encode SencodeRange object.
Contains information on the new s-attribute being coded.
char* progname = NULL |
enum { ... } set_att |
feature-set attributes: type of.
Initial value: not a feature set. Changes to set_any once we know we are dealing with a feature set. Changes to set_regular or set_whitespace once we know which format of f.s. it is.
Referenced by main(), sencode_check_set(), and sencode_parse_options().
int set_syntax_strict = 0 |
check that set attributes are always given in the same syntax
Referenced by sencode_check_set(), and sencode_parse_options().
int silent = 0 |
debug mode on/off
avoid messages in -M / -a modes
pointer into global list; NULL = start of list; linear search starts from SL_Point
Referenced by SL_insert_after_point(), SL_next(), and SL_seek().
int strip_blanks_in_values = 0 |
Referenced by sencode_parse_options().
SL StructureList = NULL |
(single) global list
Referenced by SL_insert_after_point(), SL_rewind(), and SL_seek().
FILE* text_fd = NULL |
stream handle for file to read from.
Referenced by main(), and sencode_parse_options().