cwb-scan-corpus.c File Reference

#include "../cl/globals.h"
#include "../cl/cl.h"

Data Structures

Defines

Typedefs

Functions

Variables


Define Documentation

#define DEFAULT_BUCKETS   1000000

use 1 million buckets by default

Referenced by main().

#define MAX_N   32

maximum value of N (makes life a little easier)

Referenced by main().


Typedef Documentation

typedef struct _hash_entry * HashEntry

Structure representing hash entries.

See also:
Hash

Function Documentation

void add_key ( char *  key  ) 
int find_prime ( int  n  ) 

Finds a prime number.

Returns smallest prime >= n.

Parameters:
n lower bound for the generated prime.
Returns:
The first prime number greater than n, or 0 if no prime number was found within the sizeof a signed int.

References is_prime().

int get_next_range ( int *  start,
int *  end 
)

Reads the next range of corpus positions.

The ranges of corpus positions are taken either from global settings (-s, -e) or from a specified file (-R).

Parameters:
start Where to put the start of the next range.
end Where to put the end of the next range.
Returns:
FALSE after last range, TRUE otherwise

References global_end, global_start, and ranges_fh.

Referenced by main().

void hash_add ( int *  tuple,
int  f 
)

Inserts an N-tuple into the global hash.

If the N-tuple is already in the hash, its count is incremented by f, but nothing is inserted.

Parameters:
tuple The tuple to add (array of ints).
f The frequency of the tuple.

References cl_malloc(), _hash_entry::freq, Hash, hash_find(), _Hash::K, _hash_entry::next, _Hash::table, and _hash_entry::tuple.

Referenced by main().

HashEntry hash_find ( int *  tuple,
int *  R_index 
)

Finds an N-tuple in the global hash.

Parameters:
tuple The tuple to search for.
R_index The index of the bucket containing the located HashEntry.
Returns:
The located HashEntry.

References _Hash::buckets, Hash, hash_index(), _Hash::K, _hash_entry::next, _Hash::table, _hash_entry::tuple, and tuples_eq().

Referenced by hash_add().

unsigned int hash_index ( int  N,
int *  tuple 
)

Computes a hash index for an N-tuple of ints.

Parameters:
N Size of the tuple.
tuple The tuple itself: an array of ints.
Returns:
The hash index calculated.

Referenced by hash_find().

int is_letter ( unsigned char  c  ) 

Checks whether a character is a letter.

Parameters:
c The character to check.
Returns:
Boolean.

Referenced by is_regular().

int is_prime ( int  n  ) 

Checks whether a number is prime.

Returns True iff n is a prime.

Parameters:
n number to check
Returns:
Boolean
int is_regular ( char *  s  ) 

Check regularity of a token.

A token is "regular" if it contains only letters, numbers and dashes (with no dash at the end).

"Regularity" is used as a filter on the corpus iff the -C option is specified.

Character encoding: Latin-1.

Parameters:
s String containing the token to check.
Returns:
True if the token is regular, otherwise false.

References is_letter().

Referenced by add_key(), and main().

int main ( int  argc,
char *  argv[] 
)
int parse_options ( int  argc,
char *  argv[] 
)

Parses the command-line options of the program.

Parameters:
argc argc from main()
argv argv from main()
Returns:
The value of global optind after the function has run.

References _Hash::buckets, check_words, frequency_att, frequency_threshold, global_end, global_start, Hash, output_file, quiet, ranges_file, reg_dir, and usage().

int tuples_eq ( int  N,
int *  t1,
int *  t2 
)

Compares two N-tuples for equality.

Parameters:
N Size of the tuple.
t1 First tuple (array of ints of size N).
t2 Second tuple (array of ints of size N).
Returns:
Boolean: true if all ints are identical, otherwise false.

Referenced by hash_find().

void usage ( void   ) 

Prints a usage message and exits the program.

References progname.


Variable Documentation

corpus we're working on

Referenced by regex2dfa(), and WriteStates().

int check_words = 0

if set, accept only 'regular' words in frequency counts

Referenced by add_key(), main(), and parse_options().

char* corpname = NULL

corpus name (command-line)

Referenced by add_key(), and main().

char* frequency_att = NULL

p-attribute with frequency entries for corpus rows (when abusing corpus as frequency database)

Referenced by main(), and parse_options().

frequency threshold for result table (-f option)

Referenced by main(), and parse_options().

int global_end = -1

will be set up in main() unless changed with -e switch.

See also:
global_start

Referenced by get_next_range(), main(), and parse_options().

int global_start = 0

start scanning at this cpos (defaults to start of corpus)

Referenced by get_next_range(), main(), and parse_options().

struct _Hash Hash

A specialised hash for computing frequency distributions over tuples of lexicon IDs.

Referenced by add_key(), hash_add(), hash_find(), LookUp(), main(), MakeExp(), and parse_options().

char* output_file = NULL

output file name (-o option)

char* progname = NULL

name of this program (from shell command)

int quiet = 0

if set, don't show progress information on stderr

Referenced by cqp_parse_file(), main(), and parse_options().

FILE* ranges_fh = NULL

corresponding filehandle

Referenced by get_next_range(), and main().

char* ranges_file = NULL

file with ranges to scan (pairs of corpus positions)

Referenced by main(), and parse_options().

char* reg_dir = NULL

registry directory (NULL -> use default)

Referenced by main(), and parse_options().


Generated on Sun Feb 28 18:08:04 2010 for CWB by  doxygen 1.6.1