CWB
Data Structures | Macros | Functions
ngram-hash.c File Reference
#include "globals.h"
#include "macros.h"
#include "lexhash.h"
#include "ngram-hash.h"
#include <math.h>

Data Structures

struct  _cl_ngram_hash
 TODO: consider alternative hash functions (see cl/lexhash.h) More...
 

Macros

#define DEFAULT_NR_OF_BUCKETS   250000
 Defines the default number of buckets in an n-gram hash. More...
 
#define DEFAULT_FILLRATE_LIMIT(n)   5.0
 Default parameters for auto-growing the table of buckets (. More...
 
#define DEFAULT_FILLRATE_TARGET(n)   1.0
 keep memory overhead for bucket table below 50% More...
 
#define MAX_BUCKETS   1000000007
 Maximum number of buckets n-gram hash will try to allocate when auto-growing. More...
 
#define MAX_ENTRIES   2147483647
 Maximum number of entries that can be stored in the n-gram hash. More...
 

Functions

unsigned int hash_ngram (int N, int *tuple)
 Computes 32bit hash value for n-gram. More...
 
cl_ngram_hash cl_new_ngram_hash (int N, int buckets)
 Creates a new cl_ngram_hash object. More...
 
void cl_delete_ngram_hash (cl_ngram_hash hash)
 Deletes a cl_ngram_hash object. More...
 
void cl_ngram_hash_auto_grow (cl_ngram_hash hash, int flag)
 Turns a cl_ngram_hash's ability to auto-grow on or off. More...
 
void cl_ngram_hash_auto_grow_fillrate (cl_ngram_hash hash, double limit, double target)
 Configure auto-grow parameters. More...
 
int cl_ngram_hash_check_grow (cl_ngram_hash hash)
 Grows a ngram_hash table, increasing the number of buckets, if necessary. More...
 
cl_ngram_hash_entry cl_ngram_hash_find_i (cl_ngram_hash hash, int *ngram, unsigned int *ret_offset)
 Finds the entry corresponding to a particular n-gram in a cl_ngram_hash. More...
 
cl_ngram_hash_entry cl_ngram_hash_find (cl_ngram_hash hash, int *ngram)
 Finds the entry corresponding to a particular n-gram within a cl_ngram_hash. More...
 
cl_ngram_hash_entry cl_ngram_hash_add (cl_ngram_hash hash, int *ngram, unsigned int f)
 Adds an n-gram to a cl_ngram_hash table. More...
 
int cl_ngram_hash_freq (cl_ngram_hash hash, int *ngram)
 Gets the frequency of a particular n-gram within a cl_ngram_hash. More...
 
int cl_ngram_hash_del (cl_ngram_hash hash, int *ngram)
 Deletes an n-gram from a hash. More...
 
int cl_ngram_hash_size (cl_ngram_hash hash)
 Gets the number of distinct n-grams stored in a cl_ngram_hash. More...
 
cl_ngram_hash_entrycl_ngram_hash_get_entries (cl_ngram_hash hash, int *ret_size)
 Get an array of all entries in an n-gram hash. More...
 
void cl_ngram_hash_iterator_reset (cl_ngram_hash hash)
 Iterate over all entries in an n-gram hash. More...
 
cl_ngram_hash_entry cl_ngram_hash_iterator_next (cl_ngram_hash hash)
 Iterate over all entries in an n-gram hash. More...
 
int * cl_ngram_hash_stats (cl_ngram_hash hash, int max_n)
 Compute statistics on bucket fill rates (for debugging and optimization). More...
 
void cl_ngram_hash_print_stats (cl_ngram_hash hash, int max_n)
 Display statistics on bucket fill rates (for debugging and optimization). More...
 

Macro Definition Documentation

#define DEFAULT_FILLRATE_LIMIT (   n)    5.0

Default parameters for auto-growing the table of buckets (.

See also
cl_ngram_hash_auto_grow_fillrate for details).

Referenced by cl_new_ngram_hash().

#define DEFAULT_FILLRATE_TARGET (   n)    1.0

keep memory overhead for bucket table below 50%

Referenced by cl_new_ngram_hash().

#define DEFAULT_NR_OF_BUCKETS   250000

Defines the default number of buckets in an n-gram hash.

Referenced by cl_new_ngram_hash().

#define MAX_BUCKETS   1000000007

Maximum number of buckets n-gram hash will try to allocate when auto-growing.

1 billion (incremented to next prime number)

Referenced by cl_ngram_hash_check_grow().

#define MAX_ENTRIES   2147483647

Maximum number of entries that can be stored in the n-gram hash.

2^31 - 1

Referenced by cl_ngram_hash_add().

Function Documentation

void cl_delete_ngram_hash ( cl_ngram_hash  hash)

Deletes a cl_ngram_hash object.

This deletes all the entries in all the buckets in the ngram_hash, plus the cl_ngram_hash itself.

Parameters
hashThe cl_ngram_hash to delete.

References _cl_ngram_hash::buckets, cl_free, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally().

cl_ngram_hash cl_new_ngram_hash ( int  N,
int  buckets 
)

Creates a new cl_ngram_hash object.

Parameters
NN-gram size
bucketsThe number of buckets in the newly-created cl_ngram_hash; set to 0 to use the default number of buckets.
Returns
The new cl_ngram_hash.

References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_calloc(), cl_malloc(), DEFAULT_FILLRATE_LIMIT, DEFAULT_FILLRATE_TARGET, DEFAULT_NR_OF_BUCKETS, _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash::fillrate_target, find_prime(), _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, _cl_ngram_hash::N, and _cl_ngram_hash::table.

Referenced by cl_ngram_hash_check_grow(), ComputeGroupInternally(), and main().

cl_ngram_hash_entry cl_ngram_hash_add ( cl_ngram_hash  hash,
int *  ngram,
unsigned int  f 
)

Adds an n-gram to a cl_ngram_hash table.

If the n-gram is already in the hash, its frequency count is increased by the specified value f.

Otherwise, a new entry is created and its frequency count is set to f. The n-gram is embedded in the new hash entry, so the original array does not need to be kept in memory.

Parameters
hashThe hash table to add to.
ngramThe n-gram to add.
fFrequency count of the n-gram.
Returns
A pointer to a (new or existing) entry

References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_malloc(), cl_ngram_hash_check_grow(), cl_ngram_hash_find_i(), _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash_entry::freq, MAX_ENTRIES, _cl_ngram_hash::N, _cl_ngram_hash_entry::next, _cl_ngram_hash_entry::ngram, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally(), and main().

void cl_ngram_hash_auto_grow ( cl_ngram_hash  hash,
int  flag 
)

Turns a cl_ngram_hash's ability to auto-grow on or off.

When this setting is switched on, the ngram_hash will grow automatically to avoid performance degradation.

Note the default value for this setting is SWITCHED ON.

See also
cl_ngram_hash_auto_grow_fillrate, cl_ngram_hash_check_grow
Parameters
hashThe hash that will be affected.
flagNew value for autogrow setting: boolean where true is on and false is off.

References _cl_ngram_hash::auto_grow.

Referenced by main().

void cl_ngram_hash_auto_grow_fillrate ( cl_ngram_hash  hash,
double  limit,
double  target 
)

Configure auto-grow parameters.

These settings are only relevant if auto-growing is enabled.

The decision to expand the bucket table of a ngram_hash is based on its fill rate, i.e. the average number of entries in each bucket. Under normal circumstances, this value corresponds to the average number of comparisons required to insert a new entry into the hash (locating an existing value should require roughly half as many comparisons).

Auto-growing is triggered if the fill rate exceeds a specified limit. The new number of buckets is chosen so that the fill rate after expansion corresponds to the specified target value.

The two fill rate parameters represent a trade-off between memory overhead (8 bytes for each bucket) and performance (average number of entries that have been checked for each hash access), which depends crucially on the value of N (i.e. n-gram size).

For N=1, a bucket table with low fill rate incurs a substantial memory overhead, which may even exceed the storage required for the entries themselves. For large N, the relative memory overhead is much smaller, while checking the list of entries in a bucket becomes more expensive (N integer comparisons for each item).

Note that the ratio limit / target determines how often the bucket table has to be reallocated; it should not be smaller than 4.0.

A reasonable values for the fill rate limit seems to be around 5.0; if speed is crucial, N is relatively large, and memory footprint isn't a concern, smaller values down to 2.0 might be chosen. The target fill rate should not be set too low for small N. If N=1, a target fill rate of 0.5 results in 100% memory overhead after expansion of the bucket table (16 bytes per entry vs. 8 bytes each for twice as many buckets as there are entries).

When working on very large data sets, it is recommended to disable auto-grow and initialise the n-gram hash with a sufficiently large number of buckets.

See also
cl_ngram_hash_auto_grow, cl_ngram_hash_check_grow
Parameters
hashThe hash that will be affected.
limitFill rate limit, which triggers expansion of the n-gram hash
targetTarget fill rate after expansion (determines new number of buckets)

References _cl_ngram_hash::fillrate_limit, and _cl_ngram_hash::fillrate_target.

int cl_ngram_hash_check_grow ( cl_ngram_hash  hash)

Grows a ngram_hash table, increasing the number of buckets, if necessary.

This functions is called after inserting a new entry into the n-gram hash. If checks whether the current fill rate exceeds the specified limit. If this is the case, and auto_grow is enabled, then the hash is expanded by increasing the number of buckets, such that the new average fill rate corresponds to the specified target value. This gives the hash better performance and makes it capable of absorbing more keys.

If the bucket table would be expanded to more than MAX_BUCKETS entries, auto-grow is automatically disabled for this ngram_hash.

Note: this function also implements the hashing algorithm and must be consistent with cl_ngram_hash_find_i().

Usage: expanded = cl_ngram_hash_check_grow(cl_ngram_hash hash);

This is a non-exported function.

See also
cl_ngram_hash_auto_grow, cl_ngram_hash_auto_grow_fillrate
Parameters
hashThe cl_ngram_hash to autogrow.
Returns
Always 0.

References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_debug, cl_free, cl_new_ngram_hash(), cl_ngram_hash_print_stats(), _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash::fillrate_target, hash_ngram(), MAX_BUCKETS, _cl_ngram_hash::N, _cl_ngram_hash_entry::next, _cl_ngram_hash_entry::ngram, and _cl_ngram_hash::table.

Referenced by cl_ngram_hash_add().

int cl_ngram_hash_del ( cl_ngram_hash  hash,
int *  ngram 
)

Deletes an n-gram from a hash.

The entry corresponding to the specified n-gram is removed from the cl_ngram_hash. If the n-gram is not in the hash to begin with, no action is taken.

Parameters
hashThe hash to alter.
ngramThe n-gram to remove.
Returns
The frequency of the deleted entry (0 if not found).

References cl_free, cl_ngram_hash_find_i(), _cl_ngram_hash::entries, _cl_ngram_hash_entry::freq, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

cl_ngram_hash_entry cl_ngram_hash_find ( cl_ngram_hash  hash,
int *  ngram 
)

Finds the entry corresponding to a particular n-gram within a cl_ngram_hash.

This function is basically a wrapper around the internal function cl_ngram_hash_find_i.

See also
cl_ngram_hash_find_i
Parameters
hashThe hash to search.
n-gramThe n-gram to look for.
Returns
The entry that is found (or NULL if the n-gram is not in the hash).

References cl_ngram_hash_find_i().

cl_ngram_hash_entry cl_ngram_hash_find_i ( cl_ngram_hash  hash,
int *  ngram,
unsigned int *  ret_offset 
)

Finds the entry corresponding to a particular n-gram in a cl_ngram_hash.

This function is the same as cl_ngram_hash_find(), but *ret_offset is set to the hashtable offset computed for token (i.e. the index of the bucket within the hashtable), unless *ret_offset == NULL.

Note that this function hides the hashing algorithm details from the rest of the n-gram hash implementation (except cl_ngram_hash_check_grow, which re-implements the hashing algorithm for performance reasons).

Usage: entry = cl_ngram_hash_find_i(cl_ngram_hash hash, char *token, unsigned int *ret_offset);

This is a non-exported function.

Parameters
hashThe hash to search.
ngramThe ngram to look for.
ret_offsetThis integer address will be filled with the token's hashtable offset (can be NULL, in which case, ignored).
Returns
The entry that is found (or NULL if the string is not in the hash).

References _cl_ngram_hash::buckets, hash_ngram(), _cl_ngram_hash::N, _cl_ngram_hash_entry::next, _cl_ngram_hash_entry::ngram, and _cl_ngram_hash::table.

Referenced by cl_ngram_hash_add(), cl_ngram_hash_del(), cl_ngram_hash_find(), and cl_ngram_hash_freq().

int cl_ngram_hash_freq ( cl_ngram_hash  hash,
int *  ngram 
)

Gets the frequency of a particular n-gram within a cl_ngram_hash.

Parameters
hashThe hash to look in.
ngramThe ngram to look for.
Returns
The frequency of that n-gram, or 0 if it is not in the hash

References cl_ngram_hash_find_i(), and _cl_ngram_hash_entry::freq.

Referenced by ComputeGroupInternally().

cl_ngram_hash_entry* cl_ngram_hash_get_entries ( cl_ngram_hash  hash,
int *  ret_size 
)

Get an array of all entries in an n-gram hash.

Returns allocated vector of pointers to all entries of the n-gram hash.

This function returns a newly allocated array of cl_ngram_hash_entry pointers enumerating all entries of the hash in an unspecified order.

Parameters
hashThe n-gram hash to operate on.
ret_sizeIf not NULL, the number of entries in the returned array will be stored in this location.

References _cl_ngram_hash::buckets, cl_malloc(), _cl_ngram_hash::entries, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

cl_ngram_hash_entry cl_ngram_hash_iterator_next ( cl_ngram_hash  hash)

Iterate over all entries in an n-gram hash.

Note that there is only a single iterator for each cl_ngram_hash object, so different parts of the application code must not try to iterate through the hash at the same time.

This function returns the next entry from the hash, or NULL if there are no more entries. Keep in mind that the hash is traversed in an unspecified order.

Parameters
hashThe n-gram hash to iterate over.

References _cl_ngram_hash::buckets, _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally(), and main().

void cl_ngram_hash_iterator_reset ( cl_ngram_hash  hash)

Iterate over all entries in an n-gram hash.

Simple iterator for the entries of an n-gram hash.

Note that there is only a single iterator for each cl_ngram_hash object, so different parts of the application code must not try to iterate through the hash at the same time.

This function resets the iterator to the start of the hash.

Parameters
hashThe n-gram hash to iterate over.

References _cl_ngram_hash::buckets, _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally(), and main().

void cl_ngram_hash_print_stats ( cl_ngram_hash  hash,
int  max_n 
)

Display statistics on bucket fill rates (for debugging and optimization).

This function prints a table showing the distribution of bucket sizes, i.e. how many buckets contain a given number of keys. The table will be printed to STDERR, as all debugging output in CWB.

Parameters
hashThe n-gram hash.
max_nCount buckets with up to max_n entries.

References _cl_ngram_hash::buckets, cl_free, cl_ngram_hash_stats(), and _cl_ngram_hash::entries.

Referenced by cl_ngram_hash_check_grow(), and main().

int cl_ngram_hash_size ( cl_ngram_hash  hash)

Gets the number of distinct n-grams stored in a cl_ngram_hash.

This returns the total number of entries in all the buckets in the whole hash table.

Parameters
hashThe hash to size up.

References _cl_ngram_hash::entries.

Referenced by ComputeGroupInternally(), and main().

int* cl_ngram_hash_stats ( cl_ngram_hash  hash,
int  max_n 
)

Compute statistics on bucket fill rates (for debugging and optimization).

Statistics on bucket fill rates for debugging purposes.

This function returns an allocated integer array of length max_n + 1, whose i-th entry specifies the number of buckets containing i keys. For i == 0, this is the number of empty buckets. The last entry (i == max_n) is the cumulative number of buckets containing i or more entries.

Parameters
hashThe n-gram hash.
max_nCount buckets with up to max_n entries.

References _cl_ngram_hash::buckets, cl_calloc(), _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

Referenced by cl_ngram_hash_print_stats().

unsigned int hash_ngram ( int  N,
int *  tuple 
)

Computes 32bit hash value for n-gram.

Referenced by cl_ngram_hash_check_grow(), and cl_ngram_hash_find_i().