CWB
|
#include "globals.h"
#include "macros.h"
#include "lexhash.h"
#include "ngram-hash.h"
#include <math.h>
Data Structures | |
struct | _cl_ngram_hash |
TODO: consider alternative hash functions (see cl/lexhash.h) More... | |
Macros | |
#define | DEFAULT_NR_OF_BUCKETS 250000 |
Defines the default number of buckets in an n-gram hash. More... | |
#define | DEFAULT_FILLRATE_LIMIT(n) 5.0 |
Default parameters for auto-growing the table of buckets (. More... | |
#define | DEFAULT_FILLRATE_TARGET(n) 1.0 |
keep memory overhead for bucket table below 50% More... | |
#define | MAX_BUCKETS 1000000007 |
Maximum number of buckets n-gram hash will try to allocate when auto-growing. More... | |
#define | MAX_ENTRIES 2147483647 |
Maximum number of entries that can be stored in the n-gram hash. More... | |
Functions | |
unsigned int | hash_ngram (int N, int *tuple) |
Computes 32bit hash value for n-gram. More... | |
cl_ngram_hash | cl_new_ngram_hash (int N, int buckets) |
Creates a new cl_ngram_hash object. More... | |
void | cl_delete_ngram_hash (cl_ngram_hash hash) |
Deletes a cl_ngram_hash object. More... | |
void | cl_ngram_hash_auto_grow (cl_ngram_hash hash, int flag) |
Turns a cl_ngram_hash's ability to auto-grow on or off. More... | |
void | cl_ngram_hash_auto_grow_fillrate (cl_ngram_hash hash, double limit, double target) |
Configure auto-grow parameters. More... | |
int | cl_ngram_hash_check_grow (cl_ngram_hash hash) |
Grows a ngram_hash table, increasing the number of buckets, if necessary. More... | |
cl_ngram_hash_entry | cl_ngram_hash_find_i (cl_ngram_hash hash, int *ngram, unsigned int *ret_offset) |
Finds the entry corresponding to a particular n-gram in a cl_ngram_hash. More... | |
cl_ngram_hash_entry | cl_ngram_hash_find (cl_ngram_hash hash, int *ngram) |
Finds the entry corresponding to a particular n-gram within a cl_ngram_hash. More... | |
cl_ngram_hash_entry | cl_ngram_hash_add (cl_ngram_hash hash, int *ngram, unsigned int f) |
Adds an n-gram to a cl_ngram_hash table. More... | |
int | cl_ngram_hash_freq (cl_ngram_hash hash, int *ngram) |
Gets the frequency of a particular n-gram within a cl_ngram_hash. More... | |
int | cl_ngram_hash_del (cl_ngram_hash hash, int *ngram) |
Deletes an n-gram from a hash. More... | |
int | cl_ngram_hash_size (cl_ngram_hash hash) |
Gets the number of distinct n-grams stored in a cl_ngram_hash. More... | |
cl_ngram_hash_entry * | cl_ngram_hash_get_entries (cl_ngram_hash hash, int *ret_size) |
Get an array of all entries in an n-gram hash. More... | |
void | cl_ngram_hash_iterator_reset (cl_ngram_hash hash) |
Iterate over all entries in an n-gram hash. More... | |
cl_ngram_hash_entry | cl_ngram_hash_iterator_next (cl_ngram_hash hash) |
Iterate over all entries in an n-gram hash. More... | |
int * | cl_ngram_hash_stats (cl_ngram_hash hash, int max_n) |
Compute statistics on bucket fill rates (for debugging and optimization). More... | |
void | cl_ngram_hash_print_stats (cl_ngram_hash hash, int max_n) |
Display statistics on bucket fill rates (for debugging and optimization). More... | |
#define DEFAULT_FILLRATE_LIMIT | ( | n | ) | 5.0 |
Default parameters for auto-growing the table of buckets (.
Referenced by cl_new_ngram_hash().
#define DEFAULT_FILLRATE_TARGET | ( | n | ) | 1.0 |
keep memory overhead for bucket table below 50%
Referenced by cl_new_ngram_hash().
#define DEFAULT_NR_OF_BUCKETS 250000 |
Defines the default number of buckets in an n-gram hash.
Referenced by cl_new_ngram_hash().
#define MAX_BUCKETS 1000000007 |
Maximum number of buckets n-gram hash will try to allocate when auto-growing.
1 billion (incremented to next prime number)
Referenced by cl_ngram_hash_check_grow().
#define MAX_ENTRIES 2147483647 |
Maximum number of entries that can be stored in the n-gram hash.
2^31 - 1
Referenced by cl_ngram_hash_add().
void cl_delete_ngram_hash | ( | cl_ngram_hash | hash | ) |
Deletes a cl_ngram_hash object.
This deletes all the entries in all the buckets in the ngram_hash, plus the cl_ngram_hash itself.
hash | The cl_ngram_hash to delete. |
References _cl_ngram_hash::buckets, cl_free, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.
Referenced by ComputeGroupInternally().
cl_ngram_hash cl_new_ngram_hash | ( | int | N, |
int | buckets | ||
) |
Creates a new cl_ngram_hash object.
N | N-gram size |
buckets | The number of buckets in the newly-created cl_ngram_hash; set to 0 to use the default number of buckets. |
References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_calloc(), cl_malloc(), DEFAULT_FILLRATE_LIMIT, DEFAULT_FILLRATE_TARGET, DEFAULT_NR_OF_BUCKETS, _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash::fillrate_target, find_prime(), _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, _cl_ngram_hash::N, and _cl_ngram_hash::table.
Referenced by cl_ngram_hash_check_grow(), ComputeGroupInternally(), and main().
cl_ngram_hash_entry cl_ngram_hash_add | ( | cl_ngram_hash | hash, |
int * | ngram, | ||
unsigned int | f | ||
) |
Adds an n-gram to a cl_ngram_hash table.
If the n-gram is already in the hash, its frequency count is increased by the specified value f.
Otherwise, a new entry is created and its frequency count is set to f. The n-gram is embedded in the new hash entry, so the original array does not need to be kept in memory.
hash | The hash table to add to. |
ngram | The n-gram to add. |
f | Frequency count of the n-gram. |
References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_malloc(), cl_ngram_hash_check_grow(), cl_ngram_hash_find_i(), _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash_entry::freq, MAX_ENTRIES, _cl_ngram_hash::N, _cl_ngram_hash_entry::next, _cl_ngram_hash_entry::ngram, and _cl_ngram_hash::table.
Referenced by ComputeGroupInternally(), and main().
void cl_ngram_hash_auto_grow | ( | cl_ngram_hash | hash, |
int | flag | ||
) |
Turns a cl_ngram_hash's ability to auto-grow on or off.
When this setting is switched on, the ngram_hash will grow automatically to avoid performance degradation.
Note the default value for this setting is SWITCHED ON.
hash | The hash that will be affected. |
flag | New value for autogrow setting: boolean where true is on and false is off. |
References _cl_ngram_hash::auto_grow.
Referenced by main().
void cl_ngram_hash_auto_grow_fillrate | ( | cl_ngram_hash | hash, |
double | limit, | ||
double | target | ||
) |
Configure auto-grow parameters.
These settings are only relevant if auto-growing is enabled.
The decision to expand the bucket table of a ngram_hash is based on its fill rate, i.e. the average number of entries in each bucket. Under normal circumstances, this value corresponds to the average number of comparisons required to insert a new entry into the hash (locating an existing value should require roughly half as many comparisons).
Auto-growing is triggered if the fill rate exceeds a specified limit. The new number of buckets is chosen so that the fill rate after expansion corresponds to the specified target value.
The two fill rate parameters represent a trade-off between memory overhead (8 bytes for each bucket) and performance (average number of entries that have been checked for each hash access), which depends crucially on the value of N (i.e. n-gram size).
For N=1, a bucket table with low fill rate incurs a substantial memory overhead, which may even exceed the storage required for the entries themselves. For large N, the relative memory overhead is much smaller, while checking the list of entries in a bucket becomes more expensive (N integer comparisons for each item).
Note that the ratio limit / target determines how often the bucket table has to be reallocated; it should not be smaller than 4.0.
A reasonable values for the fill rate limit seems to be around 5.0; if speed is crucial, N is relatively large, and memory footprint isn't a concern, smaller values down to 2.0 might be chosen. The target fill rate should not be set too low for small N. If N=1, a target fill rate of 0.5 results in 100% memory overhead after expansion of the bucket table (16 bytes per entry vs. 8 bytes each for twice as many buckets as there are entries).
When working on very large data sets, it is recommended to disable auto-grow and initialise the n-gram hash with a sufficiently large number of buckets.
hash | The hash that will be affected. |
limit | Fill rate limit, which triggers expansion of the n-gram hash |
target | Target fill rate after expansion (determines new number of buckets) |
References _cl_ngram_hash::fillrate_limit, and _cl_ngram_hash::fillrate_target.
int cl_ngram_hash_check_grow | ( | cl_ngram_hash | hash | ) |
Grows a ngram_hash table, increasing the number of buckets, if necessary.
This functions is called after inserting a new entry into the n-gram hash. If checks whether the current fill rate exceeds the specified limit. If this is the case, and auto_grow is enabled, then the hash is expanded by increasing the number of buckets, such that the new average fill rate corresponds to the specified target value. This gives the hash better performance and makes it capable of absorbing more keys.
If the bucket table would be expanded to more than MAX_BUCKETS entries, auto-grow is automatically disabled for this ngram_hash.
Note: this function also implements the hashing algorithm and must be consistent with cl_ngram_hash_find_i().
Usage: expanded = cl_ngram_hash_check_grow(cl_ngram_hash hash);
This is a non-exported function.
hash | The cl_ngram_hash to autogrow. |
References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_debug, cl_free, cl_new_ngram_hash(), cl_ngram_hash_print_stats(), _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash::fillrate_target, hash_ngram(), MAX_BUCKETS, _cl_ngram_hash::N, _cl_ngram_hash_entry::next, _cl_ngram_hash_entry::ngram, and _cl_ngram_hash::table.
Referenced by cl_ngram_hash_add().
int cl_ngram_hash_del | ( | cl_ngram_hash | hash, |
int * | ngram | ||
) |
Deletes an n-gram from a hash.
The entry corresponding to the specified n-gram is removed from the cl_ngram_hash. If the n-gram is not in the hash to begin with, no action is taken.
hash | The hash to alter. |
ngram | The n-gram to remove. |
References cl_free, cl_ngram_hash_find_i(), _cl_ngram_hash::entries, _cl_ngram_hash_entry::freq, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.
cl_ngram_hash_entry cl_ngram_hash_find | ( | cl_ngram_hash | hash, |
int * | ngram | ||
) |
Finds the entry corresponding to a particular n-gram within a cl_ngram_hash.
This function is basically a wrapper around the internal function cl_ngram_hash_find_i.
hash | The hash to search. |
n-gram | The n-gram to look for. |
References cl_ngram_hash_find_i().
cl_ngram_hash_entry cl_ngram_hash_find_i | ( | cl_ngram_hash | hash, |
int * | ngram, | ||
unsigned int * | ret_offset | ||
) |
Finds the entry corresponding to a particular n-gram in a cl_ngram_hash.
This function is the same as cl_ngram_hash_find(), but *ret_offset is set to the hashtable offset computed for token (i.e. the index of the bucket within the hashtable), unless *ret_offset == NULL.
Note that this function hides the hashing algorithm details from the rest of the n-gram hash implementation (except cl_ngram_hash_check_grow, which re-implements the hashing algorithm for performance reasons).
Usage: entry = cl_ngram_hash_find_i(cl_ngram_hash hash, char *token, unsigned int *ret_offset);
This is a non-exported function.
hash | The hash to search. |
ngram | The ngram to look for. |
ret_offset | This integer address will be filled with the token's hashtable offset (can be NULL, in which case, ignored). |
References _cl_ngram_hash::buckets, hash_ngram(), _cl_ngram_hash::N, _cl_ngram_hash_entry::next, _cl_ngram_hash_entry::ngram, and _cl_ngram_hash::table.
Referenced by cl_ngram_hash_add(), cl_ngram_hash_del(), cl_ngram_hash_find(), and cl_ngram_hash_freq().
int cl_ngram_hash_freq | ( | cl_ngram_hash | hash, |
int * | ngram | ||
) |
Gets the frequency of a particular n-gram within a cl_ngram_hash.
hash | The hash to look in. |
ngram | The ngram to look for. |
References cl_ngram_hash_find_i(), and _cl_ngram_hash_entry::freq.
Referenced by ComputeGroupInternally().
cl_ngram_hash_entry* cl_ngram_hash_get_entries | ( | cl_ngram_hash | hash, |
int * | ret_size | ||
) |
Get an array of all entries in an n-gram hash.
Returns allocated vector of pointers to all entries of the n-gram hash.
This function returns a newly allocated array of cl_ngram_hash_entry pointers enumerating all entries of the hash in an unspecified order.
hash | The n-gram hash to operate on. |
ret_size | If not NULL, the number of entries in the returned array will be stored in this location. |
References _cl_ngram_hash::buckets, cl_malloc(), _cl_ngram_hash::entries, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.
cl_ngram_hash_entry cl_ngram_hash_iterator_next | ( | cl_ngram_hash | hash | ) |
Iterate over all entries in an n-gram hash.
Note that there is only a single iterator for each cl_ngram_hash object, so different parts of the application code must not try to iterate through the hash at the same time.
This function returns the next entry from the hash, or NULL if there are no more entries. Keep in mind that the hash is traversed in an unspecified order.
hash | The n-gram hash to iterate over. |
References _cl_ngram_hash::buckets, _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.
Referenced by ComputeGroupInternally(), and main().
void cl_ngram_hash_iterator_reset | ( | cl_ngram_hash | hash | ) |
Iterate over all entries in an n-gram hash.
Simple iterator for the entries of an n-gram hash.
Note that there is only a single iterator for each cl_ngram_hash object, so different parts of the application code must not try to iterate through the hash at the same time.
This function resets the iterator to the start of the hash.
hash | The n-gram hash to iterate over. |
References _cl_ngram_hash::buckets, _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, and _cl_ngram_hash::table.
Referenced by ComputeGroupInternally(), and main().
void cl_ngram_hash_print_stats | ( | cl_ngram_hash | hash, |
int | max_n | ||
) |
Display statistics on bucket fill rates (for debugging and optimization).
This function prints a table showing the distribution of bucket sizes, i.e. how many buckets contain a given number of keys. The table will be printed to STDERR, as all debugging output in CWB.
hash | The n-gram hash. |
max_n | Count buckets with up to max_n entries. |
References _cl_ngram_hash::buckets, cl_free, cl_ngram_hash_stats(), and _cl_ngram_hash::entries.
Referenced by cl_ngram_hash_check_grow(), and main().
int cl_ngram_hash_size | ( | cl_ngram_hash | hash | ) |
Gets the number of distinct n-grams stored in a cl_ngram_hash.
This returns the total number of entries in all the buckets in the whole hash table.
hash | The hash to size up. |
References _cl_ngram_hash::entries.
Referenced by ComputeGroupInternally(), and main().
int* cl_ngram_hash_stats | ( | cl_ngram_hash | hash, |
int | max_n | ||
) |
Compute statistics on bucket fill rates (for debugging and optimization).
Statistics on bucket fill rates for debugging purposes.
This function returns an allocated integer array of length max_n + 1, whose i-th entry specifies the number of buckets containing i keys. For i == 0, this is the number of empty buckets. The last entry (i == max_n) is the cumulative number of buckets containing i or more entries.
hash | The n-gram hash. |
max_n | Count buckets with up to max_n entries. |
References _cl_ngram_hash::buckets, cl_calloc(), _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.
Referenced by cl_ngram_hash_print_stats().
unsigned int hash_ngram | ( | int | N, |
int * | tuple | ||
) |
Computes 32bit hash value for n-gram.
Referenced by cl_ngram_hash_check_grow(), and cl_ngram_hash_find_i().