Machine learning, or text categorization, is the most widespread approach to automated classification of text, in which characteristics of pre-defined categories are learnt from intellectually categorized documents. However, intellectually categorized documents are not available in many subject areas, for different types of documents or for different user groups. For example, today the standard text classification benchmark is a Reuters RCV1 collection [14], which has about 100 classes and 800000 documents. This would imply that for a text categorization task some 8000 training and testing documents per class are needed. Another problem is that the algorithm works only for that document collection on which parts it has been trained. In addition, [20] claims that the most serious problem in text categorization evaluations is the lack of standard data collections and shows how different versions of the same collection have a strong impact on performance, whereas other versions do not.
In document clustering, the predefined categories (the controlled vocabulary) are automatically produced: both clusters' labels and relationships between them are automatically generated. Labelling of the clusters is a major research problem, with relationships between the categories, such as those of equivalence, related-term and hierarchical relationships, being even more difficult to automatically derive ([18], p.168). "Automatically-derived structures often result in heterogeneous criteria for category membership and can be difficult to understand" [5]. Also, clusters change as new documents are added to the collection.
In string-to-string matching, matching is conducted between a controlled vocabulary and text of documents to be classified. This approach does not require training documents. Usually weighting schemes are applied to indicate the degree to which a term from a document to be classified is significant for the document's topicality. The importance of the controlled vocabularies such as thesauri in automated classification has been recognized in recent research. [4] used a thesaurus to improve performance of the k-NN classifier and managed to improve precision for about 14 %, without degrading recall. [15] showed how information from a subject-specific thesaurus improved performance of key-phrase extraction by more than 1,5 times in F1, precision, and recall. [6] demonstrated that subject ontologies could help improve word sense disambiguation.
Thus, the chosen approach to automated subject classification in the crawler was string-matching. Apart from the fact that no training documents are required, a major motivation to apply this approach was to re-use the intellectual effort that has gone into creating such a controlled vocabulary. Vocabulary control in thesauri is achieved in several ways, out of which the following are beneficial for automated classification:
The algorithm searches for terms from the Ei thesaurus and classification scheme in documents to be classified. In order to do this, a term list is created, containing class captions, different thesauri terms and classes which the terms and captions denote. The term list consists of triplets: term (single word, Boolean term or phrase), class which the term designates or maps to, and weight. Boolean terms consist of words that must all be present but in any order or in any distance from each other. The Boolean terms are not explicitly part of the Ei thesaurus, so they had to be created in a pre-processing step. They are considered to be those terms which contain the following strings: 'and
', 'vs.
' (short for versus), ',
' (comma), ';
' (semi-colon, separating different concepts in class names), '(
' (parenthesis, indicating the context of a homonym), ':
' (colon, indicating a more specific description of the previous term in a class name), and '--
' (double dash, indicating heading-subheading relationship). Upper-case words from the Ei thesaurus and classification scheme are left in upper case in the term list, assuming that they are acronyms. All other words containing at least one lower-case letter are converted into lower case. Geographical names are excluded on the grounds that they are not being engineering-specific in any sense.
The following is an excerpt from the Ei thesaurus and classification scheme, based on which the excerpt from the term list (further below) was created.
From the classification scheme (captions):
931.2 Physical Properties of Gases, Liquids and Solids ... 942.1 Electric and Electronic Instruments ... 943.2 Mechanical Variables MeasurementsFrom the thesaurus:
TM Amperometric sensors UF Sensors--Amperometric measurements MC 942.1 TM Angle measurement UF Angular measurement UF Mechanical variables measurement--Angles BT Spatial variables measurement RT Micrometers MC 943.2 TM Anisotropy NT Magnetic anisotropy MC 931.2TM stands for the preferred term, UF for synonym, BT for broader term, RT for related term, NT for narrower term; MC represents the main class. Below is an excerpt from one term list, as based on the above examples:
1: electric @and electronic instruments=942.1, 1: mechanical variables measurements=943.2, 1: physical properties of gases @and liquids @and solids=931.2, 1: amperometric sensors=942.1, 1: sensors @and amperometric measurements=942.1, 1: angle measurement=943.2, 1: angular measurement=943.2, 1: mechanical variables measurement @and angles=943.2, 1: spatial variables measurement=943.2, 1: micrometers=943.2, 1: anisotropy=931.2, 1: magnetic anisotropy=931.2
The algorithm looks for strings from a given term list in the document to be classified and if the string (e.g. 'magnetic anisotropy' from the above list) is found, the class(es) assigned to that string in the term list ('931.2' in our example) are assigned to the document. One class can be designated by many terms, and each time the class is found, the corresponding weight ('1' in our example) is assigned to the class.
The scores for each class are summed up and classes with scores above a certain cut-off (heuristically defined) can be selected as the final ones for that document. Experiments with different weights and cut-offs are described in the following sections.
Ei thesaurus and classification scheme is rather big and deep (five hierarchical levels), allowing many different choices. Without a thorough qualitative analysis of automatically assigned classes one cannot be sure if, for example, the classes assigned by the algorithm, which were not intellectually assigned, are actually wrong, or if they were left-out by mistake or because of the indexing policy.
In addition, subject indexers make errors such as those related to exhaustivity policy (too many or too few terms get assigned), specificity of indexing (usually this error means that not the most specific term found was assigned), they may omit important terms, or assign an obviously incorrect term ([13], p.86-87). In addition, it has been reported that different people, whether users or subject indexers, would assign different subject terms or classes to the same document. Studies on inter-indexer and intra-indexer consistency report generally low indexer consistency ([16], p. 99-101). There are two main factors that seem to affect it: 1) higher exhaustivity and specificity of subject indexing both lead to lower consistency (indexers choose the same first term for the major subject of the document, but the consistency decreases as they choose more classes or terms); 2) the bigger the vocabulary, or, the more choices the indexers have, the less likely they will choose the same classes or terms (ibid.). Few studies have been conducted as to why indexers disagree [2].
Automated classification experiments today are mostly conducted under controlled conditions, ignoring the fact that the purpose of automated classification is improved information retrieval, which should be evaluated in context (cf. [12]). As Sebastiani ([17] p. 32) puts it, ``the evaluation of document classifiers is typically conducted experimentally, rather than analytically. The reason is that we would need a formal specification of the problem that the system is trying to solve (e.g. with respect to what correctness and completeness are defined), and the central notion that of membership of a document in a category is, due to its subjective character, inherently nonformalizable''.
Due to the fact that methodology for such experiments has yet to be developed, as well as due to limited resources, we follow the traditional approach to evaluation and start from the assumption that intellectually assigned classes in the data collection are correct, and the results of automated classification are being compared against them.
In our collection we included only those documents that have at least one class in the area of Engineering, General, covered by 92 classes we selected. The subset of 35166 documents was selected from the Compendex database by simply retrieving the first documents offered by the Compendex user interface, without changing any preferences. The query was to find those documents that were assigned a certain class. A minimum of 100 documents per class was retrieved at several different points in time during the last year. Compendex is a commercial database so the subset cannot be made available to others. However, the authors can provide documents' identification numbers on request. In the data collection there were on average 838 documents per class, ranging from 138 to 5230.
Based on other term types, too many classes get assigned, but that could be dealt with in the future by introducing cut-offs. Each class is on average designated by 88 terms, ranging from 1 to 756 terms per class. The majority of terms are related terms, followed by synonyms and preferred terms. By looking at the 10 top-performing classes, it was shown that the sole number of terms designating a class does not seem to be proportional to the performance. Moreover, these best performing classes do not have a similar distribution of types of terms designating them, i.e. the percentage of certain term types does not seem to be directly related to performance. The same was discovered for the 10 worst-performing classes.
In conclusion, the results showed that preferred terms perform best, whereas captions perform worst. Stemming in most cases showed to improve performance, whereas the stop-word list did not have a significant impact. The majority of classes is found when using all the terms and stemming: micro-averaged recall is 73 %. The remaining 27 % of classes were not found because the words in the term list designating the classes did not exist in the text of the documents to be classified. This study implies that all types of terms should be used for a term list in order to achieve best recall, but that higher weights could be given to preferred terms, captions and synonyms, as the latter yield highest precision. Stemming seems useful for achieving higher recall, and could be balanced by introducing weights for stemmed terms. Stop-word list could be applied to captions, narrower and preferred terms.
Synonyms were acquisited using a rule-based system, SynoTerm which infers synonymy relations between complex terms by employing semantic information extracted from lexical resources. Documents were first preprocessed and tagged with part-of-speech information and lemmatized. Terms were then identified using the term extractor YaTeA based on parsing patterns and endogenous disambiguation. The semantic information provided by the database WordNet was used as a bootstrap to acquire synonym terms of the basic terms.
The number of classes that were enriched using these natural language processing methods is as follows: derivation 705, out of which 93 adjective to noun, 78 noun to adjective, and 534 noun to verb derivations; permutation 1373; coordination 483; insertion 742; preposition change 69; synonymy 292 automatically extracted, out of which 168 were manually verified as correct.
By combining all the extracted terms into one term list, the mean F1 is 0.14 when stemming is applied, and microaveraged recall is 0.11. This implies that enriching the original Ei-based term list should improve recall. In comparison to results we get when gained with the original term list (micro-averaged recall with stemming 0.73), here the best recall, also microaveraged and with stemming, is 0.76.
A number of weaknesses of the described approach were identified:
Ways to deal with those problems were proposed for further research. These include enriching the term list with synonyms and different word forms, adjusting the term weights and cut-off values and word-sense disambiguation. In our further research the plan is to implement automated methods. On the other hand, the suggested manual methods (e.g. adding synonyms) would, at the same time, improve Ei's original function, that of enhancing retrieval. Having this purpose in mind, manually enriching a controlled vocabulary for automated classification or indexing would not necessarily create additional costs.
The two approaches being complementary, we investigated different combinations of the two based on combining their vocabularies. The linear SVM in the original setting was trained with no feature selection except the stop-word removal. Additionally, three experiments were conducted using feature selection, taking:
In the experiments with string-matching algorithm, four different term lists were created, and we report performance for each of them:
SVM performs best using the original set of terms, and string-matching approach also has best precision when using the original set of terms. Best recall for string-matching is achieved when using descriptive terms. Reasons for these results need further investigation, including a larger data collection and combining the two using predictions.
root 2007-09-27