Semantic classification of Dutch noun-noun compounds : a distributional semantics approach
Faculty of Arts. Linguistics and Literature
Computational linguistics in the Netherlands journal
, p. 2-18
University of Antwerp
This article describes the rst attempt to semantically analyse Dutch noun-noun compounds using the distributional hypothesis, which states that the semantics of a word is implicitly represented by the words in its context. The purpose is not only to classify compounds based on their semantics. We also investigate in what circumstances this classication works best. Using O Seaghdha (2008) as a source of inspiration, a list of 1,802 noun-noun compounds was collected and annotated. The annotators had an annotation scheme and guidelines available with six specic semantic categories (BE, HAVE, IN, ACTOR, INST, ABOUT) and ve categories for less specic categories or incor- rect compounds. An inter-annotator agreement of 60.2% was found on a 500 compound subset. The task of automatically analysing compound semantics was framed as a classication task for which we can use supervised machine learning algorithms. The instance vectors were created by concatenating the vectors containing co-occurrence information on the compound constituents. In certain variants of the experiment, principal component analysis (PCA) was used as a means of reducing the dimensionality of the dataset. Support vector machines and instance-based learning were used for the machine learning experiments. A maximum F-score of 49.0% was reached on the normal bag-of-words (BOW) data using the SVM algorithm. The PCA data yielded a maximum F-score of 45.2%. These scores should be compared with a most frequent class baseline of 29.5%. The achieved results in both main variants signicantly outperform this baseline.