Lumping, Splitting, and Natural Language Processing

Lumping, Splitting, and Natural Language Processing

by Orin Hargraves


Ever since machine-readable dictionaries (MRDs) became available, there have been attempts to incorporate them into natural language processing (NLP) tasks, including information extraction, question answering, and summarization. The assumption has been that NLP software, by incorporating a dictionary, should be able to take advantage of the thousands of man-hours of lexical analysis that is represented by a dictionary, particularly to help with the central NLP task of word sense disambiguation. In practice, however, MRDs have proved disappointing to many in the NLP field and have even been denounced as useless by some. This paper looks at

1)    what elements of dictionary data are of use to an NLP system

2)    how dictionary definition structure can abet or hinder NLP

3)    some features of contemporary dictionaries that make them particularly good or bad candidates for use in NLP.


Consideration is also given to what an ideal MRD-for-NLP dictionary would look like, and whether such a dictionary could serve the needs of both human and machine user. Finally, several current tools available to lexicographers will be surveyed with a view to their usefulness in bridging the gap between NLP and dictionary databases more effectively.

Orin Hargraves,
Feb 5, 2010, 1:26 PM