Abstract
Conversion of free-text strings in a natural language to a standard representation (codes) is an important reoccurring problem in biomedical informatics. Determining the content of a string involves identifying its meaningful constituents (morphemes). One current method of identifying these constituents is to look them up in a preexisting table (lexicon). Manual construction of lexicons and grammars in complex domains such as biomedicine is extremely laborious. As an alternative to the lexico-grammatical approach, we introduce a segmentation algorithm that automatically learns lexical and structural preferences from corpora via information compression. The method is based on the Minimum Description Length (MDL) principle from classic information theory.
Original language | English |
---|---|
Number of pages | 1 |
Journal | AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium |
State | Published - 2003 |