Extending RadLex by Automated Extraction of Terms
from the Medical Literature
 
Authors:
Rebecca J. Hazen, Rochester Institute of Technology; Alexander P. van Esbroeck; David S. Channin, MD
 
Hypothesis:
An automated natural language processing pipeline can be developed to extract imaging observations and imaging observation characteristics from the medical literature for potential inclusion in RadLex.
 
Introduction:
RadLex, the Radiology Lexicon, is a controlled vocabulary representing the terms and concepts of radiology. It was developed by the Radiological Society of North America (RSNA) in recognition of a lack of coverage of these radiology concepts by other lexicons[1]. RadLex was created and extended by the contributions of committees of radiologists, as well as members of other radiology organizations. Currently, RadLex consists of approximately 12,000 individual terms, organized in a hierarchy with 12 top level categories.

Though large, RadLex is still missing concepts, particularly those related to imaging observations and imaging observation characteristics; the lingua franca of radiologists. While the manual, committee based mechanism for extending RadLex has contributed greatly to the lexicon, it is, in the long term, unsustainable. This paper describes an automatic term extraction system to accelerate the expansion of RadLex.

 
Methods:
This work did not involve human subject research. All software was developed in the JAVA language (Sun Microsystems, Mountain View, CA).

An article finder application was developed using the Entrez Programming Utilities (www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) from the National Library of Medicine (NLM).

LexEVS (Mayo Clinic, Rochester, MN, USA) was used to access existing controlled medical terminologies including RadLex.

A term extraction system was written as a combination of standalone Java programs and pipelines for the Apache Unstructured Information Management Architecture (UIMA) framework v2.2.2 (http://incubator.apache.org/uima). UIMA is a system for processing and extracting information from large volumes of documents. The documents were pre-processed with a UIMA pipeline consisting of a tokenizer and a part-of-speech tagger. The Whitespace Tokenizer Annotator, a component of the UIMA framework, was used to identify individual words, punctuation marks, and sentences, marking them as discrete tokens. The Tagger Annotator, another UIMA component, was used to determine parts of speech for each word.

Ranked lists of imaging observations and imaging observation characteristics were created from the corpus of medical articles by a five stage process: candidate phrase identification, annotation with existing lexicons, context processing, term ranking, and term splitting.

A linguistic filter was developed to select sequences of tokens fitting a defined pattern of parts of speech for our initial list of candidate phrases. The LexEVS Annotator took the candidate phrases and determined whether or not they already existed within RadLex. Existing RadLex terms within the articles were then used to learn context words that could surround candidate phrases. To learn context information, the NC-value method[2] was enhanced to distinguish between preceding and following context words. It also identified negative context words, which frequently co-occurred with other kinds of terms. The set of identified context words was then used to generate a context score for every candidate phrase. Candidate phrases were assigned a “termhood” value, a modified NC-value,[2] as their ranking. This termhood value was calculated based on the context scores, on the number of words in the term, and on the nesting of the term. The nesting of a term is a representation of the term’s independence. The list generated by the term ranker consisted of candidate phrases deemed highly likely to contain imaging observations. The term splitter application then used the ratio between frequencies of words and phrases within a term to distinguish individual characteristics and observations.

The top 100 imaging observations and the top 100 observation characteristics identified by the system and not already present in RadLex were evaluated by three board certified radiologists, all experienced in working with RadLex. Candidate phrases were deemed “valid” if there was consensus among the three reviewers that the term was classified correctly. The precision of the system was calculated as the percentage of valid terms identified in the first 100 candidates.

 
Results:
The system was run on a corpus of 1,128 journal articles as identified and retrieved by the article finder program. These articles were processed by the pipeline, which resulted in two ranked lists: one for imaging observations, and another for imaging observation characteristics that were evaluated by the domain experts for inclusion in RadLex. The system generated lists of 624 imaging observations and 444 imaging observation characteristics. The domain experts evaluated the top 100 terms in each list and validated 52 suggested imaging observation characteristics (precision of 52%). From the list of suggested imaging observations, 26 of the top 100 (precision of 26%) new concepts were validated by all three experts.
 
Discussion:
The article finder application searched for articles using the PubMed query syntax.

The focus of the search was on each imaging modality and its associated findings. The search strings included the terms, “imaging findings [Title],” “CT findings [Title],” “MRI findings [Title],” “X-ray findings [Title],” and “PET findings [Title].” The articles were then located using their respective PubMed ID numbers and saved to the local disk. Medical journal articles were selected as the system input because they are credible, well-structured, and rich sources of terms of interest. Articles collected from older issues of journals help provide fundamental terms used across time, whereas more recent articles contain many new terms reflecting current technologies and advances in the domain.

While the system still requires the domain expert to evaluate final lists, the process is significantly less time consuming and demanding on those experts. The goal in creating RadLex was to achieve a unified lexicon, increasing clarity and decreasing variation within the community. The domain expert must, therefore, validate each new term in order to maintain the integrity and utility of the lexicon.

Capable of collecting and handling large amounts of text, the term extraction system provides a mechanism for accelerated expansion of RadLex. Unlike previous methods of expansion, this system places very little demand on the domain expert and is not dependent on the availability of committees. Journal articles can be collected and processed at anytime, producing new lists of terms for evaluation. By adjusting context and processing, other categories of phrases within RadLex can be located and extracted from journal articles. These elements provide flexibility in further development and expansion of RadLex.

 
Conclusion:
An automatic term extraction system was able to identify new imaging observations and imaging observation characteristics from the medical literature. The entire system was developed and deployed in 10 weeks by two engineering undergraduate students. The system is low cost, in that it relies entirely on free and open source software. While imperfect, with room for improvement in precision, the system offers a platform for iterative improvement of the algorithms used and a potential to greatly accelerate expansion of RadLex.
 
References:
1.Langlotz CP, Caldwell SA. The Completeness of existing lexicons for representing radiology report information.J Digit Imaging. 2002;15(Suppl 1):201-5. Epub March 2002.

2. Frantzi, Katerina, Ananiadou S, Mima H. Automatic Recognition of Multi-Word Terms: the C-value/NC-value method. Research and Advanced Technology for Digital Libraries. 1998;585-604.