Improving Concept Extraction from Radiology Reports
through Semantic Class Bootstrapping
 
Authors:
Scott L. DuVall, VA Salt Lake City Healthcare System
 
Background:
Recording information pertinent to clinical care has always involved a balance between getting the information needed and minimizing the burden of the caregiver responsible for recording it. Structured forms with drop-down lists and check boxes have appeared in electronic medical records as today’s equivalent of the complex paper forms used in the past (and still are at some institutions). Despite the inconvenience to the user, input in structured forms allows medical findings, diagnoses, and treatments to be stored in a structured format – a format that can be manipulated by computer applications to automate billing, ensure patient care follows accepted protocols, enable decision support, and facilitate research. The use of structured data alone has limitations to its use, though, as relevant information may not be recorded or may only be fully understood in the context of other information in the record.[1] These limitations, as well as the extra time required for structured data input and constraints imposed on how information can be described, are why clinical records have always included narrative text. Notes allow clinicians to record information they feel is relevant in a manner that is most convenient. Radiology reports typically consist of narrative text that is typed, transcribed, or dictated with speech recognition. Although most reports have some structure, in terms of sections that appear or templates that are used, a more formal representation of the information is needed for computer applications to manipulate. Attempts have been made to combine the benefits of structured data with the convenience of narrative text and the field continues to be an active area of research.[2-4]
 
Evaluation:
Concept extraction (CE) is the process of mapping terms used in narrative text to structured data. Often CE relies on domain experts developing long lists of terms to identify concepts of interest. CE systems often fail to identify concepts references when documents contain acronyms, abbreviations, misspellings, or describe concepts with different granularity than the terms in the lexicon. Thus, great effort is needed to ensure that as many variations as might possibly be found in the text are included in the lexicon. In a study to identify a cohort of patients with pneumonia, 65% of false negative cases resulted because the documents did not contain terms in the domain expert lexicon.[5] Also common in radiology reports is language like “position of support devices unchanged,” which could actually map to several specific concepts like “peripheral intravenous central catheter,” “nasogastric tube,” and “automated implantable cardiac defibrillator.” To reduce the number of concept references missed and the workload of the domain experts, we explore an algorithm that uses a bootstrapping method to generate semantic class lexicons.
 
Discussion:
The Bootstrapping Approach to Semantic Lexicon Induction using Semantic Knowledge (Basilisk) algorithm was developed by Riloff and Thelen in 2002.[6] We discuss the application of the algorithm and its potential uses with CE in radiology reports. Semantic lexicons are different from keyword lists in that each word belongs to a semantic class. This means that we could have a list of words commonly found in reports of pneumonia patients with the semantic class of “evidence for pneumonia” or a list of catheters, pacemakers, and tubes labeled as “implantable devices.” Bootstrapping is a computing method that takes its name from the colloquialism “pull yourself up by your bootstraps” and describes the process of a simple system activating a more complex system. In the case of Basilisk, a short list of seed words for each semantic class can be used to generate hundreds or even thousands of relevant terms. To do so, a processing step is first performed on the reports to extract all noun phrases and linguistic patterns that contain the noun phrases. The seed words form the initial list of terms in each semantic lexicon and are used to determine the linguistic patterns most representative of the semantic class. For example, if the semantic class “disease” had the seed words “adenopathy” and “osteochondrosis,” Basilisk may pull out the linguistic pattern of “is consistent with <disease>“. This means that in the set of documents the phrases “is consistent with osteochondrosis” and “is consistent with adenopathy” commonly occur. Basilisk then uses the pattern to find other words that it is used with. If the phrase “is consistent with nephropathy” also commonly occurs in the document set, the word “nephropathy” would be added to the lexicon. Basilisk repeats the steps of finding patterns that commonly occur with the words in the semantic lexicon and finding which other words are used in those same patterns, adding words to the lexicon with each iteration. The process ends when a specified number of terms has been discovered or all commonly occurring pattern haves been explored.
 
Conclusion:
The Bootstrapping Approach to Semantic Lexicon Induction using Semantic Knowledge (Basilisk) algorithm was developed by Riloff and Thelen in 2002 [6]. We discuss the application of the algorithm and its potential uses with CE in radiology reports. Semantic lexicons are different from keyword lists in that each word belongs to a semantic class. This means that we could have a list of words commonly found in reports of pneumonia patients with the semantic class of “evidence for pneumonia” or a list of catheters, pacemakers, and tubes labeled as “implantable devices.” Bootstrapping is a computing method that takes its name from the colloquialism “pull yourself up by your bootstraps” and describes the process of a simple system activating a more complex system. In the case of Basilisk, a short list of seed words for each semantic class can be used to generate hundreds or even thousands of relevant terms. To do so, a processing step is first performed on the reports to extract all noun phrases and linguistic patterns that contain the noun phrases. The seed words form the initial list of terms in each semantic lexicon and are used to determine the linguistic patterns most representative of the semantic class. For example, if the semantic class “disease” had the seed words “adenopathy” and “osteochondrosis,” Basilisk may pull out the linguistic pattern of “is consistent with <disease>“. This means that in the set of documents the phrases “is consistent with osteochondrosis” and “is consistent with adenopathy” commonly occur. Basilisk then uses the pattern to find other words that it is used with. If the phrase “is consistent with nephropathy” also commonly occurs in the document set, the word “nephropathy” would be added to the lexicon. Basilisk repeats the steps of finding patterns that commonly occur with the words in the semantic lexicon and finding which other words are used in those same patterns, adding words to the lexicon with each iteration. The process ends when a specified number of terms has been discovered or all commonly occurring pattern haves been explored.
 
References:
[1] Pawlson LG, Scholle SH, Powers A. Comparison of administrative-only versus administrative plus chart review data for reporting HEDIS hybrid measures. Am J Manag Care. October 2007;13(10):553-8.

[2] Bleeker SE, Derksen-Lubsen G, van Ginneken AM, van der Lei J, Moll HA. Structured data entry for narrative data in a broad specialty: patient history and physical examination in pediatrics. BMC Med Inform Decis Mak. July 2006;13:6:29.

[3] Matsumura Y, Kuwata S, Yamamoto Y, et al. Template-based data entry for general description in medical records and data transfer to data warehouse for analysis. Stud Health Technol Inform. 2007;129(Pt 1):412-6.

[4] Johnson SB, Bakken S, Dine D, et al. An electronic health record based on structured narrative. J Am Med Inform Assoc. January-February 2008;15(1):54-64. Epub October 2007;18.

[5] South BR, DuVall SL, Gundlapalli AV, Samore MH, Delisle S. Leveraging Domain Knowledge Obtained from a Clinical Reference Standard to Identify Pneumonia Cases and Severity Category from Chest X-ray Reports. 2009 International Society for Disease Surveillance Eighth Annual Conference. (submitted)

[6] Thelen M, Riloff E. A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).