“[..] artificial intelligence (AI) is well suited to the detection and reporting of follow-up recommendations because of the large volume of imaging studies requiring screening and the relatively standardized language employed by radiologists in preparing reports. Natural language processing (NLP) methods, including text pattern-matching and traditional machine-learning techniques, have been developed for this task. In this article, we use the term traditional machine learning to refer to all machine-learning methods that are not deep learning, and these terms will be defined in detail in the sections that follow. More recently, novel deep-learning methods for NLP have shown great promise for the detection of follow-up recommendations. [..]
We decided to develop an EHR [electronic health record]-integrated NLP system to automatically identify radiographic findings requiring follow-up. [..]
Given the anticipated high volume of lung findings and the structured clinical approach guiding their follow-up, we considered detection of lung findings requiring follow-up to be a realistic and impactful domain for clinical implementation of this system. [..]
We began by prototyping various NLP approaches to understand the scope of the problem and to inform our initial approach, starting with regular expressions (regex). Regex patterns are manually defined representations of word sequences of interest that can be used to identify radiology report text via pattern matching. Because of its relative simplicity and ease of implementation, the regex method provides a useful baseline for the evaluation of future approaches. We obtained an initial corpus of 200 radiology reports from our institution and annotated these for any findings requiring follow-up. On the basis of this corpus, 14 regex patterns were successfully developed in an iterative design process to capture both the finding description as well as the follow-up that was recommended by the radiologist. A clinical expert validated that all reports containing actionable findings were identified with 100% sensitivity and specificity. [..]
When we evaluated the regex approach on a larger data set comprising 10,916 labelled radiology reports containing 1,857 findings, the sensitivity and specificity fell to 74% and 82%, respectively, with an overall accuracy of 77% and positive predictive value (PPV) of 45%. Because it is a text search method, minor discrepancies such as misspellings or varied word placement may render relevant findings invisible to regex patterns, resulting in false negatives. False positives may also occur if additional language is used to qualify the strength of the follow-up recommendation. Additionally, documentation practices may systematically change with time, and uncommon types of findings may not be sufficiently accounted for during regex pattern development. Considering that regex patterns may easily miss findings not anticipated during regex development, as well as the inherent difficulty of scaling up the regex approach, we opted to explore more sophisticated methods for this task.
Next, we evaluated various machine-learning methods, which automate the process of learning problem–specific features and are thus better suited to handling the inherent variability in radiology reports. To determine the model best suited to detection of follow-up recommendations, we performed initial modeling on the annotated corpus of 200 radiology reports. We started with traditional NLP machine-learning methods such as logistic regression, using the Bag-of-Words method to convert our data from text to tabular data. Bag of Words is a technique that works by counting the number of times a word from selected vocabulary appears in the text sample and therefore represents text as a numeric vector that can be fed to a model.
However, a major disadvantage of this technique — and with traditional machine-learning NLP models in general — is that vectorizing model input data in this way disregards the order of words in a given text. This results in less informative input data for a model to make predictions on and significantly hampers performance.
We saw slightly improved performance using LightGBM40 and XGBoost, widely used traditional machine-learning models that use gradient boosting. Gradient boosting is a machine-learning technique that builds an ensemble model out of many highly specific models. Typically, an ensemble of decision trees is built one by one, with each decision tree trained to maximize performance for samples incorrectly classified by the previous tree. This process is repeated to yield an ensemble of highly specialized decision trees that make a strong predictive model when used together. [..]
Deep learning is a type of machine learning defined by the use of representation learning, in which the model learns features and representations of input data as a part of the training process. This is particularly beneficial for NLP tasks, because an understanding of language requires knowledge of contextual information in addition to vocabulary.
On the basis of comparisons of model performance on our initial data set, we decided to proceed with model development using a type of deep-learning architecture called bidirectional long short-term memory (BiLSTM). BiLSTMs are a form of recurrent neural network, which is a deep-learning architecture that operates on sequentially organized data such as text, preserving the position information for each word. A BiLSTM processes input data in both the forward and that backward direction, enabling it to learn word dependencies in a text corpus. [..]
To further take advantage of advances in deep-learning strategies, we adopted another NLP technique called word embedding. To prepare text for use with machine-learning algorithms, individual words or word fragments must first be converted into a numeric machine-readable format. Word embeddings are created by using deep learning in an unsupervised manner to analyze large databases of text, ultimately creating high-dimensional vector representations of words such that similar words are close to each other in the vector space. We evaluated two word embeddings: GloVe, which was pretrained using Wikipedia and Gigaword as sources of text, and BioWordVec, which was trained using a text corpus derived from PubMed and Medical Subject Headings data. We decided to use GloVe, which yielded improved performance compared with BioWordVec, despite the focus of the latter on biomedical text. We believe that GloVe’s much more extensive training set, despite not focusing exclusively on biomedical text, afforded us better results purely on the basis of its larger size. [..]
We set up an online system on the internal NM [Northwestern Medicine] network using the open-source INCEpTION platform for annotation of semantic phenomena in which trained clinical nurse annotators labeled curated radiology reports for relevant information. [..]
In the first stage of radiology report screening, the Finding/No Finding BiLSTM model classifies each radiology report as containing or not containing a finding with associated follow-up recommendation. If [..] a finding is detected, the report is passed to two models working in parallel. One is an XGBoost model that performs comment extraction to identify the portion of the radiology report containing the relevant finding and recommended follow-up. This model is trained to predict the probability that any given sentence in a radiology report contains a follow-up recommendation and provides the sentence with the maximal predicted probability as output. The other model is a BiLSTM model [..]. If the finding is lung-related, then a final BiLSTM model classifies the recommended follow-up procedure as a chest CT or other procedure. [..]
[The team developed an alert workflow that included an order set with patient notification through a patient portal and an escalation path if no action was taken.]
Of the 279 radiology reports with lung findings and follow-up recommendations, 215 were accurately identified by the system, and another 23 reports with no follow-up recommendation were incorrectly predicted to contain a lung follow-up, yielding a sensitivity of 77.1%, specificity of 99.5%, and PPV of 90.3% for lung follow-up identification. This indicated comparable clinical performance of the models for lung-related findings detection, as was expected from our internal validation, although with a somewhat decreased sensitivity. [..]
At the conclusion of the 13-month evaluation period, more than 2,400 patients with a radiology report flagged by the Result Management system had already completed follow-up care, indicating that a significant number of relevant follow-up orders have been placed outside of our workflow. As more patients continue to complete recommended follow-ups, optimization of our workflow for clinical impact remains a top priority. All flagged studies that are not acknowledged continue to be tracked by the follow-up nurse team, who verify that appropriate follow-up is pursued. [..]
The low conversion rate of finding detection to BPA [best practice advisory] acknowledgment presents a substantial challenge to the efficacy of the NLP system and will likely be improved with workflow improvements and greater clinician awareness.
Furthermore, only one-quarter of BPA acknowledgements resulted in the ordering of a follow-up imaging study through the system. Because not all follow-up recommendations involve imaging, we expect the follow-up ordering rate to be less than 100%. However, the most common acknowledgments that did not result in a follow-up order indicated Managed by Oncology, which applies to patients who already have established oncological follow-up relating to the finding. Refinement of the workflow to exclude these patients may mitigate these unnecessary alerts.”
Full article, J Domingo, G Galal, J Huang et al., NEJM Catalyst 2022.3.16