Extracting Pathology Condition Codes with LLMs
Surgical pathology reports typically include large amounts of textual notes, and these notes are used to determine final diagnoses and ICD condition codes. In this project, we demonstrate the use of LLMs in this scenario to automatically parse pathology reports and determine the appropriate condition codes. This is a complex problem for several reasons. The reports are lengthy and vary widely in detail, writing style, and accuracy from case to case. There are also a large number of unique ICD codes that can be assigned. Each case is also likely to belong to several different code classifications at once, so it is also important to identify all conditions that a case belongs to.
For this project, we used a dataset containing over 100k surgical pathology reports associated with cancer-related codes, limiting the potential ICD range to around 3k. The textual report was coded as the LLM input, and a list of the correct ICD codes was provided as the output. A variety of different LLM models were tested to determine their efficacy. This includes several BERT-based models, including BioClinicalBERT and PathologyBERT, which are modifications on the base BERT model that are trained to focus on biomedical and clinical text or pathology reports, respectively. Additionally, we created our own BERT modification, UKPathBERT, trained on UK’s own pathology report dataset. For comparison, LLaMA 7B and 13B parameter models were also trained using the same pathology datasets in an instruction-based format. All models were instructed to produce a structured output listing condition codes, allowing for consistent evaluation of the multilabel classification problem.
When evaluated, the LLaMa 13B model performed the best, achieving 80% accuracy, where accuracy was defined in this case to represent the proportion of cases in which all condition codes were correctly identified. So if the model classified some, but not all, of the correct condition codes for a given case, the answer was considered wholly incorrect. The LLaMa 7B model performed similarly well, while all of the tested BERT models were significantly worse at this task, typically achieving only around 5% accuracy. These results show the power of modern LLM models like LLaMa at analyzing unstructured input and generating structured, accurate output.
For more information, read our Arxiv paper here.