Contributor

Say you have a dataset that you want a large language model (LLM) to have access to so that it can answer questions relating to that dataset. One option would be to fine-tune the model itself by retraining on the new dataset, but there are a lot of problems with this approach. Retraining models is expensive and new parameter tuning can overwrite old information the model was initially trained on. Additionally, LLMs can hallucinate and make up information about the dataset. Finally, if the dataset contains sensitive information, then you likely won’t want to use it for retraining, as the LLM could give away that information to users without proper access. A better way to approach this problem is to use Retrieval-Augmented Generation (RAG) to connect the LLM to the dataset. This does not require any retraining of the model, and access to the data can be controlled at a user level. With this approach, the LLM can access the stored data and use it to answer questions. This post will walk through an in-depth example of utilizing an LLM with a graph database.

For this example, we will use a synthetic Fast Healthcare Interoperability Resources (FHIR) connected graph. FHIR is a system used to store healthcare information, formatted as an interconnected graph of resources, such as patients, treatments, visits, providers, and more. Thus, it can easily be represented as a graph database, through a platform such as Neo4j. FHIR is a complicated schema, with over 25 different node types and hundreds of attribute and relationship types. Below is an example of a subgraph from this type of database, representing the nodes that are all connected to just one single patient node.

Even just a subgraph of one patient’s neighbors is complicated, and this will require some special techniques to successfully work with a database of this complexity. First, the main problem is simply connecting the LLM to the database so that it can access it to answer queries. In this post, I want to describe two separate techniques for accomplishing this, which each have their own strengths and weaknesses.

Approach 1- Vector Store

This approach converts the entire graph database into a new format that the LLM can work with more easily. First, each node’s attributes are summarized into sentences to form a single paragraph that contains all information about that node. Then, these paragraphs are embedded using a vector embedding model, which converts each paragraph into a vector of numbers. These vectors are kept in a vector store database, for which there are multiple algorithms for efficient similarity searches. Using a similarity search, the node that is most relevant to an input query can be retrieved and used to answer a question. More details about this vector store method for storing graph databases can be found here.

However, just retrieving the single most relevant node is not enough. The strength of graph databases is their ability to connect different pieces of data, and this method would ignore those connections entirely. Thus, once the most relevant node is retrieved from the vector store, it is used in a query on the original graph database to gather all connected nodes to that node. This generates a subgraph containing the relevant parts of the database, and the text summary used to generate the embeddings can be retrieved from each neighbor and included in the context provided to the LLM. This way, the LLM can answer questions related to multiple connected nodes, because it has access to the text summaries from those nodes. For example, if I ask the question “What is the heart rate of patient Adan632 Cassin499 on their most recent measurement?”, then the similarity search can find the patient node first, and then the nodes connected to Adan632’s node would be retrieved as well, which will contain the patient’s vital sign measurements at every visit. The LLM can then use its natural text reasoning skills to determine the most recent measurement and provide it to the user.

However, it is not quite that simple. As shown in the image from above, a single patient node is connected to a lot of other nodes, and the text summaries of each node can be long. Thus, when every connected node is retrieved and summarized, this is often too much data for the LLM to use. LLMs have a maximum context size that limits how much additional context they can utilize when responding to queries, and this context size is usually exceeded when retrieving all connected nodes. Thus, we need a way to limit which nodes are gathered, so that the context size is not exceeded. We accomplished this by limiting the query to only gathering connected nodes that are of relevant types to the query. For instance, in the example query, we would want to gather any ‘Observation’ nodes connected to the patient, because these will contain the vital sign measurements we need. In order to determine which node types would be relevant, we use another LLM. By passing in the user’s input query and a description of each node type and when they would be relevant, the LLM can return a simple list of node types relevant to that question that we can use in the graph query. Then, the query can be limited to only those node types, and the gathered context for the final response will be small enough for the LLM to use.

There is another problem that can occur with this method, which is that the initial similarity search to find the most relevant node does not always return the node we actually want. These vector embeddings contain a lot of information, and matching them based on one small feature, such as a patient name, will not always work correctly. Therefore, we need a way to ensure that if a patient name is mentioned in a question, then that specific patient node will be used. Thankfully, advances in natural language processing (NLP) make this easy to do. One aspect of NLP is entity extraction, which allows for specific types of information to be extracted from text. Any incoming queries can be analyzed to determine if any patient names are mentioned, and if so, they can be extracted and specifically passed to the graph query to ensure that patient node is used. The Spacy module in Python was used to accomplish this, and it performs well, successfully extracting patient names even when those names, such as Adan632 Cassin499, do not resemble real names.

With these efforts in place, the LLM can successfully answer the example query and many others. When asked “What is the heart rate of patient Adan632 Cassin499 on their most recent measurement?”, the patient name is extracted, the ‘Observation’ nodes connected to that patient are collected, the combined text is passed to the LLM as context, and the answer is given: “According to the provided data, the most recent measurement of Adan632 Cassin499’s heart rate is 79 /min, which was recorded on 10/27/2022 at 03:32:54.”

This method works very well for a variety of query types. It excels when asking questions about specific patients or conditions, and due to the similarity search method and the strength of vector embeddings, it is very flexible to rewordings and misspellings. However, there is a weakness with this method, as any queries involving a large number of nodes will not be answered correctly. For example, take the question “How many patients have the condition ‘Impacted molars’?”. When asked this question, the most relevant node will be fetched, which is a condition node for the condition ‘Impacted molars’. However, this particular condition node is only connected to one patient, so only one patient is retrieved to answer the question. In reality, multiple patients in the database have this condition, but each one is connected to a different condition node with the condition ‘Impacted molars’, rather than all affected patients being connected to one single condition node. Using this method, there is no way that a question like this could be answered without the ability to survey multiple unconnected nodes, so a reworking of the approach would be necessary.

Approach 2- Directly Querying Database

Newer developments in LLM capabilities now allow LLMs to directly query graph databases. LangChain provides a pipeline through which user questions are reworked into valid queries that can run on Neo4j hosted databases, using the Cypher query language. The results of the query are then returned to the LLM to be used in the answer. However, we again run into the context size problem here. LangChain automatically gathers the full schema for the database, which contains all node types, attributes, and relationships. For a database as complex as FHIR, this schema is too large for most models to use, preventing valid Cypher queries from being generated. Luckily, newer LLMs have increased context sizes, and models such as GPT 4-o now have the context size to support the entire schema.

Still, without any modifications, the Cypher queries that are generated are often incorrect, frequently misinterpreting the meanings of certain attributes. This is because many attributes in the FHIR schema are unclear in name alone. To address this, we can pass in additional context that provides examples for each node type, which can explain what each attribute actually stores. Similarly to in the first approach, we limit this to only node types that are relevant to the question, so that the context is not overwhelmed with irrelevant information. Providing these examples led to a significant improvement in the quality of the queries. This method is much more flexible than the first approach in the types of queries that it can respond to. For example, the question “How many patients have the condition ‘Impacted molars’?” now returns the correct answer, because the entire database can be queried to determine which patients are connected to condition nodes with the condition ‘Impacted molars’.

This method still does not work in all cases, however. This technique is less flexible to differently-worded queries compared to the first approach. For example, asking for the ‘body weight’ of a patient at a certain observation will work because the measurement’s name is ‘body weight’, but just asking for the ‘weight’ of a patient will not match and will return nothing. Also, the model will sometimes make simple mistakes with some relationship types, timestamps, and other details that, while seemingly minor, prevent an answer from being reached.

Conclusion

Both of these approaches work well in a variety of cases, but neither is perfect. Approach 1 works very well for questions relating to only a couple of nodes, and it is very flexible to different wordings of questions. However, it cannot answer questions that require information from many different unconnected nodes. Approach 2 can answer these types of questions and more, but it lacks the same flexibility of different wordings. Therefore, a joint approach seems to be the most consistent solution, utilizing both vector store similarity searches and direct graph queries. Additionally, there may be other ways to improve both methods, such as different embedding models or similarity search techniques for Approach 1, or utilizing LLM adapters to improve Cypher query writing for Approach 2. As it stands, both of these approaches represent the remarkable continuous advancement in LLM capabilities today, and the ability to efficiently gather data from graph databases such as FHIR is a valuable and impactful development.

Tags: