Team Members: Yao Xiao, Bowen Xu, Tong Xiao
This project was the final project for the Harvard AC215 (Fall 2024) course.
Project | |
Repository | |
Workflow |
Warning
This project is no longer live online because it it not cheap to host the project on Google Cloud in the long term. If one is interested in creating a similar project, see the Recreating from Scratch section.
- Introduction
- Data Pipeline
- Model Training & Optimization
- Frontend Interface
- Backend Service
- Deployment
- Future Steps
- References
- Recreating from Scratch
Subdirectory READMEs:
Clinical trial data includes structured information collected during research studies designed to evaluate the safety, efficacy, and outcomes of medical interventions, treatments, or devices on human participants. In recent years, this type of data has expanded rapidly, especially in large repositories like ClinicalTrials.gov, creating immense opportunities to advance healthcare research and improve patient outcomes. However, the large volume and complexity of such data create challenges for researchers, clinicians, and patients in finding and understanding trials relevant to their specific needs.
One key limitation lies in the search functionality of platforms like ClinicalTrials.gov. The current search is based on fuzzy string matching, which struggles to deliver accurate results when user queries are not precise or involve long, complex sentences. Additionally, even when users locate a specific trial, understanding its details can be challenging due to the technical language and dense structure of the information.
Solution: To overcome these challenges, we aim to develop an AI-powered application that improves the information retrieval process for clinical trials. By leveraging state-of-the-art embedding models, our system will retrieve the most relevant trials from ClinicalTrials.gov based on user queries, even if those queries are less structured or precise. After retrieval, an intuitive conversational AI chatbot will enable users to explore specific details of a trial, such as endpoints, results, and eligibility criteria. Our app can benefit a lot of different user groups. For clinical trial researchers, they can find trials to their needs more accurately and interpret the result more efficiently. For patients, it might help them identify target recruiting clinical trials that they can participate in. This interactive approach streamlines access to critical information, empowering users to make more informed decisions in clinical research and patient care.
Project organization
├── .github > GitHub workflows
│ ├── workflows
│ └── dependabot.yaml
├── app
│ ├── backend/ > VeritasTrial backend
│ ├── frontend/ > VeritasTrial frontend
│ ├── docker-compose.yaml > VeritasTrial app compose
│ └── Makefile
├── deploy
│ ├── app/ > App deployment
│ ├── chromadb/ > ChromaDB deployment
│ ├── pipeline/ > Pipeline deployment
│ ├── inventory.yaml > Ansible inventory
│ └── ...
├── misc/ > Miscellaneous
├── secrets/ > Secrets (private)
├── src
│ ├── data-pipeline/ > Data pipeline
│ ├── embedding-model/ > Embedding model
│ ├── construct-qa/ > QA construction (legacy)
│ └── finetune-model/ > Model finetuning (legacy)
├── .gitignore
├── LICENSE.md
└── README.md
Raw data collection: We begin by collecting clinical trial data from ClinicalTrials.gov using their API. This data is preprocessed to retain only the necessary columns, focusing on clinical trials that are completed and have results available. These filtered trials are stored as JSONL files in Google Cloud Storage (GCS) buckets for easy access and scalability in downstream task. For more details, see /src/data-pipeline/.
Training data curation (embedding model): To finetune our embedding model, we curate a triplet dataset specifically designed to improve its ability to match brief trial titles to corresponding summary-level information. The triplet structure is as follows:
Query | Positive | Negative |
---|---|---|
Title | Summary | Summaries from 5 other random trials |
This dataset is tailored for contrastive learning, enabling the embedding model to distinguish between relevant and irrelevant matches effectively. By learning from these structured triplets, the model is better equipped to embed clinical trial titles and summaries into a shared semantic space for improved retrieval accuracy. This step is not included in our pipeline but manually executed in Google Colab; see /misc/Finetune-BGE.ipynb for more details.
Training data curation (LLM): To finetune the LLM, we use an existing PubMedQA dataset and a self-curated ClinicalTrialQA dataset. For the former, we utilize the PubMedQA dataset available on Hugging Face, which contains 211K biomedical question-answer pairs derived from PubMed abstracts. This dataset provides a strong foundation for fine-tuning Gemini on domain-specific QA tasks. For the latter, to further specialize the chatbot for clinical trials in our dataset, we generate additional QA pairs directly from trial documents. Using Gemini 1.5 Flash on Vertex AI, we prompt the model to create relevant question-answer pairs based on the context of individual trial documents. This augmented dataset ensures that Gemini is well-equipped to handle nuanced questions about clinical trials, such as eligibility criteria, study results, or endpoints. For more details, see /src/construct-qa (legacy).
Vector database: After fine-tuning the embedding model, we embed the summary text for each clinical trial into a high-dimensional vector space. These embeddings, along with relevant metadata (e.g., study phases, conditions, eligibility criteria, etc), are stored in a ChromaDB vector database. The database lives in a VM instance in the GCP compute engine, exposing its service. See /deploy about deploying ChromaDB on GCP. For more details about the pipeline, see /src/embedding-model.
We have validated the quality of our embeddings in the vector database by generating N random samples, with one of them being the correct sample, and see if the model can accurately retrieve the correct one. Both the AUROC score (area under the receiver operating characteristic curve) and the MRR score (mean reciprocal rank) are above 0.99, meaning high retrieval accuracy.
Our application design involves two model training processes: finetuning the embedding model for trial retrieval and finetuning the LLM for trial interpretation.
Fintuning the embedding model: We use BGE-small-en-v1.5 as the base embedding model and finetuned it with the sentence-transformers
package with the triplet dataset. We adopt a contrastive learning approach, finetuning the embedding model with the triplet loss function. This step is not included in our pipeline but manually executed in Google Colab; see /misc/Finetune-BGE.ipynb for more details.
Finetuning the LLM: We finetuned the Gemini 1.5 Flash model (gemini-1.5-flash-002
) with 29,800 messages (15,764,259 tokens) and 3 epochs. The learing rate muliplier is 0.1 and the adapter size is 4. No sample is too long to be truncated. The training metrics during the supervised finetuning progress are as follows:
The validation metrics (on a validation set of size 256) during the supervised finetuning progress are as follows:
See /app for details about the application.
We implemented a frontend prototype application using React and TypeScript to streamline clinical trial retrieval. The interface features a range of filters, including options for eligible sex, study type, study phases, patient types, age range, and result dates, allowing users to customize their search. Users can input their query in a text box, and then the system will retrieve relevant clinical trials that are displayed with the trial title, a clickable link to the trial's endpoint, and a chat icon that enables users to initiate a detailed conversation about the trial. Users can also adjust the number of results to display using the Top K option in the bottom-right corner, with choices of 1, 3, 5, 10, 20, or 30. Additionally, the top-right corner provides quick access to the project's GitHub repository and a toggle for switching between light and dark modes, ensuring a user-friendly experience.
After selecting a specific trial from the retrieval results, users can proceed to ask detailed questions about the trial, such as its summary, outcome measures, or sponsor information. This functionality is supported through the side chat panel. If the chat icon for a trial is clicked for the first time, a new chat is created in the panel. However, if the trial has already been accessed before, the interface automatically redirects the user to the existing chat for that trial, ensuring continuity and convenience. To return to the retrieval interface, users can simply click the "Trial Retrieval" button at the top of the side panel.
There are many more UI/UX designs to enhance user experience that we will not cover here. Some of them include: responsive design for mobile devices, copy button for queries and responses, automatic scroll-into-view for sidebars and chat panels, GitHub-flavored markdown support, etc.
See /app for details about the application.
We implemented the backend for VeritasTrial using FastAPI to manage RESTful APIs that facilitate seamless communication with the frontend. The backend handles clinical trial retrieval, filtering, and conversational interactions. Below are the implemented API endpoints:
/heartbeat
: AGET
endpoint that checks server health by returning the current timestamp in nanoseconds./retrieve
: AGET
endpoint that retrieves clinical trials based on user queries, specified filters (e.g., study type, age range, date range), and the desired number of results (Top K)./meta/{item_id}
: AGET
endpoint that retrieves metadata for a specific trial using its unique ID./chat/{model}/{item_id}
: APOST
endpoint that enables interaction with a generative AI model about a specific trial. Users can ask questions (e.g., trial outcomes, sponsors), and the system provides context-aware answers. Chat sessions are automatically created and destroyed on demand.
For deployment instructions and details, see /deploy. Here we provide a brief overview.
We use Ansible playbooks to automate the deployment of the VeritasTrial application on Google Kubernetes Engine (GKE). We create or update a Kubernetes cluster with the specified configuration, including node pools and machine types. Then, we deploy the frontend and backend services as Kubernetes Deployments, exposing them via Kubernetes Services. To enable external access and SSL termination, we set up an Nginx Ingress controller. The Ingress routes incoming traffic to the appropriate service based on URL paths. Additionally, we manage secrets for SSL certificates and service account credentials. This automated deployment process ensures consistency, reduces manual effort, and facilitates efficient scaling of the VeritasTrial application.
The deployment of ChromaDB uses Terraform, as suggested in ChromaDB docs. It will deploy a VM instance that runs ChromaDB service. This is separate from the Kubernetes cluster of the frontend and the backend. This is because the vector database is stateful and recreating it is expensive. This isolated design allows us to disaggregate the pipeline workflow (that updates data in the database, executed less frequently) from the app workflow (that accesses data in the database, executed more frequently).
All deployment steps can be triggered by GitHub Actions workflows.
Taking our goals and objectives into consideration, we aim to expand our project to reach a larger audience and provide greater utility for diverse user groups. Some additional work we might consider includes:
- Multilingual Support: Expand the application to support multiple languages beyond English, enabling users to retrieve and understand clinical trial data in their preferred language.
- Integration with Other Databases: Extend the system to integrate with additional clinical trial databases or medical resources, such as WHO ICTRP or PubMed, to provide users with a more comprehensive dataset.
- Real-Time Updates: Implement real-time updates for clinical trial information to ensure users have access to the most current data, including ongoing trial statuses and newly published results.
- Enhanced Conversational Capabilities: Improve the chatbot’s capabilities to handle more complex and contextual queries, such as comparing multiple trials or answering follow-up questions about a specific trial.
- Data Visualization: Add interactive data visualization tools to help users better understand clinical trial results and other relevant information.
- Chen J, Xiao S, Zhang P, et al. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation[J]. arXiv preprint arXiv:2402.03216, 2024. https://arxiv.org/abs/2402.03216
- Jin Q, Dhingra B, Liu Z, et al. Pubmedqa: A dataset for biomedical research question answering[J]. arXiv preprint arXiv:1909.06146, 2019. https://arxiv.org/abs/1909.06146
- Gao T, Yao X, Chen D. Simcse: Simple contrastive learning of sentence embeddings[J]. arXiv preprint arXiv:2104.08821, 2021. https://arxiv.org/abs/2104.08821
In this section we will describe how to recreate a similar from project from scratch. Note that this is not tested to work, but you are welcome to open an issue in the issue tracker if the instructions do not directly work, so that we can refine them gradually.
- Clone this repository. Make sure to have GitHub Actions available.
- Create a project on Google Cloud Platform. Visit your dashboard, where you can see your project ID. Replace all occurrences in the codebase of
veritastrial
with your project ID (case-sensitive). Also pick the region and zone for your project. Replace all occurrences ofus-central1-a
with your zone, andus-central1
with your region. You may want to exclude theREADME.md
file in this process. - Go to the APIs & Services dashboard and enable the following APIs: Cloud Monitoring API, Compute Engine API, Cloud Logging API, Vertex AI API, Kubernetes Engine API, Artifact Registry API, Cloud Resource Manager API, Cloud Run Admin API, Network Connectivity API, Notebooks API. Note that this list may not be complete, and you may enable other APIs when needed.
- Go to IAM & Admin> Service Accounts and create two service accounts. Let
your-project-name
be a project name which you can choose at random. The first service account should be namedyour-project-name-service
and granted the following accesses: Storage Admin, Vertex AI Administrator. The second service account should be namedyour-project-name-deployment
and granted the following accesses: Artifact Registry Administrator, Compute Admin, Compute OS Admin, Kubernetes Engine Admin, Service Account User, Storage Admin, Vertex AI Administrator. Click into your service acccounts after creation, go to "Keys", click "Add key" then "Create new key" and download as JSON file. Name the downloaded JSON filesyour-project-name-service.json
andyour-project-name-deployment.json
, respectively. Put them under the/secrets/
directory - they will be automatically git ignored. Then change all occurrences in the codebase ofveritas-trial
withyour-project-name
(case sensitive). - Go to Cloud Storage > Buckets and create a bucket named
your-project-name
. Inside the bucket create the following folders:data-pipeline
,embedding-model
. Make sure the region and zone is correct. - Go to Artifact Registry and create a repository named
docker
. Make sure that the region and zone is correct. - On your forked repository in GitHub, go to "Actions", choose "Deploy ChromaDB", and click "Run workflow" with both checkboxes unchecked (which is the default). This should take a long time. After it succeeds, click into the workflow run, click into the job
deploy-chromadb
, and search in the logsNginx ingress IP address
(supose it says1.2.3.4
), then your deployment should be ready athttp://1.2.3.4.sslip.io/
in a few minutes. If you wanthttps
(so as to activate the clipboard API in the app), see Obtaining SSL certificate from ZeroSSL. The backend service will be ready athttp://1.2.3.4.sslip.io/api/
. - A little bit more details about the previous step: the "Deploy ChromaDB" workflow actually deploys a ChromaDB service via GCP Compute Engine. Then it (1) deploys the pipeline, i.e. uploads some Docker images to repository you created in the artifact registry, prepares the data in the created buckets, and adds vector embeddings into the deployed ChromaDB database, (2) deploys the app, i.e., deploys some Docker images to the repository you created in the artifact registry, and deploys a Kubernetes cluster in GKE hosting the frontend and the backend. Normally "Deploy ChromaDB" should be run only once. In the future if you make changes to the pipeline (
/src/
) you should run the "Deploy Pipeline" workflow, and if you make changes to the app (/app/
) you should run the "Deploy App" workflow. See /deploy for more details in deployment. - After the deployment a pull request will be created and it will be auto-merged (unless CI fails). That will update
/deploy/.docker-tag-app
,/deploy/.docker-tag-pipeline
, and/deploy/chromadb/.instance-ip
./deploy/chromadb/.instance-ip
contains the IP address where your deployed ChromaDB service is accessible.
For details of local development, check out the subdirectory READMEs. Again, feel free to open an issue in our issue tracker if you see anything wrong in the instructions, or if you have questions. Happy coding!