Summary
SOAP Note Generation through Ambient Listening, Large Language Model Fine-Tuning, and RAG
Tags
RAG
Fine-tune
Research to Industry
😀
Research of LLM Embedding
Industry/Client: Uchicago Booth
Role: LLM Researcher / AI Researcher / Data Scientist / Research Assistant
notion image
Demo Video:
Video preview
MediNotes: SOAP Note Generation through Ambient Listening, Large Language Model Fine-Tuning, and RAG
MediNotes is a first-gen GenAI framework that enhances clinical consultations by automating documentation and providing a healthcare-domain–fine-tuned copilot with retrieval-augmented generation (RAG) and ambient listening.
  • MediNotes was awarded “Best in Show” as one of the Top Capstone Projects at the University of Chicago showcase 2024.
  • This project was a collaboration with UChicago Medicine to advance healthcare AI.
  • Building on groundbreaking research from the Microsoft AI team published in Nature, we developed an innovative framework designed to streamline medical documentation and the consultation process, with the goal of alleviating physician burnout.
  • By combining cutting-edge technologies like ambient listening, large language model fine-tuning, and retrieval-augmented generation (RAG), MediNotes represents a significant step forward in optimizing healthcare workflows and improving physician efficiency.
  • Collaborated with clinicians at UChicago Medicine to validate and refine the system, culminating in 2 IEEE publications and pilot deployments in live clinical environments.

 
This published in 2 IEEE Conferences Paper:
Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation
A GEN AI Framework for Medical Note Generation
Please check out Github for code:
 

1. Motivation and Objectives

Clinicians spend as much as 25–50 percent of their time on Electronic Health Record (EHR) documentation, detracting from patient care and driving burnout. This paper introduces MediNotes, a generative AI system designed to automate SOAP-note creation from live or recorded doctor–patient conversations. By combining Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR), MediNotes aims to:
  • Reduce documentation time via real-time ambient transcription.
  • Maintain accuracy and completeness with advanced RAG retrieval.
  • Operate efficiently in typical clinical hardware environments using PEFT and QLoRA fine-tuning.

notion image

2. System Architecture

MediNotes handles two main scenarios (illustrated in Fig. 1 of the paper):
Live ASR and speaker diarization convert dialogue into cleaned, chunked transcripts. These chunks are embedded using a fine-tuned embedding model with RAFT and indexed in a PostgreSQL vector database via the PGVector extension. A customized hybrid search retrieves and re-ranks relevant chunks in response to queries, which are then filled into prompt engineering for a fine-tuned LLM. The generated report is validated and stored. Q&A responses are streamed in real time using FastAPI.
  1. Real-time SOAP-note generation
      • ASR & Speaker Diarization: Uses Whisper-base and Pyannote to capture and separate physician vs. patient speech.
      • LLM Processing: Transcribes and organizes dialogue into Subjective, Objective, Assessment, and Plan sections.
      • Storage: Stores resulting notes in a vector database for future retrieval.
  1. Query-based Retrieval
      • User Query: Accepts text or voice queries.
      • Embedding & RAG: Converts query to embeddings, retrieves relevant document chunks via PGVector, and uses the LLM to generate precise answers.
 
notion image

3. Model Fine-Tuning Techniques

To balance performance and resource constraints, the authors apply:
  • Parameter-Efficient Fine-Tuning (PEFT): Adapts only key weight matrices (e.g., q_proj, v_proj) with low-rank adapters (r = 16) to drastically reduce compute need.
  • Quantized LoRA (QLoRA): Further compresses model parameters to 4 bits, enabling high-fidelity fine-tuning on standard GPUs.
  • Instruction Tuning: Trains the model on “dialogue → SOAP note” prompts to improve structure and coherence.

notion image

4. Evaluation and Key Findings

  1. Automated Note Quality (ACI-BENCH Dataset):
      • Metrics: Outperforms GPT4o and BART+FTSAMSsum on ROUGE-1 (up to 58.9), ROUGE-2, ROUGE-Lsum, BERTScore (F1 = 73.2), and BLEURT (> 41) across three test splits.
  1. Clinical Usability Study:
      • Participants: 10 doctors and 10 patients, each in 8 recorded and 8 query sessions.
      • Results:
        • 75 percent of notes require no manual correction (Accuracy).
        • 60 percent meet completeness thresholds.
        • Overall usefulness rated 89 percent for reducing clinician burden.

5. Limitations and Ethical Considerations

  • Data Constraints: Reliance on the ACI-BENCH role-play dataset (207 dialogues) may not capture full clinical diversity, potentially limiting generalization.
  • Privacy & Bias: Handling sensitive medical conversations demands strict HIPAA-compliant protocols. Models must be audited to prevent biased or incorrect documentation.
  • Adoption Challenges: Integration into varied EHR platforms and training of medical staff will be essential for real-world deployment.

6. Practical Applications and Future Directions

  • EHR Integration: Embedding MediNotes in hospital systems to auto-populate SOAP notes in real time.
  • Telehealth Support: Streamlined note generation and sharing across remote consultations.
  • Enhanced Retrieval: Expanding vector store with broader clinical datasets to improve query accuracy.
  • Dataset Expansion: Acquiring de-identified real consultations to enrich model robustness and reduce overfitting to synthetic dialogues.
 

Further development

Further development has been done since the original (yet unpublished) work.
 
notion image
  • Ambient listening is the process of passively capturing and processing real-time conversations. In our case, it’s applied to doctor-patient consultations. For the doctor, it’s as simple as hitting ‘record button,’ but behind it, a lot happens to turn that raw audio into a structured output.
  • In this slide, the green boxes represent the input and output of our system, while the light blue boxes show the actions and techniques working in the background. First, the audio is transcribed by our fine-tuned model, converting it into text. Then, speaker identification labels who said what—doctor or patient—transforming the conversation into a structured, labeled format.
  • While the doctor and patient are talking, our system handles the work in the background, so the doctor doesn’t have to recall what was said later. This improves documentation accuracy and saves significant time and effort
 

Additional Advance Method for RAG pipeline

 

Retrieval aware fine-tuning (RAFT)

notion image
notion image
 

Matryoshka Representation Learning (MRL)

notion image