Clinical Registry Intelligence
Built clinical ETL and NLP systems at Carta Healthcare, normalizing heterogeneous data into HL7/FHIR-aligned workflows and integrating LLM-assisted extraction into registry operations.
Clinical Registry Intelligence is the kind of work that sits underneath reliable applied AI. Before extraction quality, analytics, or downstream modeling can matter, the data has to be normalized, legible, and stable enough to trust.
Overview
At Carta Healthcare, I designed and owned end-to-end ETL and clinical NLP pipelines that moved heterogeneous healthcare data into forms that registry workflows could actually use. The work combined data normalization, schema alignment, extraction logic, API-oriented processing, and LLM-assisted workflows for information that could not be captured cleanly with rules alone.
Context
Carta operates in a domain where operational value depends on turning difficult clinical data into structured registry-ready signals. Inputs arrived from multiple sources and in multiple shapes: structured fields, semi-structured exports, and free-text clinical documentation. The technical challenge was not only extracting information, but doing so in a way that downstream registry and analytics workflows could rely on consistently.
Problem
Clinical data is rarely neat enough for direct use. Important fields may be missing, encoded differently across systems, or embedded in narrative documentation. Without a normalization layer, each downstream consumer inherits ambiguity, and any ML or analytics system built on top becomes harder to validate and maintain.
For registry workflows specifically, that created three compounding issues:
- heterogeneous source schemas needed to map into a stable representation
- clinically relevant facts were often present only in unstructured text
- downstream teams needed outputs that were standardized, inspectable, and operationally usable
System Design
I treated the problem as a workflow design problem rather than a single-model problem.
The system centered on ETL pipelines that ingested heterogeneous clinical inputs, normalized them into HL7- and FHIR-aligned structures, and prepared them for downstream extraction and registry consumption. That meant combining parsing, field mapping, schema validation, and transformation logic in a way that made the outputs consistent across sources.
On top of that normalized layer, I integrated clinical NLP and LLM-assisted extraction for information that could not be captured reliably from structured data alone. The important design choice was separation of concerns: normalization first, extraction second, downstream consumption third. That kept the system easier to reason about and reduced the chance that extraction logic would become entangled with source-specific cleanup.
Key Technical Decisions
One important decision was to standardize early around HL7/FHIR-compatible representations rather than letting every downstream step interpret raw source formats differently. That created a cleaner contract between ingestion, extraction, and registry-facing outputs.
Another was to use LLM-assisted extraction selectively. I did not treat LLMs as a universal parsing layer. They were most valuable where clinical text contained useful information that was difficult to capture through deterministic rules alone. By constraining where they sat in the pipeline, it became easier to reason about reliability and to preserve a stable surrounding system.
I also favored pipeline designs that made intermediate states legible. In a clinical setting, traceability matters. If an extracted field looked wrong, the path back to the source and transformation logic needed to stay understandable.
Reliability and Evaluation
Reliability here was not a single benchmark; it was a systems property.
I focused on pipeline behavior that could support trustworthy downstream use: schema consistency, validation around transformed outputs, and extraction workflows that could be reviewed and iterated when failure modes appeared. The goal was not just to maximize extraction coverage, but to make the overall system more dependable in real registry operations.
That required thinking beyond model quality alone. A strong extraction component inside a weak normalization pipeline still produces brittle results. The system had to be coherent end to end.
Outcome
The result was a stronger clinical data foundation for registry and analytics workflows: more standardized inputs, more usable structured outputs, and a clearer path from difficult clinical documentation to downstream operational value.
The work reinforced a pattern that still shapes how I build AI systems: the highest-leverage improvements often come from better workflow design, cleaner data contracts, and more disciplined system boundaries rather than from model complexity alone.
What I Owned
I owned the design and implementation of end-to-end ETL and NLP pipelines for clinical data workflows, including normalization strategy, schema alignment, extraction integration, and the practical shape of how outputs became usable downstream.
Reflection
This project is a good representation of how I like to work: careful about interfaces, skeptical of unnecessary complexity, and focused on making advanced methods useful inside real operating constraints. In applied AI, credibility comes from systems that hold together under messy conditions. That is the kind of work I find most meaningful.