Pooja Putta | Data Engineer

Who I am

Engineering pipelines
teams depend on daily

I'm a Data Engineer with 3+ years building cloud-native pipelines, real-time ingestion systems, and analytics platforms across AWS, Azure, Databricks, and Snowflake — delivering pipelines processing 5TB+ daily banking transactions, reducing data latency from T+1 to under 5 minutes, and sustaining 99.9% SLA through schema validation, observability, and automated deployment.

At Wells Fargo, I owned ELT pipelines on Azure Data Factory and Databricks, built Kafka producers/consumers for real-time ingestion, and enforced SOX/GDPR/CCPA compliance through the pipeline code itself. At UC Transportation, I built automated ingestion from 12 shuttle routes and delivered Power BI dashboards that replaced fully manual reporting.

Currently building medical imaging data pipelines for AI-assisted clinical research at UC Medical Center — U-Net segmenting 6 cardiac structures with 0.897 Dice on LIMA. Published researcher, Springer ICDMLA 2024.

EducationM.Eng. Computer Science · University of Cincinnati · GPA 4.0 / 4.0

CertsAWS Certified Data Engineer · IBM Data Engineering Professional

LocationCincinnati, OH · Open to remote & US relocation

TargetingData Engineer · AI/ML Engineer · Analytics Engineer

Emailpoojaputta3699@gmail.com

Phone+1 615-864-0495

Career

Work Experience

Mar 2026
Present

UC Medical Center

Cincinnati, OH

Research Associate

Built an end-to-end medical imaging pipeline processing 1,559 DICOM files per patient — handling inconsistent DICOM metadata and automating SMARTPHASE series selection, STL-to-voxel conversion, U-Net inference, and color-coded STL export to support 3D surgical modeling and intraoperative AR visualization
Trained a U-Net segmenting 6 cardiac structures (Aorta, LAD, LIMA, Heart, Ribs, Sternum) with weighted loss up to 12× for class imbalance; achieved 0.897 Dice on LIMA, establishing a reproducible baseline for future patient cohorts
Designed incremental retraining so new patient batches onboard without full pipeline restarts, ensuring reproducible inference across batches — serving as the sole engineer collaborating directly with the clinical research team

TensorFlowU-NetDICOMPython3D STLNumPy

Sep 2024
Apr 2026

University of Cincinnati

Cincinnati, OH

Data Research Assistant

Built an automated ingestion system polling real-time GPS location and ridership data from 12 shuttle routes via REST APIs every 30 minutes; stored 8,000+ daily records as raw JSON in AWS S3, replacing fully manual data collection
Engineered SQL and Pandas transformation pipelines to clean, deduplicate, and model raw API payloads into analysis-ready datasets; maintained automated weekly refresh pipelines cutting manual report preparation time by 60%
Delivered 3 Power BI dashboards tracking route utilization, peak-hour ridership, and scheduling patterns — providing operations staff with visibility into scheduling gaps across all 12 routes to support data-driven planning decisions

PythonREST APIsAWS S3SQLPandasPower BI

Jan 2022
Jul 2024

Wells Fargo

Hyderabad, India

Data Engineer

Designed and owned ELT ingestion pipelines on Azure Data Factory and Databricks processing 5TB+ of daily banking transactions into a Delta Lake Medallion Architecture (Bronze/Silver/Gold), ensuring reliable, audit-ready data storage for 3 regional analytics teams
Built star schema data models on top of the Medallion layers enabling consistent, self-service reporting across Power BI and Tableau; partnered with BI engineers and product managers to define SLAs and deliver high-quality datasets
Built Apache Kafka (Confluent) producers/consumers for real-time banking transaction ingestion — cut data latency from T+1 batch to under 5 minutes, enabling near-real-time fraud monitoring dashboards
Deployed Splunk observability and automated schema validation across all pipelines; reduced silent data failures by 25% and sustained 99.9% SLA; partnered with analysts to improve data availability and self-service reporting
Enforced SOX, GDPR, and CCPA compliance via PII masking, IAM access controls, and data lineage tracking; built CI/CD with GitHub Actions and Azure DevOps for automated pipeline deployment across environments

Azure ADFDatabricksDelta LakeKafka (Confluent)PySparkSplunkPower BITableauAzure DevOps

Deep Dive

Featured Case Study:
CDC Replication Pipeline

A personal project simulating a change data capture system — from PostgreSQL source to Redshift Serverless analytics target. Built to demonstrate production CDC patterns: schema evolution, idempotent consumers, end-to-end data lineage, and infrastructure-as-code.

Stack
Debezium · Amazon MSK · AWS Glue · Redshift · OpenLineage · dbt · Terraform

Key result
<30s replication lag · 5K+ events/sec

View on GitHub

Pipeline Architecture

PostgreSQL

Source DB · WAL logical replication enabled

↓

Debezium Connector

Captures row-level INSERT / UPDATE / DELETE events

↓

Amazon MSK (Kafka)

Avro-serialized events · Schema Registry

↓

AWS Glue Consumer

Idempotent writes · DynamoDB offset tracking · schema evolution

↓

Redshift Serverless

dbt models · data contracts · OpenLineage lineage graph

↓

Airflow + Marquez

Orchestration · health checks · visual lineage UI

Results

<30sReplication lag

5K+Events/sec throughput

ZeroManual schema interventions

100%Idempotent delivery

Full Stack

DebeziumPostgreSQL WALAmazon MSKAvro + Schema RegistryAWS GlueRedshift ServerlessDynamoDBOpenLineageMarquezdbtAirflowTerraformGitHub Actions

Key Engineering Challenges

Schema Evolution Without Downtime

Source tables change without warning. Built Avro schema evolution with Schema Registry compatibility checks and AWS Glue schema auto-detection — Redshift targets adapt to column additions or type changes without manual intervention or pipeline restarts.

Exactly-Once Delivery

Kafka consumers can reprocess messages on failure, causing duplicate rows. Solved with DynamoDB offset tracking and idempotent upsert logic in Glue — every event is applied exactly once regardless of retries.

Data Lineage Across the Full Stack

Integrated OpenLineage emitters at each pipeline stage, feeding Marquez for a visual lineage graph from PostgreSQL source through to Redshift dbt marts — any downstream issue can be traced back to its origin table.

dbt Contracts at CI Time

dbt data contracts enforce column names and types. If a source schema change breaks a downstream mart, the CI pipeline catches it before merge — not after data has silently degraded in production.

Infrastructure as Code

Every AWS resource — MSK cluster, Glue jobs, Redshift Serverless, DynamoDB table, IAM roles, VPC — is Terraform-managed. The entire stack can be torn down and recreated in one command.

Portfolio

Selected Projects

Data Engineering

Featured Project

CDC Replication Pipeline

PostgreSQL WAL → Debezium → Amazon MSK → AWS Glue → Redshift Serverless. Idempotent consumers with DynamoDB offset tracking, Avro schema evolution, OpenLineage lineage graph, and full Terraform-managed AWS infra. Replication lag under 30 seconds.

DebeziumAmazon MSKAWS GlueRedshiftOpenLineagedbtTerraform

GitHub Case Study ↓

AI & ML

Production-Style Build

AI Data Pipeline + RAG with Drift Detection

Hybrid retrieval pipeline combining Pinecone vector search and Snowflake structured features. MMD drift detection flags embedding distribution shifts before they degrade retrieval quality. MLflow cost tracking per run across 10K+ events/sec on 8 Kafka partitions.

OpenAIPineconeSnowflakeKafkaMLflowMMD DriftAirflow

GitHub

Data Engineering

Production-Style Build

Real-Time Fraud Detection Pipeline

Transaction streams via Azure Event Hubs → PySpark anomaly detection with dead-letter queue → dbt Bronze/Silver/Gold layers provisioned via Terraform. Automated Great Expectations quality gates at each layer. Live Power BI dashboards with per-event audit logging.

Azure Event HubsPySparkdbtGreat ExpectationsTerraformPower BI

GitHub

Data Engineering

Production-Style Build

Cloud-Native Lakehouse with FinOps

Medallion architecture: Bronze (Parquet/S3) → Silver (Apache Iceberg) → Gold (Snowflake). AWS Cost Explorer + Snowflake ACCOUNT_USAGE for daily per-tier cost tracking. Automated partition pruning and cluster rightsizing with dbt data contracts enforced at CI time.

AWS S3Apache IcebergDatabricksSnowflakedbtAirflowTerraform

GitHub

AI & ML

Production-Style Build

OutreachAI — LLM Recruiter Outreach Platform

FastAPI + React + PostgreSQL platform. Gemini 2.5 Flash generates and scores outreach emails (tone / relevance / professionalism) before send. dbt models track outreach funnel and LLM prompt performance. 5 Airflow DAGs, 15-test pytest suite, GitHub Actions CI/CD.

FastAPIGemini 2.5 FlashdbtAirflowPostgreSQLReact/TSDocker

GitHub

Research

Active · UC Medical Center

AR_Cardio — Cardiac Vessel Segmentation

U-Net pipeline segmenting Aorta, LAD, LIMA from CT scans (1,559 DICOM files). Exports 5-structure colored 3D STL models for surgical planning and 3D printing. Phase 1 complete: LIMA Dice 0.897. Phase 2 targeting 50–100 patients pending IRB approval.

TensorFlowU-NetDICOMPython 3.93D STLNumPy

Repo private during IRB review

Research

★ Published · Springer · 2024

Hate Speech Detection

Published in Springer (2024) at ICDMLA. Multi-model NLP pipeline (Decision Trees, KNN, Random Forest) for social media hate speech detection. Key finding: hate speech lacks unique discriminative linguistic features — contributed a novel methodological insight to the literature.

PythonScikit-LearnNLPWordCloudLazyPredictPandas

GitHub Springer ↗

Research

Seeded AR_Cardio Collaboration

Abdominal Trauma Detection

CNN-based pipeline for automated abdominal trauma detection in CT scans with SHAP explainability for clinical interpretability. This project led directly to the AR_Cardio collaboration — the UC Medical Center surgeon reached out after seeing it on the resume.

TensorFlowCNNSHAPPythonOpenCV

GitHub

Stack

Technical Skills

Languages

PythonSQL (T-SQL / PostgreSQL)PySparkBash

Azure

ADFSynapseEvent HubsDatabricksADLSAzure DevOps

AWS

RedshiftS3GlueKinesisMSKLambdaStep FunctionsCloudWatch

GCP & Snowflake

BigQueryDataflowSnowflake

Data Engineering

Apache KafkaDebezium CDCAirflowdbtDelta LakeApache IcebergOpenLineageMedallion ArchitectureSplunk

ML / AI

TensorFlowPyTorchScikit-LearnMLflowU-NetMMD Drift DetectionRAGGemini APIPandasNumPy

BI & Quality

Power BITableauGreat ExpectationsSchema Drift DetectionData Lineage

DevOps

CI/CDGitHub ActionsAzure DevOpsDockerTerraform

Compliance

SOXGDPRCCPAPII MaskingIAM Access ControlsAudit Logging

Competitions

Hackathons & Challenges

Computer Vision · AI

WeCrafts Event

AI-Powered Handloom Authentication System

Built an app that detects the quality of fabric and determines whether it is original handloom-woven or machine-made — helping preserve and authenticate traditional artisan work. Used image classification and texture analysis to distinguish hand-weaving patterns from machine-produced imitations.

Computer VisionImage ClassificationPythonAI/ML

YOLO · OCR · Document AI

Hackathon

AI Document Validation for Government Scheme Applications

Developed an AI system that automatically detects misplaced document uploads (Aadhaar, PAN, signature, photographs) in government scheme applications using YOLO-based classification and OCR validation. Implemented real-time quality checks and field-level verification, reducing form rejection rates and manual review effort.

YOLOOCRDocument ClassificationPythonComputer Vision

Beyond the screen

A little more about me

🎭

Kuchipudi Dancer

Trained classical dancer in Kuchipudi, one of India's major classical dance forms. Years of rigorous practice taught discipline, precision, and performance under pressure — qualities that carry over into engineering.

🎨

Artist

Painter working across multiple styles. Creating art sharpens the ability to see structure in complexity — useful when designing data models and system architecture diagrams that need to communicate clearly at a glance.

🏸

Badminton Player

Competitive badminton player. The sport demands fast decision-making, spatial awareness, and staying calm under pressure — skills that translate directly to debugging production pipelines at 2am.

Let's connect

Open to Opportunities

Actively seeking Data Engineer and AI/ML Engineer roles. If you're building serious data infrastructure, I'd love to talk.

poojaputta3699@gmail.com +1 615-864-0495 LinkedIn GitHub

PoojaPutta.

Engineering pipelinesteams depend on daily

Work Experience

Featured Case Study:CDC Replication Pipeline

Selected Projects

Technical Skills

Hackathons & Challenges

A little more about me

Open to Opportunities

Engineering pipelines
teams depend on daily

Featured Case Study:
CDC Replication Pipeline