Open to New Roles

3+ Years · Azure · AWS · Databricks · Snowflake

PoojaPutta.

M.Eng. Computer Science · University of Cincinnati · 4.0 GPA

Data Engineer·AI/ML Engineer·Analytics Engineer

✓ AWS Certified Data Engineer ✓ IBM Data Engineering Professional
View Projects About Me
3+Years experience
4.0M.Eng. GPA
5TB+Daily data processed
<5 minLatency · T+1 to real-time

Who I am

Engineering pipelines
teams depend on daily

I'm a Data Engineer with 3+ years building cloud-native pipelines, real-time ingestion systems, and analytics platforms across AWS, Azure, Databricks, and Snowflake — delivering pipelines processing 5TB+ daily banking transactions, reducing data latency from T+1 to under 5 minutes, and sustaining 99.9% SLA through schema validation, observability, and automated deployment.

At Wells Fargo, I owned ELT pipelines on Azure Data Factory and Databricks, built Kafka producers/consumers for real-time ingestion, and enforced SOX/GDPR/CCPA compliance through the pipeline code itself. At UC Transportation, I built automated ingestion from 12 shuttle routes and delivered Power BI dashboards that replaced fully manual reporting.

Currently building medical imaging data pipelines for AI-assisted clinical research at UC Medical Center — U-Net segmenting 6 cardiac structures with 0.897 Dice on LIMA. Published researcher, Springer ICDMLA 2024.

EducationM.Eng. Computer Science · University of Cincinnati · GPA 4.0 / 4.0
CertsAWS Certified Data Engineer · IBM Data Engineering Professional
LocationCincinnati, OH · Open to remote & US relocation
TargetingData Engineer · AI/ML Engineer · Analytics Engineer

Career

Work Experience

Mar 2026
Present
UC Medical Center
Cincinnati, OH
Research Associate
  • Built an end-to-end medical imaging pipeline processing 1,559 DICOM files per patient — handling inconsistent DICOM metadata and automating SMARTPHASE series selection, STL-to-voxel conversion, U-Net inference, and color-coded STL export to support 3D surgical modeling and intraoperative AR visualization
  • Trained a U-Net segmenting 6 cardiac structures (Aorta, LAD, LIMA, Heart, Ribs, Sternum) with weighted loss up to 12× for class imbalance; achieved 0.897 Dice on LIMA, establishing a reproducible baseline for future patient cohorts
  • Designed incremental retraining so new patient batches onboard without full pipeline restarts, ensuring reproducible inference across batches — serving as the sole engineer collaborating directly with the clinical research team
TensorFlowU-NetDICOMPython3D STLNumPy
Sep 2024
Apr 2026
University of Cincinnati
Cincinnati, OH
Data Research Assistant
  • Built an automated ingestion system polling real-time GPS location and ridership data from 12 shuttle routes via REST APIs every 30 minutes; stored 8,000+ daily records as raw JSON in AWS S3, replacing fully manual data collection
  • Engineered SQL and Pandas transformation pipelines to clean, deduplicate, and model raw API payloads into analysis-ready datasets; maintained automated weekly refresh pipelines cutting manual report preparation time by 60%
  • Delivered 3 Power BI dashboards tracking route utilization, peak-hour ridership, and scheduling patterns — providing operations staff with visibility into scheduling gaps across all 12 routes to support data-driven planning decisions
PythonREST APIsAWS S3SQLPandasPower BI
Jan 2022
Jul 2024
Wells Fargo
Hyderabad, India
Data Engineer
  • Designed and owned ELT ingestion pipelines on Azure Data Factory and Databricks processing 5TB+ of daily banking transactions into a Delta Lake Medallion Architecture (Bronze/Silver/Gold), ensuring reliable, audit-ready data storage for 3 regional analytics teams
  • Built star schema data models on top of the Medallion layers enabling consistent, self-service reporting across Power BI and Tableau; partnered with BI engineers and product managers to define SLAs and deliver high-quality datasets
  • Built Apache Kafka (Confluent) producers/consumers for real-time banking transaction ingestion — cut data latency from T+1 batch to under 5 minutes, enabling near-real-time fraud monitoring dashboards
  • Deployed Splunk observability and automated schema validation across all pipelines; reduced silent data failures by 25% and sustained 99.9% SLA; partnered with analysts to improve data availability and self-service reporting
  • Enforced SOX, GDPR, and CCPA compliance via PII masking, IAM access controls, and data lineage tracking; built CI/CD with GitHub Actions and Azure DevOps for automated pipeline deployment across environments
Azure ADFDatabricksDelta LakeKafka (Confluent)PySparkSplunkPower BITableauAzure DevOps

Deep Dive

Featured Case Study:
CDC Replication Pipeline

A personal project simulating a change data capture system — from PostgreSQL source to Redshift Serverless analytics target. Built to demonstrate production CDC patterns: schema evolution, idempotent consumers, end-to-end data lineage, and infrastructure-as-code.

Stack
Debezium · Amazon MSK · AWS Glue · Redshift · OpenLineage · dbt · Terraform
Key result
<30s replication lag  ·  5K+ events/sec
View on GitHub
PostgreSQL
Source DB · WAL logical replication enabled
Debezium Connector
Captures row-level INSERT / UPDATE / DELETE events
Amazon MSK (Kafka)
Avro-serialized events · Schema Registry
AWS Glue Consumer
Idempotent writes · DynamoDB offset tracking · schema evolution
Redshift Serverless
dbt models · data contracts · OpenLineage lineage graph
Airflow + Marquez
Orchestration · health checks · visual lineage UI
<30sReplication lag
5K+Events/sec throughput
ZeroManual schema interventions
100%Idempotent delivery
DebeziumPostgreSQL WALAmazon MSKAvro + Schema RegistryAWS GlueRedshift ServerlessDynamoDBOpenLineageMarquezdbtAirflowTerraformGitHub Actions
Schema Evolution Without Downtime
Source tables change without warning. Built Avro schema evolution with Schema Registry compatibility checks and AWS Glue schema auto-detection — Redshift targets adapt to column additions or type changes without manual intervention or pipeline restarts.
Exactly-Once Delivery
Kafka consumers can reprocess messages on failure, causing duplicate rows. Solved with DynamoDB offset tracking and idempotent upsert logic in Glue — every event is applied exactly once regardless of retries.
Data Lineage Across the Full Stack
Integrated OpenLineage emitters at each pipeline stage, feeding Marquez for a visual lineage graph from PostgreSQL source through to Redshift dbt marts — any downstream issue can be traced back to its origin table.
dbt Contracts at CI Time
dbt data contracts enforce column names and types. If a source schema change breaks a downstream mart, the CI pipeline catches it before merge — not after data has silently degraded in production.
Infrastructure as Code
Every AWS resource — MSK cluster, Glue jobs, Redshift Serverless, DynamoDB table, IAM roles, VPC — is Terraform-managed. The entire stack can be torn down and recreated in one command.

Portfolio

Selected Projects

Data Engineering
Featured Project
CDC Replication Pipeline
PostgreSQL WAL → Debezium → Amazon MSK → AWS Glue → Redshift Serverless. Idempotent consumers with DynamoDB offset tracking, Avro schema evolution, OpenLineage lineage graph, and full Terraform-managed AWS infra. Replication lag under 30 seconds.
DebeziumAmazon MSKAWS GlueRedshiftOpenLineagedbtTerraform
AI & ML
Production-Style Build
AI Data Pipeline + RAG with Drift Detection
Hybrid retrieval pipeline combining Pinecone vector search and Snowflake structured features. MMD drift detection flags embedding distribution shifts before they degrade retrieval quality. MLflow cost tracking per run across 10K+ events/sec on 8 Kafka partitions.
OpenAIPineconeSnowflakeKafkaMLflowMMD DriftAirflow
Data Engineering
Production-Style Build
Real-Time Fraud Detection Pipeline
Transaction streams via Azure Event Hubs → PySpark anomaly detection with dead-letter queue → dbt Bronze/Silver/Gold layers provisioned via Terraform. Automated Great Expectations quality gates at each layer. Live Power BI dashboards with per-event audit logging.
Azure Event HubsPySparkdbtGreat ExpectationsTerraformPower BI
Data Engineering
Production-Style Build
Cloud-Native Lakehouse with FinOps
Medallion architecture: Bronze (Parquet/S3) → Silver (Apache Iceberg) → Gold (Snowflake). AWS Cost Explorer + Snowflake ACCOUNT_USAGE for daily per-tier cost tracking. Automated partition pruning and cluster rightsizing with dbt data contracts enforced at CI time.
AWS S3Apache IcebergDatabricksSnowflakedbtAirflowTerraform
AI & ML
Production-Style Build
OutreachAI — LLM Recruiter Outreach Platform
FastAPI + React + PostgreSQL platform. Gemini 2.5 Flash generates and scores outreach emails (tone / relevance / professionalism) before send. dbt models track outreach funnel and LLM prompt performance. 5 Airflow DAGs, 15-test pytest suite, GitHub Actions CI/CD.
FastAPIGemini 2.5 FlashdbtAirflowPostgreSQLReact/TSDocker
Research
Active · UC Medical Center
AR_Cardio — Cardiac Vessel Segmentation
U-Net pipeline segmenting Aorta, LAD, LIMA from CT scans (1,559 DICOM files). Exports 5-structure colored 3D STL models for surgical planning and 3D printing. Phase 1 complete: LIMA Dice 0.897. Phase 2 targeting 50–100 patients pending IRB approval.
TensorFlowU-NetDICOMPython 3.93D STLNumPy
Research
★ Published · Springer · 2024
Hate Speech Detection
Published in Springer (2024) at ICDMLA. Multi-model NLP pipeline (Decision Trees, KNN, Random Forest) for social media hate speech detection. Key finding: hate speech lacks unique discriminative linguistic features — contributed a novel methodological insight to the literature.
PythonScikit-LearnNLPWordCloudLazyPredictPandas
Research
Seeded AR_Cardio Collaboration
Abdominal Trauma Detection
CNN-based pipeline for automated abdominal trauma detection in CT scans with SHAP explainability for clinical interpretability. This project led directly to the AR_Cardio collaboration — the UC Medical Center surgeon reached out after seeing it on the resume.
TensorFlowCNNSHAPPythonOpenCV

Stack

Technical Skills

Languages
PythonSQL (T-SQL / PostgreSQL)PySparkBash
Azure
ADFSynapseEvent HubsDatabricksADLSAzure DevOps
AWS
RedshiftS3GlueKinesisMSKLambdaStep FunctionsCloudWatch
GCP & Snowflake
BigQueryDataflowSnowflake
Data Engineering
Apache KafkaDebezium CDCAirflowdbtDelta LakeApache IcebergOpenLineageMedallion ArchitectureSplunk
ML / AI
TensorFlowPyTorchScikit-LearnMLflowU-NetMMD Drift DetectionRAGGemini APIPandasNumPy
BI & Quality
Power BITableauGreat ExpectationsSchema Drift DetectionData Lineage
DevOps
CI/CDGitHub ActionsAzure DevOpsDockerTerraform
Compliance
SOXGDPRCCPAPII MaskingIAM Access ControlsAudit Logging

Competitions

Hackathons & Challenges

Computer Vision · AI
WeCrafts Event
AI-Powered Handloom Authentication System
Built an app that detects the quality of fabric and determines whether it is original handloom-woven or machine-made — helping preserve and authenticate traditional artisan work. Used image classification and texture analysis to distinguish hand-weaving patterns from machine-produced imitations.
Computer VisionImage ClassificationPythonAI/ML
YOLO · OCR · Document AI
Hackathon
AI Document Validation for Government Scheme Applications
Developed an AI system that automatically detects misplaced document uploads (Aadhaar, PAN, signature, photographs) in government scheme applications using YOLO-based classification and OCR validation. Implemented real-time quality checks and field-level verification, reducing form rejection rates and manual review effort.
YOLOOCRDocument ClassificationPythonComputer Vision

Beyond the screen

A little more about me

🎭
Kuchipudi Dancer
Trained classical dancer in Kuchipudi, one of India's major classical dance forms. Years of rigorous practice taught discipline, precision, and performance under pressure — qualities that carry over into engineering.
🎨
Artist
Painter working across multiple styles. Creating art sharpens the ability to see structure in complexity — useful when designing data models and system architecture diagrams that need to communicate clearly at a glance.
🏸
Badminton Player
Competitive badminton player. The sport demands fast decision-making, spatial awareness, and staying calm under pressure — skills that translate directly to debugging production pipelines at 2am.

Let's connect

Open to Opportunities

Actively seeking Data Engineer and AI/ML Engineer roles. If you're building serious data infrastructure, I'd love to talk.