Data Engineering
Featured Project
CDC Replication Pipeline
PostgreSQL WAL → Debezium → Amazon MSK → AWS Glue → Redshift Serverless. Idempotent consumers with DynamoDB offset tracking, Avro schema evolution, OpenLineage lineage graph, and full Terraform-managed AWS infra. Replication lag under 30 seconds.
DebeziumAmazon MSKAWS GlueRedshiftOpenLineagedbtTerraform
AI & ML
Production-Style Build
AI Data Pipeline + RAG with Drift Detection
Hybrid retrieval pipeline combining Pinecone vector search and Snowflake structured features. MMD drift detection flags embedding distribution shifts before they degrade retrieval quality. MLflow cost tracking per run across 10K+ events/sec on 8 Kafka partitions.
OpenAIPineconeSnowflakeKafkaMLflowMMD DriftAirflow
Data Engineering
Production-Style Build
Real-Time Fraud Detection Pipeline
Transaction streams via Azure Event Hubs → PySpark anomaly detection with dead-letter queue → dbt Bronze/Silver/Gold layers provisioned via Terraform. Automated Great Expectations quality gates at each layer. Live Power BI dashboards with per-event audit logging.
Azure Event HubsPySparkdbtGreat ExpectationsTerraformPower BI
Data Engineering
Production-Style Build
Cloud-Native Lakehouse with FinOps
Medallion architecture: Bronze (Parquet/S3) → Silver (Apache Iceberg) → Gold (Snowflake). AWS Cost Explorer + Snowflake ACCOUNT_USAGE for daily per-tier cost tracking. Automated partition pruning and cluster rightsizing with dbt data contracts enforced at CI time.
AWS S3Apache IcebergDatabricksSnowflakedbtAirflowTerraform
AI & ML
Production-Style Build
OutreachAI — LLM Recruiter Outreach Platform
FastAPI + React + PostgreSQL platform. Gemini 2.5 Flash generates and scores outreach emails (tone / relevance / professionalism) before send. dbt models track outreach funnel and LLM prompt performance. 5 Airflow DAGs, 15-test pytest suite, GitHub Actions CI/CD.
FastAPIGemini 2.5 FlashdbtAirflowPostgreSQLReact/TSDocker
Research
Active · UC Medical Center
AR_Cardio — Cardiac Vessel Segmentation
U-Net pipeline segmenting Aorta, LAD, LIMA from CT scans (1,559 DICOM files). Exports 5-structure colored 3D STL models for surgical planning and 3D printing. Phase 1 complete: LIMA Dice 0.897. Phase 2 targeting 50–100 patients pending IRB approval.
TensorFlowU-NetDICOMPython 3.93D STLNumPy
Repo private during IRB review
Research
★ Published · Springer · 2024
Hate Speech Detection
Published in Springer (2024) at ICDMLA. Multi-model NLP pipeline (Decision Trees, KNN, Random Forest) for social media hate speech detection. Key finding: hate speech lacks unique discriminative linguistic features — contributed a novel methodological insight to the literature.
PythonScikit-LearnNLPWordCloudLazyPredictPandas
Research
Seeded AR_Cardio Collaboration
Abdominal Trauma Detection
CNN-based pipeline for automated abdominal trauma detection in CT scans with SHAP explainability for clinical interpretability. This project led directly to the AR_Cardio collaboration — the UC Medical Center surgeon reached out after seeing it on the resume.
TensorFlowCNNSHAPPythonOpenCV