AI evaluation
LLM, RAG, and agent workload measurement across quality, latency, cost, reliability, and release gates.
Senior AI Engineer · Berlin, Germany
I build cost-aware LLM/RAG/agent evaluation tooling, production AI systems, and product architectures for sensitive data domains.
Current direction
My work sits between applied AI engineering, production LLM systems, privacy-first product engineering, and business-impact optimization. I care about the practical tradeoffs that decide whether a product survives production: model quality, latency, reliability, inference cost, privacy boundaries, and maintainability.
At Solenix Engineering, I contribute to ESA and EUMETSAT-related AI initiatives across satellite health forecasting, telemetry anomaly detection, AI validation workflows, synthetic QA generation for RAG evaluation, multi-agent LLM systems, and Kubernetes/GitOps deployments. In public, I am now building Pangolin Eval and Health Passport as proof that the same production discipline can become useful tools.
Focus areas
LLM, RAG, and agent workload measurement across quality, latency, cost, reliability, and release gates.
RAG, AI agents, evaluation workflows, provider switching, synthetic QA data, and observability.
Local-first data flows, permission boundaries, receipts, and product architecture for sensitive domains.
MLflow, CI/CD, Docker, Kubernetes, GitOps, monitoring, and model lifecycle management.
Selected projects
Flagship open-source AI project
Public Python CLI/library for measuring LLM, RAG, and agent workloads across cost, latency, quality, and reliability. Includes weighted evaluators, gates, RAG diagnostics, TraceCards, OTel-style exports, gateway examples, Docker demos, and a v0.2.2 release track.
Product build
Privacy-first iOS continuity layer for Fitbit/Google wearable data. It imports supported data, preserves normalized records locally, and writes clean supported samples back to Apple Health with user permission.
Product-style macOS maintenance CLI with dry-run-first safety, local memory, rules, profiles, hooks, and scriptable output.
RAG and LLM application experiments, including PDF chat workflows with LangChain, FAISS, and OpenAI embeddings.
Experience
Published Pangolin Eval as an open-source local evaluation toolkit for LLM, RAG, and agent workflows. Building Health Passport as an iOS-first, privacy-first wearable data continuity product with HealthKit, TypeScript normalization rules, local vault boundaries, sync receipts, and backend Pro-service foundations.
Contributing to ESA and EUMETSAT-related AI initiatives across mission operations, satellite health forecasting, telemetry anomaly detection, AI validation, synthetic QA generation, multi-agent LLM systems, MLflow monitoring, and Kubernetes/GitOps deployment workflows.
Built LLM-backed and ML systems for entity intelligence, matching, ranking, and operational automation. Reduced monthly cloud expenditure by about 35% / $32K+ through model, VM, and storage optimization.
Delivered forecasting, recommendation, segmentation, and neural translation systems tied to revenue, sales uplift, campaign response, and translation cost reduction.
Built NLP chatbots, sentiment analysis, demand forecasting, ETL automation, SQL optimization, and data analysis pipelines across startup and consulting environments.
Stack
Contact
I am building in public around Pangolin Eval, cost-aware LLMOps, model evaluation, privacy-first product engineering, and practical tooling for teams moving from prototype to production.