ജോലി വിവരണം

About the role

You will join an early-stage, AI-first startup that already has clear market traction. The company develops advanced AI products for Governance, Risk and Compliance (GRC) teams at enterprises globally. Because the users are auditors, risk managers, and compliance professionals, the product must meet high standards for evaluation rigor, traceability, and readiness for EU AI Act requirements.

As the AI Engineer for LLMOps & Evaluation, you will take full ownership of the LLMOps pipeline and work closely with the founding team. The role is centered on turning experiments into reliable customer-facing AI capabilities, while continuously improving quality, reliability, and operational control.

What you'll do

Take end-to-end ownership of the LLMOps workflow, including evaluation infrastructure, prompt improvement loops, and production integration.
Define the evaluation approach for each output type, choosing between deterministic checks such as exact match, schema validation, and embeddings, or LLM-as-judge where appropriate.
Build the supporting framework for trustworthy evaluation, including rubrics, test datasets, and human review processes.
Lead prompt engineering across the product and evolve prompt work from manual tuning into a measurable, repeatable optimization process.
Select the most suitable technical approach for each problem, whether that means an LLM-based solution, embedding and classical NLP methods, or deterministic logic.
Own the production operation of AI features, including observability, latency and cost optimization, and incident handling when performance drops.
Create human-in-the-loop workflows such as review queues, feedback capture, and labeling so production insights improve evaluations and prompt iterations.
Mentor the AI & Analytics Intern and help shape how the AI team develops over time.

What we're looking for

At least 3 years of practical experience building and deploying ML or AI systems in production, with emphasis on shipped outcomes rather than CV length.
Direct experience owning an LLM evaluation or prompt optimization pipeline end to end.
Strong hands-on knowledge of LLM-as-judge, including common variance issues and ways to control them.
A solid base in classical NLP and ML operations, including embeddings, semantic similarity, entity matching, classification, and fuzzy matching.
Practical judgment on when deterministic evaluation is better than LLM-based evaluation, informed by real experience.
Experience handling production tradeoffs around cost, latency, observability, incident response, and prompt regressions for an LLM-powered feature.
Excellent Python skills.
Strong written and spoken English for frequent technical discussions with the founders and customers.
Comfort working through ambiguity, experimenting on real data, building intuition, and knowing when to stop iterating.

Nice to have

Experience with LLM observability and evaluation tools such as Langfuse, LangSmith, Phoenix/Arize, Helicone, Braintrust, or Weights & Biases.
Familiarity with DSPy or similar prompt optimization frameworks, including where they are effective and where they fall short.
Exposure to Azure OpenAI in EU regions or EU-sovereign providers such as Mistral or Aleph Alpha.
Background in guardrails, content safety, or AI governance.
Experience in enterprise software, especially GRC, compliance, audit, or regulated environments.
Working familiarity with Java/Spring Boot or Kubernetes on Azure for clean integration.
German language skills.

Benefits

Direct ownership of a real AI product used by enterprise customers.
Close collaboration with the founding team from the start.
A hybrid setup based in Munich North, with at least one day per week in the office and flexibility otherwise; strong candidates elsewhere in the EU may also be considered, and onboarding will happen in person.
A steep learning opportunity at the intersection of LLM engineering, enterprise GRC, and startup operations.
The opportunity to help build and shape the AI team as the company grows.

About the company

The company builds AI-driven GRC solutions for enterprises. It is early-stage, growing quickly, and already supported by paying customers. The technology stack includes Java and Spring Boot, Angular, Kubernetes on Azure, and LLMs from OpenAI and Anthropic.

AI Engineer for LLM Ops & Evaluation (m/f/d)

Where you'll work