Role overview

Designing and implementing evaluation frameworks to measure LLM and agent performance across reasoning, accuracy, multi-turn dialogue, and tool usage
Creating and open-sourcing benchmarks for evaluating LLM output on investment-research-specific tasks such as synthesis quality and citation grounding
Building prompt refinement systems that learn from production signals and human feedback to improve reliability and performance
Developing and maintaining agentic tooling including research assistants, deep research flows, and voice agents
Integrating external APIs, search, speech-to-text, and text-to-speech technologies into production systems
Prototyping lightweight voice agent frameworks with strong evaluation around latency, error recovery, and conversational flow
Collaborating closely with research and product teams to productionize new prompting, retrieval, and multi-agent orchestration techniques
Contributing meaningfully to product direction, prioritization, and long-term technical strategy

Benefits

Is based in NYC (in-person, 5 days per week)
Has 5+ years of professional experience, with recent, hands-on work with LLMs
Has strong opinions and enjoys contributing to product and architectural decisions
Communicates clearly and is comfortable in a client-facing environment
Can explain complex AI concepts to non-technical stakeholders and turn ideas into testable experiments
Has built LLM systems end to end in a product-focused organization, from data and logging to evaluation and prompt optimization
Has a strong bias to action and experience delivering complex projects with senior stakeholders
Is excited to help grow a team and shape engineering culture
Deep, recent experience working with LLMs and agentic systems is required
Strong software engineering mindset rather than a purely research-focused background
Either a software engineer who has transitioned into LLM systems, or an ML engineer who has spent the last few years heavily focused on LLMs
Experience forming clear views on improving LLM output, reliability, and evaluation
Leadership potential, with the opportunity to grow into a Head of AI role over time

Used for matching and alerts on DevFound

Fulltime Remote Ai Ai Engineer