Role overview
We're seeking a data-driven analyst to conduct comprehensive failure analysis on AI agent performance across finance-sector tasks. You'll identify patterns, root causes, and systemic issues in our evaluation framework by analyzing task performance across multiple dimensions (task types, file types, criteria, etc.).
Responsibilities
- Statistical Failure Analysis: Identify patterns in AI agent failures across task components (prompts, rubrics, templates, file types, tags)
- Root Cause Analysis: Determine whether failures stem from task design, rubric clarity, file complexity, or agent limitations
- Dimension Analysis: Analyze performance variations across finance sub-domains, file types, and task categories
- Reporting & Visualization: Create dashboards and reports highlighting failure clusters, edge cases, and improvement opportunities
- Quality Framework: Recommend improvements to task design, rubric structure, and evaluation criteria based on statistical findings
- Stakeholder Communication: Present insights to data labeling experts and technical teams
Basic qualifications
- Statistical Expertise: Strong foundation in statistical analysis, hypothesis testing, and pattern recognition
- Programming: Proficiency in Python (pandas, scipy, matplotlib/seaborn) or R for data analysis
- Data Analysis: Experience with exploratory data analysis and creating actionable insights from complex datasets
- AI/ML Familiarity: Understanding of LLM evaluation methods and quality metrics
- Tools: Comfortable working with Excel, data visualization tools (Tableau/Looker), and SQL
Preferred qualifications
- Experience with AI/ML model evaluation or quality assurance
- Background in finance or willingness to learn finance domain concepts
- Experience with multi-dimensional failure analysis
- Familiarity with benchmark datasets and evaluation frameworks
- 2-4 years of relevant experience
- You will be engaged as an independent contractor.
- This is a fully remote role that can be completed on your own schedule.
- Projects can be extended, shortened, or concluded early depending on needs and performance.
- Your work will not involve access to confidential or proprietary information from any employer, client, or institution.
- Payments are weekly on Stripe or Wise based on services rendered.
Tags & focus areas
Used for matching and alerts on DevFound Contract Remote Ai Data Science