Design and maintain evaluation systems for machine learning models, ensuring quality and deployment controls for enterprise workflows. Requires strong Python skills, experience with ML or software engineering, and ability to work with noisy data.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
Machine Learning Eval Engineer
San Francisco, CA · On-site · Full-time
$150K–$300K base + competitive equity
The company
The vast majority of enterprise data lives in PDFs, spreadsheets, and other files that are awkward for models to handle. The company builds software that turns those documents into LLM-ready inputs with the accuracy and deployment controls enterprise workflows require.
The business has grown revenue 8x year over year, has hundreds of companies using the product, and now processes tens of millions of pages every month.
Customers include leading AI teams like Harvey, Vanta, Scale, and Meta, plus enterprise customers across FAANG and top trading firms.
Deployment is part of the product: cloud, fully air-gapped environments, SOC II and HIPAA compliance, and zero data retention.
Founded in 2023, the company has raised over $100M from a16z, Benchmark, and First Round Capital. The team includes people from Stripe, Discord, Scale AI, Dynamo AI, HRT, BAM, and similar backgrounds.
The role
This is the person who owns how the company measures model quality.
You will design the evaluation systems, benchmarks, and inspection tools that tell the team where models are strong, where they fail, and which failures are important enough to change training or product decisions.
The work sits between ML engineering, data analysis, and lightweight tooling. You will work closely with ML, platform, and GTM teams, and you will often be the one translating a vague customer problem into a reproducible benchmark.
This role shapes release confidence and the training priorities that come next.
Searching for Development & Programming roles that provide visa sponsorship? Connect with international employers through Development & Programming Jobs with Visa Sponsorship opportunities actively seeking talented professionals.
The technical problem
Document workflows fail in ways that are hard to see in curated benchmarks: layout shifts, table errors, scan quality, long-tail formats, and customer-specific distributions.
A model can look excellent on a small sample and still break on the cases that matter in production.
The hard part is building an evaluation system that stays useful as the product and data distribution change: fast regression checks, hard-example mining, bespoke customer benchmarks, and metrics that predict real-world performance.
The eval surface is already large enough that manual review does not scale. The systems you build need to operate across billions of documents and still surface signals the ML team can act on.
What you'll own
• Benchmarks and regression suites: design and maintain test sets that capture real document failure modes, not only curated samples.
• Failure detection: build metrics, heuristics, and automated workflows that surface new errors across large, messy datasets.
• Model feedback loops: turn evaluation results into training priorities, error analysis, and concrete model improvements with the ML team.
• Document inspection: work hands-on with PDFs, spreadsheets, and other difficult formats to find edge cases and construct hard examples.
• Internal and customer-facing tooling: build lightweight Python tools, including simple Flask interfaces, so teams can inspect outputs and explain model behavior.
• Customer-specific evals: partner with customers and GTM to define bespoke benchmarks that reflect real deployment requirements.
• Data plumbing: use AWS S3 and analytics systems like Tinybird to store, query, and analyze large-scale evaluation runs.
Who this is for
You are likely a fit if you have:
• 1–5 years of experience in ML or software engineering, with work that already shows strong independence.
• Strong Python skills and the ability to build clean, reliable technical solutions without much hand-holding.
Explore our comprehensive directory of visa sponsorship jobs from employers worldwide who are ready to sponsor talented international professionals.
• Experience building evaluation, analytics, or data systems end to end.
• Comfort working through noisy data, ambiguous failure cases, and metrics that need to be validated before anyone trusts them.
• Enough product sense to build tools other engineers, researchers, or customer-facing teams will actually use.
• Comfort with AWS S3 and OLAP or analytics systems like Tinybird.
• The habit of taking ownership from problem definition through implementation and iteration.
• Clear communication with both technical and non-technical stakeholders.
You do not need to come from document AI specifically, but you should enjoy getting close to the data and the failure cases instead of staying at the level of abstract model metrics.
Tech stack
• Python
• Flask for lightweight internal tools
• AWS S3
• OLAP / analytics systems such as Tinybird
• LLM evaluation tooling and data inspection workflows are a plus
The stack is intentionally small because the hard part is the measurement system, not the framework.
Why now
The company already has real customer usage and real enterprise constraints. Revenue has grown 8x year over year, and the product is being used in environments where deployment flexibility and data handling are part of the buying decision.
The next constraint is not whether the company can process documents. It is whether the team can measure quality well enough to keep improving models as the customer base and document surface area expand.
The evaluation systems you build will be run at significant scale, across billions of documents, so reproducibility and signal quality matter more than one-off analysis.
Interested in opportunities specifically in United State? Discover our dedicated Visa Sponsorship Jobs in United State page featuring roles from top employers in this location.
This role will define that measurement layer.
This role is not for you if
• You want a pure research role detached from production systems.
• You prefer clean datasets over messy real-world document distributions.
• You need fully specified tickets before you can start.
• You are not comfortable building tools and workflows that other teams depend on.
• You do not want to be on-site in San Francisco five days a week.
Compensation and logistics
• Base salary: $150K–$300K
• Equity: competitive
• Location: San Francisco, CA
• Work model: on-site, 5 days per week
• Employment: full-time
• Visa sponsorship: available on a case-by-case basis
Benefits include daily lunch, transportation reimbursement, health insurance, a wellness budget, and parental leave support.
About Aurora
Aurora helps exceptional engineers find the right role at some of the most ambitious startups worldwide.
We work with teams that value high ownership, strong technical standards, and clear scope.
Similar Jobs
Explore other opportunities that match your interests
Product Support Engineer
cartesia
far.ai