The question to ask every software company you're invested in

3 minute read

By Amr Ellabban, Head of AI

Every software company I speak to has an AI story by now. Most of them are compelling. But over the past two years, talking to prospective new investments and companies in our own portfolio, I've started asking a question that cuts through the narrative, and gets to the impact faster than any other.

How do you evaluate whether AI is getting better at the specific thing your customers pay you for?

The answer tells you a lot. Not about ambition, investment levels, or technical talent; but about whether a company is building something durable with AI or whether it's simply upgrading LLMs as they continue to improve without accumulating any compounding advantage of its own.

What's actually being evaluated?

The frontier AI labs have spent years solving a version of this problem for themselves. How do you know if a new model is genuinely better, rather than just better at appearing better? The answer was structured benchmarks (“evals”).

For example, the SWE-bench tests whether a model can resolve real software engineering problems. Others cover mathematical reasoning, legal analysis, and medical diagnosis. Each defines a specific task, a body of source material with ‘golden answers’, and a rubric that domain experts can apply repeatedly. When they’re run across model releases, they demonstrate that progress is real.

Software engineering is a useful starting point because it's one of the few domains where 'correct' has a relatively clear meaning. A bug is either fixed, or it isn't. Competent engineers can typically assess the quality of the output, regardless of which industry the software serves.

Most professional workflows aren't like that. Take payroll. The core calculation process is deterministic, rules-based software built and refined over decades. AI isn't replacing that. What it's starting to do is handle the ring of expert work that surrounds it, such as interpreting edge cases, responding to queries, flagging anomalies, preparing for audits. That work currently sits with human experts. Agentic tools are beginning to take it on.

To know whether those tools are getting better, you need a rubric. What does a good answer look like when an employee raises a query spanning two jurisdictions? What counts as correctly flagged versus incorrectly escalated? There's also no universal standard. The rubric has to be built by people who've spent years in that specific domain, seeing that specific set of edge cases across specific customer sets. You can only define what good looks like if you deeply understand the deterministic core the agentic layer operates around and your own customers’ processes.

That's the moat. A new entrant with access to the same frontier model can build a product. They can't yet define what good looks like for your customers.

What we've learned

This distinction matters for how investors should think about AI progress in their portfolios. Many AI features are thin wrappers over the underlying models, which are a point-in-time bet on a model or architecture. But the underlying models keep improving, which means any specific capability likely has a short shelf life. An in-house evaluation framework is different. It survives model changes and improves with each iteration. It's the thing that compounds.

We've been working on this across the portfolio for two years, drawing on the experience of Hg Catalyst, our in-house AI product team, who build and ship AI products directly inside portfolio companies. The companies that have progressed fastest brought domain experts into the process early, rather than leaving it to engineering teams working in isolation.

Writing an in-house domain-specific eval rubric demands deep domain knowledge more than software engineering expertise. As such, the decision to build one and to hold it to account is a business one, rather than technical. And these scores deserve to sit alongside revenue growth and net revenue retention (NRR) as a board-level metric, not buried in technical teams several levels below the CEO.

That's why you can learn a lot from this question. If the answer is specific and measurable, you know the company is building genuine AI advantage into its products. If it's vague, they've got some work to do.

Share this article