Lunette is a platform for understanding and improving agents. You can run your agents in isolated sandboxes using Lunette, and then after the fact run Lunette investigations to understand your agent’s bottlenecks, and whether your environment is measuring the right things. It is available to try directly here – contact us to scale your usage.

Issues in agent evals are pernicious. Even SWE-bench Verified, the most popular software engineering benchmark, human validated by OpenAI, has unsolvable tasks that are therefore useless for understanding an agent’s abilities. We manually annotated Django tasks in swebench, and found that 24% were broken due to test mis-specification – the tests enforcing requirements that were not specified in the original spec to the model. We ran the Lunette investigator on this set of tasks, with controls, and it had an issue detection accuracy of 81%, whereas a transcript-only baseline got 9%.

Lunette has investigator agents, which probe the same environment your agents ran in for issues. These issues then get filtered and reported to the end user so they can improve their system.

What are you supposed to do with Lunette?
The main task Lunette helps you solve is figuring out if there’s some difference between what you expect your agents are doing in some task and what they’re actually doing in that task, including with what your task is measuring. We use Lunette to find cases where evals are measuring something that you don’t realize they are – because your env is broken, your agent is failing in stupid ways, etc…

How Lunette Works

Users provide Lunette with investigation specs for agents or evals they want to understand. Lunette then launches investigator agents that operate in parallel. For each trajectory, an investigator agent reads the agent trace, modifies and runs commands in the eval environment to test hypotheses, and writes findings. Validator agents then critique these findings and filter for high-quality results. At the end, users can explore investigation results in the Fulcrum frontend and chat with an agent to learn more.

Validating Lunette by finding broken tasks in SWE-bench verified

To validate and improve Lunette’s investigators, we manually found issues in SWE-bench tasks.

From OpenAI’s Introducing SWE-bench Verified:

Each sample in the SWE-bench test set is created from a resolved GitHub issue in one of 12 open-source Python repositories on GitHub. Each sample has an associated pull request (PR), which includes both the solution code and unit tests to verify code correctness. These unit tests fail before the solution code in the PR is added, but pass afterwards, and are therefore called FAIL_TO_PASS tests. Each sample also has associated PASS_TO_PASS tests, which pass both before and after the PR is merged, and are used to check that existing unrelated functionality in the codebase has not been broken by the PR.

For each sample in SWE-bench, agents are provided with the original text from the GitHub issue, known as the problem statement, and are given access to the codebase. Given these, agents must edit the files in the codebase to resolve the issue. The tests are not shown to the agent.

A proposed edit is evaluated by running both the FAIL_TO_PASS and PASS_TO_PASS tests. If the FAIL_TO_PASS tests pass, this means the edit solves the issue. If the PASS_TO_PASS tests pass, then the edit has not inadvertently broken unrelated sections of the codebase. Both sets of tests are required to pass for the edit to fully resolve the original GitHub issue.

The core, repeated problem is that the FAIL_TO_PASS tests aren’t necessarily specified by the issues, since they are often provided by the implementer of the PR alongside the updated implementation.

This was, of course, the motivation for developing SWE-bench Verified in the first place, which used human validators to filter out problems with:

  1. Overly specific unit tests
  2. Underspecified issue descriptions
  3. Difficulty with environmental setup

In our investigation, we found that many of these issues persisted even after filtering from human validators. To generate an eval set, we manually looked at and understood various Django tasks, and labeled them with issues and whether they were impossible.

We then ran Lunette on this eval, and compared it to a transcript only baseline – a reasoning model with the same base model intelligence, and prompt (ignoring lunette specific info). We found that Lunette dominated (see fig. 1), effectively validating that the tools and infrastructure we’ve built greatly improve our system’s ability to understand and detect issues in environments.

See [examples] for more details on the kind of issues we found.

Fig. 1: Investigator performance

Try Lunette

Lunette works out of the box for inspect-based evals with a one line change on the sandbox type. We also have a live demo you can upload and run investigations for your own data here. We’d be really excited to see if this tool is useful for you. To set it up, please book a call with us!

Examples of SWE-bench issues

Most problematic tasks are ill-posed, in the sense that there simply isn’t enough information in the task description for the agent to solve the task in the intended way. (We only found one example of a well-posed yet problematic task; its tests were too lenient.)

The ill-posed tasks we reviewed can be classified into two groups.

  1. Problem descriptions that are too vague to clearly identify the intended solution. In these cases, the desired behavior of the code wasn’t specified in the issue report, but instead it was only decided upon in the subsequent discussion.

    Example: django__django-11964. The bug is that TextChoices and IntegerChoices enum fields return different types for created versus retrieved model instances. When a model instance is created with a TextChoices or IntegerChoices field, the field value is an enum instance (e.g., MyChoice.FIRST_CHOICE), but when the same instance is retrieved from the database, the field value becomes a plain string or integer. This inconsistency particularly causes issues when using str() on the field value or rendering it in templates, where created instances show “MyChoice.FIRST_CHOICE” while retrieved instances show “first”. The fix adds a custom str() method to the choice enums to ensure consistent string representation.

    The PR discussion reveals this task is problematic because there are seven different proposed solutions with fundamentally different intended behaviors. The developers debated whether to: (1) keep enum types everywhere, (2) convert to primitives on save, (3) convert in field descriptors, (4) override str() on enums, (5) convert in to_python(), (6) use a proxy object, or (7) document the current behavior as intended. The agents have no way of knowing which of these solutions is expected of them.

  2. Tasks that have tests they are so strict they rule out valid solutions.

    Example: django__django-11790.The bug is that the client-side username validation in AuthenticationForm.UsernameField is broken, always using a maximum length of 150. This is because the maxlength attribute in the corresponding HTML widget is not set at initialization. The solution is a one-liner setting that attribute.

    In our experiments, all agents successfully diagnose the bug. However, they all fail to complete the task, because they set the maxlength attribute to a str instead of an int, as the tests expect. Since previous versions of Django had set that same HTML attribute to a str and the rendered HTML is the same either way, we think this task is problematic and all agents should have passed this task.