Why does our tool keep getting better?

Short answer: it runs a loop. Most JSON fixers don't.

Where this came from

I'd been looking at Andrej Karpathy's autoresearch project. An AI agent edits training code, runs a short experiment, scores the result, and keeps or discards the change — then does it again, on its own, all night. The part that stuck wasn't the agent. It was the loop: look at the real data, trust only what you measured, keep what's better, throw out what isn't, go again.

So I asked a dumb question. Could that work on something as boring as a JSON repair tool?

This page is the answer. It did.

The same loop, pointed at JSON

The parallels aren't loose. They're the same design decisions, one domain over:

An ungameable metric. autoresearch scores bits per byte, picked so the agent can't win by changing the thing it's allowed to change. This tool scores safety-weighted: confident-wrong costs -2, more than an honest failure. Neither metric can be cheated by the move it's measuring.
A keep-or-discard ratchet. autoresearch: change → train → score → keep or discard → again. This tool: change → benchmark → keep if better → revert if worse → again. Same loop. Neither hopes; both gate every change on a number.
A narrow, reviewable surface. The agent there only touches one file, so diffs stay readable. Here the engine is one deterministic parser and every run is logged by commit. You can read the whole trajectory.
The human owns the rules, not the grind. There, a human writes the instructions; the agent does the search. Here, I own the scoring and what counts as unsafe; the loop does the grinding.
Real data, not synthetic comfort. A small but real training setup there. Real broken JSON people actually complained about here. No made-up coverage.

One thing is deliberately not parallel. autoresearch puts an agent in the loop. This tool keeps the agent out of your runtime — the engine is deterministic, and the loop runs in development, never on your data. Same method, no unpredictable thing in your hot path.

Credit where it's due: the method came from Andrej Karpathy's autoresearch. This is that idea, aimed at the most boring target I could find — on purpose. If the loop makes even a JSON fixer measurably better, the loop is the point.

Poor, then fair, then leading

It started poor.

The first measured version: 33 test cases, a benchmark score of 21, and 2 unsafe repairs — output that looked like valid JSON but was wrong. That's the worst thing a tool like this can do. You ship corrupted data and don't find out until production.

Then it got fair.

We pulled real broken JSON from where people actually complain about it — Reddit, Stack Overflow, GitHub — and turned each one into a test. The corpus grew to 46 cases. The score went to 41. Unsafe repairs went to 0.

The corpus kept growing, so the raw score isn't a clean line — 33 cases isn't 52 cases, and you can't compare them like they are. Here's the number that is clean: unsafe repairs went 2 → 0, and have stayed at 0 on every run since.

Now it leads, in the places that matter.

On the current 52-case corpus, head to head with the standard jsonrepair npm library, commit 04a509d:

Metric	This engine	`jsonrepair` library
Correct outcomes	45 / 52	34 / 52
Repair accuracy	86.5%	—
LLM-style success	88.1% (37/42)	—
Safe-failure rate	100%	—
Unsafe repairs	0	7
Valid-but-wrong outputs	0	17

On the same inputs, the standard library returned 7 unsafe repairs and 17 outputs that were valid JSON but wrong. This engine returned 0 of each. That's the whole point.

How a fix gets in

The loop is simple. It runs in development, between commits — never while you're waiting on a repair.

1. Mine real failures. A script searches Reddit, Stack Overflow, and GitHub for real broken-JSON complaints — "invalid json chatgpt", "jsonrepair not working". It pulls the broken snippet, sorts it by failure type, and freezes it as a test. The corpus grows from breakage that actually happened, not guesses about what might.

2. Score for safety, not just parsing. Every case runs through this engine and through the standard jsonrepair library. The scoring is lopsided on purpose:

+1 correct repair
+1 correct refusal on genuinely ambiguous input
-1 invalid output
-2 valid JSON that's semantically wrong
-2 ambiguous input silently guessed

We don't score "did it parse." We score "did it parse and not lie." Confident wrongness costs the most, because it's the most expensive thing that can happen to you.

3. Keep it or revert it. A change ships only if the score goes up and unsafe repairs and valid-JSON regressions don't. Every run is logged by commit. The whole trajectory is on record, not asserted.

What runs when you paste JSON

The thing that runs is a plain parser. It does this, in order:

Normalize smart quotes.
Strip a Markdown code fence if there is one.
Pull out the single top-level JSON-like region.
Tokenize it and walk a small state machine that closes unclosed structures, fixes separators, and normalizes True/False/None.
Try JSON.parse on each candidate; return the first valid one.

The design rule behind all of it: repair what is unambiguous, refuse what is not. A guess that parses but is wrong is worse than a clean failure. So when the input is ambiguous, the tool stops instead of guessing.

Where it still loses

The benchmark also records where jsonrepair beats this engine:

trailing-comma repair in plain malformed objects and arrays
deeper tail-truncation on large nested objects

These are known, logged, and next on the list. They're here because a tool that hides its weak spots is the kind of tool this one is built to replace. These are the next numbers the loop is built to move.

Use it in your code

The exact engine described above — the same deterministic parser, no network, no model, the same safe-or-refuse rule — is on npm:

npm install @datatool/json-heal

Zero dependencies. ESM and CommonJS. It returns a result you branch on: a repaired value, or an honest failure — never a confident guess. @datatool/json-heal on npm. Source, the full benchmark corpus, and every logged run are public at github.com/datatool-dev/json-heal. MIT.

How to check that any of this is true

Don't take the numbers on faith. Here's how to verify them.

The engine does not learn while it repairs your JSON. Nothing on this page means it "optimizes" or "recurses" at runtime. It runs the five steps above, once, and stops. No machine-learning model. No network call. No iteration. The same input always produces the same output. The loop described above is a development methodology, not a runtime behavior — it runs between commits, never in front of you. That determinism is the safety property. A tool you can't predict is a tool you can't trust with data.

Every number on this page maps to a row in artifacts/json-repair/history.jsonl (latest run: commit 04a509d, 2026-05-17). The baseline figures come from BENCHMARK.md. Both files, the corpus, and the runner are public in the repository — clone it and re-run it yourself:

npm run benchmark:json-repair

If a claim here doesn't match the log, the log is right and this page is wrong. Tell us.