Back to Feed

AI Made Every Test Pass. The Code Was Still Wrong.

We used AI to validate our Solidity converter against 17 real-world contracts. Every test passed on day one. That was the problem.

AI Made Every Test Pass. The Code Was Still Wrong.

Seventeen contracts. Two conversion passes each. Every single test: green.

We had just finished wiring up an AI-powered testing loop to validate the core of Doodledapp, the engine that converts visual flows into Solidity code and back again. The idea was simple: take real, widely-used smart contracts, feed them through the converter, and have AI write tests to catch every bug. The AI ran, the tests ran, and everything passed on the first try.

That should have been the celebration moment. Instead, it was the moment we realized something was deeply wrong.

Seventeen contracts and an ambitious idea

Doodledapp converts visual node graphs into Solidity smart contracts. To trust that conversion, we needed to prove it worked against real code, not toy examples. We grabbed 17 contracts that developers actually use in production: OpenZeppelin’s ERC-20 and ERC-721 implementations, Solmate’s gas-optimized token contracts, Uniswap V2 and V3 pool contracts, proxy patterns, a Merkle distributor, a vesting wallet, and more.

The validation strategy was what some call ”roundtrip testing.” Take a Solidity contract, convert it to a visual flow, then convert it back to Solidity. If the output matches the input semantically, the converter works. Do it twice, and you can prove the process is stable: the second pass should produce identical output to the first.

We had 17 contracts and a converter we needed to trust. We also had AI that was very good at writing tests. The plan was to point the AI at the converter, let it generate a full test suite, then loop: run the tests, fix failures, regenerate, repeat. An ouroboros of AI-driven validation that would eat its own bugs until nothing remained.

The moment everything went green (and wrong)

The AI generated the test suite. We ran it. Every test passed.

Seventeen contracts, two passes each, dozens of assertions. All green. On the first run.

We knew the converter was not perfect. We had been finding edge cases by hand for weeks. There was no way a first-generation test suite would catch zero issues. So we looked at what the tests were actually checking.

The AI had read the converter, understood what it does, and written tests confirming that it behaves exactly as implemented. It verified that functions get converted, that state variables appear in the output, that control flow structures are present. Every assertion was technically correct. The converter does those things.

But the tests never compared the output against the input. They never asked: “Does the generated Solidity do the same thing as the original contract?” They confirmed the converter runs without errors. They did not confirm the converter produces correct results.

The AI tested the code we had. Not the code we wanted.

Why AI tests your implementation, not your intent

This is not a flaw in any specific AI tool. It is a property of how AI writes tests. When you point an AI at your code and say “write tests for this,” it reads the implementation and generates assertions that the implementation satisfies. If your code has a bug that silently drops a modifier, the AI sees code that drops modifiers and writes a test confirming modifiers get dropped. The test passes. The bug is invisible.

Researchers call this the ”ground truth problem.” Something has to know what correct behavior looks like. When a human writes tests, they bring an understanding of intent: what the code should do, not just what it does. When AI writes tests from code alone, it has no independent source of truth. The code is both the subject and the specification.

A 2024 study from the University of Alberta found that large language models generate test assertions that “predicate on implemented behavior” rather than specified behavior. The tests are syntactically valid, they run, and they pass. They also protect bugs.

Hillel Wayne describes this as Goodhart’s Law applied to software: “When a measure becomes a target, it ceases to be a good measure.” A passing test suite feels like safety. But if the tests were generated by reading the code, the passing suite only proves the code does what it does. That is a circular argument, not a safety net.

Rewriting the loop with the right reference point

Here is what the first approach looked like:

graph TD
    A[Original Solidity] --> B[Pass 1: Convert Round-trip]
    B --> C[Normalized Output]
    C --> D[Pass 2: Convert Round-trip]
    D --> E[Second Output]
    E --> F{Pass 1 = Pass 2?}
    F -->|Yes| G[Test Passes]
    style G fill:#22c55e,color:#fff

The original plan was to normalize each contract before comparing: strip comments, enforce consistent indentation, standardize formatting. That way the diff would only surface real semantic differences, not cosmetic ones. But instead of building a dedicated normalizer, the AI ran each contract through the converter itself as “pass one” to normalize it. Then it ran the normalized output through a second time and checked that the two outputs matched.

The problem is that the original contract disappears after the first pass. If the converter silently drops a modifier, flattens an expression, or loses a data location, the first pass bakes that error into the “normalized” output. The second pass converts the already-broken version and produces the same broken result. Pass one equals pass two. Test passes. Bug is invisible.

This only proved idempotency: the converter produces consistent output. It never proved correctness: the converter produces the right output.

Once we identified the problem, we restructured the entire approach. The AI could not be both the test writer and the source of truth. The original contracts had to be the reference point.

The key insight was to stop comparing Solidity strings altogether. Formatting differences, comment styles, whitespace: none of that matters. What matters is whether the contract’s logic survives the round trip. So we compared at the AST level instead. Parse the original contract into an abstract syntax tree, run it through the converter and back, parse the result into another AST, and compare the two trees. If the structures match, the logic is preserved, regardless of how the code looks on the page.

The new loop worked like this:

graph TD
    A[Original Solidity] --> B[Parse to AST]
    A --> C[Convert to Flow]
    C --> D[Convert Back to Solidity]
    D --> E[Parse to AST]
    E --> F{ASTs Match?}
    B --> F
    F -->|No| G[AI Analyzes Diff]
    G --> H[Fix Converter]
    H --> A
    F -->|Yes| I[Contract Passes]

Take a real contract. Parse it into an AST. Convert it to a visual flow, convert it back, and parse that result into an AST. Compare the two trees. When they differ, hand the diff to AI and ask: “What is the converter getting wrong?” Fix the converter. Run again.

This version of the loop worked. The AI was no longer writing tests from code. It was analyzing concrete structural differences between two ASTs: one that we knew was correct (the original) and one the converter produced. With that frame of reference, the AI could identify real bugs, like missing modifiers, incorrect operator precedence, malformed loop boundaries, and lost data locations.

The ouroboros ran for hours. Each cycle fixed a few edge cases across those 17 contracts. Some were small, like whitespace handling in nested structures. Others were significant, like incorrect reconstruction of complex expressions or mishandled inheritance chains. The AI analyzed hundreds of diffs and identified the root cause more reliably than we expected, as long as it had the original contract to compare against.

What stuck with us

The biggest takeaway was not about our converter. It was about the relationship between AI and intent.

AI is excellent at understanding what code does. It cannot know what code should do unless you give it that reference point. When we asked it to “write tests for the converter,” it tested the converter. When we asked it to “compare this output against this known-good input,” it found real bugs.

The difference is the reference point. The original contracts were our ground truth. Without them, the AI had nothing to measure against except the code itself. With them, it became genuinely useful.

If you are using AI to test your software, the question to ask is not “did the tests pass?” It is “what are the tests comparing against?” If the answer is “the code itself,” your test suite is a mirror. It will show you exactly what you built and tell you it looks great, even when it does not.

The tests that matter are the ones that know what correct looks like before they ever see your code.

Spot an inaccuracy or a bug?