You choose how data is processed. Local models run on your device. Cloud models send data to that provider's API.

How much does Vibe Co-Pilot cost?

Vibe is free to install. If you run cloud models, usage is billed by that provider, and paid Vibe plans are optional.

How does Vibe compare to OpenClaw?

OpenClaw is a self-hosted agent stack. Vibe is a browser co-pilot focused on completing overcomplicated web tasks in your existing browser sessions.

How does Vibe compare to Claude Cowork?

Claude browser workflows are Claude-first. Vibe is browser-first and model-flexible, so teams can choose how they run automations.

How does Vibe compare to Chrome DevTools MCP?

DevTools MCP is a browser debugging/control interface. Vibe packages browser task execution into a co-pilot experience for end-to-end workflows.

Vibe Engineering: Testing an Agentic AI Browser Auto-Pilot with a Three-Stage Verification Strategy

Testing an AI-powered browser automation agent at vibebrowser.app is not like testing a normal web app.

Traditional software tends to be deterministic: give it the same input, and you usually get the same output. Agentic browser systems are different. They operate on live sites, react to changing layouts, depend on browser APIs, and route decisions through LLMs that can vary from run to run. That means a test strategy built around only unit tests or DOM assertions is not enough.

What worked for us was a three-stage testing pyramid that balances speed, cost, and real-world confidence:

Foundation: page extraction with OCR verification
Middle layer: mock LLM server testing
Top layer: real LLM integration tests

The testing pyramid

Stage	What it validates	Why it matters
Page extraction + OCR	Whether we capture what is actually visible on the page	DOM-only tests can lie when content is hidden, off-screen, or rendered incorrectly
Mock LLM workflow tests	Whether the extension, tool loop, and agent flow work end to end	Fast, deterministic, and cheap enough to run often
Real LLM integration tests	Whether prompts, tool calling, and live browsing work against real APIs and sites	Final proof that production behavior matches expectations

The important part is not just having three layers. It is using each one for the thing it is best at.

Stage 1: page extraction testing with OCR verification

Primary file: tests/page-extraction.test.js

The first question we ask is simple: when Vibe reads a web page, does it actually capture the content the user can see?

That sounds obvious, but it is surprisingly easy to get wrong. A DOM-based extractor can look perfect in tests while still missing content that is visually rendered to the user. You can also get false confidence from HTML assertions when text is hidden, overlapped, white-on-white, or loaded in a way that does not match the extractor's assumptions.

Why OCR matters

We use OCR as a visual ground truth.

The test flow takes a screenshot of the rendered page, runs OCR on the image, and compares that OCR output to the extractor result. That gives us a stronger guarantee than checking the DOM alone:

the extractor captures what the user actually sees
critical text is not silently dropped
indexed elements line up with visual rendering

We test two extraction paths

We validate both of the core extraction strategies:

const markdownResult = await page.evaluate(() => {
  const extractor = new MarkdownPageExtractor()
  return extractor.extractContent({
    showHighlights: false,
    maxElements: 1000,
  })
})

const htmlResult = await page.evaluate(() => {
  const extractor = new HtmlPageExtractor()
  return extractor.extractIndexedHtml({
    maxElements: 1000,
  })
})

That split matters because a browser agent needs both high-level readable content and structured, index-addressable HTML for automation.

OCR validation with a baseline

OCR is not perfect, so we do not treat it as infallible. We first compare OCR against the page's raw HTML to estimate OCR quality, then require the extractor to match that baseline within a narrow tolerance.

const ocrText = await extractTextFromScreenshot(screenshotPath)

const baselineMatchStats = calculateOCRMatch(ocrText, htmlText)

const markdownMatchStats = calculateOCRMatch(ocrText, markdownResult.content)
const requiredPercentage = Math.max(90.0, baselineMatchStats.matchPercentage - 1.5)

if (markdownMatchStats.matchPercentage < requiredPercentage) {
  throw new Error('Extractor failing to capture visible content')
}

This approach gives us two protections at once:

an absolute floor (>= 90%)
a relative floor tied to OCR quality on that page

Fuzzy matching is required

OCR introduces character errors, dropped punctuation, and spacing noise. So we do fuzzy matching with a small Levenshtein tolerance instead of insisting on exact string equality.

function calculateOCRMatch(ocrText, markdownText) {
  const ocrWords = ocrText.split(/\s+/).filter((word) => word.length >= 3)
  const mdWordsArray = markdownText.split(/\s+/)
  const mdWordsSet = new Set(mdWordsArray)

  let exactMatches = 0
  let fuzzyMatches = 0

  for (const word of ocrWords) {
    if (mdWordsSet.has(word)) {
      exactMatches++
    } else {
      const match = findBestMatch(word, mdWordsArray, 2)
      if (match) fuzzyMatches++
    }
  }

  const totalWords = ocrWords.length
  const matchPercentage = ((exactMatches + fuzzyMatches) / totalWords) * 100

  return { matchPercentage, exactMatches, fuzzyMatches, totalWords }
}

That catches the errors you expect from OCR without lowering the bar so far that extraction bugs get a free pass.

Coverage

We run this against both reference pages and live sites:

static pages with known screenshots
real websites like Wikipedia and other production pages

That lets us catch regressions in both the extractor logic and the rendering assumptions behind it.

Stage 2: mock LLM server testing

Primary files: tests/extension.mock.test.js and tests/utils/mock-llm-test-server.js

Once extraction is reliable, the next question is: can the extension and tool loop complete a browser task end to end?

This is where mock LLM testing earns its keep.

Testing everything with real models is too slow, too expensive, and too noisy for fast iteration. So we built a deterministic mock server that speaks the same OpenAI-style API contract the extension expects, but returns scripted responses instantly.

Why mock the model

Real LLM tests are:

expensive
slow
non-deterministic

Mock tests give us:

consistent tool-call sequences
instant feedback
reproducible failures
CI-friendly execution

The mock server behaves like a state machine

The mock server follows a simple phase-based model of the agent loop:

const testState = {
  phase: 'initial',
  messageCount: 0,
  toolCallsExecuted: [],
}

app.post('/v1/chat/completions', (req, res) => {
  const { messages } = req.body
  const userMessage = messages?.find((message) => message.role === 'user')?.content || ''

  if (testState.phase === 'initial' && userMessage.includes('test')) {
    testState.phase = 'navigated'
    return res.json({
      choices: [
        {
          message: {
            role: 'assistant',
            content: "Let's test it",
            tool_calls: [
              {
                function: {
                  name: 'navigate_to_url',
                  arguments: JSON.stringify({ url: `http://localhost:${PORT}/test-page` }),
                },
              },
            ],
          },
        },
      ],
    })
  }
})

That lets us verify the control loop, tool execution, and UI behavior without paying for real API calls.

Parallel-safe test infrastructure

The tests run with dynamic port allocation so concurrent runs do not step on each other:

async function findAvailablePort(startPort) {
  return new Promise((resolve, reject) => {
    const server = createServer()

    server.listen(startPort, () => {
      const port = server.address().port
      server.close(() => resolve(port))
    })

    server.on('error', (err) => {
      if (err.code === 'EADDRINUSE') {
        findAvailablePort(startPort + 1).then(resolve).catch(reject)
      } else {
        reject(err)
      }
    })
  })
}

That sounds mundane, but it matters. Browser automation tests leak state easily. Port conflicts and half-cleaned mock servers are a common source of flaky failures.

We verify behavior at multiple levels

The extension tests do not stop at "the assistant said it succeeded."

We verify:

the extension settings page was configured
the home page triggered the agent
the Chrome sidepanel opened correctly
the target page actually changed state
screenshots show the expected UI updates

For example, after the mock server instructs the agent to fill and select form values, we assert the real DOM state:

const formValues = await testPage.evaluate(() => {
  const input = document.getElementById('destinationInput')
  const select = document.getElementById('classDropdown')

  return {
    inputValue: input?.value || '',
    selectedClass: select?.value || '',
  }
})

if (!formValues.inputValue.includes('Test Input Value')) {
  throw new Error(`Expected "Test Input Value", found "${formValues.inputValue}"`)
}

if (formValues.selectedClass !== 'economy') {
  throw new Error(`Expected "economy", found "${formValues.selectedClass}"`)
}

So the test proves both that the UI updated and that the underlying interaction actually happened.

Stage 3: real LLM integration tests

Primary file: tests/extension.test.js

Mock testing is great for confidence in the mechanics. But it cannot tell you whether real prompts, real model behavior, and real websites still work together.

That is the job of the top layer.

What real tests validate

Real LLM integration tests answer questions the mock layer cannot:

does prompt engineering hold up with live model behavior?
do tool calls come back in the right shape?
do we handle real network delays and API failures?
can the agent navigate external websites successfully?

The extension is configured against the real provider instead of the mock endpoint:

if (!process.env.OPENROUTER_API_KEY) {
  throw new Error('OPENROUTER_API_KEY environment variable not set')
}

await allInputs[0].type('openrouter:gpt-oss-120b')
await allInputs[1].type(process.env.OPENROUTER_API_KEY)

We test real-world browsing, not toy pages

The top layer targets live destinations, not just local fixtures:

const testQuery =
  "Navigate to google.com, search for 'raspberry pi zero 2 w', click first link"

const externalPage = pages.find((page) => {
  const url = page.url()
  return !url.includes('chrome-extension://') && url.startsWith('http')
})

if (!externalPage) {
  throw new Error('Agent did not navigate to any external page')
}

This is where production reality shows up: CAPTCHAs, changing layouts, anti-bot behavior, slow sites, and unexpected navigation paths.

Graceful degradation matters

Real tests do not require a fairy-tale success path every time. They also verify that the agent fails in understandable, inspectable ways.

For example, the OCR assertions accept meaningful completion or failure states such as:

successful fill or completion
explicit failure
CAPTCHA or verification challenge

That matters because real-world browser agents live in messy environments. You do not want a test suite that hides every partial failure behind brittle exact-match expectations.

What we learned

1) OCR is the closest thing we have to ground truth

DOM assertions are necessary, but not sufficient.

They can pass even when:

content is hidden with CSS
text is off-screen
rendering is incomplete
a page looks wrong to the user

OCR gives us a user-visible check instead of only an implementation-visible check.

2) Screenshot everything

Every important phase should leave behind evidence:

screenshots
OCR text
extracted content
DOM snapshots
console logs

When tests fail, that evidence is the difference between guessing and knowing.

3) Mock and real tests are both necessary

Mock tests are the fast loop.

Real tests are the truth loop.

If you skip mock testing, iteration becomes too slow. If you skip real testing, you will ship prompt and integration failures that the mock layer can never catch.

4) Cleanup is part of the test design

Browser automation leaks resources constantly: browsers, mock servers, OCR workers, temporary files. Cleanup has to be explicit and defensive:

async function cleanup() {
  if (browser) {
    await browser.close()
  }

  if (mockServer && !mockServer.killed) {
    mockServer.kill('SIGKILL')
  }

  await cleanupOCR()
  cleanupPerformed = true
}

process.on('SIGINT', async () => await cleanup())
process.on('SIGTERM', async () => await cleanup())
process.on('uncaughtException', async () => await cleanup())
process.on('unhandledRejection', async () => await cleanup())

Without that, you end up debugging the test harness instead of the product.

Running the suite

The three layers can be run independently:

# Stage 1: Page extraction verification
node tests/page-extraction.test.js

# Stage 2: Mock LLM workflow testing
node tests/extension.mock.test.js

# Stage 3: Real LLM integration (requires API key)
export $(< .env)
node tests/extension.test.js

That separation is useful operationally too. Engineers can run fast checks during development, then escalate to slower and more expensive validation as confidence grows.

The broader point

Testing AI browser agents requires a different mindset from testing standard web apps.

You are not just validating code paths. You are validating:

extraction quality
browser API behavior
tool execution
prompt effectiveness
model behavior
live-site resilience

That is why our testing strategy is evidence-driven instead of assumption-driven.

The core rule is simple:

Do not trust the DOM alone. Verify what the user actually sees, then verify what the agent actually did.

That is the foundation of how we test Vibe today.

Originally published on Medium:
https://medium.com/@dzianisv/vibe-engineering-testing-an-agentic-ai-browser-auto-pilot-a-three-stage-verification-strategy-946c07963d0b