← Back to blog
AIBrowserSTT

The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks

Running speech AI fully in the browser sounds perfect on paper: no server dependency stronger privacy offline capability We tested that path with Whisper + WASM and learned a simple lesson: "technically possible" is not the same as "product-ready."...

by Jay··3 min read·VORA B.LOG

Running speech AI fully in the browser sounds perfect on paper:

  • no server dependency
  • stronger privacy
  • offline capability

We tested that path with Whisper + WASM and learned a simple lesson: "technically possible" is not the same as "product-ready."

This post covers what we tried, where it failed, and where browser inference is actually useful today.

Why We Tried Whisper in the Browser

Whisper is a strong speech model, and the modern browser stack is much better than it used to be:

  • WebAssembly for CPU-heavy compute
  • ONNX Runtime Web
  • growing WebGPU support

So the idea made sense: run transcription locally in the client, avoid sending raw audio out, and reduce backend load.

Experiment 1: ONNX Runtime Web + Whisper

Our first implementation worked functionally, but several constraints showed up fast.

1) Model size vs user experience

Large Whisper models are too heavy for first-run web UX. Smaller models are faster, but accuracy drops on real meeting audio.

That creates a hard tradeoff:

  • accuracy model: too large to load comfortably
  • lightweight model: not accurate enough for production transcription

2) SharedArrayBuffer and cross-origin isolation

To get practical performance with threading, we needed SharedArrayBuffer, which requires:

  • Cross-Origin-Opener-Policy: same-origin
  • Cross-Origin-Embedder-Policy: require-corp

Enabling these headers broke multiple existing assumptions around third-party assets and static hosting setup. We eventually got this working, but the deployment surface became much more complex than expected.

Experiment 2: Streaming with sherpa-onnx

We also tested sherpa-onnx for streaming behavior. Latency characteristics were promising, but we still faced:

  • bundle and model size concerns
  • cross-origin isolation requirements
  • uneven quality on Korean technical meeting audio

So even though the runtime model was attractive, the end-to-end product constraints remained.

Performance Reality

In practice, real-time experience depended heavily on hardware class:

  • desktop: acceptable for some model sizes and configs
  • mobile: often too slow for a smooth live transcript

For our use case (live meeting capture), users care most about immediate feedback. Delayed "better" text is useful, but not enough by itself.

Korean Technical Speech Is a Special Case

Korean meeting audio with mixed English technical vocabulary is difficult for generic ASR pipelines:

  • acronyms
  • code and infra terms
  • fast topic shifts
  • overlapping speech

In this setting, domain-aware correction matters as much as raw model quality.

Where Browser Whisper Actually Works Well

We did not abandon browser inference. We narrowed the scope.

Great use cases:

  • offline transcription of pre-recorded audio
  • privacy-sensitive local processing
  • non-real-time workflows where users accept delay

Weak use case today:

  • low-latency, always-on live meeting transcription on mixed client hardware

Product Decision: Labs, Not Core Path

VORA Labs Overview

We kept Whisper in our Labs track as an experimental feature and kept real-time transcription in the main app on a lower-latency path.

We are still exploring hybrid workflows:

  • live transcript first
  • higher-accuracy pass afterward

This gives users immediate usability and a better final artifact.

Takeaway

Before committing to browser AI for production, benchmark the exact user path on real user hardware.

If the product needs immediate output, optimize for latency first. If the product needs maximum accuracy and privacy on recorded content, browser inference can be a strong fit.