The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks

Running speech AI fully in the browser sounds perfect on paper:

no server dependency
stronger privacy
offline capability

We tested that path with Whisper + WASM and learned a simple lesson: "technically possible" is not the same as "product-ready."

This post covers what we tried, where it failed, and where browser inference is actually useful today.

Why We Tried Whisper in the Browser

Whisper is a strong speech model, and the modern browser stack is much better than it used to be:

WebAssembly for CPU-heavy compute
ONNX Runtime Web
growing WebGPU support

So the idea made sense: run transcription locally in the client, avoid sending raw audio out, and reduce backend load.

Experiment 1: ONNX Runtime Web + Whisper

Our first implementation worked functionally, but several constraints showed up fast.

1) Model size vs user experience

Large Whisper models are too heavy for first-run web UX. Smaller models are faster, but accuracy drops on real meeting audio.

That creates a hard tradeoff:

accuracy model: too large to load comfortably
lightweight model: not accurate enough for production transcription

2) SharedArrayBuffer and cross-origin isolation

To get practical performance with threading, we needed SharedArrayBuffer, which requires:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Enabling these headers broke multiple existing assumptions around third-party assets and static hosting setup. We eventually got this working, but the deployment surface became much more complex than expected.

Experiment 2: Streaming with sherpa-onnx

We also tested sherpa-onnx for streaming behavior. Latency characteristics were promising, but we still faced:

bundle and model size concerns
cross-origin isolation requirements
uneven quality on Korean technical meeting audio

So even though the runtime model was attractive, the end-to-end product constraints remained.

Performance Reality

In practice, real-time experience depended heavily on hardware class:

desktop: acceptable for some model sizes and configs
mobile: often too slow for a smooth live transcript

For our use case (live meeting capture), users care most about immediate feedback. Delayed "better" text is useful, but not enough by itself.

Korean Technical Speech Is a Special Case

Korean meeting audio with mixed English technical vocabulary is difficult for generic ASR pipelines:

acronyms
code and infra terms
fast topic shifts
overlapping speech

In this setting, domain-aware correction matters as much as raw model quality.

Where Browser Whisper Actually Works Well

We did not abandon browser inference. We narrowed the scope.

Great use cases:

offline transcription of pre-recorded audio
privacy-sensitive local processing
non-real-time workflows where users accept delay

Weak use case today:

low-latency, always-on live meeting transcription on mixed client hardware

Product Decision: Labs, Not Core Path

VORA Labs Overview

We kept Whisper in our Labs track as an experimental feature and kept real-time transcription in the main app on a lower-latency path.

We are still exploring hybrid workflows:

live transcript first
higher-accuracy pass afterward

This gives users immediate usability and a better final artifact.

Takeaway

Before committing to browser AI for production, benchmark the exact user path on real user hardware.

If the product needs immediate output, optimize for latency first. If the product needs maximum accuracy and privacy on recorded content, browser inference can be a strong fit.