Speech-to-Text Integration

~20 min Intermediate

Set up BptAISpeechToText and BptAIRealtimeTranscriber, pick a backend (local Whisper, OpenAI, Azure), and wire transcribed text into form fields or chat workflows.

Home / Learning / Speech-to-Text Integration

Two components, two use cases

ComponentUse caseAudio sourceLatency
BptAISpeechToText Transcribe a recorded audio file end-to-end. Uploaded Stream Whole-file processing (seconds to minutes depending on length).
BptAIRealtimeTranscriber Live captioning, voice commands, voice-to-form. Microphone, segmented by VAD ~200–800 ms after each utterance (depending on backend).
Pick the right one If your user uploads a meeting recording and waits for a transcript, use BptAISpeechToText. If they speak and the UI reacts as they talk, use BptAIRealtimeTranscriber.

Step 1: Choose a backend

Both components are backend-agnostic. The "backend" is the actual recognition engine — picked at runtime based on the parameters you pass. There's no special registration step; the right backend loads when its parameters are populated.

BackendServer / WASMProsCons
Whisper.Net (local GGML) Server No API costs, offline-capable, fully private. Large model files (~1.5 GB for whisper-medium), CPU/GPU-bound.
ONNX Whisper / Sherpa Server & WASM Smaller footprint than GGML; can run in-browser. WASM transcripts are noticeably slower; quality varies by ONNX export.
OpenAI Whisper API Server Best accuracy, no infra to manage. Per-minute pricing, requires sending audio to OpenAI.
Azure Speech Services Server Enterprise SLA, streaming + speaker diarization built-in. Azure account needed; region matters for latency.

Step 2: Local Whisper on the server

For most apps this is the right starting point — no API keys, no per-call cost, and you can iterate offline. Download a GGML model from the Whisper.cpp project and point ModelPath at it.

@page "/transcribe" @using Bpt.Components.AI <BptAISpeechToText Mode="BptAISpeechToTextMode.Server" ModelPath="C:\models\ggml-medium.bin" Language="auto" OnResponseCompleted="OnTranscribed" /> @code { private string _transcript = ""; private async Task OnTranscribed(string text) { _transcript = text; await InvokeAsync(StateHasChanged); } }

With ShowBuiltInUI="true" (the default), the component renders a Load-model button, a file picker, and a transcript card. Users get a working transcription page with no further code on your side.

Step 3: WebAssembly mode (in-browser)

When Mode is WebAssembly, the component resolves a model from Hugging Face and runs ONNX inference directly in the browser. No model file ships with your app; the first run downloads weights and caches them in IndexedDB.

<BptAISpeechToText Mode="BptAISpeechToTextMode.WebAssembly" HuggingFaceModelName="Xenova/whisper-tiny.en" ShowModelSelector="true" ShowLanguageSelector="true" ShowApiKeyInput="false" OnResponseCompleted="OnTranscribed" />
WASM trade-offs Browser-side inference is slower than the server (especially on phones), and the model download is a 25–200 MB blocking event on first run. Show a clear progress UI — the OnStatusUpdate callback fires repeatedly during download and load.

Step 4: OpenAI Whisper

Drop in the ApiKey parameter (server-side) and the component routes through the OpenAI API instead of running local inference. The downstream OnResponseCompleted contract is identical — your code doesn't change.

// In Program.cs, read the key from configuration (env var or .env) builder.Services.AddSingleton(new OpenAiSettings { ApiKey = builder.Configuration["OpenAi:ApiKey"] ?? throw new InvalidOperationException("OpenAi:ApiKey missing") }); // In your component: <BptAIRealtimeTranscriber Mode="BptAIRealtimeTranscriberMode.Server" TranscriptorType="openai whisper" ApiKey="@_settings.ApiKey" Language="en-US" OnResponseCompleted="OnSpoken" />

Never hard-code an API key in markup — keep it in appsettings / environment variables and inject it as shown. The component passes the key to the OpenAI endpoint only from the server; it never reaches the browser even in WASM mode.

Step 5: Azure Speech Services

Azure needs both an ApiKey and an AzureEndpoint (the region URL from your Azure portal). It supports diarization out of the box — if you need "who said what?", this is the easy path.

<BptAIRealtimeTranscriber TranscriptorType="azure" ApiKey="@_azureKey" AzureEndpoint="https://westeurope.api.cognitive.microsoft.com/" Language="en-US" OnResponseCompleted="OnSpoken" />

Step 6: Voice activity detection (VAD)

Real-time transcription needs to know when a user has stopped talking, otherwise the transcript only arrives when the mic shuts off. BPT supports two VAD backends, both running client-side regardless of where transcription happens.

VAD typeProsCons
silero (default)Neural; very accurate; handles noisy environments.~5 MB ONNX model; first load adds latency.
webrtcTiny (kilobytes), zero startup cost, used by browsers natively.Aggression-tuned: prone to mid-sentence cutoff in noisy rooms.
<BptAIRealtimeTranscriber TranscriptorType="whisper onnx" VadType="silero" Language="en-US" OnResponseCompleted="OnSpoken" />

Step 7: Voice-activated form filling

The classic "fill this textbox by speaking" pattern: hide the built-in UI, render your own mic button, and let OnResponseTokenPartComplete stream text into a bound field as the user talks. The component is fully controllable from outside via its public methods (StartRecording(), StopRecording()) — see the live demo for the headless wiring.

<BptAIRealtimeTranscriber @ref="_transcriber" ShowBuiltInUI="false" OnResponseTokenPartComplete="OnPartial" OnResponseCompleted="OnFinal" /> <BptTextInput @bind-Value="_note" Placeholder="Speak or type…" /> <button class="btn btn-primary" @onclick="ToggleAsync"> @(_recording ? "Stop" : "🎤 Dictate") </button> @code { private BptAIRealtimeTranscriber? _transcriber; private string _note = ""; private bool _recording; private async Task ToggleAsync() { if (_recording) await _transcriber!.StopRecording(); else await _transcriber!.StartRecording(); _recording = !_recording; } private void OnPartial(string token) => _note += token; private void OnFinal(string fullText) { // Optional: snap _note to the cleaned-up final text rather than the // running concatenation of partials. _note = fullText; } }

Common gotchas

  • SignalR message size. Audio streams from browser to server can be several MB. BPT sets MaximumReceiveMessageSize to 204 MB by default — if you've overridden it, check the limit before going to production.
  • Microphone permission. Browsers require user-initiated permission. The first call to StartRecording() triggers the prompt; if the user denies it, OnStatusUpdate fires with an error payload. Treat this as a first-class UI state.
  • Language autodetect. Whisper's autodetect needs ~5 seconds of audio. For short utterances, pin Language="en-US" explicitly — the transcription quality is markedly better.
  • HuggingFace model caching. In WASM mode the first run downloads model weights. If your CSP blocks huggingface.co, the load fails silently — whitelist that origin in your Content-Security-Policy headers.

An unhandled error has occurred. Reload 🗙

Rejoining the server...

Rejoin failed... trying again in seconds.

Failed to rejoin.
Please retry or reload the page.

The session has been paused by the server.

Failed to resume the session.
Please reload the page.