Speech-to-Text Integration — Blazor Power Tools

Two components, two use cases

Component	Use case	Audio source	Latency
`BptAISpeechToText`	Transcribe a recorded audio file end-to-end.	Uploaded `Stream`	Whole-file processing (seconds to minutes depending on length).
`BptAIRealtimeTranscriber`	Live captioning, voice commands, voice-to-form.	Microphone, segmented by VAD	~200–800 ms after each utterance (depending on backend).

Pick the right one If your user uploads a meeting recording and waits for a transcript, use BptAISpeechToText. If they speak and the UI reacts as they talk, use BptAIRealtimeTranscriber.

Step 1: Choose a backend

Both components are backend-agnostic. The "backend" is the actual recognition engine — picked at runtime based on the parameters you pass. There's no special registration step; the right backend loads when its parameters are populated.

Backend	Server / WASM	Pros	Cons
Whisper.Net (local GGML)	Server	No API costs, offline-capable, fully private.	Large model files (~1.5 GB for whisper-medium), CPU/GPU-bound.
ONNX Whisper / Sherpa	Server & WASM	Smaller footprint than GGML; can run in-browser.	WASM transcripts are noticeably slower; quality varies by ONNX export.
OpenAI Whisper API	Server	Best accuracy, no infra to manage.	Per-minute pricing, requires sending audio to OpenAI.
Azure Speech Services	Server	Enterprise SLA, streaming + speaker diarization built-in.	Azure account needed; region matters for latency.

Step 2: Local Whisper on the server

For most apps this is the right starting point — no API keys, no per-call cost, and you can iterate offline. Download a GGML model from the Whisper.cpp project and point ModelPath at it.

@page "/transcribe"
@using Bpt.Components.AI

<BptAISpeechToText Mode="BptAISpeechToTextMode.Server"
                   ModelPath="C:\models\ggml-medium.bin"
                   Language="auto"
                   OnResponseCompleted="OnTranscribed" />

@code {
    private string _transcript = "";

    private async Task OnTranscribed(string text)
    {
        _transcript = text;
        await InvokeAsync(StateHasChanged);
    }
}

With ShowBuiltInUI="true" (the default), the component renders a Load-model button, a file picker, and a transcript card. Users get a working transcription page with no further code on your side.

Step 3: WebAssembly mode (in-browser)

When Mode is WebAssembly, the component resolves a model from Hugging Face and runs ONNX inference directly in the browser. No model file ships with your app; the first run downloads weights and caches them in IndexedDB.

<BptAISpeechToText Mode="BptAISpeechToTextMode.WebAssembly"
                   HuggingFaceModelName="Xenova/whisper-tiny.en"
                   ShowModelSelector="true"
                   ShowLanguageSelector="true"
                   ShowApiKeyInput="false"
                   OnResponseCompleted="OnTranscribed" />

WASM trade-offs Browser-side inference is slower than the server (especially on phones), and the model download is a 25–200 MB blocking event on first run. Show a clear progress UI — the OnStatusUpdate callback fires repeatedly during download and load.

Step 4: OpenAI Whisper

Drop in the ApiKey parameter (server-side) and the component routes through the OpenAI API instead of running local inference. The downstream OnResponseCompleted contract is identical — your code doesn't change.

// In Program.cs, read the key from configuration (env var or .env)
builder.Services.AddSingleton(new OpenAiSettings
{
    ApiKey = builder.Configuration["OpenAi:ApiKey"]
        ?? throw new InvalidOperationException("OpenAi:ApiKey missing")
});

// In your component:
<BptAIRealtimeTranscriber Mode="BptAIRealtimeTranscriberMode.Server"
                          TranscriptorType="openai whisper"
                          ApiKey="@_settings.ApiKey"
                          Language="en-US"
                          OnResponseCompleted="OnSpoken" />

Never hard-code an API key in markup — keep it in appsettings / environment variables and inject it as shown. The component passes the key to the OpenAI endpoint only from the server; it never reaches the browser even in WASM mode.

Step 5: Azure Speech Services

Azure needs both an ApiKey and an AzureEndpoint (the region URL from your Azure portal). It supports diarization out of the box — if you need "who said what?", this is the easy path.

<BptAIRealtimeTranscriber TranscriptorType="azure"
                          ApiKey="@_azureKey"
                          AzureEndpoint="https://westeurope.api.cognitive.microsoft.com/"
                          Language="en-US"
                          OnResponseCompleted="OnSpoken" />

Step 6: Voice activity detection (VAD)

Real-time transcription needs to know when a user has stopped talking, otherwise the transcript only arrives when the mic shuts off. BPT supports two VAD backends, both running client-side regardless of where transcription happens.

VAD type	Pros	Cons
`silero` (default)	Neural; very accurate; handles noisy environments.	~5 MB ONNX model; first load adds latency.
`webrtc`	Tiny (kilobytes), zero startup cost, used by browsers natively.	Aggression-tuned: prone to mid-sentence cutoff in noisy rooms.

<BptAIRealtimeTranscriber TranscriptorType="whisper onnx"
                          VadType="silero"
                          Language="en-US"
                          OnResponseCompleted="OnSpoken" />

Step 7: Voice-activated form filling

The classic "fill this textbox by speaking" pattern: hide the built-in UI, render your own mic button, and let OnResponseTokenPartComplete stream text into a bound field as the user talks. The component is fully controllable from outside via its public methods (StartRecording(), StopRecording()) — see the live demo for the headless wiring.

<BptAIRealtimeTranscriber @ref="_transcriber"
                          ShowBuiltInUI="false"
                          OnResponseTokenPartComplete="OnPartial"
                          OnResponseCompleted="OnFinal" />

<BptTextInput @bind-Value="_note" Placeholder="Speak or type…" />

<button class="btn btn-primary"
        @onclick="ToggleAsync">
    @(_recording ? "Stop" : "🎤 Dictate")
</button>

@code {
    private BptAIRealtimeTranscriber? _transcriber;
    private string _note = "";
    private bool _recording;

    private async Task ToggleAsync()
    {
        if (_recording) await _transcriber!.StopRecording();
        else await _transcriber!.StartRecording();
        _recording = !_recording;
    }

    private void OnPartial(string token) => _note += token;
    private void OnFinal(string fullText)
    {
        // Optional: snap _note to the cleaned-up final text rather than the
        // running concatenation of partials.
        _note = fullText;
    }
}

Common gotchas

SignalR message size. Audio streams from browser to server can be several MB. BPT sets MaximumReceiveMessageSize to 204 MB by default — if you've overridden it, check the limit before going to production.
Microphone permission. Browsers require user-initiated permission. The first call to StartRecording() triggers the prompt; if the user denies it, OnStatusUpdate fires with an error payload. Treat this as a first-class UI state.
Language autodetect. Whisper's autodetect needs ~5 seconds of audio. For short utterances, pin Language="en-US" explicitly — the transcription quality is markedly better.
HuggingFace model caching. In WASM mode the first run downloads model weights. If your CSP blocks huggingface.co, the load fails silently — whitelist that origin in your Content-Security-Policy headers.

Next: Image Processing in the Browser

Use BptImageAnalysis for edge detection and BptAnimationEffect for GPU-accelerated visuals — all client-side.

Live Demo

Try every backend interactively, including the headless mode used in Step 7.

See also