Speech-to-Text Integration
Set up BptAISpeechToText and BptAIRealtimeTranscriber, pick a backend (local Whisper, OpenAI, Azure), and wire transcribed text into form fields or chat workflows.
Two components, two use cases
| Component | Use case | Audio source | Latency |
|---|---|---|---|
BptAISpeechToText |
Transcribe a recorded audio file end-to-end. | Uploaded Stream |
Whole-file processing (seconds to minutes depending on length). |
BptAIRealtimeTranscriber |
Live captioning, voice commands, voice-to-form. | Microphone, segmented by VAD | ~200–800 ms after each utterance (depending on backend). |
BptAISpeechToText.
If they speak and the UI reacts as they talk, use BptAIRealtimeTranscriber.
Step 1: Choose a backend
Both components are backend-agnostic. The "backend" is the actual recognition engine — picked at runtime based on the parameters you pass. There's no special registration step; the right backend loads when its parameters are populated.
| Backend | Server / WASM | Pros | Cons |
|---|---|---|---|
| Whisper.Net (local GGML) | Server | No API costs, offline-capable, fully private. | Large model files (~1.5 GB for whisper-medium), CPU/GPU-bound. |
| ONNX Whisper / Sherpa | Server & WASM | Smaller footprint than GGML; can run in-browser. | WASM transcripts are noticeably slower; quality varies by ONNX export. |
| OpenAI Whisper API | Server | Best accuracy, no infra to manage. | Per-minute pricing, requires sending audio to OpenAI. |
| Azure Speech Services | Server | Enterprise SLA, streaming + speaker diarization built-in. | Azure account needed; region matters for latency. |
Step 2: Local Whisper on the server
For most apps this is the right starting point — no API keys, no per-call cost, and you can iterate
offline. Download a GGML model from the Whisper.cpp project and point ModelPath at it.
@page "/transcribe"
@using Bpt.Components.AI
<BptAISpeechToText Mode="BptAISpeechToTextMode.Server"
ModelPath="C:\models\ggml-medium.bin"
Language="auto"
OnResponseCompleted="OnTranscribed" />
@code {
private string _transcript = "";
private async Task OnTranscribed(string text)
{
_transcript = text;
await InvokeAsync(StateHasChanged);
}
}
With ShowBuiltInUI="true" (the default), the component renders a Load-model button, a
file picker, and a transcript card. Users get a working transcription page with no further code
on your side.
Step 3: WebAssembly mode (in-browser)
When Mode is WebAssembly, the component resolves a model from
Hugging Face and runs ONNX inference directly in the browser. No model file ships with your app;
the first run downloads weights and caches them in IndexedDB.
<BptAISpeechToText Mode="BptAISpeechToTextMode.WebAssembly"
HuggingFaceModelName="Xenova/whisper-tiny.en"
ShowModelSelector="true"
ShowLanguageSelector="true"
ShowApiKeyInput="false"
OnResponseCompleted="OnTranscribed" />OnStatusUpdate
callback fires repeatedly during download and load.
Step 4: OpenAI Whisper
Drop in the ApiKey parameter (server-side) and the component routes through the OpenAI
API instead of running local inference. The downstream OnResponseCompleted contract
is identical — your code doesn't change.
// In Program.cs, read the key from configuration (env var or .env)
builder.Services.AddSingleton(new OpenAiSettings
{
ApiKey = builder.Configuration["OpenAi:ApiKey"]
?? throw new InvalidOperationException("OpenAi:ApiKey missing")
});
// In your component:
<BptAIRealtimeTranscriber Mode="BptAIRealtimeTranscriberMode.Server"
TranscriptorType="openai whisper"
ApiKey="@_settings.ApiKey"
Language="en-US"
OnResponseCompleted="OnSpoken" />
Never hard-code an API key in markup — keep it in appsettings /
environment variables and inject it as shown. The component passes the key to the OpenAI
endpoint only from the server; it never reaches the browser even in WASM mode.
Step 5: Azure Speech Services
Azure needs both an ApiKey and an AzureEndpoint (the region URL from your
Azure portal). It supports diarization out of the box — if you need "who said what?", this is the
easy path.
<BptAIRealtimeTranscriber TranscriptorType="azure"
ApiKey="@_azureKey"
AzureEndpoint="https://westeurope.api.cognitive.microsoft.com/"
Language="en-US"
OnResponseCompleted="OnSpoken" />Step 6: Voice activity detection (VAD)
Real-time transcription needs to know when a user has stopped talking, otherwise the transcript only arrives when the mic shuts off. BPT supports two VAD backends, both running client-side regardless of where transcription happens.
| VAD type | Pros | Cons |
|---|---|---|
silero (default) | Neural; very accurate; handles noisy environments. | ~5 MB ONNX model; first load adds latency. |
webrtc | Tiny (kilobytes), zero startup cost, used by browsers natively. | Aggression-tuned: prone to mid-sentence cutoff in noisy rooms. |
<BptAIRealtimeTranscriber TranscriptorType="whisper onnx"
VadType="silero"
Language="en-US"
OnResponseCompleted="OnSpoken" />Step 7: Voice-activated form filling
The classic "fill this textbox by speaking" pattern: hide the built-in UI, render your own mic
button, and let OnResponseTokenPartComplete stream text into a bound field as the
user talks. The component is fully controllable from outside via its public methods (StartRecording(),
StopRecording()) — see the live demo for the headless wiring.
<BptAIRealtimeTranscriber @ref="_transcriber"
ShowBuiltInUI="false"
OnResponseTokenPartComplete="OnPartial"
OnResponseCompleted="OnFinal" />
<BptTextInput @bind-Value="_note" Placeholder="Speak or type…" />
<button class="btn btn-primary"
@onclick="ToggleAsync">
@(_recording ? "Stop" : "🎤 Dictate")
</button>
@code {
private BptAIRealtimeTranscriber? _transcriber;
private string _note = "";
private bool _recording;
private async Task ToggleAsync()
{
if (_recording) await _transcriber!.StopRecording();
else await _transcriber!.StartRecording();
_recording = !_recording;
}
private void OnPartial(string token) => _note += token;
private void OnFinal(string fullText)
{
// Optional: snap _note to the cleaned-up final text rather than the
// running concatenation of partials.
_note = fullText;
}
}Common gotchas
- SignalR message size. Audio streams from browser to server can be several MB. BPT sets
MaximumReceiveMessageSizeto 204 MB by default — if you've overridden it, check the limit before going to production. - Microphone permission. Browsers require user-initiated permission. The first call to
StartRecording()triggers the prompt; if the user denies it,OnStatusUpdatefires with an error payload. Treat this as a first-class UI state. - Language autodetect. Whisper's autodetect needs ~5 seconds of audio. For short utterances, pin
Language="en-US"explicitly — the transcription quality is markedly better. - HuggingFace model caching. In WASM mode the first run downloads model weights. If your CSP blocks
huggingface.co, the load fails silently — whitelist that origin in yourContent-Security-Policyheaders.