How dictation works
Every InkSpoke dictation follows the same fast loop: you press a hotkey, speak, and the finished text lands where your cursor already is. This page opens up that loop so you can see what happens at each stage — the state the app is in, and how your audio becomes clean text — without needing to touch any of it.
The simple version
You only ever do three things:
- Press the activation hotkey — a small listening overlay appears.
- Speak.
- Press again (or click Send) — InkSpoke transcribes, optionally polishes the text, and types it into whatever app was focused.
Press Esc any time to cancel and throw the session away. Everything else — gain, noise cleanup, silence detection, choosing a workspace, restoring your clipboard — happens automatically in between.
The state machine
Under the hood, a single session moves through a small set of states. The listening overlay mirrors the current one with a label and a live waveform, so you always know where you are.
| State | Overlay shows | What's happening |
|---|---|---|
| Idle | (hidden) | Waiting for a hotkey press. |
| Listening | "Listening…" + waveform | Capturing and buffering your audio. |
| Processing | "Processing…" | Transcribing, applying your dictionary, and (optionally) refining. |
| Injecting | result / typing | Delivering the finished text to your app. |
| Done | "Done" | Text delivered; the overlay dismisses. |
| Cancelled | — | You aborted; nothing is processed or injected. |
| Error | "Error" | Something failed; an error cue plays and no text is injected. |
The very first dictation after launch may briefly show "Preparing…" while the speech model finishes loading. After that, sessions start instantly.
Two safety limits can end a Listening session on their own: continuous silence for about 30 seconds auto-cancels it, and a single recording is capped at roughly 5 minutes. Both are adjustable (and can be turned off) in audio settings.
Inside the pipeline
Once you stop speaking, your audio flows through a fixed sequence of stages. Most are invisible; each one hands cleaner input to the next.
Stage by stage:
- Capture — InkSpoke records from your selected input device while the overlay is up. The device name is shown in the overlay footer.
- Auto-calibrated gain — quiet microphones are amplified automatically. InkSpoke samples the first frames to measure your ambient level, then boosts the whole session to a consistent target so soft speech still transcribes well. No setting to tweak.
- Noise suppression — a neural filter (DeepFilterNet) plus a low-pass filter strip out fans, keyboard clatter, and ambient chatter while preserving your voice. On by default when its model is available; tunable in audio settings.
- Voice activity detection (VAD) — Silero VAD separates speech from silence, trims the quiet gaps before transcription, and drives the "speech detected" feedback. It's also what powers the silence auto-stop above.
- Transcribe — the speech-to-text step. By default this runs on-device with the Whisper Small model, so your audio never leaves your computer; you can switch to a larger on-device model or a cloud provider. Whether it streams a live preview or transcribes in one pass depends on your dictation mode.
- Dictionary substitutions — your personal dictionary applies deterministic, whole-word fixes (for example, spoken "gpt" → "GPT") before anything else sees the text. A shared team dictionary can sit underneath it.
- Context harvest — InkSpoke reads the foreground app and window title so it can adapt: this is what lets it match a workspace, apply the right tone, and switch into code- or terminal-aware output when you're dictating into an IDE or shell.
- Workspace resolution — the app settles on one workspace for this dictation, in strict order: a pinned workspace, then your overlay override for this session, then a smart match against the window, then your default workspace. The overlay previews the resolved workspace before you even finish speaking.
- AI refinement (optional) — if refinement is on (the default), the cleaned transcription plus your resolved workspace's vocabulary and instructions go to your text model, which returns polished, app-aware writing. If it's off, your raw transcription is used verbatim and no workspace prompt data is sent.
- Inject + clipboard restore — the finished text is typed into the app that had your cursor. When InkSpoke pastes via the clipboard, it backs up your existing clipboard first and restores it right after — so your copied content is never clobbered.
The master AI Refinement switch is the top of the decision. Even with it on, an individual workspace can opt out of refinement, or pin its own text model. The full precedence is covered in How refinement works.
When there's no place to put the text
If, at injection time, the focused element can't accept text — you clicked onto the desktop, a button, or switched to Finder — InkSpoke doesn't drop your words. Instead the overlay shows the full transcript with a Copy button and a "No text field detected" label, so you can paste it wherever you like.
Similarly, during slow character-by-character typing (terminals and remote sessions), you can press Esc to stop mid-inject. Whatever was already typed stays, and the dictation is still saved to your history.
Power users
- Two activation hotkeys. A secondary binding does the exact same start/stop as the primary one — handy if your main combo clashes with another app.
- Quiet-speech mode. A tuned mode lowers the speech-detection threshold and boosts gain so you can dictate under your breath in a shared space. Toggle it in audio settings.
- Live Preview. Switch your dictation mode from Standard (one high-accuracy pass after you stop) to Live Preview to see a streaming draft as you talk. Live Preview needs the VAD model and runs on on-device transcription only. See Dictation modes and languages.
- Command Mode. The same pipeline, but it captures your currently selected text first, then treats your speech as an instruction to transform it ("make this formal", "translate to Spanish"). See Command Mode.
Platform notes
The pipeline is identical across Windows, macOS, and Linux; only the default hotkeys and some low-level injection details differ. The defaults:
| Action | Windows / Linux | macOS |
|---|---|---|
| Start / stop dictation | Alt + Space | ⌥ + Space |
| Start / stop (secondary) | Ctrl + Shift + Space | ⌃ + ⇧ + Space |
| Cancel (overlay up) | Esc | Esc |
All of these are configurable in General and hotkeys. How text is delivered to the target app — clipboard paste, simulated keystrokes, or a remote-desktop-safe path — is chosen automatically per OS and per app; see Text injection for the details.
Next steps
- Push-to-talk basics — the everyday start, stop, and cancel flow in practice.
- The listening overlay — every control on the HUD, from the waveform to the pickers.
- On-device vs. cloud and privacy — where the transcribe and refine stages run, and what leaves your device.
- Smart matching and precedence — how the workspace-resolution stage picks a workspace for you.