Why I Built a Mac Dictation App That Doesn't Use Whisper

macOS Built-In Dictation Isn't Enough

Apple ships dictation on every Mac. Press the microphone key, talk, and text appears. It works — for about 30 seconds.

If you're dictating anything longer than a quick message, you hit the wall fast: the recording times out, accuracy drops on technical terms, and every word gets sent to Apple's servers for processing. For people who dictate meeting notes, long-form writing, or code documentation, built-in dictation is a demo, not a tool.

I wanted something that could handle 10-minute recordings, worked in any app, and kept my audio on my machine. So I built SpokenKey.

Why Everyone Uses Whisper (And Why I Didn't)

When you search for “speech to text” in the open-source world, every result uses OpenAI's Whisper. It's the default choice — well-documented, multilingual, and genuinely good.

But Whisper has quirks that matter for a dictation app:

Whisper hallucinates on silence. Feed it a recording with a long pause, and it might invent words that were never spoken. This is a known issue in the architecture — Whisper was trained on paired audio/text data that doesn't include silence.

Whisper's punctuation and capitalization are inconsistent. It's a general-purpose model covering 99 languages. English-specific formatting gets less attention.

Whisper is batch-only. You record, wait, get the result. There's no streaming mode that shows text as you speak.

I went looking for alternatives.

Parakeet TDT: NVIDIA's English-Optimized Alternative

NVIDIA's NeMo team trains a family of ASR models called Parakeet. The specific model SpokenKey uses is Parakeet TDT 0.6B v3 — a 600M-parameter Transducer model optimized for English.

Why Parakeet over Whisper:

Better English accuracy. It's trained specifically for English with better punctuation and capitalization out of the box.
No hallucination on silence. The Transducer architecture doesn't generate text when there's no speech.
Smaller and faster. The int8-quantized version runs in real-time on Apple Silicon with sherpa-onnx.

SpokenKey ships with Parakeet TDT v3 as primary, v2 as fallback, and Whisper as a last resort — a safety net for the rare case where neither Parakeet model loads, usually a corrupted model download. In practice, Parakeet handles everything.

For streaming (showing text as you speak), SpokenKey uses a separate Zipformer model — it's less accurate but fast enough for real-time preview. When you release the hotkey, Parakeet re-transcribes the full audio for the final result.

Clipboard-Free Text Injection with Quartz CGEvents

Most dictation apps work like this: transcribe the audio, copy the text to the clipboard, simulate Cmd+V to paste. This has two problems:

It overwrites your clipboard. Whatever you had copied is gone.
Clipboard managers capture every paste. If you use CopyClip or Paste, your clipboard history fills up with dictation fragments.

SpokenKey's streaming mode doesn't use the clipboard at all. It uses macOS Quartz CGEvents to simulate individual key presses — the same mechanism the OS uses for physical keyboard input. Text appears at the cursor as if you typed it.

# Create a keyboard event and set its Unicode content
event = CGEventCreateKeyboardEvent(None, 0, True)
CGEventKeyboardSetUnicodeString(event, len(char), char)
CGEventPost(kCGHIDEventTap, event)

The offline path (for when streaming is disabled) does use the clipboard, but it saves the clipboard contents before pasting and restores them after. Your clipboard stays intact.

Hold-to-Record UX

Most voice apps use a click-to-start, click-to-stop model. SpokenKey uses hold-to-record: press and hold your hotkey to record, release to transcribe.

This feels more natural — like a walkie-talkie. You don't have to remember whether you're currently recording. The physical action of holding the key maps directly to “I'm speaking now.”

The hotkey is configurable (Right Option, Right Command, Fn, F5, or F6) and uses macOS native event monitoring, so it works globally — in any app, any window.

Local AI Cleanup Without Cloud APIs

Raw transcription isn't perfect. SpokenKey optionally pipes the result through a local LLM (Ollama + Llama 3.2 3B) for grammar and punctuation cleanup.

This runs entirely on your Mac. No API keys, no usage limits, no data going anywhere. If the LLM times out (10 seconds) or returns garbage, SpokenKey falls back to the raw transcription — you never lose your words.

There's also a truncation guard: if the AI returns less than 50% of the original text length, SpokenKey assumes something went wrong and uses the raw transcription instead. Small LLMs occasionally summarize when they should just clean up.

What's Next

SpokenKey is a one-person project. The roadmap is driven by what I actually need:

Export integrations: send transcriptions to Obsidian, Notion, or other apps
Model improvements: newer Parakeet versions as NVIDIA releases them
Custom vocabulary improvements: the current vocabulary.json handles basic corrections — I'd like a UI for managing it

SpokenKey is released under the PolyForm Noncommercial license — free to read, learn from, and use personally.

The terminal workflow is free forever. The .app bundle is $29, one-time, at Gumroad. Learn more about SpokenKey.