
What we shipped. A native macOS dictation application that inserts text at the cursor in any app. Sub-300 ms perceived latency across hostile mobile networks. 40 UI languages on day one. A privacy posture that retains no audio by default.
Why it sits in our case studies. The hard problems behind Lispr are the problems we solve on client engagements. Latency engineering on cloud-dependent paths. Multilingual production at scale. Edge architecture under provider-side geo-blocks. Release engineering disciplined enough to ship 67 notarized builds in three weeks.
Live in production. 46,011 dictations and 1.1M+ words across 29 countries in the first three weeks, weighted toward users whose languages Apple Dictation and Apple Intelligence still underserve.
The delivery model. Codebridge engineers ran every workstream alongside AI agents that handled code review, localization fleets, and competitor research. The same model we apply on client engagements, demonstrated end-to-end on a product we own.
Apple Dictation underserves users whose first language is not English. Apple Intelligence excludes Ukrainian and most of the languages Lispr's user base writes in. Privacy is the second wedge. Lispr captures audio only, with no screenshots, no account, and nothing retained by default.
Cursor insertion in any application. Sub-second end-to-end response. No account, email, or payment surface. Multilingual from day one. A small, native, signed binary with auto-update.
Five constraints set the architectural decisions.
The latency floor. The conventional path (record file, POST to a Whisper endpoint, wait for response) takes 1500 to 2500 ms. The target was a path that feels instant end-to-end, fast enough that the user keeps thinking in speech instead of waiting on the machine.
No "type anywhere" API on macOS. macOS provides no system service or extension that both records audio and inserts text into arbitrary applications. We assembled the capture surface from CGEventTap, Carbon hotkeys, and synthesized keystrokes. No off-the-shelf integrated path existed.
Hostile networks. Mobile hotspots drop QUIC and UDP without surfacing an error. The ASR provider geo-blocks entire regions. The transport had to survive networks that look connected but drop packets, and reach users in countries the provider refuses to serve.
Forty languages on day one. The standard path of human translation contracts per language cannot match a multi-release-per-day cadence and does not feed quality signal back into the build. We needed broad coverage from launch with reviewable quality and a closed feedback loop into transcription.
Trust without an account. Users download Lispr and dictate. They do not sign up. The binary ships no API keys. We retain no audio by default. The trust model had to survive technical scrutiny without any data collection on the user.
A listen-only CGEventTap detects a held modifier anywhere in the system. A gesture state machine handles hold, double-tap-latch, and quick-tap interactions. Carbon RegisterEventHotKey powers the History shortcut. Lispr runs as a lightweight native macOS app with a Dock icon and menu-bar presence.
macOS offers no path for an input method that records audio and types into arbitrary apps, so we ruled out the keyboard-extension route on day one. The capture surface is built from primitives.
The client encodes PCM to Opus on every ~64 ms buffer (24 kbit/s, 20 ms packets, Ogg framing with CRC32 self-test) and streams it up a chunked URLSession upload during recording. TLS pre-warms on key-down. A shared singleton URLSession keeps connections warm.
We instrumented the full request path before tuning anything. Cloudflare fully buffers the client-to-worker body before worker.fetch() runs (body available 0 to 2 ms after key-up, even on 26-second clips). That told us the latency win lives in client-side streaming, not in the Worker.
The Worker calls the upstream ASR direct first. On a regional 403 it falls back to a Durable Object pinned to Sydney via locationHint, selected by benchmark. We key the trigger on cf.colo because the provider blocks egress IPs, not user geos. Users in unblocked colos pay zero overhead.
One English master string file feeds AI translator agents per language. A separate per-language reviewer agent scores the output. A generator bakes .lproj bundles and pre-rendered marketing pages, with English fallback. The Worker handles custom vocabularies in a dedicated Cloudflare Container (Python with lingua, wordfreq, and MeCab), off the hot path, then applies them to transcription. Instant translate runs on gpt-oss-120b.
Every build goes through Apple notarization and stapling. Updates ship via Sparkle with an EdDSA-signed appcast and a SHA-256 integrity check on the download. Analytics Engine collects telemetry that an internal dashboard visualizes behind Cloudflare Access. We forward audio transiently for transcription and retain none of it by default. A training-corpus opt-in exists; consent is required twice.
A native Swift client streams Opus audio while the user is still talking. The audio reaches a Cloudflare Worker that handles key custody, telemetry, and geo-relay. The Worker forwards to a cloud ASR provider running Whisper large-v3-turbo. Text pastes at the cursor. A StatsCounter Durable Object, R2 buckets, and Analytics Engine sit behind the Worker.

One additional modifier turns dictation into translation. Hold ⌥ to dictate; hold ⌥+⌃ to translate while you speak. The user speaks in one language, the output lands at the cursor in another, in whichever Mac app is focused.
The pipeline reuses every layer described above. The streaming Opus path delivers audio to the Worker. Whisper large-v3-turbo transcribes with automatic source-language identification. gpt-oss-120b on the same edge route translates the transcript. The smart-insertion layer pastes at the cursor. No copy-paste, no separate UI, no extra round trip on the user's side. The engineering point is the composition: a single user gesture wires a multi-model pipeline (ASR plus LLM) through one edge route with no new UI surface.
Codebridge engineers ran every workstream alongside AI agents: translation fleets with per-language reviewers, multi-agent competitor research, agent-assisted design exploration, code review. The combination is why a senior team shipped 67 Apple-notarized production releases in three weeks.
Live in production for three weeks. The metrics below are what we measured in flight.
| Dimension | Market baseline | Lispr |
|---|---|---|
| Perceived latency | 1 to 3 s | ~300 ms |
| Download size | 150 to 200+ MB | 3.67 MB |
| Download to first dictation | 3+ min (install, account, model download) | under 60 s, no account |
| UI languages at launch | 1 | 34 to 35 (now 40) |
| Account required | Yes | No |
| Mobile-network reliability | Frequent silent failures | HTTP/2-only zone plus retry layer |
| Infra to run it | Servers and GPUs | 1 Cloudflare Worker, ~1 ms CPU/dictation |
| Release cadence | Weekly at best | ~3 production releases per day |
Ukraine 8,590 · Indonesia 7,690 · Germany 1,730 · Czech Republic 1,700 · USA 1,300 · Portugal 1,260 · United Arab Emirates 780. The geography validates the multilingual-gap hypothesis: users underserved by Apple's dictation offerings dominate the early traffic.
p25: 272 ms · p50: 346 ms · p75: 441 ms · p90: 602 ms · p99: 1,251 ms.
The architecture pays the inference provider per call. The Worker handles relay, telemetry, and key custody at edge cost. Infrastructure cost scales linearly with active usage rather than with always-on GPU capacity, which is the architectural shape that lets a small operator run this at production scale.
Client memory and CPU footprint are instrumented but not stress-tested across hardware classes. Retry success on hostile networks is inferred from telemetry rather than from a controlled corpus. We do not retain lifetime audio-minutes by design, which costs us a metric we would otherwise report.
Windows release. On-device "race mode": local Apple SpeechAnalyzer running in parallel with cloud, first non-empty result wins. Per-device attestation.
iOS, built as a keyboard extension plus container app. Optional AI formatting pass. Possible single-user Pro tier.
Android, via overlay plus AccessibilityService.