NEW YEAR, NEW GOALS:   Kickstart your SaaS development journey today and secure exclusive savings for the next 3 months!
Check it out here >>
White gift box with red ribbon and bow open to reveal a golden 10% symbol, surrounded by red Christmas trees and ornaments on a red background.
Unlock Your Holiday Savings
Build your SaaS faster and save for the next 3 months. Our limited holiday offer is now live.
White gift box with red ribbon and bow open to reveal a golden 10% symbol, surrounded by red Christmas trees and ornaments on a red background.
Explore the Offer
Valid for a limited time
close icon
CODEBRIDGE LABS PRODUCT

How We Built Lispr: Native Multilingual Voice Dictation for macOS

Native macOS dictation, sub-300 ms perceived latency across hostile networks, 40 languages on day one. A Codebridge engineering case study: latency, edge architecture, multilingual delivery, and release discipline.

Software Development
https://lispr.ai/
June 15, 2026
COUNTRY
USA
TEAM SIZE
5
DURATION
ongoing
BUDGET
In-house R&D
INDUSTRY
Automation Tools
TECHNOLOGIES
Swift / Cloudflare Workers / Durable Objects / Whisper large-v3-turbo
table of content
Headshot of Myroslav Budzanivskyi, Co-founder and CTO of Codebridge.
Myroslav Budzanivskyi
Co-Founder & CTO

Discuss a similar project

Talk through scope, risks, and delivery approach with our CTO
SUMMARY

What we shipped. A native macOS dictation application that inserts text at the cursor in any app. Sub-300 ms perceived latency across hostile mobile networks. 40 UI languages on day one. A privacy posture that retains no audio by default.

Why it sits in our case studies. The hard problems behind Lispr are the problems we solve on client engagements. Latency engineering on cloud-dependent paths. Multilingual production at scale. Edge architecture under provider-side geo-blocks. Release engineering disciplined enough to ship 67 notarized builds in three weeks.

Live in production. 46,011 dictations and 1.1M+ words across 29 countries in the first three weeks, weighted toward users whose languages Apple Dictation and Apple Intelligence still underserve.

The delivery model. Codebridge engineers ran every workstream alongside AI agents that handled code review, localization fleets, and competitor research. The same model we apply on client engagements, demonstrated end-to-end on a product we own.

Strategic Context

The multilingual gap

Apple Dictation underserves users whose first language is not English. Apple Intelligence excludes Ukrainian and most of the languages Lispr's user base writes in. Privacy is the second wedge. Lispr captures audio only, with no screenshots, no account, and nothing retained by default.

Why we built it in-house

  1. A real product bet on the multilingual gap, funded from our own balance sheet.
  2. A public demonstration of the Codebridge AI-augmented delivery model on a project we own end-to-end.
  3. A testbed for engineering practices we apply on client work, validated on our own product before we recommend them.

Success criteria at kickoff

Cursor insertion in any application. Sub-second end-to-end response. No account, email, or payment surface. Multilingual from day one. A small, native, signed binary with auto-update.

The Challenge

Five constraints set the architectural decisions.

The latency floor. The conventional path (record file, POST to a Whisper endpoint, wait for response) takes 1500 to 2500 ms. The target was a path that feels instant end-to-end, fast enough that the user keeps thinking in speech instead of waiting on the machine.

No "type anywhere" API on macOS. macOS provides no system service or extension that both records audio and inserts text into arbitrary applications. We assembled the capture surface from CGEventTap, Carbon hotkeys, and synthesized keystrokes. No off-the-shelf integrated path existed.

Hostile networks. Mobile hotspots drop QUIC and UDP without surfacing an error. The ASR provider geo-blocks entire regions. The transport had to survive networks that look connected but drop packets, and reach users in countries the provider refuses to serve.

Forty languages on day one. The standard path of human translation contracts per language cannot match a multi-release-per-day cadence and does not feed quality signal back into the build. We needed broad coverage from launch with reviewable quality and a closed feedback loop into transcription.

Trust without an account. Users download Lispr and dictate. They do not sign up. The binary ships no API keys. We retain no audio by default. The trust model had to survive technical scrutiny without any data collection on the user.

Scope of Work

To tackle these challenges, our scope of work included:

Push-to-talk capture engine

A listen-only CGEventTap detects a held modifier anywhere in the system. A gesture state machine handles hold, double-tap-latch, and quick-tap interactions. Carbon RegisterEventHotKey powers the History shortcut. Lispr runs as a lightweight native macOS app with a Dock icon and menu-bar presence.

macOS offers no path for an input method that records audio and types into arbitrary apps, so we ruled out the keyboard-extension route on day one. The capture surface is built from primitives.

Streaming transcription pipeline

The client encodes PCM to Opus on every ~64 ms buffer (24 kbit/s, 20 ms packets, Ogg framing with CRC32 self-test) and streams it up a chunked URLSession upload during recording. TLS pre-warms on key-down. A shared singleton URLSession keeps connections warm.

We instrumented the full request path before tuning anything. Cloudflare fully buffers the client-to-worker body before worker.fetch() runs (body available 0 to 2 ms after key-up, even on 26-second clips). That told us the latency win lives in client-side streaming, not in the Worker.

Geo-aware edge relay

The Worker calls the upstream ASR direct first. On a regional 403 it falls back to a Durable Object pinned to Sydney via locationHint, selected by benchmark. We key the trigger on cf.colo because the provider blocks egress IPs, not user geos. Users in unblocked colos pay zero overhead.

Multi-language and vocabulary layer

One English master string file feeds AI translator agents per language. A separate per-language reviewer agent scores the output. A generator bakes .lproj bundles and pre-rendered marketing pages, with English fallback. The Worker handles custom vocabularies in a dedicated Cloudflare Container (Python with lingua, wordfreq, and MeCab), off the hot path, then applies them to transcription. Instant translate runs on gpt-oss-120b.

Trust, update, and observability layer

Every build goes through Apple notarization and stapling. Updates ship via Sparkle with an EdDSA-signed appcast and a SHA-256 integrity check on the download. Analytics Engine collects telemetry that an internal dashboard visualizes behind Cloudflare Access. We forward audio transiently for transcription and retain none of it by default. A training-corpus opt-in exists; consent is required twice.

Solution Architecture

A native Swift client streams Opus audio while the user is still talking. The audio reaches a Cloudflare Worker that handles key custody, telemetry, and geo-relay. The Worker forwards to a cloud ASR provider running Whisper large-v3-turbo. Text pastes at the cursor. A StatsCounter Durable Object, R2 buckets, and Analytics Engine sit behind the Worker.

Three-layer architecture diagram for Lispr. Layer 1, macOS Client: push-to-talk trigger, streaming Opus encoder, smart      ▎ insertion at the cursor, local history and retry. Layer 2, Cloudflare Edge: an Edge Worker handles key custody and proxy    ▎ at about 1 ms CPU per dictation, a Geo Relay Durable Object covers blocked regions, Analytics Engine and R2 sit             ▎ alongside, and a Vocabulary Service runs off the hot path. Layer 3, AI Inference: Whisper large-v3-turbo for                ▎ speech-to-text and gpt-oss-120b for instant translation, covering 40 languages. Audio streams up from the client while      ▎ the user is speaking; text comes back at the cursor in around 300 ms round trip.
Three layers behind one keypress. The native macOS client streams Opus audio to a Cloudflare Worker, which routes to Whisper for transcription or gpt-oss-120b for instant translation. The result lands at the cursor in around 300 ms.
PhaseWhat shippedVersions
0. Spike / PoCRecord, POST to Whisper, pastepre-0.2
1. App shell + push-to-talkCGEventTap trigger, AVCapture audio, recording indicator0.2-0.7
2. Worker + ASR proxyCloudflare Worker, server-side keys, client token auth0.2
3. Trust and distributionNotarization, Sparkle EdDSA auto-update, signed DMG, one-command release0.2-0.7
4. Onboarding + i18nOnboarding flow, trigger-key picker, history, 35-language localization0.8-0.16
5. Network hardeningHTTP/2-only zone, geo-relay Durable Object, retry layer0.31-0.36
6. Streaming latency rewriteOpus streaming upload, shared URLSession. ~2500 to ~300 ms0.41-0.42
7. Feature expansionInstant translate, dictation-language picker, custom vocabularies0.50-0.62
8. Observability planeAnalytics Engine telemetry, internal dashboard, e2e latency beacon0.63-0.67
9. Platform expansionWindows code-complete; iOS and Android plannedongoing

Instant Translate (shipped capability)

One additional modifier turns dictation into translation. Hold ⌥ to dictate; hold ⌥+⌃ to translate while you speak. The user speaks in one language, the output lands at the cursor in another, in whichever Mac app is focused.

The pipeline reuses every layer described above. The streaming Opus path delivers audio to the Worker. Whisper large-v3-turbo transcribes with automatic source-language identification. gpt-oss-120b on the same edge route translates the transcript. The smart-insertion layer pastes at the cursor. No copy-paste, no separate UI, no extra round trip on the user's side. The engineering point is the composition: a single user gesture wires a multi-model pipeline (ASR plus LLM) through one edge route with no new UI surface.

Technology stack

LayerStack
macOS AppSwift, AppKit, AVFoundation, CGEventTap + Carbon RegisterEventHotKey, AudioToolbox AudioConverter (custom Opus encoder), Sparkle, CoreAudio HAL, IOKit. Universal arm64+x86_64, macOS 11+. ~9,200 LOC across 42 Swift files.
Audio pipeline16k mono PCM to Opus 48k at 24 kbit/s, 20 ms packets, Ogg framing + CRC32 self-test. Dual sink: live stream plus local .ogg for retry.
Edge / backendCloudflare Workers (1,335 LOC), Durable Objects (geo relay and StatsCounter), R2, Analytics Engine, cron keep-alive.
AI / MLCloud inference: Whisper large-v3-turbo (ASR), gpt-oss-120b (instant translate). Custom vocabulary: lingua, wordfreq, and MeCab in a Cloudflare Container. Formatting pass (gpt-oss-20b) designed, on the roadmap.
Updates & distributionSparkle, EdDSA-signed appcast, Apple notarization, SHA-256 download integrity check, direct .dmg distribution.
Web / marketingStatic HTML and per-locale JSON, cheerio-based i18n build, 40 pre-rendered language pages on Cloudflare Pages with Accept-Language redirect, deployed via GitHub Actions.
Windows port.NET 8 + WPF, ~2,548 LOC across 24 C# files (code-complete).

Team and delivery model

FunctionHow it was delivered
Architecture and technical leadershipSenior engineering across the full project lifecycle
macOS engineeringNative Swift development with AI-assisted multi-agent code review passes per release
Backend and edge engineeringCloudflare Worker plus Durable Objects, same AI-assisted workflow
LocalizationAgent-assisted translation pipeline with per-language automated review and human oversight
QATelemetry-driven testing, multi-agent code review, on-device verification
DesignIn-code design system, agent-assisted exploration
Release engineering and signingCTO-held Apple Developer identity, one-command notarized release pipeline

Codebridge engineers ran every workstream alongside AI agents: translation fleets with per-language reviewers, multi-agent competitor research, agent-assisted design exploration, code review. The combination is why a senior team shipped 67 Apple-notarized production releases in three weeks.

Technologies We Use in This Project

Production Metrics and Outcomes

Live in production for three weeks. The metrics below are what we measured in flight.

  • Perceived latency. ~300 ms on the warm path. Server-side median 346 ms.
  • p99 latency. 1,251 ms across n ≈ 26,860 dictations. The geo-relay path absorbs the worst of the upstream provider's regional variance.
  • Production volume. 46,011 dictations and 1.1M+ words across 29 countries.
  • Multilingual coverage. 40 UI languages live, with pre-rendered per-locale marketing pages on each one.
  • Worker CPU per dictation. ~1 ms. The edge handles relay, not transcription.
  • Release cadence. 67 Apple-notarized production builds across three weeks.

Before and after, against the wider product bar

DimensionMarket baselineLispr
Perceived latency1 to 3 s~300 ms
Download size150 to 200+ MB3.67 MB
Download to first dictation3+ min (install, account, model download)under 60 s, no account
UI languages at launch134 to 35 (now 40)
Account requiredYesNo
Mobile-network reliabilityFrequent silent failuresHTTP/2-only zone plus retry layer
Infra to run itServers and GPUs1 Cloudflare Worker, ~1 ms CPU/dictation
Release cadenceWeekly at best~3 production releases per day

Top countries (first three weeks)

Ukraine 8,590 · Indonesia 7,690 · Germany 1,730 · Czech Republic 1,700 · USA 1,300 · Portugal 1,260 · United Arab Emirates 780. The geography validates the multilingual-gap hypothesis: users underserved by Apple's dictation offerings dominate the early traffic.

Latency percentiles (server-side, live 7 days, n ≈ 26,860)

p25: 272 ms · p50: 346 ms · p75: 441 ms · p90: 602 ms · p99: 1,251 ms.

Cost shape

The architecture pays the inference provider per call. The Worker handles relay, telemetry, and key custody at edge cost. Infrastructure cost scales linearly with active usage rather than with always-on GPU capacity, which is the architectural shape that lets a small operator run this at production scale.

What we did not measure

Client memory and CPU footprint are instrumented but not stress-tested across hardware classes. Retry success on hostile networks is inferred from telemetry rather than from a controlled corpus. We do not retain lifetime audio-minutes by design, which costs us a metric we would otherwise report.

Future Plans

Phase 2 (active)

Windows release. On-device "race mode": local Apple SpeechAnalyzer running in parallel with cloud, first non-empty result wins. Per-device attestation.

Phase 3 (planned)

iOS, built as a keyboard extension plus container app. Optional AI formatting pass. Possible single-user Pro tier.

Phase 4 (planned)

Android, via overlay plus AccessibilityService.

No items found.
at

Want to achieve similar results? Let’s develop your idea!