farzaa · xstjmark21-cmyk · Apr 11, 2026 · Apr 11, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -15,7 +15,7 @@ All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in th
 - **Framework**: SwiftUI (macOS native) with AppKit bridging for menu bar panel and cursor overlay
 - **Pattern**: MVVM with `@StateObject` / `@Published` state management
 - **AI Chat**: Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming
-- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI and Apple Speech as fallbacks
+- **Speech-to-Text**: AssemblyAI real-time streaming via websocket, using `u3-rt-pro` for English and `whisper-rt` for Chinese, with OpenAI and Apple Speech as fallbacks
 - **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy
 - **Screen Capture**: ScreenCaptureKit (macOS 14.2+), multi-monitor support
 - **Voice Input**: Push-to-talk via `AVAudioEngine` + pluggable transcription-provider layer. System-wide keyboard shortcut via listen-only CGEvent tap.
@@ -34,7 +34,7 @@ The app never calls external APIs directly. All requests go through a Cloudflare
 | `POST /transcribe-token` | `streaming.assemblyai.com/v3/token` | Fetches a short-lived (480s) AssemblyAI websocket token |
 
 Worker secrets: `ANTHROPIC_API_KEY`, `ASSEMBLYAI_API_KEY`, `ELEVENLABS_API_KEY`
-Worker vars: `ELEVENLABS_VOICE_ID`
+Worker vars: `ELEVENLABS_VOICE_ID`, `ELEVENLABS_CHINESE_VOICE_ID` (optional)
 
 ### Key Architecture Decisions
 
@@ -61,7 +61,7 @@ Worker vars: `ELEVENLABS_VOICE_ID`
 | `CompanionScreenCaptureUtility.swift` | ~132 | Multi-monitor screenshot capture using ScreenCaptureKit. Returns labeled image data for each connected display. |
 | `BuddyDictationManager.swift` | ~866 | Push-to-talk voice pipeline. Handles microphone capture via `AVAudioEngine`, provider-aware permission checks, keyboard/button dictation sessions, transcript finalization, shortcut parsing, contextual keyterms, and live audio-level reporting for waveform feedback. |
 | `BuddyTranscriptionProvider.swift` | ~100 | Protocol surface and provider factory for voice transcription backends. Resolves provider based on `VoiceTranscriptionProvider` in Info.plist — AssemblyAI, OpenAI, or Apple Speech. |
-| `AssemblyAIStreamingTranscriptionProvider.swift` | ~478 | Streaming transcription provider. Fetches temp tokens from the Cloudflare Worker, opens an AssemblyAI v3 websocket, streams PCM16 audio, tracks turn-based transcripts, and delivers finalized text on key-up. Shares a single URLSession across all sessions. |
+| `AssemblyAIStreamingTranscriptionProvider.swift` | ~541 | Streaming transcription provider. Fetches temp tokens from the Cloudflare Worker, opens an AssemblyAI v3 websocket, streams PCM16 audio, switches between `u3-rt-pro` for English and `whisper-rt` for Chinese, tracks turn-based transcripts, and delivers finalized text on key-up. Shares a single URLSession across all sessions. |
 | `OpenAIAudioTranscriptionProvider.swift` | ~317 | Upload-based transcription provider. Buffers push-to-talk audio locally, uploads as WAV on release, returns finalized transcript. |
 | `AppleSpeechTranscriptionProvider.swift` | ~147 | Local fallback transcription provider backed by Apple's Speech framework. |
 | `BuddyAudioConversionSupport.swift` | ~108 | Audio conversion helpers. Converts live mic buffers to PCM16 mono audio and builds WAV payloads for upload-based providers. |

diff --git a/README.md b/README.md
@@ -56,13 +56,16 @@ npx wrangler secret put ASSEMBLYAI_API_KEY
 npx wrangler secret put ELEVENLABS_API_KEY
 ```
 
-For the ElevenLabs voice ID, open `wrangler.toml` and set it there (it's not sensitive):
+For the ElevenLabs voice IDs, open `wrangler.toml` and set them there (they're not sensitive):
 
 ```toml
 [vars]
 ELEVENLABS_VOICE_ID = "your-voice-id-here"
+ELEVENLABS_CHINESE_VOICE_ID = "optional-chinese-voice-id"
 ```
 
+`ELEVENLABS_VOICE_ID` stays the default voice. `ELEVENLABS_CHINESE_VOICE_ID` is optional and only used when the app is set to Chinese voice mode. If you leave it blank, the default voice is reused.
+
 Deploy it:
 
 ```bash
@@ -87,6 +90,7 @@ ANTHROPIC_API_KEY=sk-ant-...
 ASSEMBLYAI_API_KEY=...
 ELEVENLABS_API_KEY=...
 ELEVENLABS_VOICE_ID=...
+ELEVENLABS_CHINESE_VOICE_ID=...
 ```
 
 Then update the proxy URLs in the Swift code to point to `http://localhost:8787` instead of the deployed Worker URL while developing. Grep for `clicky-proxy` to find them all.
@@ -127,7 +131,7 @@ The app will appear in your menu bar (not the dock). Click the icon to open the
 
 If you want the full technical breakdown, read `CLAUDE.md`. But here's the short version:
 
-**Menu bar app** (no dock icon) with two `NSPanel` windows — one for the control panel dropdown, one for the full-screen transparent cursor overlay. Push-to-talk streams audio over a websocket to AssemblyAI, sends the transcript + screenshot to Claude via streaming SSE, and plays the response through ElevenLabs TTS. Claude can embed `[POINT:x,y:label:screenN]` tags in its responses to make the cursor fly to specific UI elements across multiple monitors. All three APIs are proxied through a Cloudflare Worker.
+**Menu bar app** (no dock icon) with two `NSPanel` windows — one for the control panel dropdown, one for the full-screen transparent cursor overlay. Push-to-talk streams audio over a websocket to AssemblyAI, sends the transcript + screenshot to Claude via streaming SSE, and plays the response through ElevenLabs TTS. English uses AssemblyAI `u3-rt-pro`, while Chinese switches to `whisper-rt` so Chinese speech can be transcribed reliably. Claude can embed `[POINT:x,y:label:screenN]` tags in its responses to make the cursor fly to specific UI elements across multiple monitors. All three APIs are proxied through a Cloudflare Worker.
 
 ## Project structure
 

diff --git a/install-clicky.sh b/install-clicky.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+APP_PATH="$HOME/Library/Developer/Xcode/DerivedData/leanring-buddy-dvwepfgqqgvpjhbjcbybcytzaawt/Index.noindex/Build/Products/Debug/Clicky.app"
+
+if [ -d "$APP_PATH" ]; then
+    rm -rf "/Applications/Clicky.app" 2>/dev/null
+    cp -R "$APP_PATH" /Applications/
+    echo "✅ Clicky installed to /Applications!"
+    echo "Now open Clicky from Applications folder and grant permissions."
+else
+    echo "❌ Clicky.app not found. Please build in Xcode first (⌘R)"
+fi
diff --git a/leanring-buddy.xcodeproj/project.pbxproj b/leanring-buddy.xcodeproj/project.pbxproj
@@ -34,9 +34,22 @@
 		28F22CD62F56440300A0FC59 /* leanring-buddyUITests.xctest */ = {isa = PBXFileReference; explicitFileType = wrapper.cfbundle; includeInIndex = 0; path = "leanring-buddyUITests.xctest"; sourceTree = BUILT_PRODUCTS_DIR; };
 /* End PBXFileReference section */
 
+/* Begin PBXFileSystemSynchronizedBuildFileExceptionSet section */
+		AA11CC112F7000010039DA55 /* Exceptions for "leanring-buddy" folder in "leanring-buddy" target */ = {
+			isa = PBXFileSystemSynchronizedBuildFileExceptionSet;
+			membershipExceptions = (
+				Info.plist,
+			);
+			target = 28F22CBE2F56440300A0FC59 /* leanring-buddy */;
+		};
+/* End PBXFileSystemSynchronizedBuildFileExceptionSet section */
+
 /* Begin PBXFileSystemSynchronizedRootGroup section */
 		28F22CC12F56440300A0FC59 /* leanring-buddy */ = {
 			isa = PBXFileSystemSynchronizedRootGroup;
+			exceptions = (
+				AA11CC112F7000010039DA55 /* Exceptions for "leanring-buddy" folder in "leanring-buddy" target */,
+			);
 			path = "leanring-buddy";
 			sourceTree = "<group>";
 		};
@@ -411,7 +424,7 @@
 				CODE_SIGN_STYLE = Automatic;
 				COMBINE_HIDPI_IMAGES = YES;
 				CURRENT_PROJECT_VERSION = 1;
-				DEVELOPMENT_TEAM = 2UDAY4J48G;
+				DEVELOPMENT_TEAM = ZZY6K862N2;
 				ENABLE_APP_SANDBOX = NO;
 				ENABLE_HARDENED_RUNTIME = YES;
 				ENABLE_OUTGOING_NETWORK_CONNECTIONS = YES;
@@ -449,7 +462,7 @@
 				CODE_SIGN_STYLE = Automatic;
 				COMBINE_HIDPI_IMAGES = YES;
 				CURRENT_PROJECT_VERSION = 1;
-				DEVELOPMENT_TEAM = 2UDAY4J48G;
+				DEVELOPMENT_TEAM = ZZY6K862N2;
 				ENABLE_APP_SANDBOX = NO;
 				ENABLE_HARDENED_RUNTIME = YES;
 				ENABLE_OUTGOING_NETWORK_CONNECTIONS = YES;
@@ -484,7 +497,7 @@
 				BUNDLE_LOADER = "$(TEST_HOST)";
 				CODE_SIGN_STYLE = Automatic;
 				CURRENT_PROJECT_VERSION = 1;
-				DEVELOPMENT_TEAM = 6D7X9GGZAW;
+				DEVELOPMENT_TEAM = ZZY6K862N2;
 				GENERATE_INFOPLIST_FILE = YES;
 				MACOSX_DEPLOYMENT_TARGET = 14.2;
 				MARKETING_VERSION = 1.0;
@@ -505,7 +518,7 @@
 				BUNDLE_LOADER = "$(TEST_HOST)";
 				CODE_SIGN_STYLE = Automatic;
 				CURRENT_PROJECT_VERSION = 1;
-				DEVELOPMENT_TEAM = 6D7X9GGZAW;
+				DEVELOPMENT_TEAM = ZZY6K862N2;
 				GENERATE_INFOPLIST_FILE = YES;
 				MACOSX_DEPLOYMENT_TARGET = 14.2;
 				MARKETING_VERSION = 1.0;
@@ -525,7 +538,7 @@
 			buildSettings = {
 				CODE_SIGN_STYLE = Automatic;
 				CURRENT_PROJECT_VERSION = 1;
-				DEVELOPMENT_TEAM = 6D7X9GGZAW;
+				DEVELOPMENT_TEAM = ZZY6K862N2;
 				GENERATE_INFOPLIST_FILE = YES;
 				MARKETING_VERSION = 1.0;
 				PRODUCT_BUNDLE_IDENTIFIER = "com.yourcompany.leanring-buddyUITests";
@@ -544,7 +557,7 @@
 			buildSettings = {
 				CODE_SIGN_STYLE = Automatic;
 				CURRENT_PROJECT_VERSION = 1;
-				DEVELOPMENT_TEAM = 6D7X9GGZAW;
+				DEVELOPMENT_TEAM = ZZY6K862N2;
 				GENERATE_INFOPLIST_FILE = YES;
 				MARKETING_VERSION = 1.0;
 				PRODUCT_BUNDLE_IDENTIFIER = "com.yourcompany.leanring-buddyUITests";

diff --git a/leanring-buddy.xcodeproj/xcuserdata/mac.xcuserdatad/xcschemes/xcschememanagement.plist b/leanring-buddy.xcodeproj/xcuserdata/mac.xcuserdatad/xcschemes/xcschememanagement.plist
@@ -0,0 +1,14 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+	<key>SchemeUserState</key>
+	<dict>
+		<key>leanring-buddy.xcscheme_^#shared#^_</key>
+		<dict>
+			<key>orderHint</key>
+			<integer>0</integer>
+		</dict>
+	</dict>
+</dict>
+</plist>
diff --git a/leanring-buddy/AppleSpeechTranscriptionProvider.swift b/leanring-buddy/AppleSpeechTranscriptionProvider.swift
@@ -25,11 +25,12 @@ final class AppleSpeechTranscriptionProvider: BuddyTranscriptionProvider {
 
     func startStreamingSession(
         keyterms: [String],
+        languageCode: String?,
         onTranscriptUpdate: @escaping (String) -> Void,
         onFinalTranscriptReady: @escaping (String) -> Void,
         onError: @escaping (Error) -> Void
     ) async throws -> any BuddyStreamingTranscriptionSession {
-        guard let speechRecognizer = Self.makeBestAvailableSpeechRecognizer() else {
+        guard let speechRecognizer = Self.makeBestAvailableSpeechRecognizer(languageCode: languageCode) else {
             throw AppleSpeechTranscriptionProviderError(message: "dictation is not available on this mac.")
         }
 
@@ -41,14 +42,23 @@ final class AppleSpeechTranscriptionProvider: BuddyTranscriptionProvider {
         )
     }
 
-    private static func makeBestAvailableSpeechRecognizer() -> SFSpeechRecognizer? {
-        let preferredLocales = [
-            Locale.autoupdatingCurrent,
-            Locale(identifier: "en-US")
-        ]
+    private static func makeBestAvailableSpeechRecognizer(languageCode: String?) -> SFSpeechRecognizer? {
+        var preferredLocales: [Locale] = []
+
+        if let languageCode {
+            preferredLocales.append(Locale(identifier: languageCode))
+        }
+
+        preferredLocales.append(Locale.autoupdatingCurrent)
+        preferredLocales.append(Locale(identifier: "en-US"))
+
+        if languageCode != "zh-CN" {
+            preferredLocales.append(Locale(identifier: "zh-CN"))
+        }
 
         for preferredLocale in preferredLocales {
-            if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale) {
+            if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale),
+               speechRecognizer.isAvailable {
                 return speechRecognizer
             }
         }

diff --git a/leanring-buddy/AssemblyAIStreamingTranscriptionProvider.swift b/leanring-buddy/AssemblyAIStreamingTranscriptionProvider.swift
@@ -19,7 +19,7 @@ struct AssemblyAIStreamingTranscriptionProviderError: LocalizedError {
 final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider {
     /// URL for the Cloudflare Worker endpoint that returns a short-lived
     /// AssemblyAI streaming token. The real API key never leaves the server.
-    private static let tokenProxyURL = "https://your-worker-name.your-subdomain.workers.dev/transcribe-token"
+    private static let tokenProxyURL = "https://clicky-proxy.clicky-mark.workers.dev/transcribe-token"
 
     let displayName = "AssemblyAI"
     let requiresSpeechRecognitionPermission = false
@@ -35,11 +35,11 @@ final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider
 
     func startStreamingSession(
         keyterms: [String],
+        languageCode: String?,
         onTranscriptUpdate: @escaping (String) -> Void,
         onFinalTranscriptReady: @escaping (String) -> Void,
         onError: @escaping (Error) -> Void
     ) async throws -> any BuddyStreamingTranscriptionSession {
-        // Fetch a fresh temporary token from the proxy before each session
         let temporaryToken = try await fetchTemporaryToken()
         print("🎙️ AssemblyAI: fetched temporary token (\(temporaryToken.prefix(20))...)")
 
@@ -48,6 +48,7 @@ final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider
             temporaryToken: temporaryToken,
             urlSession: sharedWebSocketURLSession,
             keyterms: keyterms,
+            languageCode: languageCode,
             onTranscriptUpdate: onTranscriptUpdate,
             onFinalTranscriptReady: onFinalTranscriptReady,
             onError: onError
@@ -85,6 +86,53 @@ final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider
 }
 
 private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStreamingTranscriptionSession {
+    private enum StreamingSpeechModelConfiguration {
+        case universalRealtimePro
+        case whisperRealtime
+
+        init(languageCode: String?) {
+            let normalizedLanguageCode = languageCode?
+                .trimmingCharacters(in: .whitespacesAndNewlines)
+                .lowercased()
+
+            if let normalizedLanguageCode,
+               !normalizedLanguageCode.isEmpty,
+               normalizedLanguageCode != "en" {
+                self = .whisperRealtime
+                return
+            }
+
+            self = .universalRealtimePro
+        }
+
+        var modelIdentifier: String {
+            switch self {
+            case .universalRealtimePro:
+                return "u3-rt-pro"
+            case .whisperRealtime:
+                return "whisper-rt"
+            }
+        }
+
+        var supportsExplicitLanguageCode: Bool {
+            switch self {
+            case .universalRealtimePro:
+                return true
+            case .whisperRealtime:
+                return false
+            }
+        }
+
+        var shouldEnableLanguageDetection: Bool {
+            switch self {
+            case .universalRealtimePro:
+                return false
+            case .whisperRealtime:
+                return true
+            }
+        }
+    }
+
     private struct MessageEnvelope: Decodable {
         let type: String
     }
@@ -117,6 +165,7 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
     private let apiKey: String?
     private let temporaryToken: String?
     private let keyterms: [String]
+    private let languageCode: String?
     private let onTranscriptUpdate: (String) -> Void
     private let onFinalTranscriptReady: (String) -> Void
     private let onError: (Error) -> Void
@@ -142,6 +191,7 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
         temporaryToken: String?,
         urlSession: URLSession,
         keyterms: [String],
+        languageCode: String?,
         onTranscriptUpdate: @escaping (String) -> Void,
         onFinalTranscriptReady: @escaping (String) -> Void,
         onError: @escaping (Error) -> Void
@@ -150,6 +200,7 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
         self.temporaryToken = temporaryToken
         self.urlSession = urlSession
         self.keyterms = keyterms
+        self.languageCode = languageCode
         self.onTranscriptUpdate = onTranscriptUpdate
         self.onFinalTranscriptReady = onFinalTranscriptReady
         self.onError = onError
@@ -158,7 +209,8 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
     func open() async throws {
         let websocketURL = try Self.makeWebsocketURL(
             temporaryToken: temporaryToken,
-            keyterms: keyterms
+            keyterms: keyterms,
+            languageCode: languageCode
         )
 
         var websocketRequest = URLRequest(url: websocketURL)
@@ -436,21 +488,38 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
 
     private static func makeWebsocketURL(
         temporaryToken: String?,
-        keyterms: [String]
+        keyterms: [String],
+        languageCode: String?
     ) throws -> URL {
         guard var websocketURLComponents = URLComponents(string: websocketBaseURLString) else {
             throw AssemblyAIStreamingTranscriptionProviderError(
                 message: "AssemblyAI websocket URL is invalid."
             )
         }
 
+        let streamingSpeechModelConfiguration = StreamingSpeechModelConfiguration(
+            languageCode: languageCode
+        )
+
         var queryItems = [
             URLQueryItem(name: "sample_rate", value: "16000"),
             URLQueryItem(name: "encoding", value: "pcm_s16le"),
             URLQueryItem(name: "format_turns", value: "true"),
-            URLQueryItem(name: "speech_model", value: "u3-rt-pro")
+            URLQueryItem(
+                name: "speech_model",
+                value: streamingSpeechModelConfiguration.modelIdentifier
+            )
         ]
 
+        if streamingSpeechModelConfiguration.shouldEnableLanguageDetection {
+            queryItems.append(URLQueryItem(name: "language_detection", value: "true"))
+        }
+
+        if streamingSpeechModelConfiguration.supportsExplicitLanguageCode,
+           let languageCode {
+            queryItems.append(URLQueryItem(name: "language_code", value: languageCode))
+        }
+
         let normalizedKeyterms = keyterms
             .map { $0.trimmingCharacters(in: .whitespacesAndNewlines) }
             .filter { !$0.isEmpty }

diff --git a/leanring-buddy/BuddyDictationManager.swift b/leanring-buddy/BuddyDictationManager.swift
@@ -265,6 +265,7 @@ final class BuddyDictationManager: NSObject, ObservableObject {
     private let transcriptionProvider: any BuddyTranscriptionProvider
     private let audioEngine = AVAudioEngine()
     private var activeTranscriptionSession: (any BuddyStreamingTranscriptionSession)?
+    var languageCode: String?
     private var activeStartSource: BuddyDictationStartSource?
     private var draftCallbacks: BuddyDictationDraftCallbacks?
     private var draftTextBeforeCurrentDictation = ""
@@ -519,6 +520,7 @@ final class BuddyDictationManager: NSObject, ObservableObject {
 
         let activeTranscriptionSession = try await transcriptionProvider.startStreamingSession(
             keyterms: buildTranscriptionKeyterms(),
+            languageCode: languageCode,
             onTranscriptUpdate: { [weak self] transcriptText in
                 Task { @MainActor in
                     self?.latestRecognizedText = transcriptText

diff --git a/leanring-buddy/BuddyTranscriptionProvider.swift b/leanring-buddy/BuddyTranscriptionProvider.swift
@@ -23,6 +23,7 @@ protocol BuddyTranscriptionProvider {
 
     func startStreamingSession(
         keyterms: [String],
+        languageCode: String?,
         onTranscriptUpdate: @escaping (String) -> Void,
         onFinalTranscriptReady: @escaping (String) -> Void,
         onError: @escaping (Error) -> Void