Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in th
- **Framework**: SwiftUI (macOS native) with AppKit bridging for menu bar panel and cursor overlay
- **Pattern**: MVVM with `@StateObject` / `@Published` state management
- **AI Chat**: Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming
- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI and Apple Speech as fallbacks
- **Speech-to-Text**: AssemblyAI real-time streaming via websocket, using `u3-rt-pro` for English and `whisper-rt` for Chinese, with OpenAI and Apple Speech as fallbacks
- **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy
- **Screen Capture**: ScreenCaptureKit (macOS 14.2+), multi-monitor support
- **Voice Input**: Push-to-talk via `AVAudioEngine` + pluggable transcription-provider layer. System-wide keyboard shortcut via listen-only CGEvent tap.
Expand All @@ -34,7 +34,7 @@ The app never calls external APIs directly. All requests go through a Cloudflare
| `POST /transcribe-token` | `streaming.assemblyai.com/v3/token` | Fetches a short-lived (480s) AssemblyAI websocket token |

Worker secrets: `ANTHROPIC_API_KEY`, `ASSEMBLYAI_API_KEY`, `ELEVENLABS_API_KEY`
Worker vars: `ELEVENLABS_VOICE_ID`
Worker vars: `ELEVENLABS_VOICE_ID`, `ELEVENLABS_CHINESE_VOICE_ID` (optional)

### Key Architecture Decisions

Expand All @@ -61,7 +61,7 @@ Worker vars: `ELEVENLABS_VOICE_ID`
| `CompanionScreenCaptureUtility.swift` | ~132 | Multi-monitor screenshot capture using ScreenCaptureKit. Returns labeled image data for each connected display. |
| `BuddyDictationManager.swift` | ~866 | Push-to-talk voice pipeline. Handles microphone capture via `AVAudioEngine`, provider-aware permission checks, keyboard/button dictation sessions, transcript finalization, shortcut parsing, contextual keyterms, and live audio-level reporting for waveform feedback. |
| `BuddyTranscriptionProvider.swift` | ~100 | Protocol surface and provider factory for voice transcription backends. Resolves provider based on `VoiceTranscriptionProvider` in Info.plist — AssemblyAI, OpenAI, or Apple Speech. |
| `AssemblyAIStreamingTranscriptionProvider.swift` | ~478 | Streaming transcription provider. Fetches temp tokens from the Cloudflare Worker, opens an AssemblyAI v3 websocket, streams PCM16 audio, tracks turn-based transcripts, and delivers finalized text on key-up. Shares a single URLSession across all sessions. |
| `AssemblyAIStreamingTranscriptionProvider.swift` | ~541 | Streaming transcription provider. Fetches temp tokens from the Cloudflare Worker, opens an AssemblyAI v3 websocket, streams PCM16 audio, switches between `u3-rt-pro` for English and `whisper-rt` for Chinese, tracks turn-based transcripts, and delivers finalized text on key-up. Shares a single URLSession across all sessions. |
| `OpenAIAudioTranscriptionProvider.swift` | ~317 | Upload-based transcription provider. Buffers push-to-talk audio locally, uploads as WAV on release, returns finalized transcript. |
| `AppleSpeechTranscriptionProvider.swift` | ~147 | Local fallback transcription provider backed by Apple's Speech framework. |
| `BuddyAudioConversionSupport.swift` | ~108 | Audio conversion helpers. Converts live mic buffers to PCM16 mono audio and builds WAV payloads for upload-based providers. |
Expand Down
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,16 @@ npx wrangler secret put ASSEMBLYAI_API_KEY
npx wrangler secret put ELEVENLABS_API_KEY
```

For the ElevenLabs voice ID, open `wrangler.toml` and set it there (it's not sensitive):
For the ElevenLabs voice IDs, open `wrangler.toml` and set them there (they're not sensitive):

```toml
[vars]
ELEVENLABS_VOICE_ID = "your-voice-id-here"
ELEVENLABS_CHINESE_VOICE_ID = "optional-chinese-voice-id"
```

`ELEVENLABS_VOICE_ID` stays the default voice. `ELEVENLABS_CHINESE_VOICE_ID` is optional and only used when the app is set to Chinese voice mode. If you leave it blank, the default voice is reused.

Deploy it:

```bash
Expand All @@ -87,6 +90,7 @@ ANTHROPIC_API_KEY=sk-ant-...
ASSEMBLYAI_API_KEY=...
ELEVENLABS_API_KEY=...
ELEVENLABS_VOICE_ID=...
ELEVENLABS_CHINESE_VOICE_ID=...
```

Then update the proxy URLs in the Swift code to point to `http://localhost:8787` instead of the deployed Worker URL while developing. Grep for `clicky-proxy` to find them all.
Expand Down Expand Up @@ -127,7 +131,7 @@ The app will appear in your menu bar (not the dock). Click the icon to open the

If you want the full technical breakdown, read `CLAUDE.md`. But here's the short version:

**Menu bar app** (no dock icon) with two `NSPanel` windows — one for the control panel dropdown, one for the full-screen transparent cursor overlay. Push-to-talk streams audio over a websocket to AssemblyAI, sends the transcript + screenshot to Claude via streaming SSE, and plays the response through ElevenLabs TTS. Claude can embed `[POINT:x,y:label:screenN]` tags in its responses to make the cursor fly to specific UI elements across multiple monitors. All three APIs are proxied through a Cloudflare Worker.
**Menu bar app** (no dock icon) with two `NSPanel` windows — one for the control panel dropdown, one for the full-screen transparent cursor overlay. Push-to-talk streams audio over a websocket to AssemblyAI, sends the transcript + screenshot to Claude via streaming SSE, and plays the response through ElevenLabs TTS. English uses AssemblyAI `u3-rt-pro`, while Chinese switches to `whisper-rt` so Chinese speech can be transcribed reliably. Claude can embed `[POINT:x,y:label:screenN]` tags in its responses to make the cursor fly to specific UI elements across multiple monitors. All three APIs are proxied through a Cloudflare Worker.

## Project structure

Expand Down
11 changes: 11 additions & 0 deletions install-clicky.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
APP_PATH="$HOME/Library/Developer/Xcode/DerivedData/leanring-buddy-dvwepfgqqgvpjhbjcbybcytzaawt/Index.noindex/Build/Products/Debug/Clicky.app"

if [ -d "$APP_PATH" ]; then
rm -rf "/Applications/Clicky.app" 2>/dev/null
cp -R "$APP_PATH" /Applications/
echo "✅ Clicky installed to /Applications!"
echo "Now open Clicky from Applications folder and grant permissions."
else
echo "❌ Clicky.app not found. Please build in Xcode first (⌘R)"
fi
25 changes: 19 additions & 6 deletions leanring-buddy.xcodeproj/project.pbxproj
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,22 @@
28F22CD62F56440300A0FC59 /* leanring-buddyUITests.xctest */ = {isa = PBXFileReference; explicitFileType = wrapper.cfbundle; includeInIndex = 0; path = "leanring-buddyUITests.xctest"; sourceTree = BUILT_PRODUCTS_DIR; };
/* End PBXFileReference section */

/* Begin PBXFileSystemSynchronizedBuildFileExceptionSet section */
AA11CC112F7000010039DA55 /* Exceptions for "leanring-buddy" folder in "leanring-buddy" target */ = {
isa = PBXFileSystemSynchronizedBuildFileExceptionSet;
membershipExceptions = (
Info.plist,
);
target = 28F22CBE2F56440300A0FC59 /* leanring-buddy */;
};
/* End PBXFileSystemSynchronizedBuildFileExceptionSet section */

/* Begin PBXFileSystemSynchronizedRootGroup section */
28F22CC12F56440300A0FC59 /* leanring-buddy */ = {
isa = PBXFileSystemSynchronizedRootGroup;
exceptions = (
AA11CC112F7000010039DA55 /* Exceptions for "leanring-buddy" folder in "leanring-buddy" target */,
);
path = "leanring-buddy";
sourceTree = "<group>";
};
Expand Down Expand Up @@ -411,7 +424,7 @@
CODE_SIGN_STYLE = Automatic;
COMBINE_HIDPI_IMAGES = YES;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 2UDAY4J48G;
DEVELOPMENT_TEAM = ZZY6K862N2;
ENABLE_APP_SANDBOX = NO;
ENABLE_HARDENED_RUNTIME = YES;
ENABLE_OUTGOING_NETWORK_CONNECTIONS = YES;
Expand Down Expand Up @@ -449,7 +462,7 @@
CODE_SIGN_STYLE = Automatic;
COMBINE_HIDPI_IMAGES = YES;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 2UDAY4J48G;
DEVELOPMENT_TEAM = ZZY6K862N2;
ENABLE_APP_SANDBOX = NO;
ENABLE_HARDENED_RUNTIME = YES;
ENABLE_OUTGOING_NETWORK_CONNECTIONS = YES;
Expand Down Expand Up @@ -484,7 +497,7 @@
BUNDLE_LOADER = "$(TEST_HOST)";
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = ZZY6K862N2;
GENERATE_INFOPLIST_FILE = YES;
MACOSX_DEPLOYMENT_TARGET = 14.2;
MARKETING_VERSION = 1.0;
Expand All @@ -505,7 +518,7 @@
BUNDLE_LOADER = "$(TEST_HOST)";
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = ZZY6K862N2;
GENERATE_INFOPLIST_FILE = YES;
MACOSX_DEPLOYMENT_TARGET = 14.2;
MARKETING_VERSION = 1.0;
Expand All @@ -525,7 +538,7 @@
buildSettings = {
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = ZZY6K862N2;
GENERATE_INFOPLIST_FILE = YES;
MARKETING_VERSION = 1.0;
PRODUCT_BUNDLE_IDENTIFIER = "com.yourcompany.leanring-buddyUITests";
Expand All @@ -544,7 +557,7 @@
buildSettings = {
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = ZZY6K862N2;
GENERATE_INFOPLIST_FILE = YES;
MARKETING_VERSION = 1.0;
PRODUCT_BUNDLE_IDENTIFIER = "com.yourcompany.leanring-buddyUITests";
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>SchemeUserState</key>
<dict>
<key>leanring-buddy.xcscheme_^#shared#^_</key>
<dict>
<key>orderHint</key>
<integer>0</integer>
</dict>
</dict>
</dict>
</plist>
24 changes: 17 additions & 7 deletions leanring-buddy/AppleSpeechTranscriptionProvider.swift
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@ final class AppleSpeechTranscriptionProvider: BuddyTranscriptionProvider {

func startStreamingSession(
keyterms: [String],
languageCode: String?,
onTranscriptUpdate: @escaping (String) -> Void,
onFinalTranscriptReady: @escaping (String) -> Void,
onError: @escaping (Error) -> Void
) async throws -> any BuddyStreamingTranscriptionSession {
guard let speechRecognizer = Self.makeBestAvailableSpeechRecognizer() else {
guard let speechRecognizer = Self.makeBestAvailableSpeechRecognizer(languageCode: languageCode) else {
throw AppleSpeechTranscriptionProviderError(message: "dictation is not available on this mac.")
}

Expand All @@ -41,14 +42,23 @@ final class AppleSpeechTranscriptionProvider: BuddyTranscriptionProvider {
)
}

private static func makeBestAvailableSpeechRecognizer() -> SFSpeechRecognizer? {
let preferredLocales = [
Locale.autoupdatingCurrent,
Locale(identifier: "en-US")
]
private static func makeBestAvailableSpeechRecognizer(languageCode: String?) -> SFSpeechRecognizer? {
var preferredLocales: [Locale] = []

if let languageCode {
preferredLocales.append(Locale(identifier: languageCode))
}

preferredLocales.append(Locale.autoupdatingCurrent)
preferredLocales.append(Locale(identifier: "en-US"))

if languageCode != "zh-CN" {
preferredLocales.append(Locale(identifier: "zh-CN"))
}

for preferredLocale in preferredLocales {
if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale) {
if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale),
speechRecognizer.isAvailable {
return speechRecognizer
}
}
Expand Down
79 changes: 74 additions & 5 deletions leanring-buddy/AssemblyAIStreamingTranscriptionProvider.swift
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ struct AssemblyAIStreamingTranscriptionProviderError: LocalizedError {
final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider {
/// URL for the Cloudflare Worker endpoint that returns a short-lived
/// AssemblyAI streaming token. The real API key never leaves the server.
private static let tokenProxyURL = "https://your-worker-name.your-subdomain.workers.dev/transcribe-token"
private static let tokenProxyURL = "https://clicky-proxy.clicky-mark.workers.dev/transcribe-token"

let displayName = "AssemblyAI"
let requiresSpeechRecognitionPermission = false
Expand All @@ -35,11 +35,11 @@ final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider

func startStreamingSession(
keyterms: [String],
languageCode: String?,
onTranscriptUpdate: @escaping (String) -> Void,
onFinalTranscriptReady: @escaping (String) -> Void,
onError: @escaping (Error) -> Void
) async throws -> any BuddyStreamingTranscriptionSession {
// Fetch a fresh temporary token from the proxy before each session
let temporaryToken = try await fetchTemporaryToken()
print("🎙️ AssemblyAI: fetched temporary token (\(temporaryToken.prefix(20))...)")

Expand All @@ -48,6 +48,7 @@ final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider
temporaryToken: temporaryToken,
urlSession: sharedWebSocketURLSession,
keyterms: keyterms,
languageCode: languageCode,
onTranscriptUpdate: onTranscriptUpdate,
onFinalTranscriptReady: onFinalTranscriptReady,
onError: onError
Expand Down Expand Up @@ -85,6 +86,53 @@ final class AssemblyAIStreamingTranscriptionProvider: BuddyTranscriptionProvider
}

private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStreamingTranscriptionSession {
private enum StreamingSpeechModelConfiguration {
case universalRealtimePro
case whisperRealtime

init(languageCode: String?) {
let normalizedLanguageCode = languageCode?
.trimmingCharacters(in: .whitespacesAndNewlines)
.lowercased()

if let normalizedLanguageCode,
!normalizedLanguageCode.isEmpty,
normalizedLanguageCode != "en" {
self = .whisperRealtime
return
}

self = .universalRealtimePro
}

var modelIdentifier: String {
switch self {
case .universalRealtimePro:
return "u3-rt-pro"
case .whisperRealtime:
return "whisper-rt"
}
}

var supportsExplicitLanguageCode: Bool {
switch self {
case .universalRealtimePro:
return true
case .whisperRealtime:
return false
}
}

var shouldEnableLanguageDetection: Bool {
switch self {
case .universalRealtimePro:
return false
case .whisperRealtime:
return true
}
}
}

private struct MessageEnvelope: Decodable {
let type: String
}
Expand Down Expand Up @@ -117,6 +165,7 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
private let apiKey: String?
private let temporaryToken: String?
private let keyterms: [String]
private let languageCode: String?
private let onTranscriptUpdate: (String) -> Void
private let onFinalTranscriptReady: (String) -> Void
private let onError: (Error) -> Void
Expand All @@ -142,6 +191,7 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
temporaryToken: String?,
urlSession: URLSession,
keyterms: [String],
languageCode: String?,
onTranscriptUpdate: @escaping (String) -> Void,
onFinalTranscriptReady: @escaping (String) -> Void,
onError: @escaping (Error) -> Void
Expand All @@ -150,6 +200,7 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
self.temporaryToken = temporaryToken
self.urlSession = urlSession
self.keyterms = keyterms
self.languageCode = languageCode
self.onTranscriptUpdate = onTranscriptUpdate
self.onFinalTranscriptReady = onFinalTranscriptReady
self.onError = onError
Expand All @@ -158,7 +209,8 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre
func open() async throws {
let websocketURL = try Self.makeWebsocketURL(
temporaryToken: temporaryToken,
keyterms: keyterms
keyterms: keyterms,
languageCode: languageCode
)

var websocketRequest = URLRequest(url: websocketURL)
Expand Down Expand Up @@ -436,21 +488,38 @@ private final class AssemblyAIStreamingTranscriptionSession: NSObject, BuddyStre

private static func makeWebsocketURL(
temporaryToken: String?,
keyterms: [String]
keyterms: [String],
languageCode: String?
) throws -> URL {
guard var websocketURLComponents = URLComponents(string: websocketBaseURLString) else {
throw AssemblyAIStreamingTranscriptionProviderError(
message: "AssemblyAI websocket URL is invalid."
)
}

let streamingSpeechModelConfiguration = StreamingSpeechModelConfiguration(
languageCode: languageCode
)

var queryItems = [
URLQueryItem(name: "sample_rate", value: "16000"),
URLQueryItem(name: "encoding", value: "pcm_s16le"),
URLQueryItem(name: "format_turns", value: "true"),
URLQueryItem(name: "speech_model", value: "u3-rt-pro")
URLQueryItem(
name: "speech_model",
value: streamingSpeechModelConfiguration.modelIdentifier
)
]

if streamingSpeechModelConfiguration.shouldEnableLanguageDetection {
queryItems.append(URLQueryItem(name: "language_detection", value: "true"))
}

if streamingSpeechModelConfiguration.supportsExplicitLanguageCode,
let languageCode {
queryItems.append(URLQueryItem(name: "language_code", value: languageCode))
}

let normalizedKeyterms = keyterms
.map { $0.trimmingCharacters(in: .whitespacesAndNewlines) }
.filter { !$0.isEmpty }
Expand Down
2 changes: 2 additions & 0 deletions leanring-buddy/BuddyDictationManager.swift
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@ final class BuddyDictationManager: NSObject, ObservableObject {
private let transcriptionProvider: any BuddyTranscriptionProvider
private let audioEngine = AVAudioEngine()
private var activeTranscriptionSession: (any BuddyStreamingTranscriptionSession)?
var languageCode: String?
private var activeStartSource: BuddyDictationStartSource?
private var draftCallbacks: BuddyDictationDraftCallbacks?
private var draftTextBeforeCurrentDictation = ""
Expand Down Expand Up @@ -519,6 +520,7 @@ final class BuddyDictationManager: NSObject, ObservableObject {

let activeTranscriptionSession = try await transcriptionProvider.startStreamingSession(
keyterms: buildTranscriptionKeyterms(),
languageCode: languageCode,
onTranscriptUpdate: { [weak self] transcriptText in
Task { @MainActor in
self?.latestRecognizedText = transcriptText
Expand Down
1 change: 1 addition & 0 deletions leanring-buddy/BuddyTranscriptionProvider.swift
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ protocol BuddyTranscriptionProvider {

func startStreamingSession(
keyterms: [String],
languageCode: String?,
onTranscriptUpdate: @escaping (String) -> Void,
onFinalTranscriptReady: @escaping (String) -> Void,
onError: @escaping (Error) -> Void
Expand Down
Loading