Voice Notes: Talk to Your AI Agent

Stop typing — just talk. Enable voice in and voice out in minutes. Your agent listens, understands, and can talk back.

Stop typing. Just talk. The moment you add voice, your agent stops feeling like a tool and starts feeling like a conversation.

Why Voice Changes Everything

Text creates friction — you stop, type, format, wait, read. Voice removes all of that. You speak naturally, your agent responds in kind. The full round trip from your voice to an audio reply is under 3 seconds.

Use case	Details
☀️ Morning briefings	Ask your agent to read your calendar and summarise messages while you're getting ready. Hands free, eyes free.
🚗 On the move	Driving, walking, at the gym — voice works where typing is impossible or dangerous.
💭 Think out loud	Speaking clarifies thinking in ways typing doesn't. Your agent becomes a sounding board that actually responds.
⚡ It's just faster	Most people speak 3× faster than they type. Voice is the highest-bandwidth interface you already have.

How the Round Trip Works

🎤 You send a voice note in Telegram — Hold mic button → speak → release
📝 OpenAI transcribes it to text — gpt-4o-mini-transcribe — accurate, fast, handles accents and multiple languages
🧠 Your AI model processes the request — Identical capability to typed messages — nothing lost
🔊 Optional: agent replies with a voice note — Sent as a round voice bubble in Telegram

Full round trip — voice in to voice reply — typically under 3 seconds

Choose Your Setup

Both options use OpenAI for voice transcription. The difference is the voice reply quality.

Option A — OpenAI (1 key)

Full two-way voice with a single OpenAI key. Great quality, simple setup.

STT: gpt-4o-mini-transcribe
TTS: nova — natural, clear voice

Option B — ElevenLabs (2 keys)

OpenAI for transcription + ElevenLabs for ultra-realistic voice replies. Custom voice personalities.

STT: gpt-4o-mini-transcribe (OpenAI key)
TTS: eleven_multilingual_v2 (ElevenLabs key)

💡 ElevenLabs lets you create a custom voice personality — your agent gets a consistent voice that feels genuinely its own. Changes the whole interaction model.

Get Your API Key(s)

OpenAI — required for both options

Go to platform.openai.com/api-keys
Click Create new secret key
Copy it — starts with sk-proj-...

ElevenLabs — Option B only

Go to elevenlabs.io → sign up (free tier available)
Go to Profile → API Keys → copy your key
Optionally browse elevenlabs.io/voice-library and note the Voice ID you want

Paste This Prompt in Telegram

Open your chat with your agent, copy the prompt for your option, fill in your key(s), and send it. Your agent configures itself and restarts.

🔒 Your keys go directly to your private agent. Your Telegram chat is end-to-end encrypted and your agent runs on your private VPS.

Option A — OpenAI

Enable voice notes on yourself. My OpenAI API key is: [YOUR_OPENAI_KEY]

Please:
1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env
2. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe
3. Enable TTS voice replies using openai nova voice, auto mode "inbound" (only reply with voice when I send a voice note)
4. Restart docker compose
5. Confirm when done and send me a voice reply to test

Option B — ElevenLabs

Enable voice notes on yourself with premium ElevenLabs voice replies.
My OpenAI API key is: [YOUR_OPENAI_KEY]
My ElevenLabs API key is: [YOUR_ELEVENLABS_KEY]

Please:
1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env
2. Add ELEVENLABS_API_KEY=[YOUR_ELEVENLABS_KEY] to /root/.openclaw/.env
3. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe
4. Enable TTS voice replies using ElevenLabs eleven_multilingual_v2, auto mode "inbound"
5. Restart docker compose
6. Confirm when done and send me a voice reply to test

Replace the key placeholders before sending. Ask your agent to list available ElevenLabs voices if you want to pick a specific one.

What Your Agent Does

Adds any new API key(s) to its secure environment file
Updates its config to enable voice transcription (OpenAI gpt-4o-mini-transcribe)
Configures TTS voice replies in inbound mode (only speaks back when you send voice) — if requested
Restarts to apply all changes
Sends you a voice reply to confirm everything works

After Setup

Send a voice note — Hold the mic icon in Telegram, speak, release. Your agent transcribes and replies — with voice if you set up TTS.

Mix voice and text freely — Send a voice note then type a follow-up — your agent handles both in the same conversation with no mode switching.

Always reply with voice — By default voice replies only trigger when you send voice. Tell your agent "always reply with voice" to make it permanent.

Change the voice — Ask your agent to list available ElevenLabs voices and update to one you prefer. It updates its own config.

Common Questions

Does transcription work in other languages? Yes. Gemini 2.5 Flash handles most major languages automatically — no extra configuration needed.

What does transcription cost? OpenAI gpt-4o-mini-transcribe costs $0.003/minute of audio. 100 voice notes a day at 30 seconds each comes to under $3/month.

What does TTS cost? OpenAI TTS is ~$0.015 per 1,000 characters — a typical voice reply costs fractions of a cent. ElevenLabs has a free tier (10,000 chars/month), then usage-based pricing.

What if something doesn't work after setup? Ask your agent to check its voice config with /tts status or /status. You can also paste the setup prompt again — the agent will re-run the configuration steps.

✅ Pro Tip: Fully Hands-Free Mode

Voice in + voice out = completely hands-free AI on Telegram. Perfect for driving, cooking, the gym — or whenever you think faster than you type. Tell your agent "always reply with voice" to make it permanent.