
Voice Notes: Talk to Your AI Agent
Stop typing — just talk. Enable voice in and voice out in minutes. Your agent listens, understands, and can talk back.
Stop typing. Just talk. The moment you add voice, your agent stops feeling like a tool and starts feeling like a conversation.
Why Voice Changes Everything
Text creates friction — you stop, type, format, wait, read. Voice removes all of that. You speak naturally, your agent responds in kind. The full round trip from your voice to an audio reply is under 3 seconds.
| Use case | Details |
|---|---|
| ☀️ Morning briefings | Ask your agent to read your calendar and summarise messages while you're getting ready. Hands free, eyes free. |
| 🚗 On the move | Driving, walking, at the gym — voice works where typing is impossible or dangerous. |
| 💭 Think out loud | Speaking clarifies thinking in ways typing doesn't. Your agent becomes a sounding board that actually responds. |
| ⚡ It's just faster | Most people speak 3× faster than they type. Voice is the highest-bandwidth interface you already have. |
How the Round Trip Works
- 🎤 You send a voice note in Telegram — Hold mic button → speak → release
- 📝 OpenAI transcribes it to text — gpt-4o-mini-transcribe — accurate, fast, handles accents and multiple languages
- 🧠 Your AI model processes the request — Identical capability to typed messages — nothing lost
- 🔊 Optional: agent replies with a voice note — Sent as a round voice bubble in Telegram
Full round trip — voice in to voice reply — typically under 3 seconds
Choose Your Setup
Both options use OpenAI for voice transcription. The difference is the voice reply quality.
Option A — OpenAI (1 key)
Full two-way voice with a single OpenAI key. Great quality, simple setup.
- STT:
gpt-4o-mini-transcribe - TTS:
nova— natural, clear voice
Option B — ElevenLabs (2 keys)
OpenAI for transcription + ElevenLabs for ultra-realistic voice replies. Custom voice personalities.
- STT:
gpt-4o-mini-transcribe(OpenAI key) - TTS:
eleven_multilingual_v2(ElevenLabs key)
💡 ElevenLabs lets you create a custom voice personality — your agent gets a consistent voice that feels genuinely its own. Changes the whole interaction model.
Get Your API Key(s)
OpenAI — required for both options
- Go to platform.openai.com/api-keys
- Click Create new secret key
- Copy it — starts with
sk-proj-...
ElevenLabs — Option B only
- Go to elevenlabs.io → sign up (free tier available)
- Go to Profile → API Keys → copy your key
- Optionally browse elevenlabs.io/voice-library and note the Voice ID you want
Paste This Prompt in Telegram
Open your chat with your agent, copy the prompt for your option, fill in your key(s), and send it. Your agent configures itself and restarts.
🔒 Your keys go directly to your private agent. Your Telegram chat is end-to-end encrypted and your agent runs on your private VPS.
Option A — OpenAI
Enable voice notes on yourself. My OpenAI API key is: [YOUR_OPENAI_KEY]
Please:
1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env
2. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe
3. Enable TTS voice replies using openai nova voice, auto mode "inbound" (only reply with voice when I send a voice note)
4. Restart docker compose
5. Confirm when done and send me a voice reply to test
Option B — ElevenLabs
Enable voice notes on yourself with premium ElevenLabs voice replies.
My OpenAI API key is: [YOUR_OPENAI_KEY]
My ElevenLabs API key is: [YOUR_ELEVENLABS_KEY]
Please:
1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env
2. Add ELEVENLABS_API_KEY=[YOUR_ELEVENLABS_KEY] to /root/.openclaw/.env
3. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe
4. Enable TTS voice replies using ElevenLabs eleven_multilingual_v2, auto mode "inbound"
5. Restart docker compose
6. Confirm when done and send me a voice reply to test
Replace the key placeholders before sending. Ask your agent to list available ElevenLabs voices if you want to pick a specific one.
What Your Agent Does
- Adds any new API key(s) to its secure environment file
- Updates its config to enable voice transcription (OpenAI gpt-4o-mini-transcribe)
- Configures TTS voice replies in inbound mode (only speaks back when you send voice) — if requested
- Restarts to apply all changes
- Sends you a voice reply to confirm everything works
After Setup
Send a voice note — Hold the mic icon in Telegram, speak, release. Your agent transcribes and replies — with voice if you set up TTS.
Mix voice and text freely — Send a voice note then type a follow-up — your agent handles both in the same conversation with no mode switching.
Always reply with voice — By default voice replies only trigger when you send voice. Tell your agent "always reply with voice" to make it permanent.
Change the voice — Ask your agent to list available ElevenLabs voices and update to one you prefer. It updates its own config.
Common Questions
Does transcription work in other languages? Yes. Gemini 2.5 Flash handles most major languages automatically — no extra configuration needed.
What does transcription cost? OpenAI gpt-4o-mini-transcribe costs $0.003/minute of audio. 100 voice notes a day at 30 seconds each comes to under $3/month.
What does TTS cost? OpenAI TTS is ~$0.015 per 1,000 characters — a typical voice reply costs fractions of a cent. ElevenLabs has a free tier (10,000 chars/month), then usage-based pricing.
What if something doesn't work after setup?
Ask your agent to check its voice config with /tts status or /status. You can also paste the setup prompt again — the agent will re-run the configuration steps.
✅ Pro Tip: Fully Hands-Free Mode
Voice in + voice out = completely hands-free AI on Telegram. Perfect for driving, cooking, the gym — or whenever you think faster than you type. Tell your agent "always reply with voice" to make it permanent.