
Voice Notes: Talk to Your AI Agent
Stop typing. Just talk. The moment you add voice, your agent stops feeling like a tool and starts feeling like a conversation.
Why Voice Changes Everything
Text creates friction — you stop, type, format, wait, read. Voice removes all of that. You speak naturally, your agent responds in kind. The full round trip from your voice to an audio reply is under 3 seconds.
Morning briefings
Ask your agent to read your calendar and summarise messages while you're getting ready. Hands free, eyes free.
On the move
Driving, walking, at the gym — voice works where typing is impossible or dangerous.
Think out loud
Speaking clarifies thinking in ways typing doesn't. Your agent becomes a sounding board that actually responds.
It's just faster
Most people speak 3× faster than they type. Voice is the highest-bandwidth interface you already have.
How the Round Trip Works
You send a voice note in Telegram
Hold mic button → speak → release
OpenAI transcribes it to text
gpt-4o-mini-transcribe — accurate, fast, handles accents and multiple languages
Your AI model processes the request
Identical capability to typed messages — nothing lost
Optional: agent replies with a voice note
Sent as a round voice bubble in Telegram
Full round trip — voice in to voice reply — typically under 3 seconds
Choose Your Setup
Both options use OpenAI for voice transcription. The difference is the voice reply quality.
Full two-way voice with a single OpenAI key. Great quality, simple setup.
gpt-4o-mini-transcribenovanatural, clear voiceOpenAI for transcription + ElevenLabs for ultra-realistic voice replies. Custom voice personalities.
gpt-4o-mini-transcribeOpenAI keyeleven_multilingual_v2ElevenLabs key💡 ElevenLabs lets you create a custom voice personality — your agent gets a consistent voice that feels genuinely its own. Changes the whole interaction model.
Get Your API Key(s)
OpenAI — required for both options
- 1. Go to platform.openai.com/api-keys
- 2. Click Create new secret key
- 3. Copy it — starts with
sk-proj-...
ElevenLabs — Option B only
- 1. Go to elevenlabs.io → sign up (free tier available)
- 2. Go to Profile → API Keys → copy your key
- 3. Optionally browse elevenlabs.io/voice-library and note the Voice ID you want
Paste This Prompt in Telegram
Open your chat with your agent, copy the prompt for your option, fill in your key(s), and send it. Your agent configures itself and restarts.
🔒 Your keys go directly to your private agent. Your Telegram chat is end-to-end encrypted and your agent runs on your private VPS.
Enable voice notes on yourself. My OpenAI API key is: [YOUR_OPENAI_KEY] Please: 1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env 2. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe 3. Enable TTS voice replies using openai nova voice, auto mode "inbound" (only reply with voice when I send a voice note) 4. Restart docker compose 5. Confirm when done and send me a voice reply to test
Replace [YOUR_OPENAI_KEY] before sending
Enable voice notes on yourself with premium ElevenLabs voice replies. My OpenAI API key is: [YOUR_OPENAI_KEY] My ElevenLabs API key is: [YOUR_ELEVENLABS_KEY] Please: 1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env 2. Add ELEVENLABS_API_KEY=[YOUR_ELEVENLABS_KEY] to /root/.openclaw/.env 3. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe 4. Enable TTS voice replies using ElevenLabs eleven_multilingual_v2, auto mode "inbound" 5. Restart docker compose 6. Confirm when done and send me a voice reply to test
Replace both key placeholders. Ask your agent to list available ElevenLabs voices if you want to pick a specific one.
What Your Agent Does
Adds any new API key(s) to its secure environment file
Updates its config to enable voice transcription (OpenAI gpt-4o-mini-transcribe)
Configures TTS voice replies in inbound mode (only speaks back when you send voice) — if requested
Restarts to apply all changes
Sends you a voice reply to confirm everything works
After Setup
Send a voice note
Hold the mic icon in Telegram, speak, release. Your agent transcribes and replies — with voice if you set up TTS.
Mix voice and text freely
Send a voice note then type a follow-up — your agent handles both in the same conversation with no mode switching.
Always reply with voice
By default voice replies only trigger when you send voice. Tell your agent "always reply with voice" to make it permanent.
Change the voice (Option C)
Ask your agent to list available ElevenLabs voices and update to one you prefer. It updates its own config.
Common Questions
Does transcription work in other languages?
Yes. Gemini 2.5 Flash handles most major languages automatically — no extra configuration needed.
What does transcription cost?
OpenAI gpt-4o-mini-transcribe costs $0.003/minute of audio. 100 voice notes a day at 30 seconds each comes to under $3/month.
What does TTS cost?
OpenAI TTS is ~$0.015 per 1,000 characters — a typical voice reply costs fractions of a cent. ElevenLabs has a free tier (10,000 chars/month), then usage-based pricing.
What if something doesn't work after setup?
Ask your agent to check its voice config with /tts status or /status. You can also paste the setup prompt again — the agent will re-run the configuration steps.
Pro Tip: Fully Hands-Free Mode
Voice in + voice out = completely hands-free AI on Telegram. Perfect for driving, cooking, the gym — or whenever you think faster than you type. Tell your agent "always reply with voice" to make it permanent.
Ready? Open Telegram and paste the prompt.
Your agent configures itself. You'll be talking to it within minutes.
Back to All Tutorials