Back to tutorials
Person speaking to their AI agent via Telegram voice notes

Voice Notes: Talk to Your AI Agent

Stop typing. Just talk. The moment you add voice, your agent stops feeling like a tool and starts feeling like a conversation.

Why Voice Changes Everything

Text creates friction — you stop, type, format, wait, read. Voice removes all of that. You speak naturally, your agent responds in kind. The full round trip from your voice to an audio reply is under 3 seconds.

☀️

Morning briefings

Ask your agent to read your calendar and summarise messages while you're getting ready. Hands free, eyes free.

🚗

On the move

Driving, walking, at the gym — voice works where typing is impossible or dangerous.

💭

Think out loud

Speaking clarifies thinking in ways typing doesn't. Your agent becomes a sounding board that actually responds.

It's just faster

Most people speak 3× faster than they type. Voice is the highest-bandwidth interface you already have.

How the Round Trip Works

🎤

You send a voice note in Telegram

Hold mic button → speak → release

📝

OpenAI transcribes it to text

gpt-4o-mini-transcribe — accurate, fast, handles accents and multiple languages

🧠

Your AI model processes the request

Identical capability to typed messages — nothing lost

🔊

Optional: agent replies with a voice note

Sent as a round voice bubble in Telegram

Full round trip — voice in to voice reply — typically under 3 seconds

Choose Your Setup

Both options use OpenAI for voice transcription. The difference is the voice reply quality.

Option A — OpenAI1 KEY

Full two-way voice with a single OpenAI key. Great quality, simple setup.

STTgpt-4o-mini-transcribe
TTSnovanatural, clear voice
Option B — ElevenLabs2 KEYS

OpenAI for transcription + ElevenLabs for ultra-realistic voice replies. Custom voice personalities.

STTgpt-4o-mini-transcribeOpenAI key
TTSeleven_multilingual_v2ElevenLabs key

💡 ElevenLabs lets you create a custom voice personality — your agent gets a consistent voice that feels genuinely its own. Changes the whole interaction model.

Get Your API Key(s)

OpenAI — required for both options

  1. 1. Go to platform.openai.com/api-keys
  2. 2. Click Create new secret key
  3. 3. Copy it — starts with sk-proj-...

ElevenLabs — Option B only

  1. 1. Go to elevenlabs.io → sign up (free tier available)
  2. 2. Go to Profile → API Keys → copy your key
  3. 3. Optionally browse elevenlabs.io/voice-library and note the Voice ID you want

Paste This Prompt in Telegram

Open your chat with your agent, copy the prompt for your option, fill in your key(s), and send it. Your agent configures itself and restarts.

🔒 Your keys go directly to your private agent. Your Telegram chat is end-to-end encrypted and your agent runs on your private VPS.

OPTION A — OPENAI
Enable voice notes on yourself. My OpenAI API key is: [YOUR_OPENAI_KEY]

Please:
1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env
2. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe
3. Enable TTS voice replies using openai nova voice, auto mode "inbound" (only reply with voice when I send a voice note)
4. Restart docker compose
5. Confirm when done and send me a voice reply to test

Replace [YOUR_OPENAI_KEY] before sending

OPTION B — ELEVENLABS
Enable voice notes on yourself with premium ElevenLabs voice replies.
My OpenAI API key is: [YOUR_OPENAI_KEY]
My ElevenLabs API key is: [YOUR_ELEVENLABS_KEY]

Please:
1. Add OPENAI_API_KEY=[YOUR_OPENAI_KEY] to /root/.openclaw/.env
2. Add ELEVENLABS_API_KEY=[YOUR_ELEVENLABS_KEY] to /root/.openclaw/.env
3. Update /root/.openclaw/openclaw.json to enable voice transcription using openai gpt-4o-mini-transcribe
4. Enable TTS voice replies using ElevenLabs eleven_multilingual_v2, auto mode "inbound"
5. Restart docker compose
6. Confirm when done and send me a voice reply to test

Replace both key placeholders. Ask your agent to list available ElevenLabs voices if you want to pick a specific one.

What Your Agent Does

Adds any new API key(s) to its secure environment file

Updates its config to enable voice transcription (OpenAI gpt-4o-mini-transcribe)

Configures TTS voice replies in inbound mode (only speaks back when you send voice) — if requested

Restarts to apply all changes

Sends you a voice reply to confirm everything works

After Setup

Send a voice note

Hold the mic icon in Telegram, speak, release. Your agent transcribes and replies — with voice if you set up TTS.

Mix voice and text freely

Send a voice note then type a follow-up — your agent handles both in the same conversation with no mode switching.

Always reply with voice

By default voice replies only trigger when you send voice. Tell your agent "always reply with voice" to make it permanent.

Change the voice (Option C)

Ask your agent to list available ElevenLabs voices and update to one you prefer. It updates its own config.

Common Questions

Does transcription work in other languages?

Yes. Gemini 2.5 Flash handles most major languages automatically — no extra configuration needed.

What does transcription cost?

OpenAI gpt-4o-mini-transcribe costs $0.003/minute of audio. 100 voice notes a day at 30 seconds each comes to under $3/month.

What does TTS cost?

OpenAI TTS is ~$0.015 per 1,000 characters — a typical voice reply costs fractions of a cent. ElevenLabs has a free tier (10,000 chars/month), then usage-based pricing.

What if something doesn't work after setup?

Ask your agent to check its voice config with /tts status or /status. You can also paste the setup prompt again — the agent will re-run the configuration steps.

Pro Tip: Fully Hands-Free Mode

Voice in + voice out = completely hands-free AI on Telegram. Perfect for driving, cooking, the gym — or whenever you think faster than you type. Tell your agent "always reply with voice" to make it permanent.

Ready? Open Telegram and paste the prompt.

Your agent configures itself. You'll be talking to it within minutes.

Back to All Tutorials