Everything we learned, and everything we think you need to know, from technical details on 24khz/G.711 audio, RTMP, HLS, WebRTC, to Interruption/VAD, to Cost, Latency, Tool Calls, and Context Mgmt
After a couple of hundred hours of Realtime API usage, there's some things that have us looking at Pipecat and owning more of the Voice AI pipeline versus waiting on Realtime to reach GA:
1. Turn-detection + hallucination bugs - we've found it fairly easy for the API to "go off the rails" if there are rapid interruptions. The API will initially respect the end-of-speech settings, but rapid interruptions seem to dramatically reduce this. When this happens, the API occasionally generates audio completions that are both (1) far off the topic area and (2) do not appear in the output transcript. I can't reproduce the behavior with gpt-4o.
2. Sometimes OpenAI does not appear to send audio deltas associated w/the transcript. This appears as long pauses in a conversation audio recording.
The completions are not as deterministic as gpt-4o, but they have been "close enough". The above two (esc the first) have been hard to get around for production use cases as this point for us.
Have you tried feeding in realtime voice input to realtime API with text output? Is that even possible with realtime API and are there cost benefits of that?
Really appreciate the detailed post Kwindla!
After a couple of hundred hours of Realtime API usage, there's some things that have us looking at Pipecat and owning more of the Voice AI pipeline versus waiting on Realtime to reach GA:
1. Turn-detection + hallucination bugs - we've found it fairly easy for the API to "go off the rails" if there are rapid interruptions. The API will initially respect the end-of-speech settings, but rapid interruptions seem to dramatically reduce this. When this happens, the API occasionally generates audio completions that are both (1) far off the topic area and (2) do not appear in the output transcript. I can't reproduce the behavior with gpt-4o.
2. Sometimes OpenAI does not appear to send audio deltas associated w/the transcript. This appears as long pauses in a conversation audio recording.
The completions are not as deterministic as gpt-4o, but they have been "close enough". The above two (esc the first) have been hard to get around for production use cases as this point for us.
Have you tried feeding in realtime voice input to realtime API with text output? Is that even possible with realtime API and are there cost benefits of that?