A low latency video chat only works when capture, encoding, transport, and server placement are tuned as one system. The difference between a conversation that feels natural and one that keeps tripping over itself is often only a few hundred milliseconds, so the useful work is in the details that actually cut delay, not the features that merely look impressive. In this article, I focus on the practical architecture choices, tuning steps, and tradeoffs that matter most for real-time video conversations and live video products.
The pieces that matter most
- WebRTC is the default foundation for conversational video because it is built for real-time media, not playback.
- UDP-first connectivity, nearby media regions, and a working TURN fallback often matter more than fancy UI features.
- Moderate resolution and frame rate usually beat pushing maximum quality by default.
- SFUs are the usual answer once a call moves beyond simple one-to-one peer connections.
- Low-latency streaming is useful for one-to-many broadcasts, but it is still a different tool from a true video call.
What latency actually means in a video conversation
In a live conversation, latency is not an abstract network metric. It is the pause between one person speaking and the other person seeing and hearing that moment in time. Once one-way delay rises much past 150 ms, turn-taking starts to feel less fluid; around 250 to 300 ms, people begin interrupting each other or pausing awkwardly because the rhythm of speech no longer matches the rhythm of the call.
That is why I treat latency as a conversation quality problem, not just a technical one. Twilio’s latency guidance lines up with what most teams see in production: people usually notice delay around 100 to 120 ms, and conversations begin to feel broken once the lag reaches the 250 to 300 ms range. You can still build something usable beyond that, but it will stop feeling immediate.
| Latency range | What it feels like | Practical takeaway |
|---|---|---|
| Under 100 ms one-way | Very close to face-to-face | Ideal for natural turn-taking and live collaboration |
| 100 to 150 ms one-way | Still responsive | Usually acceptable for most consumer calls |
| 150 to 300 ms one-way | Noticeable lag | Usable, but people will start talking over each other |
| Above 300 ms one-way | Awkward and delayed | Fine for some broadcasts, poor for real conversation |
The important point is that end-to-end delay is cumulative. Camera capture, processing, encoding, route selection, packet loss recovery, jitter buffering, decode time, and rendering all add up. Once you think in those terms, the next step is obvious: you need to find where the extra delay is being introduced before you can remove it.

Where delay is introduced in the stack
Most teams focus on bitrate too early. Bitrate matters, but the more reliable way to reduce delay is to trace the whole path and remove the slowest links first. In practice, the delay usually comes from four places:
| Stage | What it adds | What I would optimise first |
|---|---|---|
| Capture and preprocessing | Camera readout, background effects, beauty filters, and local compositing | Keep processing light and avoid unnecessary effects before the frame leaves the device |
| Encoding | Compression work on the sender | Use hardware acceleration where possible and avoid oversized defaults |
| Transport | Network distance, routing, packet loss, and relay hops | Keep media close to users and prefer direct UDP paths first |
| Buffering and decode | Receiver-side smoothing and rendering | Stabilise the network rather than hiding problems behind a bigger buffer |
The trap is easy to describe and easy to miss: when the network looks unstable, teams often increase buffering to keep the picture smooth. That can help with stutter, but it also pushes the conversation further behind real time. I would rather accept a small amount of controlled variation than hide a broken path behind extra delay.
Once you understand where the delay comes from, the next question is which protocol stack is actually designed to keep a call responsive in the first place.
Why WebRTC is usually the right foundation
For conversational video, WebRTC is the default baseline for a reason. It is built for real-time media in browsers and native apps, and it does not require plugins or a playback-style delivery chain. That makes it a strong fit for one-to-one calls, small group rooms, tutoring, telehealth, support sessions, and any other use case where the timing of speech matters.
WebRTC also handles the unglamorous but essential part of connectivity: discovery and negotiation. ICE, STUN, and TURN are not optional side details; they are the machinery that lets the call survive NATs, firewalls, and restrictive networks. In plain terms, STUN helps a client discover how it appears to the outside world, ICE tries the best available routes, and TURN steps in as a relay when direct media paths are blocked.
For multiparty rooms, the usual answer is not pure peer-to-peer. A Selective Forwarding Unit receives media and forwards the right stream to each participant without forcing every device to encode and send separate copies to everyone else. That matters because once the room grows, the bottleneck is no longer just bandwidth; it is also device load, scaling logic, and the quality of the server-side forwarding strategy.
Codec choice matters too. In WebRTC environments, Opus is the audio codec I expect by default, and VP8 or H.264 are the usual video workhorses. That baseline is important because the best codec is not the one with the best theoretical quality; it is the one that keeps the conversation stable across real devices and mixed network conditions. If you need extra control in a browser app, WebCodecs can also help because it exposes low-level, hardware-accelerated encode and decode paths with per-frame control.
WebRTC gives you the right foundation, but it does not save you from bad configuration. The next gains usually come from how you capture and encode video, not from changing the entire product direction.
How to tune capture, encoding, and bandwidth without overdoing it
The fastest way to ruin a real-time call is to assume that higher quality automatically feels better. It does not. A stable 720p call with low delay is usually better than a flaky 1080p stream that arrives late and chews through CPU. I usually start with the smallest settings that still fit the use case, then scale up only when the network and device can prove they deserve it.
- Keep resolution realistic. For many calls, 640x360 or 1280x720 is enough. Full HD by default is often wasted work.
- Use a sensible frame rate. 24 to 30 fps feels smooth for most conversations. If the use case is mostly faces and speech, 15 fps can be acceptable on weaker networks.
- Cap bitrate deliberately. Use sender parameters such as max bitrate and maximum frame rate where the platform supports them, instead of letting the encoder run unbounded.
- Prefer hardware acceleration. Hardware encode and decode usually reduce CPU pressure and help keep latency steady.
- Reduce heavy preprocessing. Background blur, super-resolution, and cosmetic filters can be useful, but they should be treated as optional, not as the default path.
- Use adaptive delivery. Simulcast or scalable video coding helps multiparty rooms by letting each participant receive the stream quality that fits their connection.
There is also a practical constraint question that many teams ignore too long: what is the minimum quality your users actually need? If the session is mostly head-and-shoulders conversation, the extra delay and bandwidth of a high-motion setup rarely pays off. If the session includes product demos, teaching, or screen sharing, you may need a different profile for the shared content than for the camera feed.
That is why tuning is never just about the sender. The network path and server location can erase all of those gains if they are poorly chosen, which leads straight into the infrastructure layer.
When network topology matters more than the app
For a UK audience, region placement is often the most underrated latency decision. If your users are mostly in the UK, I would place media close to London or at least in a nearby Western Europe region before I worried about more exotic optimisations. A long-haul route to another continent can add delay before your app code even gets a chance to help.
The other issue is connectivity quality. The best path is usually direct media over UDP, because it keeps latency low and avoids some of the head-of-line blocking you get from more conservative transports. When that path is unavailable, relays and TCP fallbacks are necessary, but every fallback adds friction. That is not a failure condition; it is a reality of consumer networks, VPNs, mobile carriers, and corporate firewalls.
| Connection path | What it means | Latency impact |
|---|---|---|
| Direct UDP | Best-case media path between client and media server | Lowest delay |
| TURN over UDP | Relay path when direct connectivity fails | Extra hop, but still a reasonable fallback |
| ICE over TCP | Fallback for networks that block UDP | Higher delay and more buffering risk |
| TURN over TLS | Last-resort route for restrictive firewalls | Most resilient, usually the slowest |
That is why I like products that try the best transport first and fall back only when they must. It keeps the happy path fast while still allowing calls to work in poor network environments. If your product serves mixed users across the UK, that balance is more useful than chasing an ideal connection profile that only works on a lab network.
Once the path is stable, the final enemies are usually product decisions that quietly add seconds even when the stack itself is healthy.
Common mistakes that quietly add seconds
The most common errors are not clever bugs. They are defaults that made sense in a different kind of product and never got revisited. I see the same patterns over and over:
- Treating broadcast delivery as chat delivery. HLS is excellent for large-scale playback, but it is the wrong tool for natural back-and-forth conversation.
- Shipping high resolution by default. Users notice lag more than they notice a minor drop from 1080p to 720p.
- Using large buffers as a bandage. This can hide jitter while making the call feel slower.
- Ignoring the uplink. Many call products are limited by what the user can send, not what they can receive.
- Forcing every call through a distant region. That saves configuration time and costs responsiveness.
- Adding heavy visual effects before the call is stable. Filters are fine when the core path is already healthy; they are a liability when the device is under load.
The deeper mistake behind most of these is the same: optimising for the best-case network instead of the worst 20 percent of users. Real products are judged by the people on uneven Wi-Fi, commuter 4G, corporate VPNs, and older laptops. If the system still feels live there, it will feel excellent for everyone else.
That same distinction also helps when you have to choose between a true video-call stack and a streaming stack, which is where many projects make the wrong architectural compromise.
Choosing between WebRTC and streaming stacks
Not every live video product should be built the same way. If the goal is a conversation, use a conversational stack. If the goal is to reach a large audience with a bit of delay, use a streaming stack. Mixing the two only works when you are clear about which requirement is more important.
| Option | Typical delay | Best for | Main limitation |
|---|---|---|---|
| WebRTC | Sub-second to around 1 second in well-tuned setups | 1:1 calls, small groups, support, telehealth, teaching | More complex to scale as a pure broadcast medium |
| Low-latency HLS | Roughly 1 to 2 seconds at scale in the best-designed deployments | Interactive broadcasts, watch parties, live shopping with chat | Still not fast enough for natural two-way turn-taking |
| Traditional HLS | Many seconds, often more | Broad compatibility, CDN-scale reach, passive audiences | Too slow for genuine video chat |
Apple’s low-latency HLS work is useful here because it shows where the streaming line really sits: the technology is built to get live broadcasts down to roughly one to two seconds at scale, not to replace real-time conversation. That is a good result for a show, a sports stream, or a shopping event. It is still not the same thing as being able to interrupt someone naturally in a call.
So if your product needs human back-and-forth, I would not force it into a playback-shaped pipeline. I would keep the media path conversational and reserve streaming stacks for the use cases that are actually broadcast-first. From there, the launch checklist becomes much simpler.
What I would ship first for a UK audience
If I were building this for users across the UK, I would start with a conservative, boring, and reliable setup before touching any advanced tricks. That usually produces the best first version of a low-latency calling product.
- Place signalling and media as close to London as possible for UK-heavy traffic.
- Use WebRTC with Opus for audio and VP8 or H.264 for video.
- Default to adaptive 360p to 720p video, not maximum resolution.
- Keep frame rates in the 24 to 30 fps range unless the use case allows less.
- Allow direct UDP first, then TURN over UDP, then TCP or TLS fallbacks.
- Enable simulcast or SVC for multiparty rooms instead of sending one oversized stream to everyone.
- Measure one-way delay, jitter, packet loss, CPU load, and call setup time together.
That combination usually gets you far more benefit than chasing exotic media tweaks too early. Once the call is stable, you can layer in better background effects, smarter bitrate adaptation, and richer room features without breaking the feeling of immediacy. For this kind of product, the real win is not simply connecting two users; it is making the connection feel close enough that the conversation stays human.