This article kicks off “Audio over IP Networks for Events - An Opinionated Guide”, a blog series that aims to establish best practices learned from large-scale datacenter & internet service provider networking into the field of temporary networks for events which primarily transport audio content.
I’m hoping to eventually develop these articles into hands-on workshops to help audio professionals build scalable, resilient and reliable networks.
I assume that the reader has basic knowledge of networking concepts, which means that you should be able to set up a basic Audio over IP network with VLANs, Spanning Tree Protocol & IGMP snooping on your own.
I will try to guide you through the concepts in a way that lays the foundations for you to get a deeper understanding, but this is by no means a complete guide to networking. Given the starting point laid out above, my goal is that you don’t have to go back and forth, googling every second word to understand the concepts I am trying to explain. I have tried to find a middle ground for the level of details - however, if you feel that I am skipping too much detail, please let me know and I will try to improve the article.
- Part 1: Foundations and why L2 is considered harmful
- Part 2: Layer 3 Network Design Principles
- Part 3: OSPF for self-healing networks that just work (TM)
- Part 4: BGP as advanced routing protocol for when you need a little bit more spice
- Part 5: Using PIM-SM to distribute Multicast
- Part 6: Best Practices: Proven Design Patterns and Reference Designs
- Part 7: Gear Guide: Selecting Hardware That Actually Works
- Part 8: Test Before You Deploy! Network Simulation Tools and Techniques
As the title warns, this is an opinionated guide that reflects my personal opinions and field experience.
I make no claims to absolute truth and it is well within the realms of possibility that some statements I make are just plain wrong or lack exposure to scenarios that would shift my thinking.
I welcome all questions, suggestions and feedback (even if it’s a rant about how you think I’ve completely missed the mark).
The state of Audio over IP Networks for Events
The state of temporary Audio over IP Networks is poor. You commonly see large, stretched VLANs and redundancy provided by some variant of Spanning Tree Protocol. If you’re lucky, you might find that IGMP Snooping is enabled and configured correctly, but more often than not, it is not.
This is a brittle design that is prone to failure and does not scale well. If you dive into the world where the real networking happens (e.g. large-scale datacenter & internet service provider networking), you will quickly find that they have moved beyond this kind of network design for ages and it’s considered bad practice.
Design goals for Audio over IP Networks
The design goals for Audio over IP Networks are simple:
- The network must be compatible with real-world protocols and devices used in the audio industry.
- Common protocols are primarily Dante, sometimes Milan, AVB or Ravenna
- End devices (like audio console, speakers, etc.) must be assumed to be “dumb” and not capable of any routing protocols (not even static routing, except a default route)
- The network must be able to transport Multicast Traffic and provide a consistent and low latency
- The network must be reliable, there must not be any interruptions or dropouts
- The network must be resilient, it must be able to recover from failures without manual intervention
- The network should be sufficiently simple to configure, maintain and troubleshoot, so that it can be prepared and set up by a single person in a short amount of time
- The network should be scalable and able to handle a large number of devices distributed across a large area
- The network should be flexible and able to transport a wide range of protocols and devices, not just audio (e.g. control traffic)
Why Layer 2 is considered harmful
As outlined above, temporary Audio over IP networks for events often use Layer 2 fabrics with stretched VLANs.
Transport or forwarding of packets in Layer 2 fabrics is based on Bridging, which is a method of forwarding packets based on their MAC addresses. This is in contrast to Layer 3 networks, which use routing to forward packets based on their IP addresses.
I will use the following terminology:
- Layer 2 -> Bridging -> Forwarding based on MAC addresses
- Layer 3 -> Routing -> Forwarding based on IP addresses
I will also use the term “Fabric” and “Network” interchangeably.
What even is a stretched VLAN?
This term is quite common in the world of network operator groups, but probably not so much in the world of audio professionals.
“Stretched VLANs” refers to a network design where a single VLAN is extended across multiple network bridges (Layer-2 switches). In other words, a VLAN that spans multiple switches is a stretched VLAN. If you have a link between two switches, and this link transports traffic for a single VLAN, then this is a stretched VLAN.
A stretched VLAN is a single “Broadcast Domain”. A Broadcast Domain is a logical division of a computer network, in which all nodes can reach each other by broadcast at the data link layer (Layer 2). If any node sends a broadcast frame, all other nodes in the same Broadcast Domain will receive it. VLANs (VIRTUAL Local Area Networks) are used to create separate Broadcast Domains within a single physical network.
The term “virtual” can be confusing here. “Virtualization” simply means that a single physical resource is divided into multiple somewhat isolated logical resources.
Flooding of BUM (Broadcast, Unknown Unicast and Multicast) traffic - the fundamental problem
BUM traffic stands for Broadcast, Unknown Unicast and Multicast traffic and it is reasonable to say that this is the root of all evil in Layer 2 networks.
BUM traffic is fundamentally flooded across the entire Layer 2 fabric, which means that every Bridge (Layer-2 Switch) will receive it and forward / copy / duplicate it to all ports in respective Broadcast Domain (VLAN).
This leads to a number of problems.
The most widely known and most common issues are Networks loops. If you have a network loop (or bridging loop), BUM traffic will be flooded infinitely, leading to a broadcast storm that will likely bring down the entire network. Layer 2 frames do not have a Time To Live (TTL) field, so they will just keep circulating. The network meltdown is caused by multiple effects:
- Link congestion: The network links become saturated with the circulating traffic
- Control Plane overload: In network devices like switches, BUM traffic is often punted from the ASIC (“Network Chip”) to the Control Plane for processing, which can lead to slow down or even crash the device
- Overload of end hosts: The end hosts can be overwhelmed by the sheer amount of traffic they receive
Therefore, it is crucial to prevent network loops in Layer 2 networks. Several mechanisms exist to prevent network loops, the most common one being the Spanning Tree Protocol (STP).
Lack of address hierarchy
Layer-2 works based on MAC addresses, which are mostly flat and do not have any helpful hierarchy. This means that there is no way to group devices based on their location or function, which makes it difficult to manage and troubleshoot the network. MAC addresses do have some hierarchy, in that the first 3 octets (bytes) represent some Organizationally Unique Identifier (OUI) and the last 3 octets are assigned by the manufacturer. However, this is not very useful in practice, as it does not provide any information about the location or function of the device.
Layer-3 adressing (IP addresses) on the other hand are inherently hierarchical by a principle called “LPM” (Longest Prefix Match). This will be covered in the next article, but in short, it means that you can establish a hierarchy devices and group them based on their location or function by creating subnets with varying prefix lengths.
Scalability issues
Layer-2 networks are not very scalable, as they rely on flooding BUM traffic across the entire Broadcast Domain. This means that the more devices you add to the network, the more traffic will be flooded.
Due to the lack of an address hierarchy, it is also impossible to group devices or establish hierarchies.
Every bridge must know the MAC address of every device in the Broadcast Domain, which means that the MAC address table will grow with the number of devices. If you have too many devices, the MAC address table will overflow and the switch will start flooding all traffic, which will just reinforce the problem.
As explained above, Layer-2 fabrics take special care to avoid network loops. Because only one path is allowed to be active at a time (notwithstanding exotics like SPB or TRILL, see below), scalability is limited. You can add additional links, but they will not be used for forwarding traffic, they will only be used for redundancy and not increase the available bandwidth.
Therefore, a single Broadcast Domain must be considered a single failure domain.
Multi-Homing on Layer-2 is hard
Consider the common scenario where you have a server has multiple uplinks connected to multiple different switches. The uplinks should appear as if they are a single logical link and all be utilized (not just active-standby), so that the server can use all uplinks for redundancy, load balancing and increased bandwidth. This is called Multi-Homing.
On Layer-3, this is easily solved by a principle called Equal Cost Multi-Path (ECMP) routing, which allows multiple paths to be used for forwarding traffic.
On Layer-2, this is much more difficult. In basic Layer-2 fabric, the solution is mostly a collection of proprietary protocols like MLAG (Multi-Chassis Link Aggregation, also called MCLAG, MC-LAG or MCT - Multi-Chassis Trunking). While some good MLAG impelementations exists (like Arista’s MLAG or Cumulus Linux’s MLAG), MLAG has a history of being buggy and leading to entire data center network meltdowns. Additionally, MLAG is proprietary and not standardized, which means that it is not interoperable between different vendors.
In virtualized Layer-2 fabrics, e.g. EVPN-VXLAN fabrics, complex protocols like EVPN-MH (EVPN Multi-Homing) are used to achieve the same goal. These protocols are complex and require a lot of configuration, which makes them difficult to set up and maintain.
The workarounds, aka Spanning Tree Protocol, VLANs, IGMP Snooping, MLAG and exotics like SPB or TRILL
Over the years, several workarounds have been developed in order to mitigate the issues with Layer-2 networks. I am specifically referring to these as “workarounds” and not “solutions” because they do not solve the fundamental problems, but rather try to work around them. They treat the symptoms, not the root cause.
Spanning Tree Protocol (STP)
Spanning Tree Protocol is a protocol that is used to prevent network loops in Layer-2 networks. It works by blocking redundant links in the network, so that there is only one active path between any two devices. Several variants of STP exist, e.g. RSTP (Rapid Spanning Tree Protocol) or MSTP (Multiple Spanning Tree Protocol), but they all work on the same principle.
STP has several drawbacks. As mentioned above, it limits scalability, as only one path is active at a time. This means that you cannot use multiple links to increase the available bandwidth.
Depending on the specific flavour (e.g. RSTP vs normal STP), STP also has a slow (or very slow) convergence time, which means that it can take a long time to recover from a failure. This is especially problematic in temporary networks for events, where you need to be able to recover quickly from failures. Even “Rapid” Spanning Tree Protocol (RSTP) can take several seconds to converge, which is unsuitable for audio networks.
STP can also require a significant amount of configuration and tuning (e.g. setting the correct bridge priority, port costs, port guard, etc.). Some proprietary features like portfast can be a real footgun and lead to network loops (ironic, not?).
Finally, STP is risky in the sense that a bug in the implementation can lead to a network meltdown. While most STP implementations are quite stable, there have been cases where bugs in the implementation have led to network loops and broadcast storms.
VLANs
VLANs try to moderate the inherent problems of Layer-2 networks by creating separate logical partitions or Broadcast Domains. While VLANs provide some degree of isolation, they are merely a drop in the bucket and the fundamental problems of Layer-2 networks still remain.
VLANs also give many people a false sense of security. VLANs are not a security feature! Once you have some kind of Inter-VLAN routing that does not go through a firewall with proper filtering, you can easily access devices in other VLANs.
IGMP Snooping
IGMP Snooping is a feature that is used to manage Multicast traffic in Layer-2 networks. The goal is to forward Multicast traffic only to the ports that have requested it. This can help to reduce the amount of BUM traffic that is flooded across the network, but it still does not solve the fundamental problems of Layer-2 networks.
Configuring IGMP Snooping correctly can also be quite tricky and it is often not done properly in practice.
MLAG (Multi-Chassis Link Aggregation)
As explained above, MLAG exists as a proprietary workaround to the problem of Multi-Homing on Layer-2 networks, with all the drawbacks.
Exotics like SPB or TRILL
In order to address the scalability issues of Layer-2 networks, some exotic protocols have been developed, such as SPB (Shortest Path Bridging, IEEE 802.1aq) or TRILL (Transparent Interconnection of Lots of Links, RFC5556 and other).
The idea of SPB is great. It utilises the mature IS-IS protocol (Intermediate System to Intermediate System) to allow multiple active paths between devices (increasing the available bandwidth and improve redundancy), while still preventing networks loops and preserving the plug-and-play nature of Layer-2 networks. TRILL has a similar goal, but uses a different approach.
The problem with these protocols is that they are not widely supported and rather exotic. Additionally, they are a step in the wrong direction. Instead of getting down to the root of the problem, they try to be another (better?) workaround for the fundamental issues of Layer-2 networks. This leads us to the next question:
Why are we still using Layer 2 fabrics and why is it so hard to get rid of them?
This is a more general question that applies to all Layer-2 networks, not just Audio over IP networks for events.
From my point of view, there are four main reasons why Layer-2 networks are still so prevalent:
- Plug and Play, aka “they just work”
- Layer-2 networks are easy to set up and do not require any configuration. You can just plug in the devices and they will start communicating with each other.
- This is especially true for Audio over IP networks, where devices are often “dumb” and do not support any routing protocols.
- But there is no reason why an alternative could not be just as easy to set up!
- Legacy
- There are too many existing devices, so even if there is a perfect alternative, it is not be possible to switch to it overnight and some degree of backwards compatibility is required
- We still have MAC addresses for the same reason - they were required in 10BASE5 networks, but most Ethernet links are point-to-point these days could work with an alternative approach
- Many people are dogmatic about Layer-2 networks
- “We’ve always done it this way, and it has always worked” - ignoring the fact that it has always been brittle, prone to failure and not admitting that there WERE meltdowns
- They often do not understand the fundamental problems of Layer-2 networks and think that they are just fine.
- False assumptions about Layer-3 networks or lack of knowledge that they even exist
- Many people think that Layer-3 networks are too complex or require too much configuration, which is not true.
- Layer-3 networks can be just as easy to set up and maintain as Layer-2 networks, especially with modern routing protocols like OSPF or BGP.
- Many people think that Layer-3 networks are overpowered for their purposes and only required for super-large networks, which is also not true. Layer-3 networks can be just as simple and effective for small networks as they are for large networks
- Many people do not even know that Layer-3 networks exist and think that Layer-2 is the only option.
Conclusion
We’ve explored why the event industry’s standard approach, Layer 2 networks with stretched VLANs, taped together with the Spanning Tree Protocol and IGMP Snooping, is fundamentally flawed. These networks suffer from excessive BUM traffic flooding, don’t scale well and suffer from bad reliability and resilience, resulting in a high risk of traffic disruption and thus audio dropouts.
There’s a better way. Layer 3 fabrics solve these problems through hierarchical addressing and routing protocols. They’re not more complex. They’re just different, and they actually work much more reliably at scale.
In Part 2, we’ll explore Layer 3 network design principles and show you how to build networks that scale, recover gracefully from failures, and provide the reliability that events demand. It’s time to move beyond the workarounds.