SD-WAN in a Work from Anywhere World

SD-WAN vs VPN

Connecting remote users into corporate IT resources has never been trivial. With the shift to work from remote already well underway for many before 2020, the global COVID-19 pandemic catalyzed the transition dramatically. The change happened so fast, many organizations struggled to get their remote users connected when working from home as it became a requirement. Remote access to an organization’s applications traditionally necessitates creation of a virtual private network (VPN) connection of some kind for these users to safely encrypt and secure access to sensitive data they need to do their jobs. We witnessed first hand our clients scaling this VPN demand up and having firewalls crashing from exceeding capacity, mad scrambles to upgrade licensing for more users, hastily getting software installed on laptops and needs to add cloud based VPN termination options to elastically expand capacity. It certainly was an eye opening experience for many that their infrastructure simply was not prepared to scale up quickly when they needed it most.

So how can we accommodate this sort of thing better in the future? With a bit of hindsight and introspection on what we just went through, are we as an industry thinking about remote connectivity in the correct way? Managing and securing remote connectivity for users can be achieved a number of ways, each of which having their own trade offs. There are some very progressive and forward looking models that are built with a “cloud first” mindset that do not require VPN connectivity. That said, I would say those are outliers in the corporate world today. Most organizations have a legacy application or other need to leverage a VPN for connectivity into their IT resources. Let’s explore a couple of methods of using VPNs for corporate connectivity and the compromises for each.

Endpoint Agent/Client Software

Many folks leverage an approach which involves a software client or agent running on the user device which will establish the VPN to “tunnel” a user’s traffic securely to and from the IT environment. Though very common and popular, is this the best way to connect users is a modern hybrid and multi-cloud environment?

The positive things about this approach:

  • Most organizations have a firewall and the functionality to connect users via VPN is often already baked in.
  • No additional hardware is required outside of the firewall and the endpoint itself.
  • Sometimes VPN software agents inspect traffic before it enters the tunnel so it can be inspected before sent into the corporate network. This puts the security perimeter very close to the user.
  • This is a well understood and a very common deployment scenario.
  • User identity, and therefore access policy, is known and can be managed by nature of the user logging into the machine the agent is hosted on.

The negative things about this approach:

  • Certain devices can’t run endpoint software (iOS, Android, unsupported and older operating systems, etc) so using phones, tablets and older computers may not be an option.
  • Agents are putting the burden of inspecting, securing and reporting network & application usage on the user device which may consume compute resources taking away from the user experience.
  • Managing permissions and controls on the user device can be difficult and time consuming.
  • Hybrid, multi-cloud and SaaS network connectivity needs become complex to manage and secure.
  • Additional licensing costs may be required to add users.
  • Lack of network visibility without additional tools on the device.

So we get VPN connectivity included with components we may already have, but there are some reasons why it is not a one-size-fits all model. Let’s contrast it with an alternate approach.

SD-WAN Network Appliance

Another approach for remote users to access IT resources is leveraging an actual network appliance to terminate the WAN connectivity and then connect the user device via Ethernet or Wifi. Many platforms have SD-WAN capabilities today, not to mention some security features baked in so let’s assume we are working with these modern edge appliance features for the sake of our argument.

The positive things about this approach:

The negative things about this approach:

  • Additional devices to install, manage and support
  • Additional hardware costs
  • Depending on the platform, additional licensing costs
  • Without agent, cannot validate state of end user device before attempting to connect
  • More planning and coordination with users to get network connected vs getting on Wifi/Ethernet and firing up a VPN client.

In conclusion, which is better?

So which is perferable? The age old “it depends” applies. In most cases, my design preference would be the SD-WAN network appliance. I may be biased as a network practitioner, but I predict we will find many moving to a network based approach for work from anywhere. As computing capabilities evolve and can be supported in smaller packages, remote users will have a little “puck” sized appliance that will give them access to network resources.

My key reasons for this are:

  • Lack of requirements for a software agent allows for user device independence.
  • No need to manage software on user machines i.e. no dealing with OS permissions issues, no keeping agent software up to date, no user performance impact from agent, etc.
  • More network and application visibility/telemetry opportunities with network appliance that stream this info not to mention the ability to easily issue packet capture at the edge on network appliance.
  • Though you will have some additional costs to install and manage the hardware, there are great options to automate and orchestrate this control, not to mention things like zero touch provisioning to stand them up. It can be argued deployment can happen more rapidly.
  • In the future, the potential to install apps at the edge. Examples would be synthetic application monitoring and measurement platforms, application optimization, data synchronization, etc.
  • WAN & application optimization tools to clean up performance are typically baked in to correct problems like packet loss, jitter and packet loss on the fly.
  • Managing routing, access control, content filtering and other things that we typically depend on the network devices we use today are well known and easier to manage on network appliances.

What do you think? Which approach seems better to you for remote connectivity, agent software or SD-WAN? Please comment here or on social media with you thoughts. As always, thanks for reading and I certainly would appreciate any input you may have!

Is your Internet connectivity REALLY redundant?

Outages

The CenturyLink/Level3 internet outage on August 30th, 2020 got a lot of network engineers thinking about internet reachability and the ways things can go wrong. The way this particular failure played out was unique and definitely gave us all a lot to consider in the way of oddball failure scenarios. Problems started for CenturyLink/Level3 when a BGP Flowspec announcement came from a datacenter in Mississauga, Ontario in Canada. Flowspec, a security mechanism within BGP, is commonly used to filter large volumetric distributed denial of service (DDoS) attacks within a network. The root cause of this particular issue appears to be operator error by way of a CenturyLink engineer being allowed to put a wildcard entry into a Flowspec command to block such an attack. This misformatted entry caused many more IP addresses than intended to be filtered, wreaking havoc on the CenturyLink/Level3 backbone within Autonomous System (AS) 3356. BGP sessions were torn down by the rule because of filtering across the backbone causing instability and reachability issues throughout.

One very interesting bit about how things failed was what happened when other networks tried to shutdown their BGP sessions to AS 3356. CenturyLink/Level3 didn’t stop propagating prefixes/IP address blocks even after the BGP sessions were shut down. This made the BGP speakers still connected to AS 3356 think it was a valid path to reach said prefixes/IP addresses but it was not any longer. This traffic was then “blackholed” within the CenturyLink/Level3 backbone because there was no longer an exit point to reach the IP addresses. So not only could you not use the backbone during the disruption, the failure actually could have prevented those who proactively disconnected from CenturyLink/Level3 to be able to utilize alternate paths they might be connected to.

So a question comes to mind of many network engineers examining the post-mortem of this event: How can I can I make sure my network is not affected if this happens again? There are a few things that come to mind as items to take into consideration:

SD-WAN – Now mainstream and very mature, SD-WAN is a fantastic way to overcome connectivity issues over the Internet. Because probes are sent periodically to measure path performance, the right SD-WAN solution could route around performance problems on a network. An SD-WAN overlay alone can’t resolve every issue but combined with some of the other recommendations here, certainly gives you greater resilience.

Autonomous System Diversity – When designing internet connectivity resilience, the goal is to make the links you have as independent from one another as you can. The autonomous system paths of the providers you select is important to examine to be sure they do not depend on one another for transit. A great tool to assist with this is CAIDA’s ASRank which is helpful to to see ASN relationships with one another. Take a look at the ASN of the providers you are considering to see their relationship to one another. In particular, you likely want to avoid the two ASes having a “customer” or “provider” relationship. Ideally, you’ll want them to be a “peer”. Unfortunately that doesn’t 100% guarantee you won’t be affected by something like what happened on August 30th to AS3356 basically still advertising and blackholing but it will get you about as close as you can get to the ASNs not having inter-dependence on one another and have a better chance of survivability.

Three Connections or More – Many with redundant Internet connections assume two connections are enough. I would contend that having a third connection, even if it’s a backup only connection via 4G/5G over a wireless carrier, can save your bacon if the other two carriers are affected by the same outage.

IXPs, CXPs and Cloud Direct Connections – You may want to consider peering into one of the following:

  • Internet Exchange Point (IXP) – You’ll find IXPs all over the world as a means to inexpensively peer networks directly in a multilateral or bilateral peering arrangement. With multilateral peering, you connect to a route server with one BGP peering session then send and receive all routes with anyone connected to the route server. Bilateral peering is a direct BGP peering relationship with another entity on the exchange. These allow a network to directly connect to regional network connections without the need for transit saving money, latency and improving overall performance. Quick plug: I work with the Ohio IX so if you’re in Ohio, I highly recommend checking them out.
  • Cloud Exchange Points (CXP) or Direct Cloud Connections – As the public cloud becomes more important to IT infrastructures, finding a way to stay directly connected to these resources becomes critical. Like connecting to an IXP, connecting to a CXP or Direct Cloud Connection to get to key cloud providers is another opportunity to not just improve redundancy but performance as well.

In closing, it’s difficult to plan for every type of network failure that can occur. This most recent CenturyLink/Level3 outage was one for the books, that’s for sure. All we can do as network engineers is learn from it and strive to build better networks from the lessons we take away.

Thanks for reading! If you’re an Ohio network engineer, be sure to check out a couple of organizations I’m involved with: (OH)NUG and Ohio IX. I might be a little biased but feel they are great resources right in our backyard!