Old School Blue/Green Deployment for IaaS Cloud Deployments: Part I Theory

Sat Oct 07 2023

My initial launch of joelj.ca about a year ago was built as an experiment to learn infrastructure as code and cloud-based deployments by leveraging Azure’s PaaS offerings. Gradually, the architecture has evolved for cost-saving purposes, since all that cool cloud infrastructure was a little pricey to run. In January, I shifted the site onto a plain old Linux VM provisioned through Akamai’s IaaS offering, Linode. The physical architecture was about as simple as you could get, and a reboot of the host for patching would yield some downtime. This was good enough for my purposes at the time, since I’d set the project aside to focus on other things.

Recently, I was inspired to make the architecture more resilient, while sticking to the IaaS model in order to keep costs down. First and foremost, I wanted to be able to deploy new code without discernible downtime. Furthermore, I wanted to be able to keep my hosts up to date, still without taking the site down. The vision was to be able to provision a new VM with up to date software, and bring it into my pool of active servers with a simple configuration change in version control. Old servers could simply be removed from the pool and discarded, as opposed to patching them and restarting (closely resembling immutable infrastructure).

The solution was effectively an implementation of the blue/green deployment pattern with both a DNS and reverse proxy layer. Shell scripts and declarative infrastructure tooling enabled the deployment of code and configuration of hosts to be fully automated. This series is intended to document the solution, and provide a viable template for web application deployments in a IaaS cloud environment.

This first part describes the theory behind Blue / Green, some abstract physical architectures, and their potential pitfalls. In part 2, I’ll describe the details of my implementation, including observability and monitoring considerations.

Starting Point

A naive web application architecture resembling my original site would look something like this:

Here, a single physical host serves the website. DNS points directly to the sole web server. Deploying code might bring your site down depending on which web server application you’re using, and how well you’re leveraging its features. Restarting the host for a kernel patch will certainly bring the site down for at least the duration of the power cycle. It will be tricky to test a deployment before making it live (though not impossible with some creativity).

Blue / Green Theory

Blue / Green is a deployment pattern that tries to solve the pitfalls described above. I’ll start with someone else's definition and break it down. The first result I got when googling “blue green deployment” was from Red Hat. Their definition describes rolling deployment (a more advanced extension of Blue / Green) which is inaccurate so I skipped it. The definition from Amazon (https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/bluegreen-deployments.html) is a lot better:

A blue/green deployment is a deployment strategy in which you create two separate, but identical environments. One environment (blue) is running the current application version and one environment (green) is running the new application version. Using a blue/green deployment strategy increases application availability and reduces deployment risk by simplifying the rollback process if a deployment fails. Once testing has been completed on the green environment, live application traffic is directed to the green environment and the blue environment is deprecated.

Putting it into easier words for smaller brains like mine: We’re gonna duplicate our web server. We’ll direct all our production traffic to the first, and deploy a new build to the second. After successful internal testing of the deployment on two, we will point production traffic to it. Now we can invert the roles, rinse and repeat.

Some of the weasel words in abstract definitions for Blue / Green (mine included), refer to “routing” or “directing” production traffic. How this is achieved is really up to your specific implementation. The textbook definition specifies the DNS server as the routing layer:

Here, the DNS entry for your domain is initially configured to point to Web Server 1. When you want to switch traffic over, you update the entry to point to Web Server 2. This will work very well for most purposes but there are some pitfalls. The most significant is DNS caching. Looking up a DNS record adds overhead to a web call, so most clients will cache the entry. Consequently, an update to the entry will not be consumed by former visitors until the cache expires. You can influence the cache duration through the entry’s Time To Live (TTL), but this is a balancing act. Shorter durations will add more overhead to your client’s web calls, and increase load on the DNS server. Longer periods will increase latency for the switchover. 5 minutes would be considered a very short DNS TTL, but perhaps a long time to recover from a bad deployment.

To make things faster, we’ll have to move up to Layer 7 of the network stack and use a Reverse Proxy instead. A Reverse Proxy acts as a middle man that forwards HTTP requests from the web browser to your upstream web server. Since the target upstream server is a configuration item, we can alter it to direct traffic. A good reverse proxy like Nginx will wash out lingering TCP connections instead of terminating them after a configuration update, so this can be done without downtime. Reverse proxies also usually double as load balancers, so we may as well take full advantage of the two physical hosts and distribute the load. Here’s what the architecture looks like now:

At the top of the diagram, we see our DNS entry now points to the Reverse Proxy / Load Balancer. At the bottom, we see that both web servers now have a blue and green “slot”. A “slot” is an instance of the application bound to a specific port whose server instance can be deployed to independently. So, blue / green no longer refers to a specific host, but rather some port on all participating nodes. This enables us to scale up to an arbitrary number of web servers. The Load Balancer’s configuration specifies which port will be used within the pool of backend servers. This will be updated from X to Y for switchover (X and Y being positive integers and X != Y).

Also of note is that the Load Balancer is doing TLS termination (HTTPS), and communication with the backend server pool is through unsecure HTTP traffic. This is safe since all servers are on a dedicated VLAN of trusted hosts. Encrypted communication could still be supported but it would introduce unnecessary overhead and complicate the deployment. This might be a necessary evil for cases where hosts cannot be isolated to a dedicated network.

With the proxy in place, switchover can now be done almost instantaneously, but we’ve introduced a new problem: the Load Balancer is now a single point of failure. What if we wanted to patch the load balancer itself?! The solution is to add redundancy to the Load Balancer as well:

This new diagram is kind of a mashup of the previous two. Switchover between the redundant Load Balancers is back to using DNS. We can accept the higher latency, since patching activities should be relatively infrequent, and can be planned well in advance. Switchover for application deployments can be managed independently through the Reverse Proxy configuration.

Conclusion

Originally this whole piece was planned as a single post covering both Blue / Green background and my implementation. But then I ended up with over 1000 words of theory alone, and decided it would be better to break it up. In part two, I’ll discuss how I implemented the final architecture diagram with my Linode hosts, and my tooling for monitoring the health of the deployment. The automation / CICD toolchain leverages a fun combination of Powershell for Linux, Ansible, Azure Keyvault and GitHub Actions. The web stack itself consists of Angular Universal (Node.js), Pm2, and Nginx. To be continued.