Azure BlueGreen Deployment Challenges and Learnings

TheCodeKing

4.57/5 (4 votes)

Jul 24, 2016

CPOL

12 min read

10453

The following article describes some of the high-level approaches and challenges with implementing BlueGreen deployment on Azure.

Introduction

Having implemented BlueGreen deployments on AWS with OctopusDeploy, I recently had a chance to implement a similar solution using the Microsoft Azure stack. In this case we were deploying Sitecore 8.2 on .NET to multiple Azure regions and with automated failover.

The following article describes some of the high-level approaches and challenges.

Beware of Preview

The first challenge with Azure is the number of features still in Preview. Preview Terms & Conditions do not allow for production scenarios and are largely unsupported. Always check the documentation as Azure is not necessarily clear with what's in preview and what isn't.

We started implementing auto-scaling using Azure scale-sets but ran into numerous issues with reliability. In the end we moved to availability groups and a fixed number of servers. Azure still allows you to scale by updating the stack, but it's more of a manual process. I was surprised how behind Azure is with this feature. AWS has an advantage here with a much more mature solution which has been in place for many years.

Of course Azure does have fully managed WebApps as an alternative to running servers and similar to AWS ElasticBeanStalk. Unfortunately it's not suited to products like Sitecore yet, but the move away from hosted servers is coming fast.

There Can Be Only One

One thing to be aware of is that Azure is in the middle of migrating between it's legacy web portal and the newer portal. Likewise the PowerShell Cmdlets are migrating from legacy APIs to the new AzureRM APIs. If you don't know which you are using, all AzureRM Cmdlets are prefixed by AzureRM. The old mechanism for switching between modes in gone.

Each set of Cmdlets have different authentication implementations and you should never mix and match when looking to automate. It's important to use Cmdlets prefixed by AzureRM everywhere or you will run into issues, particularly I found with SQL Azure. When mixing APIs, SQL Azure often got stuck not recognising that deleted databases were in fact deleted. Whereas one set of Cmdlets would report the database exists, another would argue it was deleted.

AzureRM is the future and legacy APIs are being phased out. There are gaps in the AzureRM APIs, but they are catching up fast and most things are possible. Finding documentation is often the hardest challenge.

Applications Not Users

Authentication in Azure is a mess, especially via PowerShell. It's possible to have the same email address associated with an organisation or with a live account. This allows you to sign into completely different accounts with the same email via the portal, but Azure PowerShell isn't that smart or helpful with these scenarios. Legacy Azure PowerShell and AzureRM have different behaviour depending on what type of account you are using and whether it will let you authenticate from PowerShell. You can't always authenticate with the credentials you want and the errors will leave you with no clue. You may want to throw your computer out of the window. Resist. There is is hope.

Live accounts can't be used from PowerShell with AzureRM, but accounts associated with an organisation can. Most of the docs suggest downloading and using the PublishSettingsFile. This is a legacy solution which installs a certificate on the local machine and allows authentication with Azure under your account. It also contains a lot of sensitive credential information linked to your account, so not great to check into source control or leave lying around on servers. I find this doesn't work well with automation and I don't want my credentials used everywhere during automation. However it's also unlikely a client you are consulting for will give you an organisation account, so this can be a problem when it comes to automation with AzureRM.

Instead I found it's possible to setup an AD Application with public/private tokens and use that to authenticate with AzureRM from PowerShell. This bypasses the need for a organisational service account and works just fine. Note this doesn't work with legacy Azure PowerShell Cmdlets, so you must commit to using AzureRM everywhere. And yes sometimes it hurts!

Setting up an AD Application allows you to authenticate like a username/password from PowerShell. The password is temporary and lasts up to 2 years. It's not linked to a personal account, so great for automation. I use PowerShell SecureString encryption to protect the application password and keep it in source control. More on this another time.

To setup AD Application:

Navigate to Active Directory in the legacy Azure portal
Select the AD associated with your subscription
Select Applications and choose Add
Provide a Name and leave "WEB APPLICATION AND/OR WEB API" checked
Add any unique URI for "SIGN-ON URL" and "APP ID URI"
Once created click on configure
The "Client ID" is the service username to use for authentication
Under Keys create a new key and set length to 2 years
The key shown is the service password used for authentication
In the new Azure Portal you can now select your named application and assign permissions

To authenticate AzureRM in PowerShell use the credentials above as follows:

$password = ConvertTo-SecureString -String $servicePassword -AsPlainText -Force
$credential = New-Object -TypeName System.Management.Automation.PSCredential `
                         -ArgumentList $serviceUserName, $password

Login-AzureRmAccount -Credential $credential `
                         -SubscriptionId $subscriptionId `
                         -TenantId $tenantId `
                         -ServicePrincipal

Note Subscription and Tenant are optional, but you may run into problems with multi-tenant and multi-subscription accounts if they aren't specified. You won't know why.

Herding Cattle

If you believe in Cattle not Pets, then you should consider scripting your environment stacks and keeping them organised. ResourceGroups are used for this purpose and equivalent of Stacks in AWS.

Templates in Azure are like AWS CloudFormation templates with declarative JSON format. The documentation is poor, and it's mainly described by example on Github. It's also possible to backwards engineer templates from within the portal itself, but you'll need to rework these. It is however a good way to learn.

As with AWS the best way to automate provisioning is by passing a template and parameters (answer file) to the API. You can also pass in key/value tags which can be associated with the ResourceGroups. You can also assign tags to individual resources within the template. Tags can be a useful way to passing variables to provisioning scripts running on servers. They can be used to bind an OctopusDeploy Tentacle to a particular environment name during automated install for example. Tags shouldn't be used to pass secrets.

Create-AzureResourceGroup -infrastructureTemplate $infrastructureTemplate `
    -templateParams $paramsFilePath `
    -resourceGroupName $resourceGroupName `
    -rgRegion $location `
    -tags @{Name="environment";Value=$Environment}

Debugging errors with templates is hard, but there is some verbose output in JSON which can be passed. It's often easier to look into the portal under Deployment tab for details of the error.

The following is a crude way to log better errors within a try/catch block if provisioning fails:

(Get-AzureRmLog -Status Failed -ResourceGroup $resourceGroupName -DetailedOutput).Properties `
                         | ?{ $_.Content -and $_.Content["statusMessage"] } `
                         | %{ $_.Content["statusMessage"] } | Select-object -first 1

BlueGreen Sauce

Once you treat infrastructure as code by implementing AzureRM templates and automating everything, you have Cattle and not Pets. Implementing BlueGreen then becomes easier, and is a great way to deploy releases with zero-downtime. It allows you to safely rollout changes to code, content and even infrastructure.

BlueGreen involves spinning up a replica environment and rotating the entire stack into production, usually via a DNS change. As DNS is heavily cached throughout the network and on devices, it works best when implemented behind a CDN. When using a CDN the public facing DNS remains fixed and only the origin DNS records are switched behind the scenes. The changeover is therefore seamless as CDNs are good at respecting TTLs, unlike browsers. Remember to also automate clearing the CDN cache on go-live.

Azure does have Azure DNS in the pipeline to provide a scriptable DNS solution, like Route53 on AWS. This will provide improved ways to implement BlueGreen on Azure (once it comes out of preview). In the mean time there's TrafficManager. TrafficManager is a DNS based solution that allows you to route traffic to different endpoints. In this case it allows changing routing between Blue or Green environments for deployments.

To implement BlueGreen with TrafficManager setup a 2 TrafficManager instances to represent Production and PreProduction. In each TrafficManager instance setup 2 endpoints and then use the external endpoint option, which lets you route traffic to a given URL. Point one endpoint at your Blue stack and the other at the Green stack. On the TrafficManager you choose as Production disable the Blue endpoint to make Green the Production environment. On the PreProduction TrafficManager disable the Green endpoint to make Blue PreProduction.

Next create a PowerShell script to automate swapping the enabled/disabled endpoint status settings. The script should update both TrafficManagers and synchronise the configuration across both TrafficManagers. It's a good idea to create a utility PowerShell function for looking up the current Production or PreProduction colour. This can be used to automate tearing down PreProduction when not in use, or when targeting deployments. It avoids human error and keeps the current BlueGreen status as a single source of truth.

This scratches the surface of implementing BlueGreen. It's important to rotate the entire stack during BlueGreen deployment, including databases and content. This has many challenges outside the scope of this article, but allows seamless go-live and rollback capability when done right. If you are modifying your production data during BlueGreen then you are doing something wrong.

Confidence To Fail

Long gone are the days for complex disaster recovery plans and standby servers. If you look after your Cattle, and adhere to good practices such as treating infrastructure as code, then there is nothing to fear. Obviously it's still vital to look after your assets and keep your data redundant, but keep as much as possible in source control. The golden rule is never manually login to a server and configure anything! Once the hard work is done, scaling out to multiple data-centres is easy. At it's simplest it's a variable change in your scripts.

Azure offers some great SQL replication features to push your data around the world as read-only copies. This can all be automated via PowerShell and baked into your deployment sequence. We use SQL Azure replication to spin-up Sitecore delivery instances in other locations around the world and implement BlueGreen across all.

Once you have more than one data-centre, a second line of TrafficManagers can be used to automatically failover between locations if health checks in one location fail. This can be implemented using a priority mode TrafficManager. Alternatively use performance mode within TrafficManager to automatically route the traffic to a data-centre closest to the user. Again if an location becomes unhealthy it will automatically divert traffic to the next healthy location. When using priority mode you can always setup the lowest priority endpoint as a fallback to a static holding page.

The Cake Is A Lie

One of the problems with automation on Azure is that it tends to lie to you. No seriously. Lies.

An item can be reported as deleted, but it won't let you create a new one with the same name immediately. It will tell you the resource already exists. In fact I still don't know how long the wait is before you can re-provision a deleted item with the same name. I usually wait 5-10 mins or it will complain.

This behaviour can cause problems with automation scripts if you frequently teardown and re-provision resources. SQL Azure seems to be the biggest culprit with it's server names. For databases it's usually fine, so it wasn't a major issue for us outside of testing our templates.

Reaching Out

A benefit of Azure Load Balancers is that all outbound traffic will originate from the Load Balancer IP address, and not the IP of the individual servers. This can simplify setting up firewall rules between services, and useful when you need to leverage IP restriction.

When using OctopusDeploy Tentacles in listening mode, I'd recommended using NAT rules on the Load Balancer. Setting up NAT routing for each server on a unique port allows OctopusDeploy server to connect to each server. You'll also need to allow the OctopusDeploy server IP inbound access on these ports. Firewall rules can all be automated via PowerShell provisioning or baked into templates. We lookup the Octopus server IP address during provisioning on the fly and add a specific firewall rule for the current server.

Dude Where's My Data

A crucial part of any repeatable deployment process that involves a database, is resetting the initial state to avoid pollution. In the case of Sitecore we needed to reset all of the content databases to a recent copy of production at the start of every deployment.

There's 2 approaches to automating this. Either backup/restore the databases via scripts or use the Azure database copy feature. Azure database copy is substantially quicker, and backup/restore we found quite fragile.

In our case we had split production and non-production environments across 2 subscriptions for added security. Unfortunately database copy doesn't work across subscriptions. To minimise our deployment times we setup scheduled tasks to backup/restore the production databases overnight to a standby server in the non-production subscription. We then used database copy during all our test environment deployments. For production deployments we just use copy against production directly.

Summary

Azure is not without it's challenges and seems behind in terms of maturity in many areas compared with AWS. It is however vastly improving it's Cloud offerings with many features in the pipeline. Once out of preview these features will better position Azure alongside AWS in terms of it's Cloud server offering, with better support for auto-scaling and self-healing.

Having implemented the same conceptual deployment pipeline on both Azure and AWS, it's reassuring to know that the principles of Continuous Delivery can be applied anywhere and hold true. Automation is the key, but sometimes the hardest challenge is achieving stability and repeatability with vendor APIs. Be prepared to build in retries to handle unexpected failures.

When successfully implemented, Continuous Delivery and single-click deployments are hugely satisfying. The rewards are low-risk and high-speed to market for the business. It does however require a large investment to reach maturity.