The technology value stream is often viewed within the context of development and Agile environments, but with the help of a book or two (or a dozen) I’ve recently begun to think about it in terms of operations and infrastructure. When you analogise the term to the Thames, Seine, Yangtze, Mississippi or any other great river for that matter, there is always a head water, or source, and a mouth where the river eventually spills out into a wider body of water; typically an ocean.
In this example let’s assume the so-called ‘head water’ is analogous to the architecture of the platform on which a given solution runs. The ‘mouth’ is the actual deployment of the solution to the greater outside world and its consumption by its’ customers.
Like pollution that starts at the head water of a river and eventually ends up in our oceans so too do the efforts of the mismanaged, ill-organised, and self-focused teams of architecture, engineering, and operations negatively affect downstream developers and eventually the ocean of users we depend on for our very existence.
How does this happen?
More often than not its because solutions are built within silos that cohorts downstream don’t want, need, or in the worse cases, can’t work with in order to achieve success. What started off with good intention ends up becoming someone else’s blocker.
This pollution eventually manifests itself in the guise of slipped delivery times, overly complex systems which require and act of God to institute change, and perhaps worst of all outages and/or security breaches; which we all know leads to a decrease in customer satisfaction and trust and an eventual decline in either revenue or productivity. If one is lucky, it’s not catastrophic.
Does your organisation pollute its own river? Are you talking to your clients or customers on regular basis be they internal or external and establishing feedback loops?
If not, then it’s likely that you are the source of your own pollution.
Throughout the great state of Texas you will find signs along its highways and byways which simply exclaim “Don’t Mess with Texas!” A lot of people think the message the in-your-face embodiment of American bravado or that of Texas in particular. It isn’t. It’s the anti-pollution slogan.
“Don’t Mess with Technology Value Stream!”
The recent announcement of Azure Service Fabric Preview has become an increasingly favourite topic of discussion across the board . Not only amongst developers who stand to benefit most directly but also amongst those focused on infrastructure, business delivery, and operations. To these groups the concepts may not come so readily as they are abstracted from BAU. I certainly fell into this category. By deconstructing something traditional and familiar I was able to paint a picture that made sense and that could be communicated to a variety of audiences.
Using the attached whiteboard I try to explain in four high level snippets. I’ve taken some liberty when describing how IIS works for illustrative purposes.
1. The Traditional Landscape
The story begins with the three tier web application model. You normally have a SQL backend, an application, or logic, secondary tier, and a web front end tier manned by IIS – at least in the MSFT world. It is easy to consider an IIS server as a single entity. After all it’s just a server that serves web pages, right?
Well, yes and no.
2. The Anatomy of IIS
Traditionally when we expected traffic to increase we needed to build up the various tiers of our solution which meant adding additional servers – physical or virtual – to the relevant tier. In this example the tier that needs to be expanded is commonly known as the Web Front End tier which is the entry point for any client trying to access www.Uber.com* for example.
As it turns out an IIS server often performs multiple functions. For example when you install IIS you have the options to also include services such as SMTP (for sending and receiving email), MSMQ (for managing messages a queues between tiers of a given solution), .NET, or a multitude of other platforms) and the like. So, you see, an IIS server is often times does much more than serve up web pages.
3. The Scalability Challenge
So now we have a server that in reality is running a number of supporting services. During times of increased traffic we might choose to light up additional IIS servers. Whether those servers are physical or virtual the problem is the same: management. Patching, avoiding so-called configuration drift, accommodating continuous integration and deployments, maintaining security, etc.
In the case of a high fidelity solution it may be that the web service itself isn’t the bottleneck. Maybe we need to scale out one of the supporting services like MSMQ or the .NET logic tier.
Either way we’re deploying additional servers. Some times just to handle these small but critical additional services.
4. The Azure Service Fabric and the Microservice paradigm
Depending on who you ask the Azure fabric can mean many things. It’s the control plane. It’s the data plane. It’s the ‘brain’ of the Azure platform. For me it’s all of the above. And it’s powerful.
The Azure Service Fabric erodes the physical and virtual machine footprint for solutions built cloud-first from the ground up. These services can now be deployed in a roaming, ‘head-less’ state throughout the world. A developer can ‘hook’ into the fabric, access the service of her or his choice, and operations doesn’t have to be concerned with yesterday’s problems; not all of them any way. Better yet these services when used optimally can be self-healing and self-spawning.
When correctly designed and architected your PaaS based, Service Fabric integrated solution should match or exceed the general SLA of Azure at 99.95% at a fraction of the cost of traditional H/A deployments.
Spoiler Alert: For those of you too busy to read this post the answer is an imminent YES.
Azure is the midst of a paradigm shift as it evolves from Azure Service Manager (ASM) to Azure Resource Manager (ARM) – so-called Azure v1 and Azure v2.
Many organisations have looked to Azure (and other cloud providers) as the solution to a quick exit from costly traditional datacentre contracts and private cloud solutions; often times these efforts take the form of a ‘lift and shift’ exercise which, whilst serving the tactical requirement of a quick exit, quite often fails to deliver on the more strategic vision of moving to the cloud in order to take advantage of its highly touted flexibility, elasticity, and cost efficiency. In other words, it’s the classic case of garbage out, garbage in.
Be that as it may many programmes of migration that I have seen are continuing to deploy into ASM and one must question that strategy now that ARM is becoming more and more mature with each passing sprint. For smaller organisations vNet to vNet connectivity between ASM and ARM has been possible for a while so the co-existence of the two has been realised since last year. For larger organisations, however, Expressroute (Microsoft.Network/expressRouteCircuits/) has been a considerable blocker. Expressroute is the facility by which an organisation can privately connect a traditional datacentre with Azure without traversing the internet. Until recently one had to create TWO Expressroute circuits if they wanted co-existence between ASM and ARM. Because Expressroute circuits have a monthly cost associated with them (varying depending on model, size, and provider) this simply equated to twice the cost for network connectivity.
Well, no more. MSFT has recently announced that it will soon be possible to have both ASM and ARM networks running over a single Expressroute circuit. So, given this new integration does it still make sense to keep deploying onto ASM or does it make sense to step back, re-assess, and invest in the necessary changes to target ongoing migration efforts to ARM? In my opinion it is the latter for three over riding reasons:
- Financial Accounting
- Management and Operations
Many large organisation operating under a shared services model in which a centralised body offers IT services to the rest of the organisation and its business units often wish to ‘charge back’ these services. With ASM this is near impossible at any granular level within a single subscription not to mention multiple subscriptions or tenants. With ARM, however, metadata TAGGING is the order of the day where one is able to TAG a resource to a specific owner. In fact, one can TAG at the most minute of levels; an individual NIC, for example. More common is the desire to have different workloads from different departments running under the same subscription and still retain the ability to split out the accounting for each resource and issue a separate bill of services for each department, division, or business unit. ARM enables this.
Microsoft has always embraced the concept of least-privileged access meaning that someone only has the rights necessary to perform a specific operation or function. This evolved into Role Based Access, or RBAC, which is more granular and has made its way through the Microsoft stack from Active Directory to Exchange to SharePoint to System Center, etc., and now it has come to Azure. In ASM one had a very limited set of roles to choose from:
- Owner – full management rights
- Contributor – full management rights except for user management
- Reader – view resource rights
This was a major constraint from a security perspective and does not follow the ‘least privilege’ mantra. For example, both the owner and contributor roles could delete a resource; there was no separation of duty. As an illustration of how things have now changed (this is probably not up to precise date) these are some typical roles one would find within ARM and its service providers:
- API Management Service Contributor
- Application Insights Component Contributor
- BizTalk Contributor
- ClearDB MySQL DB Contributor
- Data Factory Contributor
- DocumentDB Account Contributor
- Intelligent Systems Account Contributor
- New Relic APM Account Contributor
- Redis Cache Contributor
- SQL DB Contributor
- SQL Security Manager
- SQL Server Contributor
- Scheduler Job Collections Contributor
- Search Service Contributor
- Storage Account Contributor
- User Access Administrator
- Virtual Machine Contributor
- Virtual Network Contributor
- Web Plan Contributor
- Website Contributor
Clearly more granular. Clearly more secure.
From an authorisation perspective ARM is hands down the better platform. Indeed, from an identity perspective ARM now only supports Azure Active Directory and has deprecated X.509 certificate authentication which some will agree is more secure and others, myself included, will agree is more convenient.
MANAGEMENT and OPERATIONS
Perhaps the core difference between Azure v1 and Azure v2 is that of the REST API’s underlying each. A lot of customers don’t realise that ASM is over five years old. In cloud years that’s a long time.
ARM brings with it many advantages like those already discussed but perhaps the most significant and underappreciated advantage would have to be its ‘pluggable’ model which means that future services can be seamlessly integrated. The term ‘future proof’ is almost always itself a misnomer but with ARM we get as close to it as possible.
Parallelism is a major improvement. In order to deploy 10 VMs under ASM you had to create one VM at a time. This was true whether you were scripting the process with Powershell or performing it through the portal. With ARM, however, all 10 VMs are provisioned at the same time. Equally it is much faster to decommission a complex environment: a development environment for example.
The enablement of Infrastructure as Code (IaC) is critical to achieving the elasticity and operations of any modern day cloud platform. Simplistically the cloud is asking us to reduce a datacentre into a collection of files…a lot of files. What ARM provides is a framework with which to couple, decouple, and otherwise associate these files through the JSON syntax allowing complex environments to be described and deployed in a declarative fashion. This is important because it allows for the creation of reusable templates that describe various aspects of an infrastructure that can quickly be deployed and updated and moves your organisation one step closer to the holy grail: Automation.
What I am starting to hear now on a regular basis from various forums, both public and private, colleague’s work in the field, and even some senior level MSFT architects is that now is the time to start integrating ARM into your plans. If you’re just getting started the choice is clear: ARM is the future, go with ARM.
If, however, you find yourself in the middle of a global ‘lift and shift’ and have a good way to go before your regional datacentres are fully decommissioned I believe now is the time to step back, re-assess, and invest in ARM integration. At the very least roll out the prerequisites: set up additional subscriptions if required, configure your Expressroute links, connect your vNets, and deploy supporting services like domain controllers if you’re in a hybrid situation. This will give you the option to choose whether or not to move future workloads into ASM or ARM and in future cases prevent the dreaded ‘double hop’ whereby instead of going directly to your target state you are forced to migrate everything TWICE. It’s costly, it’s complicated, and it takes up valuable resources.
One Final Thought….
The story of the ‘big bang’ approach to migrating from ASM to ARM is still being written and much of it is closely guarded. What we’ve heard, however, is that the transition largely revolves around metadata and therefore in theory could eventually be achieved without downtime. In other words, you would call up MSFT, tell them you’re ready to convert your subscription, they click a box on the back end, and voilà! Job done.
Realistically, however, when you consider the interdependencies among the components of even the most basic multi-tiered web application you quickly realise that this nirvana may be a bit further out of reach than at first glance.
**Some services have not yet been transitioned to ARM but these are few and the list keeps dwindling. Only specific cases where a required service does not exist should ASM be targeted.
Sometime ago I was asked to design an Azure Defined Datacentre.
The overriding design principal was that it’s footprint must resemble what one would typically find in a traditional ‘terrestrial’ deployment that has a large dependency on developing in-house line of business applications (LOB). That is to say that it had to consist of four independent environments, or tenants: development, testing, staging, and production.
With today’s emphasis on continuous integration and agile development this concept of segregating four identical environments can be cumbersome to the application lifecycle and deployment management and there are perhaps better models to follow, however the decision had been made.
Sometimes ours is not to reason why, ours is but to do and design. And so the stage had been set.
At the highest level the business requirements were:
- Extend the current terrestrial environment into Azure
- Create four logically separated environments as listed above
- Security and billing boundaries must exist between the environments for auditing and consumption control
- It must allow for resiliency and failover by stretching the solution across two geographic Azure regions
Keep in mind that all of this was prior to the general availability (GA) of some of the capabilities of Azure Resource Manager so I was forced to make certain decisions then that I probably wouldn’t make now. But such is the case when you find yourself living with the breakneck cadence of change that is inherent when working with the greatest public cloud platform on earth (MSFT you can find my billing details under my Xbox account to reward me for that shout out). Or any public cloud offering for that matter…my colleagues who are devoted to AWS and OpenStack I am sure face the same realities.
The subscription taxonomy was most notably affected. At the time I was unable to use tagging or implement Resource Groups in order to achieve requirement #3 above. The unavailability of UDR (User Defined Routing) also posed a challenge but I was able to get around that with the help of a well-known third party virtual appliance provi…..oh, hell, I’ll just say it: It was my good friends at Barracuda Networks that helped me get around this issue. We needed a way to ensure that traffic amongst subscriptions and their respective VNets stayed within the fabric and did not have to ‘bounce down and up’ an ExpressRoute circuit in order to be routed.
And so it began with this high concept scribble (on my Surface Pro 3, of course).
Not much to look at but it gave a 40,000 high level view and provided two things:
- I didn’t have to exercise my Visio OCD and spend hours getting boxes to line up perfectly (yeah, I know, there are only four, but those of you who would understand – understand).
- I quickly realised that even though these environments were to be logically separated that some services (name resolution, authentication and authorisation, certain file services, etc.) would need to be shared. After all it would be a nightmare to deploy four separate forests and their dependencies in order to support the solution.
So far so good. But how do we tether back to HQ?
Fortunately, as I was sweating over site-to-site VPNs, gateways, and the like, MSFT made an important announcement: ExpressRoute (#XR) now supported multiple subscriptions hanging off of the same dedicate circuit, albeit in a piggy back fashion.
This would allow us to carve up our 1GB connection into #XR ‘lanes’ for the purposes of bandwidth assignment. Whist it’s beyond the scope of this ramble to dive into #XR at length there are essentially two options for #XR: connecting it through a network service provider (NSP) or an exchange provider (amazingly I have no acronym for this one). Depending on your points of termination within your terrestrial datacentres you’ll have the option to go with one or the other or both. The primary differentiator is the bandwidth that will be available to you. We had to go through an NSP and so had a maximum of 1GB to work with.
We now had the subscription taxonomy, knew that we were going to share some core services, and use #XR to connect back into our two terrestrial datacentres. It was time to go back to the drawing board and further flesh things out.
For the next evolution I started to drill down into more specifics as I began to think about the wider eco-system and other ongoing efforts that were in motion and how they would fit into the picture: O365 adoption and SSO for third party services for example.
Besides being wonderfully colourful the scribble to the left revealed a number of things:
- Per the requirements it captured the four environments and their associated subscriptions: SubProd, SubStage, SubTest, and SubDev
- It introduced the concept of wrapping each subscription around its own virtual network (VNet), or Super VNets as they would later become known
- It depicted an area for core services mentioned above although these seem to be ‘hanging’ in the ether at the moment
- I also realised that we would need some form of DMZ (duh), not unlike the traditional terrestrial DMZs we all know and love, to house things like a web application proxy (WAP) or a *spoiler alert* web application firewall (WAF)
- Azure Active Directory (AAD) had to be included because I knew that O365 was coming down the pike and there were murmurs of Power BI
- Ancillary capabilities like Azure SSO & MFA were called out to light up third party web based applications like SalesForce, SAP, and the like.
This was only half of the picture, of course, as there was to be another mirrored deployment across the English Channel to fulfil the resiliency, high-availability, and business continuity requirements.
Now that we had all or most of the components represented I had to somehow string them all together in a fashion that provided the following:
- Efficiency in routing for both performance and cost control. I knew that any egress traffic out of the Azure datacentres would incur costs as well as any traffic flowing between regions. For this reason, it was imperative that intra-region, inter-subscription traffic be confined to the fabric and not require traversal of #XR to reach its intended target.
- Robust network analytics must be made available regardless of where traffic was being generated or consumed
- A security model that could be easily surfaced and ideally visually represented
- Management that could quickly be adopted by traditional network engineers
This posed, and still poses in some cases, manageability issues with regards to Azure’s data and control planes. While Network Security Groups (NSGs) and Access Control Lists (ACLs) can do most in the way of network segregation there is no easy way to fully configure them outside of the beloved PowerShell. Even then the management and operational handover of the architecture is a challenge because unless you have a network engineer who is keen to learn PowerShell (anyone?) or dive into User Defined Routing (UDR) Azure-style then it will be difficult to transfer ownership in such a way as to achieve Operational Acceptance (OA).
Analytics also pose a challenge. This may have changed by now but at the time there was no way to robustly inspect traffic to and from VNets and/or subscriptions in the way that one might wish to do so today with any number of physical devices. Never forget your network team.
Enter the virtual network appliance:
Widely used interface that is familiar to network engineers and architects? Check.
Easily mapped visual of the underlying architecture? Check.
Robust analytics for security auditing and anomaly detection? Check.
Able to route entirely within the fabric? Check.
And so we arrive at the final evolution. No longer a scribble but an actual Visio diagram.
If you are going through a fast paced sprint into the cloud and you don’t have a visual representation of what your cloud looks like, get one. I can not emphasise this enough. A picture truly is worth a thousand words…or in the case of an Excel spreadsheet – 10,000 rows.
So here are the characteristics of my blueprint and what will eventually be built:
- Azure Core Virtualised Network (ACVN) for Azure Region I which will more or less be duplicated in Azure Region II
- Four Azure subscriptions each ‘wrapped’ by a Super vNet representing development, test, staging, and production
- Each subscription and its corresponding vNet have dedicated subnets (Vsnet) which house virtual network appliances that will govern traffic to and from subscriptions, provide analytics, and provide a familiar interface for my network engineers
- vNetProd contains a dedicated vSnet for a DMZ that will at first house a WAP, or WAF, to facilitate ADFS and other services
- vNetProd also contains a dedicated Service Layer Vsnet which house our virtual network appliances. To start with these are firewalls but the service layer is capable of expanding to include other appliances such as WAN optimisation (Riverbed, for example) products and others.
- The cornerstone of each subscription is a network appliance which allows for intra-regional traffic to be contained within the fabric, and within its own subscription if so desired, without the need to traverse the #XR circuit (disregard any model numbers as they were used as placeholders)
- vNetProd contains a dedicated vSnet to house core, or foundation, services like domain controllers, ADFS servers, files services, or anything else that makes sense to share amongst subscriptions.
And there you have it! From scribbles to reality.
Clearly the behind the scenes story was much more involved: countless discussions, heated debates, loathed politics, the involvement of my friends at both Microsoft and Barracuda (as well as other big players in the network space), collaboration between myself and my colleagues at Network-Insight and MonoConsultancy, and of course the direction of the world class Azure Programme team in Redmond and around the world.
All in all it was a great journey and I look forward to the next one which is already underway….so stay tuned AND STAY TETHERED.
Q: What does Azure ExpressRoute provide that a rival public cloud provider (Contoso) does not?*
A: An SLA! [Service Level Agreement]
Yes, that’s right. Apparently Contoso’s DirectAttach can fail at any time for any reason. As an avid proponent of that ‘other’ public cloud, Azure, I have to say that I was shocked when I heard this. So much so that I am hoping that someone jumps on this post and tells me that it’s not true.
Can you imagine? The DirectAttach link between your regional corporate hub and the Contoso Cloud, which after a painful migration houses mission critical applications that require the link to refer to data sitting in an on-premises data warehouse, goes dark. Potentially millions of clients, sales, or widgets are lost as a result of the outage.
What do you do? I guess you call up the network exchange provider or Contoso and ask ‘Hey, what just happened?’
I suppose they just shrug and reply, ‘What? We never guaranteed it was going to be available. What’s the problem?’
Explain signing that contract. ouch
Now maybe it’s not quite so dark. Someone will likely mention architecting redundancy into the solution. But at what cost? Clearly redundancy is not a choice when you choose Contoso who offers a Service Level Agreement of ZERO.
*I’m sure there are other benefits in addition to having the widest global footprint of any public cloud provider on Earth. #Azure #XR