Mission-critical services
It is said that the strength of a chain is given by the weakest link … this has never been more true than in the world of mission-critical cloud services.
But what is actually a cloud service? Typically any service made available to users via the Internet from a cloud computing provider’s servers fit this definition. Mission critical cloud services simply have one additional requirement: they have to work no matter what is happening with any underlying server architecture.
In the following lines we are going to do an anatomy of the cloud services from the perspective of redundancy and failover.
Dreaded Single Point of Failure (SPOF)
Typical architecture of a cloud service consist of several layers:
Hardware layer
Infrastructure layer
Application layer
Each layer consists of multiple components with specific role for each layer and their malfunction can have various effects from performance penalties to complete failure of the service, in which case the component becomes Single Point of Failure. For mission critical services having any SPOF is not an option.
Rule number one in this context is that anything can and will fail, so rather than trying to design components to never fail (which is proven to be almost impossible), we leave them to fail in a controlled manner and have other components take over their role to sustain the service.
Monolithic approach versus microservices
A typical monolithic application consists of a large application which includes multiple services deployed on a pair of identical servers behind a load balancer. In such case, although the complexity of the solution will be lower, there are multiple Single Points of Failure: one heavy loaded service can cause overhead on both servers which can lead to lost or delayed messages, failures in hardware or infrastructure layer are not fault-tolerant, database can fail or can have heavy performance penalty which are not acceptable for mission-critical services.
Scalability is also an issue as large, monolithic applications require more time to prepare and deploy, which hurt the ability to do dynamic scaling depending on load.
A much better approach which also solves all these issues is migration to containerized microservices with automatic scaling, distributed across multiple physical or virtual servers. A spike of the requests to a service cannot cause overhead on the entire server anymore, as the load will be distributed to multiple instances of the service and the cloud manager will actually spawn more instances if needed.
Hardware and infrastructure layers
At the bottom of any cloud service are the machines which provide the processing power for the service as well as network infrastructure needed in order to operate. They can consist of physical machines deployed in data centers or they can be virtual machines from a 3rd party provider.
Which one is better, which one can fail? Well … remember rule number one: both of them can fail, no matter if there are state-of-the-art physical servers in the most modern data center, or the best 3rd party provider, they can have outages which can become Single Point of Failure for your services.
How do we prevent that?
The keyword is distribution of the hardware and infrastructure on multiple providers and zones, so even a global event affecting entire data center won’t disrupt the functionality of your service.
Application layer
The software components of your application typically consist from a sum of 3rd party tools and frameworks (like web server, database, messaging queues, storage services etc) glued together by your own custom code.
How can we make sure they don’t fail? We don’t … we actually allow them to fail, so other components take over their role to perform the job.
How do we actually do that? The keyword is containerization with automatic management, so our applications are encapsulated in Docker containers and distributed using Kubernetes.
This assure both automatic scalability of the platform as Kubernetes can run more containers when load is high or stop them when are not needed anymore, allowing better cost control of the infrastructure.
The 3rd party components which we rely upon have their own mechanisms of failover and redundancy built on the same concepts as our entire architecture to assure high availability, scalability and performance.