Service Mesh with Istio for Cloud-Native Application
One of my favs quotes when making architecture design is: “Everything fails all the time”
Werner Vogels, VP dan CTO Amazon.
He says the simple truth, every architecture will eventually fail usually sooner or later. So that instead of pouring all your energy, time, and absolutely cost into trying to prevent failure from happening, plan for it. The question is how your architecture can handle failures without impacting your users, how resilient your system is in relation to failures.
There is one tool to help you meet the need above in the Cloud-Native Application world, called Service Mesh. What is Service Mesh? Based on the book “Mastering Service Mesh”, Anjali Khatri and Vikram Khatri write this:
“A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud-native application. In practice, the service mesh’s implementation is an array of lightweight network proxies deployed alongside microservices, without the applications needing to be aware.”
It looks like Software Define Network (SDN) for me, I was a network engineer but service meshes differ here most notably by their emphasis on a developer-centric approach, not a network engineer–centric one. For the most part, today’s service meshes are entirely software-based. And usually, the next question is why I need service mesh, I have Kubernetes already. Well, those are two different things. Kubernetes handles scheduler or eviction pod, manage health status, manage labels selector, and create services. And service mesh added one layer to manage traffic management, security, observability, and many things.
The service mesh revolution is quite new, still continuously evolving, and as far I know there are three leading service mesh providers known as Istio, Linkerd, and Consul. And this quick review from me, Linkerd especially version 2.x focus on performance, easier to use, simple. Istio has more features but it has a learning curve. And sorry for Consul, I don’t have experience with that. It is difficult to rank the service mesh providers, the cause will be different for each end-user.
But in here, PT. Alto Network. We try to use Istio because it has many features like:
- Telemetry
Istio generates detailed telemetry for all services within a mesh. Because all communication is routed through Envoy proxies, so Istio’s Control Plane can gather all the information like logs and metrics. And we can measure for example performance of the application, how often certain features are used, measurement of start-up time, and processing time. Istio also supports distributed traces. Including tools like Zipkin, Jaeger, Kiali, etc. Unfortunately, you need to add header propagation in your code if you want Istio to get all metrics in your environment.
2. Mutual TLS
One of the challenges when using microservices-based was how best to properly secure communication between other services. I think it’s hard if we manage every microservice to have a self-certificate. First, we should create a CA server, and for each service generate the key and then create a CSR file, sent to CA Server to make the CRT file. Can you imagine? You must do all steps for each service. This is where Istio specifically comes into the picture. Istio automatically configures workload proxy/sidecars to use mutual TLS when calling other workloads. By default, Istio using permissive mode. It means the service can accept both encrypted and plain-text traffic. Or if you want to accept only encrypted traffic, you can change to strict mode.
3. Circuit Breaking
Circuit breakers are a design pattern to create resilient microservices by limiting the impact of service failures and latencies. One of the goals of the Circuit breaker is to handles failures gracefully, prevent cascading failures. To use this feature, unlike telemetry, you don’t need to change your code. Istio can handle this with the proxy/sidecar level. Many of you might be familiar with a library called Hystrix, from Netflix projects. Hystrix is a circuit breaker java library designed to enable resilience in a complex distributed system, but they need to add the library into your code. And Hystrix no longer in active development.
4. Fault Injection
How if I tell you, Istio can make service in your environment run slowly. By adding delays into any requests. Why would you want to do that? Okay, when you build distributed architecture, you should never assume you have 100% reliable. You must have fault tolerance, and you can do that in Istio. You can inject failures at the application layer with the specific condition to simulate service failures and higher latency. For me, it sounds like the Chaos Engineering in practice, and CMIIW Netflix is who first really popularized that concept with their tools “Chaos Monkey”.
5. Dark Release
If you want to release a risky feature in production, and even after you are testing in a staging server, you still worry about behavior in production, you can use this feature. The Dark release is when you want only specific users to be able access the service. We can release the risky feature in production without letting the end-user know about it, so that only test engineer can access it. And when the feature is ready in the production environment, we can change to release it to be live. But like the telemetry, we need to propagate headers in code level.
6. And the others…
Istio has many other features like Canary Deployment, you can do automatically Canary with Flagger tool to reduce the risk when it releases some feature. For example, you can release first with 1% traffic for the canary. After 1% canary, monitoring Prometheus will capture your error result, if okay so flagger can release canary with next step with 5% canary, so on…so on after you release 100% canary. Next, in Istio you can set Ingress and Egress Control in layer 7 application. You can change the traffic from the old version with the new version.
With so many features in Istio, that’s why Istio is so popular. And I think Service Mesh will become the new normal with cloud-native architecture, especially in Kubernetes. But I need to be realistic, Istio probably isn’t for everyone, you need to consider the pros and cons. Istio is still complex, but like I said before, continuously evolving. And with istio, you can add more layers for security and reliability.
In my company, PT. Alto Network. Architecture Security is our priority. Currently, we are in the process of PCI-DSS (Payment Card Industry Data Security Standard). And that’s why in our CI-CD pipeline, we already install some automation security layer, from code analysis, static security testing, container scanner. After deploying we also doing Dynamic Security testing and Interactive Security testing. In secret management, we try to implement Vault from HashiCorp. And of course, we add Service Mesh as well. In summary, we are committed to providing our customers with safe products.
Reference book:
- Mastering Service Mesh
Anjali Khatri, Vikram Khatri — March 2020
- Istio Up and Running
Lee Calcote, Zack Butcher — October 2019