What I learned from migrate large Enterprise Microservices System to Kubernetes

There is one SIEM system we built. As one of our customer requirements, recently we developed and migrate it to Kubernetes (K8s).
Here are some advice to share:

1. Get to Know Your System Well

Cloud Native is not simply move processes to docker. You need to fully understand your app, estimate works of migration and alteration, it may even involve new system/hardware plan.

Internal Service Dependency
Does the service support multi-instances? E.g. some code have to run on primary.
Need to start one by one or concurrently? Require other services?
External Service Dependency
You have no control of external service, what if only your services can be in container?
OS Dependency
Any bash/shell commands, scripts, file r/w, cron jobs?
Hardware Dependency
Serial port communication? NIC management?

2. Step by Step

Roman was not built in one day. Make a reasonable, secure plan.

Phase I: try part of services, side business first
Phase II: migrate core business online
Phase III: migrate as much as possible, target is to get rid of old machines

3. System Resource Planning

Be aware of Cloud Native require more resources to run, especially when your system was well-designed base on the specific hardware.

Storage
How much disk spaces need per each service? iops? does it require write or only read?
CPU
Compute-intensive? Range? Minimum?
Memory
Similar questions as CPU, be careful of jvm args and oversold (ref1 ref2 ref3) if your app is based on Java.
Network
Every rule of reachability and performance between inside and outside should review and treat as unchanged.

4. Application Changes

The app was coupled design with OS/hardware, part of program may need remake or reconsideration.

OS Monitor
It’s not possible to talk to OS directly as usual before.
Logging
The task of log message is no longer just write to local file.
App Services Management
The way of control services is totally changed.
Network Management
Forget it.
Hardware communication
You don’t want to keep old machines any longer, in generally.

Our Practices

Before Cloud Native

A multi-nodes mixed style distributed system:

Java APPs
About 50 services implemented with Spring Cloud. Some services are not well-supported run in multi-instances nor stateless, some are require others to run/start first.
Nginx
Relational Database
PostgreSQL with pgpool.
Redis
Big data platform
Hadoop, Hive, Spark, Kafka, Elasticsearch.
3rd Service API
Operation System (commands, bash/shell, file r/w, cron jobs)
Read hardware to generate things like license; Information Security related data; Implicitly resources such as “getClass().getClassLoader().getResource”.
Hardware
Serial port for extended SMS card.

After Cloud Native

Four fifths of services are migrated to K8s (service and pod), half of them have small code changes
System components are able to run on K8s: Nginx (Ingress); PostgreSQL; Redis
Bigdata stack is remaining in traditional deplyment (VM)

Lesson Learned

At the beginning, we tried remove Eureka totally, which was a mistake. Eureka was basis of Microservices that every java service needs, delete it from one service require several config changes even code changes, and even worse, we realized some services are too hard to convert quickly, which mean Eureka cannot be removed for the moment. Two weeks are thus wasted.

Make one common docker image, move differences dynamically remotely.
We tried build customized image for each service with openjdk-alpine, which lead to two critical problems:

Tons of images make up a huge install package, even though many of them are actually very similar.
openjdk-alpine is small indeed. However, it has critical debug problem (ref4 ref5). For use cases not require app elastic expanding often, one big common image is not an issue. BTW docker has one great image shrink tool called docker-slim, K8s Ephemeral Containers is still in alpha until today.

So we turn to build a common image for all services, move runtime (app jar, jdk, shared libs, python etc) to network storage. Use a global config-map to keep every environment value the same.
We keep the logging in “local file” for now, by alter the file name format with hostname.

Network Changes has to be made.
The direct way to get client IP is broken after the migration, as K8s service design. An extra LB is required if you need it as we do.
Client mode of Spark submit is not working in K8s pod, becaure of server can’t reach back to pod. You need to change app to use cluster mode, and pass all the environments during startup (client codes to get environments on the fly will make no sense).

The rest of migration
Some system components are still in traditional deploy, or not well tested on K8s (Redis, PostgreSQL, Bigdata).
Get rid of eureka
spring-cloud-kubernetes-discovery is still a newborn, framework like Feign are not well supported in it.
Service Mesh
Istio.
Microservices monitor
Monitor services in K8s, such as prometheus and grafana.
A better K8s
HA: multi master; independent etcd cluster; a-z zone; pod scheduler policy; operator;
Chaos test: asobti/kube-monkey; chaosblade-io/chaosblade-operator

28 Jun 2020

Suo Lu