First major Kubernetes problem


Few days ago we've had first major Kubernetes problem, it's almost 3 months of fiddling with it, so it must have come eventually :)

It was quite a journey and opportunity to learn some low level knobs and gears of K8S.

Few words on setup

We've got few clusters, all of them hosted on Azure, some are AKS and some installed on plain Azure VMs and managed by Rancher. Not all clusters are in the same Azure region (it was important for the problem).

Rancher installs clusters using tool called RKE, and manages them through pods called cattle-agents. It's a simple setup where etcd lives on the master nodes with all other kubernetes management services, and all user pods are running on compute nodes.

One important thing is that nodes (master and compute) need to "call back" Rancher through https. First approach was to install Rancher on public IP address (due to problems with connectivity and ssl certs).

Problem occurred on one of the clusters installed by Rancher on Azure VMs, fortunately all of these clusters are still only used for development and learning, so few hours of outage was no big deal :)

What happened

Finally we've got time to change Rancher setup: move Rancher to private network and change hostname.

We are already familiar with:

  • internal load balancers/ingress controllers,
  • auto generated LetsEncrypt certs on K8s (cert-manager FTW!),
  • we've got peerings set up between Azure VNets,

so whole idea of moving to private network looked pretty easy, though devil is in the details.

We've put another ingress controller on private IP and configured new hostname.

Rancher seemed to work so next step was to drop ingress controller on public IP.

After deleting public ALB, Rancher reported errors on communication with clusters. Idea was to reinstall masters and nodes as it's pretty easy with Rancher. Just add new master and delete old one. It's easy when everything communicates without problems...

We've installed second master node with Rancher, but it wasn't able to fully connect to cluster. I've removed it, and than the hell broke loose :-D

Before going further - what really happened is that we've figured out in the hard way that Azure Global VNet Peering isn't THAT global and connections to Internal Azure LB between regions do not work.

Back to disaster:

In the mean time, I've switched Rancher to public Azure LB/ingress controller and pointed domain records there.

Main problem: etcd

After second master stopped working, etcd on the only master stopped responding. It took some time to figure it out, because initially I thought that errors from Rancher (that it can't connect to etcd) were caused by network connection problems and not etcd itself.

In our setup, etcd lives on master nodes inside a docker container beside kubernetes (it's not K8S pod!).

etcd was logging raft errors, that it can't connect to second node.

Unfortunately etcd didn't accept any connections. TCP backlog was full, and I couldn't interact with it in any way. Diagnostic tool etcdctl couldn't connect either:

# etcdctl member list
Error:  context deadline exceeded

What I've tried:

  • installing second master node again from Rancher, the second master node couldn't register in the cluster because etcd was unresponsive (d'oh!),
  • I've tried to put up IP address of second master node on the same machine in hope to get management working - no luck, etcd uses ssl communication and verifies keys - it logged that second node has wrong certificates and can't connect to it. Management still not working.

I couldn't change data in etcd without connecting to it, but etcd peers were shown in the args of etcd.

How to change args for running docker container?

You can't. Sort of :)

It's a bit tricky, but possible:

  • stop dockerd (note that all containers are stopped with dockerd!),
  • edit config.json of etcd container (I've removed not initialized node)
  • start dockerd.

Still no luck, no second master in args, but etcd still tries to connect to it, and still doesn't accept connections.

After some googling, I've found the argument: --force-new-cluster. Next dockerd stop, quick edit of config.json and... etcd works!

Don't forget to remove --force-new-cluster (with another restart of dockerd ;).

Now everything is working, nodes can communicate with Rancher. But none of the changes were made (nor 'rancher on private ip', nor 'new domain for rancher').

Aftermath

  1. We've managed to workaround Global VNet Peering & Internal LB problem with VPN.
  2. Still one thing left to do - change rancher domain.

    For now, nodes still connect to old rancher domain. I couldn't find how to change it in Rancher docs, so it can be task for another post-mortem blog entry ;)

Hope you liked it, it's always fun to read about someone else failures :)

Maybe someone will be in similar trouble and this post will make his life (outage) a little easier.


Daniel Fenert