jeudi 27 octobre 2016

Connections to Kubernetes Service lost at Interval (23-24hours)

We are currently experiencing a problem with our Kubernetes Deployment where Client applications through web services are unable to connect to Pods via Kubernetes Services at a set interval (About 24 hours). The rest of the time our Applications are working as expected. we suspect something on the Kube-Proxy that routes traffic from the service to the pods are the culprit, but we are unable to find the root cause. Also simply deleting the Kube-proxies on the nodes to re-start the proxy pods seems not to be restarting or fixing the Issue.

We please need help\ideas on the following:

  1. Troubleshooting (we have tried the obvious documented routes) in finding the root-cause
  2. How do we restart Kubernetes without any service interruption
  3. What to check in our configuration, is there any updates etc scheduled every 24 hours that could cause this and we might have missed it.

Summary of our Deployment: Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.4", GitCommit:"3eed1e3be6848b877ff80a93da3785d9034d0a4f", GitTreeState:"clean"} Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.3+coreos.0", GitCommit:"d8dcde95b396ecd9a74b779cda9bc4d5b71e8550", GitTreeState:"clean"} CoreOS stable (1068.8.0)

We have a Micro services Architecture model that allows third party application to request Telemetry data (We are streaming 1200ms/s via a service that uses 0MQ http://zeromq.org/ as the underlying transport layer, this data is then written to a Cassandra Cluster). The stream is also "Published" which allows literally thousands of clients listening to their data stream - Potentially creating thousands of connections and creating the above problem. We also expose a REST web service to request data in the case a client lost connection and want to "download" the missing data.

What we are seeing is that all services will run perfectly (Multiple replicas - Pods exposed via Kubernetes services) for up to 24 Hours, then all connections to the services are dropped. What we could see by digging through the logs, (I am sure we have not covered all the logs, we are still new to Kubernetes) The problem is mainly seen from the Client Applications getting a "Connection Time-out" when trying to retrieve data from the REST web services. From http://ift.tt/2bU1yqp and if I understand the above correct then the default conntrack-tcp-timeout-established duration is set at 24 Hours (Is this just a coincidence with the failures we are experiencing?) or is our failure due to the above posts? If we leave Kubernetes, we are not trying any restarts of services, then it Any pointers in how and where to look to help if it can solve the above and our problem will gladly be appreciated.

If we need to provide specific logs, files etc we can.




Aucun commentaire:

Enregistrer un commentaire