\

Fixing a kubelet memory leak in Kubernetes 1.36

38 points - today at 2:14 AM

Source
  • compumike

    today at 2:15 AM

    Author here! If you're running a Kubernetes cluster, I recommend you check `kubectl version` and see if you're running "Server Version: v1.36.[0,1,2]". If so, you may want to use the one-liner at the end of the article to check your "process_resident_memory_bytes" on each node, and consider restarting kubelet as a temporary workaround to tame the memory leak until v1.36.3 is released.

    • rirze

      today at 4:52 PM

      Very cool. It's often daunting to contribute to such a well-established and recognizable project, but this is exactly how it should work.

      • __turbobrew__

        today at 5:50 PM

        A good reason to health check the kubelet process and restart it when the checks fail.

          • compumike

            today at 6:06 PM

            What kind of health checks? In my case, the kubelet process was staying alive and responsive to queries, I believe due to:

              # cat /proc/$(pgrep kubelet)/oom_score_adj
              -999
              
              (from OOMScoreAdjust=-999 in /etc/systemd/system/kubelet.service)  
            
            With this score, the Linux OOM killer wouldn't touch it, but any of my Pods were fair game.

        • CamouflagedKiwi

          today at 5:11 PM

          Nice find.

          Can't help but feel this is one of the subtle traps hidden beneath the advice that contexts aren't supposed to be stored. I know it's not always that easy, of course.

            • compumike

              today at 5:40 PM

              Thanks. I know there's a `go vet` tool that's run as part of Kubernetes CI, and one of its checks is:

                lostcancel: check cancel func returned by context.WithCancel is called
              
              I'm not 100% sure why `go vet` didn't catch this issue, but storing the cancelFn in the struct is probably part of the reason. Any Go experts know if that's the case?

                • cyberax

                  today at 6:23 PM

                  The cancel function escapes the function body, so static analysis can't detect it. There's another lint for that (containedctx), but I think it's off in K8s.

                  This is a serious tripping point with Go. There's no way to express: "this is a root context that I _want_ to store and only use to create derived contexts". Goroutines are also a source of problems, you can't easily say "I'm passing the ownership of this context to a goroutine".

          • fsuts

            today at 5:35 PM

            Not all heroes wear capes! Well done