Kubelet1.7.16使用kubeconfig时,没有设置--require-kubeconfig,导致node不能注册

Tags: kubernetes_problem 

目录

现象

kubelet启动时日志报错:

E0114 11:06:58.217478     738 kubelet.go:1234] Image garbage collection failed once. Stats initialization may not have completed yet: unable to find data for container /
E0114 11:06:58.223804     738 kubelet.go:1733] Failed to check if disk space is available for the runtime: failed to get fs info for "runtime": unable to find data for container /
E0114 11:06:58.223845     738 kubelet.go:1741] Failed to check if disk space is available on the root partition: failed to get fs info for "root": unable to find data for container /
E0114 11:06:58.228106     738 container_manager_linux.go:543] [ContainerManager]: Fail to get rootfs information unable to find data for container /

kubectl get node命令一直看不到对应的node,也就说kubelet一直没有向master注册。

kubelet版本是1.7.16,kubernetes master版本是1.9.11

结论

分析过程走了好大一个弯路,上面的日志是在命令行运行kubelet时打印的,只有ERROR。

一直以为是这几个ERROR导致node不能注册,调查了好久。

后来纳闷代码中的一些日志怎么没在屏幕上打印出来,查看kubelet的运行参数,发现--logtostderr=false,所以屏幕上只打印了Error…

修改为--logtostderr=true之后,出现了大量日志,调查过程就顺利多了。

日志中有这样一行:

W0114 15:32:13.582668   27521 kubelet.go:1318] No api server defined - no node status update will be sent.

“因为没有指定apiservers,所以node状态不会上报”。kubelet用的是kubeconfig文件,没有在命令行指定apiserver,难道这样就不能上报node状态了??

检查kubelet 1.7.16的参数发现有一个--require-kubeconfig参数:

--require-kubeconfig     If true the Kubelet will exit if there are configuration errors, and will ignore the 
                         value of --api-servers in favor of the server defined in the kubeconfig file.

加上这个参数,就可以注册了。

存档之前走的歪路

分析

打印日志的kubelet代码:

// kubernetes/pkg/kubelet/kubelet.go:1727
// isOutOfDisk detects if pods can't fit due to lack of disk space.
func (kl *Kubelet) isOutOfDisk() bool {
	// Check disk space once globally and reject or accept all new pods.
	withinBounds, err := kl.diskSpaceManager.IsRuntimeDiskSpaceAvailable()
	// Assume enough space in case of errors.
	if err != nil {
		glog.Errorf("Failed to check if disk space is available for the runtime: %v", err)
	} else if !withinBounds {
		return true
	}

	withinBounds, err = kl.diskSpaceManager.IsRootDiskSpaceAvailable()
	// Assume enough space in case of errors.
	if err != nil {
		glog.Errorf("Failed to check if disk space is available on the root partition: %v", err)
	} else if !withinBounds {
		return true
	}
	return false
}

通过分析代码可以确定是下面的dm.cadvisor.ImagesFsInfodm.cadvisor.RootFsInfo函数运行报错:

// pkg/kubelet/disk_manager.go: 86
func (dm *realDiskSpaceManager) IsRuntimeDiskSpaceAvailable() (bool, error) {
	return dm.isSpaceAvailable("runtime", dm.policy.DockerFreeDiskMB, dm.cadvisor.ImagesFsInfo)
}

func (dm *realDiskSpaceManager) IsRootDiskSpaceAvailable() (bool, error) {
	return dm.isSpaceAvailable("root", dm.policy.RootFreeDiskMB, dm.cadvisor.RootFsInfo)
}

另外第三行日志也与cadvisor有关,cm.cadvisorInterface.RootFsInfo()

// kubernetes/pkg/kubelet/cm/container_manager_linux.go: 543
func (cm *containerManagerImpl) Start(node *v1.Node, activePods ActivePodsFunc) error {
	...
	stopChan := make(chan struct{})
	go wait.Until(func() {
		if err := cm.setFsCapacity(); err != nil {
			glog.Errorf("[ContainerManager]: %v", err)
			return
		}
		close(stopChan)
	}, time.Second, stopChan)
	return nil
}

func (cm *containerManagerImpl) setFsCapacity() error {
	rootfs, err := cm.cadvisorInterface.RootFsInfo()
	if err != nil {
		return fmt.Errorf("Fail to get rootfs information %v", err)
	}
	...
}

查找dm.cadvisor的创建位置

dm.cadvisor是第一个参数kubeDeps.CAdvisorInterface

// kubernetes/pkg/kubelet/kubelet.go: 424
diskSpaceManager, err := newDiskSpaceManager(kubeDeps.CAdvisorInterface, diskSpacePolicy)

kubeDeps.CAdvisorInterface的创建:

// kubernetes/cmd/kubelet/app/server.go: 538
func run(s *options.KubeletServer, kubeDeps *kubelet.KubeletDeps) (err error) {
	...
	if kubeDeps.CAdvisorInterface == nil {
		kubeDeps.CAdvisorInterface, err = cadvisor.New(uint(s.CAdvisorPort), s.ContainerRuntime, s.RootDirectory)
		if err != nil {
			return err
		}
	}
}

上面调用的cadvisor是"k8s.io/kubernetes/pkg/kubelet/cadvisor"

// kubernetes/pkg/kubelet/cadvisor/cadvisor_linux.go: 97
func New(port uint, runtime string, rootPath string) (Interface, error) {
	sysFs := sysfs.NewRealSysFs()

	// Create and start the cAdvisor container manager.
	m, err := manager.New(memory.New(statsCacheDuration, nil), sysFs, maxHousekeepingInterval, allowDynamicHousekeeping, \
	            cadvisormetrics.MetricSet{cadvisormetrics.NetworkTcpUsageMetrics: struct{}{}}, http.DefaultClient)
	if err != nil {
		return nil, err
	}

	cadvisorClient := &cadvisorClient{
		runtime:  runtime,
		rootPath: rootPath,
		Manager:  m,
	}

	err = cadvisorClient.exportHTTP(port)
	if err != nil {
		return nil, err
	}
	return cadvisorClient, nil
}

怎么是cadvisorUnsupported{}

// kubernetes/pkg/kubelet/cadvisor/cadvisor_linux.go: 188
func (cc *cadvisorClient) ImagesFsInfo() (cadvisorapiv2.FsInfo, error) {
	var label string
	
	switch cc.runtime {
	case "docker":
		label = cadvisorfs.LabelDockerImages
	case "rkt":
		label = cadvisorfs.LabelRktImages
	default:
		return cadvisorapiv2.FsInfo{}, fmt.Errorf("ImagesFsInfo: unknown runtime: %v", cc.runtime)
	}
	
	return cc.getFsInfo(label)
}

func (cc *cadvisorClient) RootFsInfo() (cadvisorapiv2.FsInfo, error) {
	return cc.GetDirFsInfo(cc.rootPath)
}

kubernetes_problem

  1. kubernetes ingress-nginx 启用 upstream 长连接,需要注意,否则容易 502
  2. kubernetes ingress-nginx 的 canary 影响指向同一个 service 的所有 ingress
  3. ingress-nginx 启用 tls 加密,配置了不存在的证书,导致 unable to get local issuer certificate
  4. https 协议访问,误用 http 端口,CONNECT_CR_SRVR_HELLO: wrong version number
  5. Kubernetes ingress-nginx 4 层 tcp 代理,无限重试不存在的地址,高达百万次
  6. Kubernetes 集群中个别 Pod 的 CPU 使用率异常高的问题调查
  7. Kubernetes 集群 Node 间歇性变为 NotReady 状态: IO 负载高,延迟严重
  8. Kubernetes的nginx-ingress-controller刷新nginx的配置滞后十分钟导致504
  9. Kubernetes的Nginx Ingress 0.20之前的版本,upstream的keep-alive不生效
  10. Kubernetes node 的 xfs文件系统损坏,kubelet主动退出且重启失败,恢复后无法创建pod
  11. Kubernetes的Pod无法删除,glusterfs导致docker无响应,集群雪崩
  12. Kubernetes集群node无法访问service: kube-proxy没有正确设置cluster-cidr
  13. Kubernetes集群node上的容器无法ping通外网: iptables snat规则缺失导致
  14. Kubernetes问题调查: failed to get cgroup stats for /systemd/system.slice
  15. Kubelet1.7.16使用kubeconfig时,没有设置--require-kubeconfig,导致node不能注册
  16. Kubelet从1.7.16升级到1.9.11,Sandbox以外的容器都被重建的问题调查
  17. Kubernetes: 内核参数rp_filter设置为Strict RPF,导致Service不通
  18. Kubernetes使用过程中遇到的一些问题与解决方法
  19. Kubernetes集群节点被入侵挖矿,CPU被占满
  20. kubernetes的node上的重启linux网络服务后,pod无法联通
  21. kubernetes的pod因为同名Sandbox的存在,一直无法删除
  22. kubelet升级,导致calico中存在多余的workloadendpoint,node上存在多余的veth设备
  23. 使用petset创建的etcd集群在kubernetes中运行失败
  24. Kubernetes 容器启动失败: unable to create nf_conn slab cache
  25. 未在calico中创建hostendpoint,导致开启隔离后,在kubernetes的node上无法访问pod
  26. calico分配的ip冲突,pod内部arp记录丢失,pod无法访问外部服务
  27. kubernetes的dnsmasq缓存查询结果,导致pod偶尔无法访问域名
  28. k8s: rbd image is locked by other nodes
  29. kuberntes的node无法通过物理机网卡访问Service

推荐阅读

Copyright @2011-2019 All rights reserved. 转载请添加原文连接,合作请加微信lijiaocn或者发送邮件: lijiaocn@foxmail.com,备注网站合作

友情链接:  系统软件  程序语言  运营经验  水库文集  网络课程  微信网文  发现知识星球