记一次虚机强制断电磁盘损坏导致 K8s 集群 部分节点未就绪(NotReady) 问题解决

news/2023/6/7 23:51:32

写在前面


  • 自己的实验环境遇到,分享解决过程
  • 理解不足小伙伴帮忙指正

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》


我遇到了什么问题

哈,中午走的时候钥匙被锁屋里了,急着回家找师傅开门,单位的 nuc 要带回去,就把自己 nuc 强制关机了,结果虚机部署的 k8s 集群回来都起不来了,不过还不算太糟糕, 至少 master 还在,不幸的万幸。 之前有一次是强制关机了,结果也是 k8s 集群都起不来了,etcd 对应的 pod 也挂掉了,没有备份,最后没办法,使用 kubeadm 重置集群了。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms156.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    NotReady   <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

哈,部分集群节点未就绪,对应的虚机也起不来,下面为开机启动直接进入救援模式的提示信息。

[ 9.800336] XFS (sdal): Metadata corruption detected at xfs_agf_read_verify+0
×78/8×12[ xfs], xfs_agf block 0x4b000019.008356] XFS(sda1): Unmount and run xfs_repair
9.008376] XFS(sdal): First 64 butes of corrupted metadata buffer:
9.808395] ffff88803610a400:58 41 47 46 80 0880 01 80 08 80 81 80 96 88 88
XAGF...….
I9.8884151ffff88803610a410:80 88 80 81 80 88 88 82 80 88 80 88 80 88 80 819.808435] ffff88803610a420:80 88 80 81 80 88 80 88 80 88 88 88 80 8880 83
I 9.080454] ffff88003610a430:00 80 00 84 00 8d d1 2d 00 77 c3 a3 00 80 08 88
....-.w.……
I9.080515] XFS(sdal): metadata I/0 error: block 0x4b00001 ("xfs_trans_read_
buf_map") error 117 numblks 1
Generating "/run/initramfs/rdsosreport. txt."
Entering emergency mode. Exit the shell to continue.
Type "journalctl"to view system logs.
You might want to save "/run/initramfs/rdsosreport. txt"to a USB stick or /boot after mounting them and attach it to a bug report.
:/#

磁盘损坏,需要修复。哈,太坑了

我是如何做的

寻找磁盘恢复的解决方案,操作步骤:

  1. 启动虚拟机 E 进入单用户模式
  2. linux16 开头的那行末尾添加 rd.break
  3. 在上一步的基础上 ctrl+x 进入救援模式,然后执行 xfs_repair -L /dev/sda1 : 这里的 sda1 是上面损坏的磁盘,可以在救援模式的输出中看到。
  4. 执行 reboot

OK,陆续修复磁盘,开机,然后查看节点,发现恢复了一个,还是有两个节点未就绪。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms156.liruilongs.github.io   Ready      <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    NotReady   <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

我最开始以为 kubectl 的问题,排查了日志发现没有问题。

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node AgentLoaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)Drop-In: /usr/lib/systemd/system/kubelet.service.d└─10-kubeadm.confActive: active (running) since 二 2023-01-17 20:53:02 CST; 1min 18s ago....

然后在集群事件中,发现 Is the docker daemon running?, Error while dialing dial unix /run/containerd/containerd. sock: connect: connection refused": unavailable 类似的事件提示。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get events | grep -i error
54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
54m         Warning   Unhealthy                pod/calico-node-nfkzd                                 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
44m         Warning   FailedCreatePodSandBox   pod/calico-node-vxpxt                                 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "calico-node-vxpxt": Error response from daemon: transport is closing: unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "kube-proxy-htg7t": Error response from daemon: transport is closing: unavailable
44m         Warning   FailedCreatePodSandBox   pod/kube-proxy-htg7t                                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "kube-proxy-htg7t": error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.41/containers/create?name=k8s_POD_kube-proxy-htg7t_kube-system_85fe510d-d713-4fe6-b852-dd1655d37fff_15": EOF
44m         Warning   FailedKillPod            pod/skooner-5b65f884f8-9cs4k                          error killing pod: failed to "KillPodSandbox" for "eb888be0-5f30-4620-a4a2-111f14bb092d" with KillPodSandbo
Error: "rpc error: code = Unknown desc = [networkPlugin cni failed to teardown pod \"skooner-5b65f884f8-9cs4k_kube-system\" network: error getting ClusterInformation: Get \"https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: connect: connection refused, Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]"
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

有可以能有些节点的 docker 没有起来,然后我查看了未就绪节点的 docker 的状态

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl  status docker
● docker.service - Docker Application Container EngineLoaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)Active: inactive (dead)Docs: https://docs.docker.com117 21:08:19 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
117 21:08:19 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.
117 21:08:25 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
117 21:08:25 vms82.liruilongs.github.io systemd[1]: Job docker.service/start failed with result 'dependency'.
117 21:08:30 vms82.liruilongs.github.io systemd[1]: Dependency failed for Docker Application Container Engine.
。。。。。。。

发现 docker 果然没有启动成功,提示他的依赖没有启动成功,查看一下 docker 的正向依赖,即在 docker 之前启动的服务

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl list-dependencies docker.service
docker.service
● ├─containerd.service
● ├─docker.socket
● ├─system.slice
● ├─basic.target
● │ ├─microcode.service
● │ ├─rhel-autorelabel-mark.service
● │ ├─rhel-autorelabel.service
● │ ├─rhel-configure.service
● │ ├─rhel-dmesg.service
● │ ├─rhel-loadmodules.service
● │ ├─selinux-policy-migrate-local-changes@targeted.service
● │ ├─paths.target
● │ ├─slices.target
● │ │ ├─-.slice
● │ │ └─system.slice
● │ ├─sockets.target
............................

然后我们看一下第一个 依赖的服务 containerd.service ,查看发现也么有启动成功

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl status containerd.service
● containerd.service - containerd container runtimeLoaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)Active: activating (auto-restart) (Result: exit-code) since 二 2023-01-17 21:14:58 CST; 4s agoDocs: https://containerd.ioProcess: 6494 ExecStart=/usr/bin/containerd (code=exited, status=2)Process: 6491 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)Main PID: 6494 (code=exited, status=2)117 21:14:58 vms82.liruilongs.github.io systemd[1]: Failed to start containerd container runtime.
117 21:14:58 vms82.liruilongs.github.io systemd[1]: Unit containerd.service entered failed state.
117 21:14:58 vms82.liruilongs.github.io systemd[1]: containerd.service failed.
┌──[root@vms82.liruilongs.github.io]-[~]
└─$

没有更多的提示信息,只是提示 启动失败了,这里我们尝试重启试试

┌──[root@vms82.liruilongs.github.io]-[~]
└─$systemctl restart containerd.service
Job for containerd.service failed because the control process exited with error code. See "systemctl status containerd.service" and "journalctl -xe" for details.

查看 containerd 服务日志,这里先查看一下 error 的信息

┌──[root@vms82.liruilongs.github.io]-[~]
└─$journalctl -u  containerd | grep -i error -m 3
117 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203387028+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.\\n\"): skip plugin" type=io.containerd.snapshotter.v1
117 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.203699262+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
117 20:41:56 vms82.liruilongs.github.io containerd[962]: time="2023-01-17T20:41:56.204050775+08:00" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
┌──[root@vms82.liruilongs.github.io]-[~]
└─$

我们通过日志,得到了下面的日志信息,猜测可能是磁盘损坏照成的,这里我们备份 /var/lib/containerd/ 对应的文件夹,删除试试

aufs is not supported (modprobe aufs failed: exit status 1 \"modprobe: FATAL: Module aufs not found.
path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: sk

删除文件夹下所有文件

┌──[root@vms82.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
io.containerd.content.v1.content/       io.containerd.runtime.v1.linux/         io.containerd.snapshotter.v1.native/    tmpmounts/
io.containerd.metadata.v1.bolt/         io.containerd.runtime.v2.task/          io.containerd.snapshotter.v1.overlayfs/
┌──[root@vms82.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$rm -rf *
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$ls

删除之后尝试重新 启动 containerd

┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl start containerd
┌──[root@vms82.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl status containerd
● containerd.service - containerd container runtimeLoaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)Active: active (running) since 二 2023-01-17 21:25:13 CST; 51s agoDocs: https://containerd.ioProcess: 8180 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)Main PID: 8182 (containerd)Memory: 146.8M...........

OK ,启动成功,这个时候我们发现 节点也正常了。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS     ROLES                  AGE    VERSION
vms155.liruilongs.github.io   NotReady   <none>                 76d    v1.22.2
vms156.liruilongs.github.io   Ready      <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready      control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready      <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

其他的节点陆续操作下

192.168.26.155 发现这个节点 docker 也是启动失败的,但是问题不一样,操作后,服务没有自动重启,日志有error 级别的日志,

┌──[root@vms81.liruilongs.github.io]-[~]
└─$ssh root@192.168.26.155
Last login: Mon Jan 16 02:26:43 2023 from 192.168.26.81
┌──[root@vms155.liruilongs.github.io]-[~]
└─$systemctl is-active  docker
failed
┌──[root@vms155.liruilongs.github.io]-[~]
└─$cd /var/lib/containerd/
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$rm -rf *
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl start containerd
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl is-active  docker
failed
┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl status docker
● docker.service - Docker Application Container EngineLoaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)Active: failed (Result: start-limit) since 二 2023-01-17 20:20:03 CST; 1h 31min agoDocs: https://docs.docker.comProcess: 2030 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)Main PID: 2030 (code=exited, status=0/SUCCESS)117 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796621853+08:00" level=error msg="712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e37724402f10d3045 cleanup: failed to de...
1月 17 20:20:02 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:02.796669296+08:00" level=error msg="Handler for POST /v1.41/containers/712fd90a1962d0f546eaf6c9db05c2577ac9855b38f9f41e377...
117 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.285529266+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containe...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.783878143+08:00" level=info msg="Processing signal 'terminated'"
1月 17 20:20:03 vms155.liruilongs.github.io systemd[1]: Stopping Docker Application Container Engine...
1月 17 20:20:03 vms155.liruilongs.github.io dockerd[2030]: time="2023-01-17T20:20:03.784550238+08:00" level=info msg="Daemon shutdown complete"
117 20:20:03 vms155.liruilongs.github.io systemd[1]: start request repeated too quickly for docker.service
117 20:20:03 vms155.liruilongs.github.io systemd[1]: Failed to start Docker Application Container Engine.
117 20:20:03 vms155.liruilongs.github.io systemd[1]: Unit docker.service entered failed state.
117 20:20:03 vms155.liruilongs.github.io systemd[1]: docker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

服务没有自动重启,这里手动重启试试

┌──[root@vms155.liruilongs.github.io]-[/var/lib/containerd]
└─$systemctl restart docker

查看节点状态,所有节点 ready。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS   ROLES                  AGE    VERSION
vms155.liruilongs.github.io   Ready    <none>                 76d    v1.22.2
vms156.liruilongs.github.io   Ready    <none>                 76d    v1.22.2
vms81.liruilongs.github.io    Ready    control-plane,master   400d   v1.22.2
vms82.liruilongs.github.io    Ready    <none>                 400d   v1.22.2
vms83.liruilongs.github.io    Ready    <none>                 400d   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

博文参考


https://blog.csdn.net/qq_35022803/article/details/109287086

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.exyb.cn/news/show-4555852.html

如若内容造成侵权/违法违规/事实不符,请联系郑州代理记账网进行投诉反馈,一经查实,立即删除!

相关文章

IDEA 各个图标含义,C图标、I图标、m图标、f图标....

常见的图标含义 Icon Description Class Abstract class Groovy class Annotation Enumeration Exception Final Java class Interface Java class that contains declaration of the main() method. Test case Java class located out of the Sources root. Java…

ITOP-i.MX8M mini开发板适用于充电桩,物联网,工业控制,医疗,智能交通

iTOP-i.MX8M mini 核心板采用四核 Cortex-A53 架构&#xff0c;单核 Cortex-M4&#xff0c;多达五个内核&#xff0c;主频高达 1.8GHz&#xff0c;2G DDR4内存&#xff0c;8G EMMC 存储。拥有先进的 14LPC FinFET 工艺&#xff0c;提供更快的速度和更高的电源效率。核心板采用邮…

第十四届蓝桥杯单片机组学习笔记(2):按键

第十四届蓝桥杯单片机组学习笔记&#xff08;2&#xff09;&#xff1a;按键前言区分高低电平驱动按键消抖软件消抖触发处理的方式矩阵键盘最简单常用的人机交互手段——按键 前言 实现按键检测需要解决的问题&#xff1a; 按键是低电平按下还是高电平按下&#xff1b;按键消…

信息学奥赛C++语言:面积

【题目描述】 最近楠楠的数学老师在教他计算梯形面积&#xff0c;由于测验时他不小心计算错了一道求等腰梯形面积的题目&#xff0c;楠楠的老师要罚他计算 N 道这样的题&#xff0c;这太痛苦了&#xff01;楠楠求你编个程序来帮助他快速计算面积。 已经知道每道题给如下图形状的…

等腰梯形c语言程序源代码,2012年二级C语言程序题及答案解析精选).doc

您所在位置&#xff1a;网站首页 > 海量文档&nbsp>&nbsp计算机&nbsp>&nbspC/C资料2012年二级C语言程序题及答案解析精选).doc20页本文档一共被下载&#xff1a;次,您可全文免费在线阅读后下载本文档。下载提示1.本站不保证该用户上传的文档完整性&a…

如何用python计算梯形的面积_AUC计算方法与Python实现代码

-AUC计算方法 -AUC的Python实现方式 AUC计算方法 AUC是ROC曲线下的面积&#xff0c;它是机器学习用于二分类模型的评价指标&#xff0c;AUC反应的是模型对样本的排序能力。它的统计意义是从所有正样本随机抽取一个正样本&#xff0c;从所有负样本随机抽取一个负样本&#xff0c…

C语言求等腰梯形面积,几道C语言的题目!

注&#xff1a;编译环境 VC2010&#xff0c;系统WIN7 64位&#xff0c;其他编译环境和系统未测试1-1. 编程&#xff0c;输入n&#xff0c;输出如下例(n5)所示的图形&#xff1a;*************************# include int main(){int length;void print_parallelogram(int length…

数学题

https://cn.vjudge.net/contest/240113#problem/D 欧拉公式&#xff0c;求phi(1..N),然后一个大整数幂公式 a ^ b % mod (a % mod) ^ (b % phi[mod] phi[mod]) % mod. phi是欧拉函数的模板 https://vjudge.net/contest/240113#problem/A 素数区间转移缩小问题 https://vjudg…