记录一次机房异常断电恢复过程

1. Ubuntu 文件系统错误

1
the root filesystem on /dev/mapper/ubuntu--vg--ubuntu--lv  requires a manual fsck

解决方法

手动执行fsck,修复文件系统错误。

1
fsck -y /dev/mapper/ubuntu--vg--ubuntu--lv

2. 修K8s

.12

etcd 服务启动失败
journalctl -xeu etcd查看日志如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Apr 12 11:08:27 master etcd[9791]: etcd Version: 3.4.13
Apr 12 11:08:27 master etcd[9791]: Git SHA: ae9734ed2
Apr 12 11:08:27 master etcd[9791]: Go Version: go1.12.17
Apr 12 11:08:27 master etcd[9791]: Go OS/Arch: linux/amd64
Apr 12 11:08:27 master etcd[9791]: setting maximum number of CPUs to 6, total number of available CPUs is 6
Apr 12 11:08:27 master etcd[9791]: the server is already initialized as member before, starting as etcd member...
Apr 12 11:08:27 master etcd[9791]: peerTLS: cert = /etc/ssl/etcd/ssl/member-master.pem, key = /etc/ssl/etcd/ssl/member-master-key.pem, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file =
Apr 12 11:08:27 master etcd[9791]: name = etcd-master
Apr 12 11:08:27 master etcd[9791]: data dir = /var/lib/etcd
Apr 12 11:08:27 master etcd[9791]: member dir = /var/lib/etcd/member
Apr 12 11:08:27 master etcd[9791]: heartbeat = 250ms
Apr 12 11:08:27 master etcd[9791]: election = 5000ms
Apr 12 11:08:27 master etcd[9791]: snapshot count = 10000
Apr 12 11:08:27 master etcd[9791]: advertise client URLs = https://10.180.244.12:2379
Apr 12 11:08:27 master etcd[9791]: initial advertise peer URLs = https://10.180.244.12:2380
Apr 12 11:08:27 master etcd[9791]: initial cluster =
Apr 12 11:08:27 master etcd[9791]: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
Apr 12 11:08:27 master etcd[9791]: recovered store from snapshot at index 173568063
Apr 12 11:08:27 master etcd[9791]: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
Apr 12 11:08:27 master etcd[9791]: panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
Apr 12 11:08:27 master etcd[9791]: panic: runtime error: invalid memory address or nil pointer dereference
Apr 12 11:08:27 master etcd[9791]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0587e]
Apr 12 11:08:27 master etcd[9791]: goroutine 1 [running]:
Apr 12 11:08:27 master etcd[9791]: go.etcd.io/etcd/etcdserver.NewServer.func1(0xc0002c4e30, 0xc0002c2d80)
Apr 12 11:08:27 master etcd[9791]: /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:334 +0x3e
Apr 12 11:08:27 master etcd[9791]: panic(0xee6840, 0xc00003aef0)
Apr 12 11:08:27 master etcd[9791]: /usr/local/go/src/runtime/panic.go:522 +0x1b5
Apr 12 11:08:27 master etcd[9791]: github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001b36c0, 0x10c13b2, 0x2a, 0xc0002c2e50, 0x1, 0x1)
Apr 12 11:08:27 master etcd[9791]: /home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/github.com/coreos/[email protected]/capnslog/pkg_logger.go:75 +0x135
Apr 12 11:08:27 master etcd[9791]: go.etcd.io/etcd/etcdserver.NewServer(0xc00004406a, 0xb, 0x0, 0x0, 0x0, 0x0, 0xc000109100, 0x1, 0x1, 0xc000109280, ...)
Apr 12 11:08:27 master etcd[9791]: /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:464 +0x433c
Apr 12 11:08:27 master etcd[9791]: go.etcd.io/etcd/embed.StartEtcd(0xc000290000, 0xc0000e9600, 0x0, 0x0)
Apr 12 11:08:27 master etcd[9791]: /tmp/etcd-release-3.4.13/etcd/release/etcd/embed/etcd.go:214 +0x988
Apr 12 11:08:27 master etcd[9791]: go.etcd.io/etcd/etcdmain.startEtcd(0xc000290000, 0x10963d6, 0x6, 0x1, 0xc000223140)
Apr 12 11:08:27 master etcd[9791]: /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
Apr 12 11:08:27 master etcd[9791]: go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
Apr 12 11:08:27 master etcd[9791]: /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:144 +0x2ef9
Apr 12 11:08:27 master etcd[9791]: go.etcd.io/etcd/etcdmain.Main()
Apr 12 11:08:27 master etcd[9791]: /tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/main.go:46 +0x38
Apr 12 11:08:27 master etcd[9791]: main.main()
Apr 12 11:08:27 master etcd[9791]: /tmp/etcd-release-3.4.13/etcd/release/etcd/main.go:28 +0x20
Apr 12 11:08:27 master systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
-- Subject: Unit process exited

.201

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Apr 12 11:20:37 master1 systemd[1]: Starting etcd...
░░ Subject: A start job for unit etcd.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit etcd.service has begun execution.
░░
░░ The job identifier is 9387.
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683452+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ADVERTISE_CLIENT_URLS","variable-value":"https://10.180.244.201:2379"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683574+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"8"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683589+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CERT_FILE","variable-value":"/etc/ssl/etcd/ssl/member-master1.pem"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.6836+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CLIENT_CERT_AUTH","variable-value":"true"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683613+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_DATA_DIR","variable-value":"/var/lib/etcd"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683628+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ELECTION_TIMEOUT","variable-value":"5000"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683637+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ENABLE_V2","variable-value":"true"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683666+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_HEARTBEAT_INTERVAL","variable-value":"250"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.68368+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ADVERTISE_PEER_URLS","variable-value":"https://10.180.244.201:2380"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683688+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"etcd-master1=https://10.180.244.201:2380"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683696+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"existing"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683704+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_TOKEN","variable-value":"k8s_etcd"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683716+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_KEY_FILE","variable-value":"/etc/ssl/etcd/ssl/member-master1-key.pem"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683729+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_CLIENT_URLS","variable-value":"https://10.180.244.201:2379,https://127.0.0.1:2379"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683742+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_PEER_URLS","variable-value":"https://10.180.244.201:2380"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683755+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_METRICS","variable-value":"basic"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683762+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"etcd-master1"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683772+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_CERT_FILE","variable-value":"/etc/ssl/etcd/ssl/member-master1.pem"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.68378+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_CLIENT_CERT_AUTH","variable-value":"true"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.68379+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_KEY_FILE","variable-value":"/etc/ssl/etcd/ssl/member-master1-key.pem"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683799+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_TRUSTED_CA_FILE","variable-value":"/etc/ssl/etcd/ssl/ca.pem"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683807+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PROXY","variable-value":"off"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683823+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"10000"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.683839+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_TRUSTED_CA_FILE","variable-value":"/etc/ssl/etcd/ssl/ca.pem"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"warn","ts":"2025-04-12T11:20:37.683932+0800","caller":"embed/config.go:679","msg":"Running http and grpc server on single port. This is not recommended for production."}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.68396+0800","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/usr/local/bin/etcd"]}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.684035+0800","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"warn","ts":"2025-04-12T11:20:37.68406+0800","caller":"embed/config.go:679","msg":"Running http and grpc server on single port. This is not recommended for production."}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.684075+0800","caller":"embed/etcd.go:127","msg":"configuring peer listeners","listen-peer-urls":["https://10.180.244.201:2380"]}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.684111+0800","caller":"embed/etcd.go:494","msg":"starting with peer TLS","tls-info":"cert = /etc/ssl/etcd/ssl/member-master1.pem, key = /etc/ssl/etcd/ssl/member-master1-key.pem, client-cert=, client-key=, trusted-ca = /etc/ssl/etcd/ssl/ca.pem, client-cert-auth = true, crl-file = ","cipher-suites":[]}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.685281+0800","caller":"embed/etcd.go:135","msg":"configuring client listeners","listen-client-urls":["https://10.180.244.201:2379","https://127.0.0.1:2379"]}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"info","ts":"2025-04-12T11:20:37.685515+0800","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.13","git-sha":"c9063a0dc","go-version":"go1.21.8","go-os":"linux","go-arch":"amd64","max-cpu-set":16,"max-cpu-available":16,"member-initialized":true,"name":"etcd-master1","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"250ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://10.180.244.201:2380"],"listen-peer-urls":["https://10.180.244.201:2380"],"advertise-client-urls":["https://10.180.244.201:2379"],"listen-client-urls":["https://10.180.244.201:2379","https://127.0.0.1:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"8h0m0s","auto-compaction-interval":"8h0m0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Apr 12 11:20:37 master1 etcd[8932]: {"level":"warn","ts":"2025-04-12T11:20:37.685779+0800","caller":"fileutil/fileutil.go:53","msg":"check file permission","error":"directory \"/var/lib/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
Apr 12 11:20:37 master1 etcd[8932]: panic: freepages: failed to get all reachable pages (page 1545: multiple references (stack: [16767 14011 1545]))
Apr 12 11:20:37 master1 etcd[8932]: goroutine 59 [running]:
Apr 12 11:20:37 master1 etcd[8932]: go.etcd.io/bbolt.(*DB).freepages.func2()
Apr 12 11:20:37 master1 etcd[8932]: go.etcd.io/[email protected]/db.go:1202 +0x8d
Apr 12 11:20:37 master1 etcd[8932]: created by go.etcd.io/bbolt.(*DB).freepages in goroutine 58
Apr 12 11:20:37 master1 etcd[8932]: go.etcd.io/[email protected]/db.go:1200 +0x1e5
Apr 12 11:20:37 master1 systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

2.1 解决方法1 - 重装

如果清空数据无所谓

2.1.1 使用 kubukey 删除集群

1
./kk delete cluster -f config.yaml

输出日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
W0414 16:19:32.364542   25465 cleanupnode.go:99] [reset] Failed to remove containers: [failed to stop running pod c2806a3f94f0e02cbfb43709e646d7954825eb15fd46fc575ece2b88a71c2142: output: E0414 16:09:01.277986   25763 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="c2806a3f94f0e02cbfb43709e646d7954825eb15fd46fc575ece2b88a71c2142"
time="2025-04-14T16:09:01+08:00" level=fatal msg="stopping the pod sandbox \"c2806a3f94f0e02cbfb43709e646d7954825eb15fd46fc575ece2b88a71c2142\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 089f68c27162df73574cccfeee09e1c6c93145f2bc86fb8e8c61b8696f2e87d2: output: E0414 16:09:26.497640 25970 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="089f68c27162df73574cccfeee09e1c6c93145f2bc86fb8e8c61b8696f2e87d2"
time="2025-04-14T16:09:26+08:00" level=fatal msg="stopping the pod sandbox \"089f68c27162df73574cccfeee09e1c6c93145f2bc86fb8e8c61b8696f2e87d2\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 25d98b3eb6b3795ebdbd8a0a1609888fba0faee05da32ca349738144c74723d3: output: E0414 16:09:51.636863 26199 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="25d98b3eb6b3795ebdbd8a0a1609888fba0faee05da32ca349738144c74723d3"
time="2025-04-14T16:09:51+08:00" level=fatal msg="stopping the pod sandbox \"25d98b3eb6b3795ebdbd8a0a1609888fba0faee05da32ca349738144c74723d3\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 1c8c6c21193d2e1cab18303f3e9c64ed3c56da5a21b1e8eb7701966f5acb577f: output: E0414 16:10:17.118486 26471 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="1c8c6c21193d2e1cab18303f3e9c64ed3c56da5a21b1e8eb7701966f5acb577f"
time="2025-04-14T16:10:17+08:00" level=fatal msg="stopping the pod sandbox \"1c8c6c21193d2e1cab18303f3e9c64ed3c56da5a21b1e8eb7701966f5acb577f\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 59cf8e938c590b34482292c09371116a40e7050659476401a109c1eb77d4f08d: output: E0414 16:10:42.485352 26775 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="59cf8e938c590b34482292c09371116a40e7050659476401a109c1eb77d4f08d"
time="2025-04-14T16:10:42+08:00" level=fatal msg="stopping the pod sandbox \"59cf8e938c590b34482292c09371116a40e7050659476401a109c1eb77d4f08d\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 4f5dfcb197dd7b2b459f22eff4778de044f94beae2ec0a22774232d061d08dd5: output: E0414 16:11:07.664678 27057 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="4f5dfcb197dd7b2b459f22eff4778de044f94beae2ec0a22774232d061d08dd5"
time="2025-04-14T16:11:07+08:00" level=fatal msg="stopping the pod sandbox \"4f5dfcb197dd7b2b459f22eff4778de044f94beae2ec0a22774232d061d08dd5\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod a277f3b4673219c78a6ce806bd160a6ea54b70da3bb290bfd7f2835051ae284e: output: E0414 16:11:32.857703 27343 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="a277f3b4673219c78a6ce806bd160a6ea54b70da3bb290bfd7f2835051ae284e"
time="2025-04-14T16:11:32+08:00" level=fatal msg="stopping the pod sandbox \"a277f3b4673219c78a6ce806bd160a6ea54b70da3bb290bfd7f2835051ae284e\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 96be8163987f96a1dc1ed612247ee2e0ae566abaef50efe1462f3788c89dfad8: output: E0414 16:11:58.019128 27619 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="96be8163987f96a1dc1ed612247ee2e0ae566abaef50efe1462f3788c89dfad8"
time="2025-04-14T16:11:58+08:00" level=fatal msg="stopping the pod sandbox \"96be8163987f96a1dc1ed612247ee2e0ae566abaef50efe1462f3788c89dfad8\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 1a89cb3b7db65aa86e08c1c5ec4ff9ed3415ea7ad264935189b4b4da4f818e70: output: E0414 16:12:23.173529 27907 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="1a89cb3b7db65aa86e08c1c5ec4ff9ed3415ea7ad264935189b4b4da4f818e70"
time="2025-04-14T16:12:23+08:00" level=fatal msg="stopping the pod sandbox \"1a89cb3b7db65aa86e08c1c5ec4ff9ed3415ea7ad264935189b4b4da4f818e70\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 1e504fdf8d77880f9ab366214e6ced2b9c40eabb4b23deb82215d5dabe8313f9: output: E0414 16:12:48.354015 28188 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="1e504fdf8d77880f9ab366214e6ced2b9c40eabb4b23deb82215d5dabe8313f9"
time="2025-04-14T16:12:48+08:00" level=fatal msg="stopping the pod sandbox \"1e504fdf8d77880f9ab366214e6ced2b9c40eabb4b23deb82215d5dabe8313f9\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod fe2aa182e41e79f265bb56f0aaf816906682e782853ca4ff87cae9c817e009d2: output: E0414 16:13:13.579420 28462 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="fe2aa182e41e79f265bb56f0aaf816906682e782853ca4ff87cae9c817e009d2"
time="2025-04-14T16:13:13+08:00" level=fatal msg="stopping the pod sandbox \"fe2aa182e41e79f265bb56f0aaf816906682e782853ca4ff87cae9c817e009d2\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod efbf55a44e1ff8d25bdae11be2b77ec108455e09de8d6586b85a1c7f5dd8fac0: output: E0414 16:13:39.212120 28771 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="efbf55a44e1ff8d25bdae11be2b77ec108455e09de8d6586b85a1c7f5dd8fac0"
time="2025-04-14T16:13:39+08:00" level=fatal msg="stopping the pod sandbox \"efbf55a44e1ff8d25bdae11be2b77ec108455e09de8d6586b85a1c7f5dd8fac0\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 31eba6df7655d4171119cd71c44fc435ea76a5557124f28033333b25edeb24f0: output: E0414 16:14:04.598831 29067 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="31eba6df7655d4171119cd71c44fc435ea76a5557124f28033333b25edeb24f0"
time="2025-04-14T16:14:04+08:00" level=fatal msg="stopping the pod sandbox \"31eba6df7655d4171119cd71c44fc435ea76a5557124f28033333b25edeb24f0\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod eca3dc6e439b2d86fa87f9a55caab8524646047afe6b0c3fe352818cb5d6c1ef: output: E0414 16:14:29.755758 29347 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="eca3dc6e439b2d86fa87f9a55caab8524646047afe6b0c3fe352818cb5d6c1ef"
time="2025-04-14T16:14:29+08:00" level=fatal msg="stopping the pod sandbox \"eca3dc6e439b2d86fa87f9a55caab8524646047afe6b0c3fe352818cb5d6c1ef\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod e3ebc59cadb6419a8234cca445f17afda27ccdd7848c3f608dc3a7f8823fb732: output: E0414 16:14:54.904993 29622 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="e3ebc59cadb6419a8234cca445f17afda27ccdd7848c3f608dc3a7f8823fb732"
time="2025-04-14T16:14:54+08:00" level=fatal msg="stopping the pod sandbox \"e3ebc59cadb6419a8234cca445f17afda27ccdd7848c3f608dc3a7f8823fb732\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 312982760e677932d7a1da16e4b9d7aaba57a445835b1af5502c081e7602fcc9: output: E0414 16:15:20.398950 29921 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="312982760e677932d7a1da16e4b9d7aaba57a445835b1af5502c081e7602fcc9"
time="2025-04-14T16:15:20+08:00" level=fatal msg="stopping the pod sandbox \"312982760e677932d7a1da16e4b9d7aaba57a445835b1af5502c081e7602fcc9\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod d6e8531295d767877e5fa82e56e7fafefd318bbfa89b90c45d99609db87c9aee: output: E0414 16:15:45.849515 30219 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="d6e8531295d767877e5fa82e56e7fafefd318bbfa89b90c45d99609db87c9aee"
time="2025-04-14T16:15:45+08:00" level=fatal msg="stopping the pod sandbox \"d6e8531295d767877e5fa82e56e7fafefd318bbfa89b90c45d99609db87c9aee\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod a544829555939bd57b6aeeb98bb57665461daa7c187d7f318f435b6917c18b6d: output: E0414 16:16:11.071381 30501 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="a544829555939bd57b6aeeb98bb57665461daa7c187d7f318f435b6917c18b6d"
time="2025-04-14T16:16:11+08:00" level=fatal msg="stopping the pod sandbox \"a544829555939bd57b6aeeb98bb57665461daa7c187d7f318f435b6917c18b6d\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 90bafd8eb19d53c0f97a9f78af2df5d5cd26467e331cd5647e5e45ed5f0bd7aa: output: E0414 16:16:36.253130 30754 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="90bafd8eb19d53c0f97a9f78af2df5d5cd26467e331cd5647e5e45ed5f0bd7aa"
time="2025-04-14T16:16:36+08:00" level=fatal msg="stopping the pod sandbox \"90bafd8eb19d53c0f97a9f78af2df5d5cd26467e331cd5647e5e45ed5f0bd7aa\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 1871b44d9fcb5804a117ff2855e31455293c3ee50f8b9f45ab791b8f747838d3: output: E0414 16:17:01.392112 31001 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="1871b44d9fcb5804a117ff2855e31455293c3ee50f8b9f45ab791b8f747838d3"
time="2025-04-14T16:17:01+08:00" level=fatal msg="stopping the pod sandbox \"1871b44d9fcb5804a117ff2855e31455293c3ee50f8b9f45ab791b8f747838d3\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 4267de7bc6b2969a2d34835b94b3bc3d62c5cf8eaf0c0eb334289871a721cd12: output: E0414 16:17:26.581459 31292 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="4267de7bc6b2969a2d34835b94b3bc3d62c5cf8eaf0c0eb334289871a721cd12"
time="2025-04-14T16:17:26+08:00" level=fatal msg="stopping the pod sandbox \"4267de7bc6b2969a2d34835b94b3bc3d62c5cf8eaf0c0eb334289871a721cd12\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod a953161f9ad43abe6c798c56b3375770f8c625d2bb558b8d3e6f64feb4dcdbfa: output: E0414 16:17:51.774054 31567 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="a953161f9ad43abe6c798c56b3375770f8c625d2bb558b8d3e6f64feb4dcdbfa"
time="2025-04-14T16:17:51+08:00" level=fatal msg="stopping the pod sandbox \"a953161f9ad43abe6c798c56b3375770f8c625d2bb558b8d3e6f64feb4dcdbfa\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 8df9db69c22f8752cb5c86b38c8efcf9512f8f72ff632e5955da04b9fdff6a62: output: E0414 16:18:16.903184 31855 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="8df9db69c22f8752cb5c86b38c8efcf9512f8f72ff632e5955da04b9fdff6a62"
time="2025-04-14T16:18:16+08:00" level=fatal msg="stopping the pod sandbox \"8df9db69c22f8752cb5c86b38c8efcf9512f8f72ff632e5955da04b9fdff6a62\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod 5b4a48e414e7f1901bbc97636f1ee93aef8f6b191c2c3d402515ade3035958bf: output: E0414 16:18:42.064658 32165 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="5b4a48e414e7f1901bbc97636f1ee93aef8f6b191c2c3d402515ade3035958bf"
time="2025-04-14T16:18:42+08:00" level=fatal msg="stopping the pod sandbox \"5b4a48e414e7f1901bbc97636f1ee93aef8f6b191c2c3d402515ade3035958bf\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod ff5b6bc2e1ad1e2b07f69513a9c04040e2010821079a9a11f1e14a5893153b2a: output: E0414 16:19:07.198720 32500 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="ff5b6bc2e1ad1e2b07f69513a9c04040e2010821079a9a11f1e14a5893153b2a"
time="2025-04-14T16:19:07+08:00" level=fatal msg="stopping the pod sandbox \"ff5b6bc2e1ad1e2b07f69513a9c04040e2010821079a9a11f1e14a5893153b2a\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1, failed to stop running pod e70c440c0cb0b4db4652602405259e62bd6775ead44da553f7848334ef908615: output: E0414 16:19:32.361107 32786 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="e70c440c0cb0b4db4652602405259e62bd6775ead44da553f7848334ef908615"
time="2025-04-14T16:19:32+08:00" level=fatal msg="stopping the pod sandbox \"e70c440c0cb0b4db4652602405259e62bd6775ead44da553f7848334ef908615\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
: exit status 1]
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.

2.1.2 删除 etcd,清理其他文件

1
2
rm -rf /etc/etcd.env /usr/local/bin/etcd* /etc/systemd/system/etcd.service
rm -rf /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf /etc/cni/net.d

2.1.3 使用kubekey 安装集群

1
./kk create cluster -f config.yaml --with-kubernetes v1.29.10

2.2 解决方法2 - 使用备份恢复

参考文档:ETCD 备份还原实战 - KubeSphere 开发者社区

使用 KubeKey 部署的集群每天都会自动备份 etcd,位于/var/backups/kube_etcd/目录下。

2.2.0 操作前先拍个快照

2.2.1 停止服务

1
2
3
4
systemctl stop kubelet
systemctl stop docker.service
systemctl stop docker.socket
systemctl stop etcd

2.2.2 备份当前 etcd 数据

1
cp /var/lib/etcd/* /var/lib/etcd_bak/*

2.2.3 获取当前 etcd 参数

使用 KubeKey 部署的集群,etcd 的参数会保存在 /etc/etcd.env 文件中。

1
cat /etc/etcd.env

主要需要的参数包括:

  • ETCD_NAME
  • ETCD_DATA_DIR
  • ETCDCTL_ENDPOINTS
  • ETCD_INITIAL_CLUSTER
  • ETCD_INITIAL_CLUSTER_TOKEN
  • ETCD_INITIAL_ADVERTISE_PEER_URLS

2.2.4 恢复 etcd 备份

删除现有的etcd数据,使用上一步获取的参数恢复 etcd 备份数据。

1
2
3
4
5
6
7
rm -r /var/lib/etcd/*
etcdctl snapshot restore /var/backups/kube_etcd/etcd-<DATE>/snapshot.db \
--name=<ETCD_NAME> --endpoints=<ETCDCTL_ENDPOINTS> \
--initial-cluster=<ETCD_INITIAL_CLUSTER> \
--initial-advertise-peer-urls=<ETCD_INITIAL_ADVERTISE_PEER_URLS> \
--initial-cluster-token=<ETCD_INITIAL_CLUSTER_TOKEN> \
--data-dir=<ETCD_DATA_DIR>

例如:

1
cat /etc/etcd.env
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Environment file for etcd v3.5.13
ETCD_DATA_DIR=/var/lib/etcd
ETCD_ADVERTISE_CLIENT_URLS=https://10.180.244.201:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.180.244.201:2380
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_METRICS=basic
ETCD_LISTEN_CLIENT_URLS=https://10.180.244.201:2379,https://127.0.0.1:2379
ETCD_INITIAL_CLUSTER_TOKEN=k8s_etcd
ETCD_LISTEN_PEER_URLS=https://10.180.244.201:2380
ETCD_NAME=etcd-master1
ETCD_PROXY=off
ETCD_ENABLE_V2=true
ETCD_INITIAL_CLUSTER=etcd-master1=https://10.180.244.201:2380
ETCD_ELECTION_TIMEOUT=5000
ETCD_HEARTBEAT_INTERVAL=250
ETCD_AUTO_COMPACTION_RETENTION=8
ETCD_SNAPSHOT_COUNT=10000

# TLS settings
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-master1.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-master1-key.pem
ETCD_CLIENT_CERT_AUTH=true

ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-master1.pem
ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-master1-key.pem
ETCD_PEER_CLIENT_CERT_AUTH=true

# CLI settings
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-master1-key.pem
ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-master1.pem
1
2
3
4
5
6
7
etcdctl snapshot restore /var/backups/kube_etcd/etcd-2025-06-26-02-00-00/snapshot.db \
--name=etcd-master1 \
--endpoints=https://10.180.244.201:2379 \
--initial-cluster=etcd-master1=https://10.180.244.201:2380 \
--initial-advertise-peer-urls=https://10.180.244.201:2380 \
--initial-cluster-token=k8s_etcd \
--data-dir=/var/lib/etcd

2.2.5 启动服务

启动 etcd

1
systemctl start etcd

检查 etcd 状态

1
2
3
4
5
6
export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
export ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-master1.pem
export ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-master1-key.pem
export ETCDCTL_ENDPOINTS=https://10.180.244.201:2379
etcdctl endpoint status -w table

启动其他服务

1
2
3
systemctl start docker.service
systemctl start docker.socket
systemctl start kubelet

3. 虚拟机开机失败:XXX.vmdk 的重做日志已损坏。如果该问题仍未解决,请放弃该重做日志。

在虚拟机开机后,系统执行 fsck 检查磁盘时,esxi 弹出此报错并自动关机。

解决方法

3.1 备份虚拟机

数据浏览器 > 复制粘贴

3.2 删除所有快照

删除快照,整合磁盘

3.3 正常开机

可选:删除备份

4. K8s恢复后,Pod 创建失败

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod “XXX”: operation timeout: context deadline exceeded

解决方法

  1. 检查各节点 NFS 是否挂载:对于使用了/etc/fstab实现开机自动挂载的情况,简单的方法是先启动 NFS 服务端虚拟机,再重启其他虚拟机即可自动完成挂载;
  2. 检查网络情况:对于外部镜像拉取失败,检查各节点的/etc/docker/daemon.json文件中的 proxies ,在其对应节点检查代理程序是否启动,节点是否能联通。
  3. 优先恢复 Harbor:对于使用了私仓镜像的服务,需要先恢复 Harbor 服务,其他服务才能拉取镜像。

记录一次机房异常断电恢复过程
https://heeteve-blog.pages.dev/2025/04/记录一次机房异常断电恢复过程/
作者
Heeteve
发布于
2025年4月12日
许可协议