[05:19:15] I just 'resolved' a tools nfs outage, see T380827 [05:19:36] T380827: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827 [05:19:38] Not sure if this is an ongoing issue or not; I don't see any related tickets. [05:20:43] Going to sleep soon but I'll check in ~30 to see that it's not recurring [07:06:55] arturo: please lmk when you're online [07:07:06] well, or anyone really :) [07:14:23] andrewbogott: is the nfs issue still happening? [07:14:41] thanks for fixing, it must be late for you! [07:14:51] I don't think so, but as soon as I started to reboot k8s-nfs workers, the jobs api failed [07:15:04] prior to that it was failing intermittently but now it's just entirely down [07:15:20] Shouldn't that just be 'webservice restart' as the 'jobs' tool? [07:16:02] do you mean that jobs are failing, or that the jobs-api is down? [07:16:35] the api [07:16:41] at least [07:16:58] some actual things were failing too due to dns issues, but I'm hoping that will be resolved with the reboots [07:17:07] yeah I see it now [07:17:16] the jobs tool itself claims to be fine [07:17:25] https://www.irccloud.com/pastebin/2TqF0h1a/ [07:17:35] which makes me think that's a decoy from the actual jobs api [07:17:48] restarting the jobs-api deployment [07:18:07] ok :) tell me more? [07:18:11] seems to be coming up ok now [07:18:37] I've been reading https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Jobs_framework which says 'jobs-framework-api (code) --- uses flask-restful and runs inside the k8s cluster as a webservice. ' [07:19:09] it's a k8s component [07:19:23] oh no, it's crashing again [07:19:41] I'm looking at it via k9s on control-9 [07:19:53] so is 'runs inside the k8s cluster as a webservice' just not true, or am I misunderstanding that somehow? [07:20:13] no, it's a helm deployment [07:20:59] the pods running on tools-k8s-worker-nfs-27,37 are crashing, the one on -71 seems okay [07:21:44] checking the logs now [07:22:34] https://www.irccloud.com/pastebin/VSOuhxIa/ [07:23:44] blancadesal: T380832 [07:23:54] T380832: jobs-api crashing - https://phabricator.wikimedia.org/T380832 [07:24:14] ok, that nameservice failure is what had me rebooting worker nodes in the first place [07:24:43] I absolutely cannot get 'dig' or 'host' or 'nslookup' to fail anyplace so I don't have any theory about where those dns failures are coming from really [07:25:07] but I'm still in the middle of the reboot cookbook, don't know if 27,37 are already rebooted [07:25:18] even the one pos that isn't crashing is having issues [07:25:21] pod [07:25:25] https://www.irccloud.com/pastebin/tzno9z61/ [07:25:58] 027 was already rebooted [07:26:20] ok, so there's still some DNS brokeness at the heart of this [07:26:32] seems like it [07:26:43] I wonder if that's related to the dns outage from 24 hours ago? [07:27:02] Can you reproduce a dns failure from /outside/ of a container? [07:27:31] how would I try that? [07:28:43] the one running on -27 seems ok now, it's handled the requests it's got from tools ok since the restart [07:29:19] I would just like to see 'dig tools-harbor.wmcloud.org' or 'host tools-harbor.wmcloud.org' fail on a cli someplace so that I can reproduce and investigate [07:30:10] Ok, all nfs workers are rebooted now [07:31:49] the pod on -37 is the only one still crashing. I've deleted it to force a restart on another node [07:32:01] ok [07:32:34] so no more dns failures? Maybe the reboots really did fix that... [07:32:36] * andrewbogott can hope [07:33:25] it was rescheduled on -1, it works now [07:34:36] welp [07:34:51] seems weird that there were no alerts during all this [07:35:37] I'm rebooting some tool deployments that seem to have got stuck [07:36:25] some also had dns errors: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "mxumxbmjinr.svc.trove.eqiad1.wikimedia.cloud" [07:36:37] OK. If this was indeed a long-running dns issue resolved by reboots, then we should reboot the non-nfs worker nodes too. [07:37:01] Not sure we have a cookbook for non-nfs nodes though, maybe just for all nodes [07:37:11] which, probably it's fine to reboot the nfs nodes again [07:37:43] But I'd like someone else to do that so I can sleep :) [07:39:56] wmcs.toolforge.k8s.reboot ? [07:40:45] yeah, with either --all-workers (which, I mostly understand what that will do) or --all (which seems scary but is maybe necessary) [07:41:57] if you wan't to go to bed, I can monitor until someone else comes online, then we can try rebooting [07:42:13] ok [07:42:23] I should probably sleep, thanks for sorting out the jobs api! [07:43:04] good night :) [07:46:58] \o hi [07:48:49] blancadesal: is this still an issue? what's the status? [07:49:10] morning [07:49:17] it seems to still be an issue, yes [07:49:35] the jobs-api is ok now, but some tools are still failing with dns errors [07:49:56] ack, did you restart coredns? [07:50:02] andrew suggested we try rebooting the nodes, it seems to have worked with the nfs nodes [07:50:10] re coredns: I don't think so [07:51:37] when nfs is stuck, restarting the worker node is the only way yes [07:52:26] I don't see any more workers stuck [07:52:44] (or at least none showing in https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&from=now-30m&to=now D state processes) [07:53:19] the dns problem might be a different issue? or is related to the nfs failure? [07:54:41] I have not seen it with it at the same time before, so my guess it's a different issue [07:56:43] I'm running the functional tests from on e of my tools [07:58:27] maintain-kubeusers is down too, looking [07:58:58] I see it up though :/ [08:00:17] https://www.irccloud.com/pastebin/8RbeYErk/ [08:03:13] it might be on the ingress side of things, it seems to be failing to connect sometimes [08:04:04] looking into the haproxies [08:04:59] i'm still manually restarting some deployments [08:05:18] it's on the api-gateway level, the nginx was having issues [08:05:29] oh :( [08:05:31] │ 2024/11/26 08:03:57 [error] 22#22: *317184 jobs-api.jobs-api.svc.tools.local could not be resolved (110: Operation timed out) while sending to client, client: 192.168.254.192, server: , request: "POST /jobs/v1/tool/zoomviewer/jobs/ HTTP/1.1", host: "api.svc.tools.eqiad1.wikimedia.cloud:30003" │ [08:05:36] dns issues, restarting [08:05:55] on the tools-k8s-ingress-7 node (for the record) [08:06:26] okok, now it seems to work all the time :/ [08:06:35] running tests again [08:06:41] many tools crashed due to dns failures [08:06:56] do you know if they are crashing in the same set of workers? [08:07:12] (trying to see if it's the VM/worker misbehaving, or just the pod) [08:07:33] no, but I will check from now on [08:07:40] ack [08:09:28] blancadesal: how are you discovering the deployments that need restarting? [08:10:11] going through the deployments through k9s, looking at the logs/events to see what the issue is [08:13:46] I can resolve dns from within some other containers on ingress-7, so maybe not all pods are borked in the node (might be pod-specific) [08:17:39] hmm, probably unrelated but there are quite a few tools that are failing to pull from harbor simply because the image doesn't exist [08:17:58] one of those is my nodejs sample tool, and I know for sure the image should exist [08:18:27] in some cases, the project in harbor doesn't even exist [08:18:36] this is the case for my tool [08:18:58] hmm, that might be a maintain-harbor issue [08:18:59] :/ [08:19:16] can you open a task with the tool? we can look into it later [08:19:28] ok [08:19:54] can you let me know if you find one of the pods that does not work? I want to try debugging it a bit [08:20:56] oh, a build failed [08:20:59] [step-clone] 2024-11-26T08:13:17.330085261Z {"level":"fatal","ts":1732608797.3299356,"caller":"git-init/main.go:54","msg":"Error fetching git repository: failed to fetch []: exit status 128","stacktrace":"main.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:54\nruntime.main\n\truntime/proc.go:250"} [08:21:21] Could not resolve host: gitlab.wikimedia.org\n" [08:23:41] so it's not something that happened, and then stopped happening (and things just need a restart), it's still creating pods that fail [08:23:48] tracked it to node tools-k8s-worker-nfs-61 [08:26:52] tool-quickstatements/bot-65b6cfd8cc-445p2 is one such case – it failed with dns errors right now when attempting to restart [08:27:19] on nfs-72 [08:29:26] I see one on nfs-61 too [08:29:27] :/ [08:29:59] exe-ing into a container and running a small python script to do a request fails with name resolution on an existing pod, let me check the othens [08:32:10] other pods fail too, I'll reboot that worker [08:32:16] (61) [08:35:32] i'm also seeing quite a few Warning BackOff 2m40s (x413 over 92m) kubelet Back-off restarting failed container job in pod... always on nfs-workers (not necessarily the same one). not sure if related. [08:36:01] example rustbot-cbcd8db7f-pfw84 [08:36:46] does it say anything more about why it failed? [08:37:06] no :/ [08:38:18] would there be anything in the kubelet logs on the worker? [08:39:48] maybe the return code [08:40:34] nfs-72 is even failing to pull images :/ [08:40:40] "tools-harbor.wmcloud.org/toolforge/library-bash:5.4.1": failed to resolve [08:43:30] the tag was wrong also xd [08:43:56] with this we can check a specific node [08:44:00] node=tools-k8s-worker-nfs-72; kubectl run --image=tools-harbor.wmcloud.org/toolforge/library-bash:5.1.4 --overrides='{"apiVersion": "v1", "spec": {"nodeSelector": { "kubernetes.io/hostname": "'$node'" }}}' $node-nettest --command --attach -- nslookup wikimedia.org [08:44:33] putting it in a loop [08:46:08] https://www.irccloud.com/pastebin/mAthDNmL/ [08:46:52] that does not help much :/ [08:50:04] oh, kyverno started timing out too [08:50:07] Error from server (InternalError): Internal error occurred: failed calling webhook "mutate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.kyverno.svc:443/mutate/fail?timeout=10s": context deadline exceeded [08:51:53] ohh, I just saw that one too [08:52:26] while trying to delete a pod to see if getting it rescheduled on another worker would help [08:55:53] well nope, it's having the same issues on nfs-32 as on nfs-14 [08:56:54] nfs-14 is resolving ok :/ [08:57:03] https://www.irccloud.com/pastebin/aLw94fTf/ [08:57:11] 17 is not [08:57:14] https://www.irccloud.com/pastebin/yVGzwdf0/ [08:58:29] blancadesal: did you find any non-nfs worker with issues? [08:59:07] oh, 17 is now passing, unfortunate [08:59:16] just right now: greetbot on -107 is having a similar backoff issue. all the others I saw were on nfs [09:00:34] that is tool-dewikigreetbot [09:00:37] hello, I just got in the laptop [09:00:39] how can I help? [09:01:30] blancadesal: tool-dewikigreetbot ? [09:01:37] yes [09:02:12] arturo: there's random (not all the time, not all in the same worker, not all in the same pod) DNS resolution errors around the k8s cluster pods (only seen inside pods so far) [09:02:29] also tool-congressedits on -108 [09:02:36] dcaro: ok, thanks, did you check coredns pods already? [09:02:47] blancadesal: the last status change for that is from 2020 [09:03:28] dcaro: to clarify, those are 'Back-off restarting failed' errors, not sure if related to dns [09:04:05] all those that I restarted that were having dns issues seem fine now [09:04:13] blancadesal: ack, that one seems to me expected (as in it was not working due to script failing/etc) [09:05:24] https://www.irccloud.com/pastebin/r3xZpHUx/ [09:05:32] from the pod [09:05:44] ack [09:06:28] I have a silly script to run nslookup inside a pod on all the worker nodes: [09:06:35] https://www.irccloud.com/pastebin/xfY50yQ4/ [09:07:03] but I get intermittent failures :/ [09:08:19] I will reboot coredns pods [09:09:22] done [09:09:24] Error from server (InternalError): Internal error occurred: failed calling webhook "mutate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.kyverno.svc:443/mutate/fail?timeout=10s": context deadline exceeded [09:09:26] another [09:11:12] https://www.irccloud.com/pastebin/KXXX4aFF/ [09:11:15] still failing [09:11:30] is it always nfs workers? [09:12:29] so far yes [09:13:17] I'm rebooting the workers I find, let's see if that makes them stable [09:13:28] the api-servers are struggling to contact etcd [09:13:29] W1126 08:43:26.669757 1 logging.go:59] [core] [Channel #233841 SubChannel #233842] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp [09:13:29] 172.16.3.135:42034->172.20.255.1:53: i/o timeout" [09:14:16] well, its the same DNS issue [09:14:41] but etcd being unavailable most likely explains everything else [09:14:43] yep dns :/ [09:15:25] if rebooting a node fixes the dns issues on it, we can reboot the control nodes [09:16:17] nope, worker-nfs-50 still failing after reboot [09:18:03] apiserver is also timing out on kyverno calls [09:18:04] │ W1126 09:09:15.545291 1 dispatcher.go:225] Failed calling webhook, failing closed mutate.kyverno.svc-fail: failed calling webhook "mutate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.kyverno.svc:443/mutate/fail?timeout=10s": context deadline exceeded │ [09:18:17] do we need to declare and incident, etc? [09:18:39] yes please [09:20:26] ok, doing [09:21:03] how can I help? [09:21:24] kube-apiserver-tools-k8s-control-8 is having lats of trouble [09:22:28] https://docs.google.com/document/d/1g485z-mX9Y9rN1ajY0lSOBP3tsyPJO5ZjL7xosgCeKo/edit?tab=t.0#heading=h.95p2g5d67t9q [09:24:43] arturo: do you want help with the incident doc so that you can be free to troubleshoot? [09:25:02] blancadesal: you can be incident coordinator if you want! [09:25:11] supporting docs are https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Incident_response_process [09:25:23] arturo: ok [09:25:31] ok, thanks [09:25:38] I am the IC [09:25:56] hm, the dns resolving issues started sunday [09:26:28] I will send an email to cloud-announce [09:26:45] dcaro: ok, I'll see if I can match any automated package upgrade or something [09:27:14] could it be the nameservers puppet issue fallout? (the nameservers got emptied on the host, then new pods started getting it from the host, ...) [09:27:25] dcaro: yes, it could be [09:27:52] looking [09:27:59] new pods sometimes fail though :/ [09:28:39] I'm scanning pdns recursor logs [09:28:48] two consecutive runs on the same worker [09:28:50] (new pods) [09:28:56] https://www.irccloud.com/pastebin/t2VTCgTP/ [09:30:03] dcaro: so you think this is worker-level problem, i.e, on the VM? [09:30:45] not sure, as it's intermittent on the worker [09:31:15] oh, it's also intermittent inside the same pod [09:31:21] https://www.irccloud.com/pastebin/G9dvF4G2/ [09:31:31] that points to the DNS servers maybe? [09:31:43] yeah [09:31:58] let me force all dns traffic into one of the servers and see what happens? [09:32:27] ack [09:32:55] done, stopped pdns-recursor in cloudservices1005 [09:33:22] I see traffic on cloudservices1006 [09:34:39] still hanging, let me increase the cpu request for the coredns, they are at ~140% of the current one [09:34:49] ok [09:36:31] oh, their cpu usage spiked now , still using ~2x what I set, I'll increase more [09:37:14] set to 500m [09:37:42] maybe give it 2 full cpus to each coredns and see what happens? [09:38:04] scratch that, I see they have no limits [09:38:21] still hanging [09:39:12] should I scale it down to 1 replica, then add new ones bit by bit? [09:39:23] (might be a node misbehaving somehow?) [09:39:48] I had re-created the coredns pods earlier, including rescheduling, with no effects [09:40:27] I will failover the pdns-recursor to cloudservices1005 [09:40:29] that still spreads them throughout many nodes no? [09:41:03] yeah, try again, maybe it is different this time [09:42:08] I had briefly stopped pdns-recursor in both nodes to verify indeed the service is failovering [09:42:18] it pdns-recursor is now running on cloudservices1005 [09:42:52] querying directly the dns servers 8.8.8.8 and 172.20.255.1, does not have issues [09:43:08] using the coredns ones still hangs sometimes [09:43:22] maybe scale up the replicas [09:43:53] give it 8 replicas and see if that makes any difference [09:44:00] done [09:44:47] also, memory limit should be lifted [09:44:55] I'm reading some docs, about it being a scaling factor [09:45:02] it's nowhere near it though [09:45:12] ok [09:45:15] (also doubled it before) [09:45:19] just in case [09:45:44] I'm reading this https://github.com/coredns/deployment/blob/master/kubernetes/Scaling_CoreDNS.md [09:45:45] still having issues :/ [09:47:49] we have ~4000 services+pods [09:49:07] according to that we should need <100M, we have 170M currently (and pods seem to be ~50-60 after the restarts) [09:50:01] ok [09:51:15] I'll check calico for problems [09:55:15] mmm [09:55:30] some of the calico-node pods seem to be rebooting [09:56:17] I'm testing now to resolve on each of the cordns pods from within itself (kubectl debug) [09:58:22] they seem to be replying without issues :/, (tested 3/8 so far) [09:58:48] I see a bunch of messages like this from calico-node [09:58:49] calico-node 2024-11-26 09:56:05.608 [WARNING][82] felix/table.go 680: Chain had unexpected inserts, marking for resync actualRuleIDs=[]string{"", "0i8pjzKKPyA34aQD"} chainName="POSTROUTING" ipVersion=0x4 table="nat" [09:58:57] and I think that might be concerning [09:59:38] if the NAT config is changing constantly, that might explain the intermittent behavior we are seeing [10:00:47] ack, what other things would that affect? (webhooks/any internal service/external services?) [10:01:01] kube-proxy maybe [10:03:22] hmm, there seem to be connectivity issues between pods [10:03:42] https://www.irccloud.com/pastebin/MAlyuZtZ/ [10:04:01] that could also be explained by a calico failure [10:04:13] from pod pod/coredns-fcdb7d5f5-kpm96, I can't reach those two other coredns pods, but I can reach others [10:04:29] (and that's not flaky so far, as in, no success for those ips, 100% success for the others) [10:04:39] I just confirmed that calico is flushing/recreating the ruleset [10:04:47] reason unknown, but definitely not good [10:05:10] I think we should focus on calico [10:05:36] to confirm what I just saw [10:05:41] calico is using iptables [10:05:49] but in these nodes, iptables is actually nftables [10:06:08] so if you install `apt-get install nftables`, you get the `nft` binary [10:06:22] and if you run `nft monitor` you will see the ruleset being deleted/created by calico [10:06:40] why don't we run calico in nftables mode directly? [10:06:43] 192.168.173.104 -> 192.168.57.85 traffic (udp 53) is completely not working [10:08:35] and to 192.168.42.159 (cloudcontrol-8 and -9) [10:10:22] hmm, though the coredns deployment has [10:10:28] https://www.irccloud.com/pastebin/c6SjL56z/ [10:11:50] that's the toleration xd, that enables it to run there [10:12:27] I will set `FELIX_IPTABLESBACKEND=nft` in the calico configmap [10:12:29] arturo: is there any difference between control 8 and 9, and 7? [10:12:41] dcaro: what do you mean? [10:13:02] 8 and 9 do not get traffic from nfs workers (cordens traffic), while 7 does [10:13:18] (generalizing, in the test I did) [10:13:43] I don't know [10:15:40] I'll wait for you to do the changes you want to do, I'll reboot the control nodes if that does not work [10:16:27] you can reboot them while I figure how to inject the config I want [10:16:46] ack [10:20:28] good morning, do you have a task in Phabricator for the ongoing toolforge/network/dns issue? :) [10:20:47] I'd like to mark it as a blocker of the mw train which somehow relies heavily on toolforge [10:20:51] T380844 I think (blancadesal just created) [10:21:01] T380844: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844 [10:21:04] great thank you [10:22:00] I'll send an announcement to wikitech-l [10:22:19] thanks [10:25:20] after the reboot, I can nslookup directly to all of them [10:25:31] checking the service ip [10:27:39] so far looking good [10:27:56] (manually tested a pod as I did before, and it did not fail anymore) [10:28:11] one failure: [10:28:14] https://www.irccloud.com/pastebin/axHMOQce/ [10:28:34] at least in control-7, I still see calico recreating the ruleset [10:29:09] I did not reboot control-7 [10:29:37] let me do it just in case [10:30:15] ok [10:30:35] no other failures though [10:31:07] ok, the ruleset refreshes I see are mostly from kube-proxy I believe [10:31:46] oh, cookbook crashed [10:31:47] TypeError: RemoteHosts.wait_reboot_since() got an unexpected keyword argument 'tries' [10:32:28] weird, it did not crash before (and I did not change anything in between, pip/package/venv/..., same shell I ran it on) [10:34:46] same from cloudcumin o.O, weird [10:36:15] blancadesal: do you see any dns issues still happening? (new ones since the control reboots) [10:36:25] dcaro: let me check [10:37:28] ooohhh, the cookbook is trying to reboot the node it uses for kubectl stuff [10:40:59] dcaro: i'm not seeing any crashed deployments that are due to dns right now [10:41:13] ack, let me know if you do, I can't reproduce the issues either [10:44:13] ok [10:44:16] I wont do any more changes if the cluster is stable now [10:45:04] I'm suspicious that it might be related to the calico issues you mentioned, and that eventually it will degrade again [10:45:49] I wanted to do some calico changes, but the spider-sense warned [10:45:58] playing with the calico config might need further testing [10:46:05] any ideas why traffic between nodes would not be working? could that be related to the nat tables? [10:46:09] so, we may play in lima-kilo / toolsbet first [10:46:33] that was not temporary, it was sustained, so it was not just the recreation of the table [10:46:47] (maybe a lost nat entry?) [10:47:17] I don't know enough about how calico and kube-proxy operate [10:47:27] all I know is that I see the ruleset being recreated [10:47:30] in a loop [10:47:34] dcaro: i'm looking at job pods too, but all the dns errors I'm seeing so far are from ~3-4 hours ago [10:47:39] and that warning by calico, is still ongoing [10:47:54] calico-node 2024-11-26 10:45:54.480 [WARNING][87] felix/table.go 680: Chain had unexpected inserts, marking for resync actualRuleIDs=[]string{"", "0i8pjzKKPyA34aQD"} chainName="POSTROUTING" ipVersion=0x4 table="nat" [10:48:10] blancadesal: ack, I'll start running the functional tests in a loop again (using my tool, to avoid collisions) [10:53:02] went through all the crashed pods, confirming all the dns errors i'm seeing are before the control plane reboot [10:54:45] ack, awesome [10:55:05] functional tests are passing so far, and the dns resolution tests per-worker are passing too [10:56:40] ok [11:04:31] the ruleset recreation loop is by kube-proxy, not by calico [11:08:27] I'm not getting any more errors, I think we can declare the incident over [11:09:50] ok [11:10:13] ok! [11:10:39] I'll send updates to the mailing lists [11:10:40] I think there was a calico related incident in the wiki k8s cluster not a long time ago [11:12:31] https://docs.google.com/document/d/1w5x8-_KTEzL1ARuyVeROlslAMNI68g1JRqVyqGKr_jY/edit?tab=t.0#heading=h.95p2g5d67t9q [11:16:32] fix for the cookbook issue https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1097991 [11:18:15] I am resolving the incident [11:18:47] I'll go for lunch, then refine the incident doc [11:18:53] thanks [11:19:13] feel free to add details/fix inaccuracies, etc! [11:21:05] * dcaro going for lunch too [11:21:09] thanks both! [11:25:11] thank you all of you :) [11:35:59] all this is probably my fault because just last week I was thinking "wow, we haven't really had any incidents in a while" 😅 [12:02:18] heh [12:29:24] thanks blancadesal arturo dcaro for handling the incident! I had to catch up on sleep this morning and I only opened IRC when the incident was already over :) [12:29:40] that's ok :-) [12:30:00] blancadesal: were the IC docs clear enough? any suggestions on what could be improved? [12:30:56] one thing on my mind is that we often get asked for a phab link (as it happened today), but the incident doc doesn't mention explicitly to create one. [12:32:00] dhinus: +1 for being explicit about creating an incident task in phab [12:32:43] one thing that both T380827 and T380844 may have in common as potential source of problems is the openstack virtual network [12:32:44] T380827: tools-nfs outage 2024-11-25 - https://phabricator.wikimedia.org/T380827 [12:32:44] T380844: 2024-11-26 Toolforge DNS incident - https://phabricator.wikimedia.org/T380844 [12:33:18] dhinus: I did skim through the docs several times to make sure I wasn't missing anything major, but other than that I was mostly winging it so might still have missed something. But that wouldn't be the fault of the docs :)) [13:05:11] incident report draft: https://wikitech.wikimedia.org/wiki/Incidents/2024-11-26_WMCS_Toolforge_k8s_DNS_resolution_failures [13:05:56] I'll fill out as much as I can, then delegate to you arturo and dcaro [13:06:04] thanks [13:25:49] dcaro, arturo: I think you might have already done so, but is there something you'd want to add/change in the timeline before I copy the relevant parts over to the report? [13:26:04] to the google doc? [13:26:04] I have changed a few bits [13:26:15] dcaro: the gdoc yes [13:26:22] ack, looking [13:26:33] from the timeline, it is not clear to me _when_ the DNS outage started [13:27:35] we detected it while dealing with the nfs issue – it might have started before that though [13:28:05] I suspect the may be two effects of the same underlying problem [13:28:47] first relevant user report in -cloud is from 3:58 utc [13:31:58] from the logs the dns issues started earlier, around the same as the nfs issues [13:32:16] let me see if I can find the logs in my scroll (from some of the pods) [13:32:43] restarting so many of the pods lost a bit of the history :/ [13:32:54] yeah [13:33:53] if we can find some logs that are in a ~30m time range in the NFS outage, we could say they are both the same [13:34:06] for the DNS issue some CI jobs failed on Monday around 09:43:35 UTC ( https://phabricator.wikimedia.org/T374830#10352082 ) but it looks like it was die to `nameserver` being removed from resolv.conf or something similar [13:37:24] this is from the 24 [13:37:28] https://www.irccloud.com/pastebin/OdUv5Wtx/ [13:37:43] 25 [13:37:47] from that task, I also found out the beta cluster keeps making requests to commons.wikimedia.org , apparently frequently enough to catch the issue from time to time [13:37:47] https://www.irccloud.com/pastebin/5BaDPzmc/ [13:38:38] 26 [13:38:42] https://www.irccloud.com/pastebin/WIm9rotk/ [13:39:15] those are timing out reading from the cloudvps dns though, not coredns [13:39:26] https://beta-logs.wmcloud.org/goto/d1cbec87f734b84b7887aac5ad072f50 ( password is in /root of deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud ) [13:39:32] maybe that can give some info [13:42:00] awesome, it shows things all over the place though [13:42:24] (I mean, not that's bad, just unexpected) [13:42:51] there's definitely a spike around the time of the nfs outage [13:45:34] Hii, I need a wikireplica in s8 depooled. Is this the most correct path to do it? https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas#Depooling_using_conftool_on_cumin_hosts [13:46:15] Amir1: yep [13:46:17] (why I need it: [[T379724]]) [13:46:18] T379724: s8 replication on an-redacteddb1001 is broken - https://phabricator.wikimedia.org/T379724 [13:46:20] awesome [13:46:22] TTYL [13:48:47] if you just need one, please depool the "web" and not the "analytics" one [13:49:13] (see the previous paragraph on that same page: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas#web_vs_analytics_considerations) [13:55:32] * arturo food [14:02:30] dcaro: toolforge sync? [14:03:05] 🤦‍♂️ coming [14:58:04] I haven't had breakfast yet, but is there any followup needed for the dns issue last night? [14:58:44] (My mental model when I went to bed last night was: "the network changes that caused the dns issue were reverted but left some latent bad caching/crashed agents on some VMs which is resolved by rebooting them" is that anywhere close to right? [14:58:46] ) [15:02:32] no :-( [15:02:58] it was the neutron.conf cleanup triggering a ovs restart, as far as my current theory goes [15:04:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094471 ? [15:04:40] I thought the dns issues preceded that but maybe there were different/multiple dns issues [15:08:35] I'm catching up on the phab task now [15:21:27] is there a way to flag as "ok" the kernel error logs already in the logs? [15:21:36] T380877 [15:21:36] T380877: Kernel error Server cloudcephmon1004 may have kernel errors - https://phabricator.wikimedia.org/T380877 [15:22:03] there's no note in the runbook (I'll add if there is) [15:22:12] *there is a way [15:23:44] dcaro: I don't think there is any [15:24:07] the system for detection is a bit all-or-nothing [15:26:10] we could have the detection script read a config file with regexes of lines to ignore [15:26:16] and fill the file via hiera [15:28:31] Added for now some notes here https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic#Common_issues [15:42:22] dcaro, arturo: do we have any action items/follow-up tasks for the incident? [15:44:10] there is one cloudvirt node (cloudvirt1062.eqiad.wmnet) to be rebooted for https://phabricator.wikimedia.org/T380731, could someone of you please take care of it? [16:15:39] because we have a strong suspicion that the 2 outages today are related, I created T380882 as parent ticket for the other 32 [16:15:40] T380882: 2024-11-16 openstack network problems - https://phabricator.wikimedia.org/T380882 [16:15:45] other 2* [16:23:59] I will wrap up https://wikitech.wikimedia.org/wiki/Incidents/2024-11-26_WMCS_Toolforge_k8s_DNS_resolution_failures tomorrow [16:24:15] feel free to edit if you wish in the meantime [16:24:45] I will create a couple of action item tickets now [16:26:32] thank you [16:41:18] sorry about leaving the meeting early! lmk if I need to catch up on anything [16:41:26] moritzm: I will do the reboot if no one else has yet [16:42:38] I have created T380886 [16:42:38] T380886: openstack: increase virtual network observability - https://phabricator.wikimedia.org/T380886 [16:43:21] moritzm: that host was reimaged, uptime 7 days. So should be good, I'll tick the box [16:43:24] blancadesal: I have renamed the incident wikipage to https://wikitech.wikimedia.org/wiki/Incidents/2024-11-26_WMCS_network_problems [16:47:22] ack [16:53:49] andrewbogott: hey, cloudcephosd1001/2/3 can be decommissioned anytime, I had no time to do anything today, but if you have time (and want), you can give it a go :), note that I have never decommissioned a monitor yet, so might not be documented what to cleanup [16:54:34] sure, is there a decom ticket? If not I can at least start with that much :) [16:55:02] let me try to find it (I think there is, not 100% sure) [16:57:09] hmpf... I think there's not :/ [16:57:22] (I created one for the setup of the new ones, not decom of the old ones) [16:57:41] andrewbogott: thanks! [16:57:56] dcaro: link me to the creation ticket and I'll attach the dcom to it [16:57:57] thx [16:58:13] this is the refresh T361363 [16:58:42] the task I created for the setting up T374005 [16:58:42] thanks! [16:58:43] T374005: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements - https://phabricator.wikimedia.org/T374005 [16:59:04] gtg, thanks! [16:59:05] * dcaro off [17:01:05] * arturo offline [17:29:26] I dumped some info from my gitlab-account-approval-bot tool into the incident doc. I think things were messed up from 2024-11-26T02:21:55Z through 2024-11-26T10:33:46Z. [18:02:48] thanks, that looks like the timeline yep [18:25:48] I think alerts from prometheus.wmcloud stopped sending emails, I raised T380901 [18:25:48] T380901: prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901 [18:26:18] * dhinus offline [20:56:00] T380833 seems pretty alarming [20:56:00] T380833: [harbor] some artifacts and projects seems to have gone missing - https://phabricator.wikimedia.org/T380833