[06:44:01] * dcaro online
[06:44:08] <dcaro>	 looking at the backlog
[06:47:43] <blancadesal>	 morning :)
[06:51:37] <dcaro>	 morning ☕
[07:07:19] <dcaro>	 functional tests are not passing on tools, looking (read timed out when doing `build quota`)
[07:09:08] <dcaro>	 two failures in a row
[07:09:17] <dcaro>	 I'll open a task
[07:41:59] <dcaro>	 those are dns issues, so the patch to pdn did not solve it, looking
[08:42:55] <dcaro>	 oh, got a kyverno error
[08:42:57] <dcaro>	    Error from server (InternalError): Internal error occurred: failed calling webhook "mutate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.kyverno.svc:443/mutate/fail?timeout=10s": context deadline exceeded
[08:43:00] <dcaro>	 looking
[08:54:34] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/38 to help debugging the openapi.json intermittent errors now happening on both tools and toolsbeta
[08:55:09] <blancadesal>	 dcaro: done
[08:55:16] <dcaro>	 thanks!
[09:00:15] <dcaro>	 hmmm
[09:00:16] <dcaro>	 | calico | chart | calico | calico-0.0.6-20230710081103-dcbbe692 | toolforge-deploy has calico-0.0.7-20240411190528-63954490 |                                                                                                                                                                                                                         
[09:00:19] <dcaro>	 on toolsbeta
[09:00:29] <dcaro>	 we have a calico < than the one in toolforge-deploy
[09:03:33] <dcaro>	 that's on tools, sorry
[09:06:55] <blancadesal>	 you mean tools never got whatever is the our most recent version deployed, only toolsbeta?
[09:09:55] <dcaro>	 yep
[09:10:02] <dcaro>	 toolsbeta has 0.0.7
[09:10:05] <dcaro>	 | calico | chart | calico | calico-0.0.7-20240411190528-63954490 |  |                                                                                                                                                                                                                                                                                  
[09:12:31] <dcaro>	 the change is a pre-commit autoupdate
[09:13:00] <blancadesal>	 it'd be cool to have a tool that's just a UI showing the output of toolforge_get_versions for tools and toolsbeta, (and maybe include kubeVersion where applicable)
[09:13:25] <dcaro>	 yep :)
[09:14:11] <dcaro>	 the main issue is that the info of the installed versions in only available for admins (helm list)
[09:14:19] <dcaro>	 so we would have to sort that out somehow
[09:19:39] <dcaro>	 got tekton down alert looking
[09:20:40] <blancadesal>	 everything seems a bit shaky today xd
[09:23:06] <dcaro>	 to get us in shape after pto xd
[09:44:32] <dcaro>	 hmpf... the tools prometheus instances are struggling with memory, and getting killed periodically
[09:59:01] <dcaro>	 dns is still an issue
[09:59:02] <dcaro>	    [step-clone] 2024-08-26T09:56:50.943727100Z {"level":"error","ts":1724666210.942934,"caller":"git/git.go:55","msg":"Error running git [fetch --recurse-submodules=yes --depth=1 origin --update-head-ok --force ]: exit status 128\nfatal: unable to access 'https://gitlab.wikimedia.org/toolforge-repos/sample-static-buildpack-app/': Could not resolve host: 
[09:59:02] <dcaro>	 gitlab.wikimedia.org\n","stacktrace":"github.com/tektoncd/pipeline/pkg/git.run\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:55\ngithub.com/tektoncd/pipeline/pkg/git.Fetch\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:150\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:53\nruntime.main\n\truntime/proc.go:255"}
[10:12:16] <dcaro>	 oh my, all the calico stuff is in a single humongous yaml again... it's going to be very painful to debug
[10:12:47] <_joe_>	 hi, dunno if you've seen https://phabricator.wikimedia.org/T373250 
[10:15:21] <dcaro>	 I had not, looking, thanks for the notice
[10:15:43] <_joe_>	 yeah i see it's been one of those mornings, heh
[10:15:44] <_joe_>	 :(
[10:15:56] <dcaro>	 first day after pto
[10:19:39] <dcaro>	 topranks: can you check the cloudswitch that was misbehaving to see if it's doing weird things again? we are having a bunch or semi-random network issues that might be explained by a wild switch
[10:20:05] <dcaro>	 (it's a stretch as I don't se packages dropped, but well, weirder things have happened)
[10:21:06] <arturo-afk>	 dcaro: I'm briefly in front of the laptop, in case there is an emergency going on?
[10:22:31] <dcaro>	 random intermittent network issues (DNS inside k8s, timeouts between pods, toolsadmin timing out too, ...)
[10:27:18] <arturo-afk>	 I don't see anything weird with coredns
[10:27:46] <arturo-afk>	 coredns logs feel OK in all 3 tools replicas
[10:27:49] <dcaro>	 I didn't either, it's using ~4x the request cpu it sets, but it does not seem to be strucgling
[10:27:57] <dcaro>	 (no restarts or anything)
[10:28:22] <arturo-afk>	 yeah, and the -k8s-control nodes have CPU headroom, loadavg 2
[10:28:24] <dcaro>	 I scaled it up to 4 replicas before, and seemed to help shortly (maybe just my imagination, or random luck), and it's back misbehaving
[10:28:27] <arturo-afk>	 (8 cpu)
[10:29:00] <dcaro>	 striker seems happy now, will monitor for a bit
[10:32:58] <arturo-afk>	 I'm checking cloudservices nodes
[10:33:08] <arturo-afk>	 as they are the DNS upstream for some of the work coredns is doing
[10:33:32] <arturo-afk>	 https://www.irccloud.com/pastebin/nqg7HMnR/
[10:33:44] <arturo-afk>	 there seems to be some disk issue in cloudservices1005
[10:34:35] <dcaro>	 is that increasing or stable?
[10:34:40] <dcaro>	 (it's the same metric as the ceph nodes)
[10:35:33] <arturo-afk>	 I have only seen that log entry once so far since I started watching the logs
[10:36:13] <dcaro>	 seems stable from previous logs
[10:37:01] <dcaro>	 sda increased by 8 in 3 days
[10:37:01] <arturo-afk>	 other than that pdns / designate seems to be working as expected
[10:37:56] <dcaro>	 https://www.irccloud.com/pastebin/LLYACkEl/
[10:38:21] <arturo-afk>	 I wonder if a rollout restart of coredns would make any difference
[10:39:28] <dcaro>	 I did a rollout restart already
[10:39:36] <arturo-afk>	 ok, I see 140m ago
[10:43:06] <dcaro>	 I think it times out before
[10:43:15] <dcaro>	 https://www.irccloud.com/pastebin/9vkQPBqw/
[10:43:51] <dcaro>	 even three retries fails sometimes
[10:44:00] <dcaro>	 https://www.irccloud.com/pastebin/9l69vwzp/
[10:44:12] <arturo-afk>	 I wonder if there has been any package updates on the workers recently
[10:44:22] <arturo-afk>	 or kernel?
[10:44:49] <dcaro>	 maybe, external dns seems to reply reliably
[10:45:00] <dcaro>	 (ex. `I have no name!@shell-1724668943:~$ nslookup github.com 8.8.8.8`)
[10:46:15] <dcaro>	 this also, so k8s specific for sure
[10:46:19] <dcaro>	 `I have no name!@shell-1724668943:~$ nslookup tools-harbor.wmcloud.org 172.20.255.1`
[10:46:47] <topranks>	 dcaro: just out of meeting 
[10:46:49] * topranks looking 
[10:47:46] <topranks>	 not the exact same problem anyway - but unlikely it would be I guess 
[10:48:05] <dcaro>	 yep, this seems k8s only (though there were a few other issues around, like striker and such)
[10:49:09] <topranks>	 ok, if there is any specific network thing to check let me know 
[10:49:40] <topranks>	 that 10.96.0.10 IP looks internal to the cluster, it's not in any of the switch routing tables 
[10:50:20] <dcaro>	 yep, it's internal yes
[10:51:04] <arturo-afk>	 I need to go offline now, ping me if the problem escalates into a full outage
[10:51:28] <dcaro>	 sure, thanks, cya
[10:55:29] <dcaro>	 coredns queries per second look ok
[10:55:36] <dcaro>	 https://usercontent.irccloud-cdn.com/file/797u7AGx/image.png
[10:56:08] <dcaro>	 the peaks are more or less the same for the last 30 days, when I scaled up, it went down, but there's no sudden peak in the last days either
[11:18:26] <dcaro>	 something I see is that we do many requests to resolve any non-resolvable name, as we have many domains configured in resolv.conf, not that that would help, but would decrease the load
[11:18:44] <dcaro>	 https://www.irccloud.com/pastebin/KdYwo9ih/
[12:30:06] <dcaro>	 andrewbogott: also, can you elaborate a bit on what did you do yesterday for the pdns replication?
[12:30:14] <dcaro>	 I'm still struggling to get dns working on k8s
[12:35:03] <andrewbogott>	 There were two things I did right at the same time. The most obvious is this config setting rename: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1065037
[12:35:47] <andrewbogott>	 The other  is that I manually reset the domain serial in the database so that axfr would think it was way out of date and sync. I only did that for one domain though so I doubt that fixed anything affecting you.
[12:36:14] <andrewbogott>	 My tests over the weekend suggested that pdns is fully in sync and that the failure is internal to k8s, are you seeing things that are out of sync or misbehaving within pdns?
[12:40:57] <dcaro>	 ack, thanks, yep, it does not seem related, as the current issues affect all  dns entries
[12:41:07] <dcaro>	 they seem to affect only certain workrers though
[12:41:26] <dcaro>	 trying to resolve from k8s-worker-106 works fast and reliably from the coredns pod and from envvars-api pod
[12:43:11] <andrewbogott>	 seems like we should try rebooting one bad worker and one good one and see if they trade places :)
[12:43:29] <dcaro>	 yep, on it :)
[12:43:39] <dcaro>	 rebooting 104
[12:50:41] <andrewbogott>	 I need to go out until our checkin but I can look at the prometheus thing when I'm back.  Sorry to vanish!
[12:50:42] <dcaro>	 the reboot did not help :(
[12:50:46] <dcaro>	 https://www.irccloud.com/pastebin/JMRISqs4/
[13:51:44] * dcaro paged harbor down
[13:53:52] <dcaro>	 hmpf... I see the entry in cloud-feed but not on karma ui
[13:55:29] <dcaro>	 grafana does not show any issues either, not sure why it triggered, looking
[13:56:34] <dcaro>	 `Got no data for any component, all might be down or unresponsive, toolforge might be unable to pull images.` might be related to the restarts of prometheusn
[13:57:30] <dcaro>	 I'm going to take the 7 worker nodes that are having dns issues out of the pool, I might create new workers after if I don't find a solution for those
[14:01:59] * andrewbogott is back
[14:02:14] <andrewbogott>	 dcaro: that seems reasonable although it's frustrating that they're misbehaving arbitrarily.
[14:02:43] <andrewbogott>	 And it does seem like it corresponds to the openstack upgrade? But I can't think why unless they got upset about the service flapping and somehow can never get over it.
[14:06:11] <dcaro>	 I rebooted all of them, and hard stop/started one without any improvement :/
[14:06:33] <dcaro>	 they are all scattered around cloudvirts too, on different racks and all
[14:06:59] <dcaro>	 so seems bound to the k8s layer to me, maybe just bad timing?
[14:09:09] <andrewbogott>	 sounds like but it's suspicious!
[14:19:02] <dcaro>	 stashbot was down, we should add an alert for that
[14:19:02] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[14:22:54] <dcaro>	 thanks stashbot xd
[14:23:48] <andrewbogott>	 clearly not down now
[14:24:00] <RhinosF1>	 Lucas already restarted it
[14:27:32] <dcaro>	 it's also tricky as it's not just checking that there's pods running, as it gets stuck but is still running, we would have to check if it replies on irc or similar
[14:27:49] <dcaro>	 (maybe the logs?) so we might not have the stats to set the alert right away
[14:29:03] * andrewbogott imagines once-per minute conversations on cloud-feed "you there?" "yep, still here"
[14:29:35] <dcaro>	 that'd be doable actually xd
[14:54:02] <dcaro>	 https://www.irccloud.com/pastebin/IT26jBrQ/%20
[15:35:54] <bd808>	 > stashbot was down, we should add an alert for that -- y'all want to take over all of my toys ;)
[15:37:28] <dcaro>	 that's because they are useful!
[15:37:30] <dcaro>	 :)
[15:39:39] <bd808>	 I see some ldap connection drops in stashbot's err log and a handful of dns lookup failures that almost certainly were related to the general k8s dns problems. Nothing exciting though.
[16:46:20] <dcaro>	 andrewbogott: I'm getting `Keystone service is temporarily unavailable. ` on the ui :/
[16:46:35] <dcaro>	 (horizon) when trying to get the instances for tools
[16:46:51] <andrewbogott>	 try again? I just restarted some services
[16:47:25] <dcaro>	 ack
[16:47:51] <dcaro>	 seems to work yep
[16:50:14] <dcaro>	 I think I'm going to call it a day for now, I'll leave the current worker being added, and I'll add a couple more workers bit by bit and retake tomorrow
[16:50:51] * dcaro off
[16:51:03] <andrewbogott>	 sgtm. Thanks dcaro