[07:01:27] I just got paged by ceph [07:01:56] I'm on a doctor appointment and dont have the laptop with me [07:03:51] dhinus: are you able to take a look? [07:10:45] * dcaro paged [07:10:51] it just went away [07:10:53] looking [07:10:56] I was also paged [07:13:34] I don't see anything hapenning right now [07:13:47] there was an alert for 1 slow ops, and ceph unknown state [07:13:56] a ticket about slow ops was just created [07:15:58] the metric for ceph unknown status `(ceph_health_status{job="ceph_eqiad"} or on () vector(-1)) == -1` does not show anything to me :/, feels like the one we had for the openstack apis [07:16:19] as if we were getting different data different times [07:16:49] and that one is also triggering right now xd [07:16:50] Got no data from the Openstack API to check the response times. It may mean that the control plane is unreliable. [07:17:06] may be worth it to ping o11y about it [07:17:13] see if they know what's what [07:17:42] do you think they are related? [07:18:05] are cloudcontrol servers involved in the ceph metric scrapes? [07:18:09] both are related to no data being found, but both show some data when queried (and show gaps sometimes when queried too) [07:18:21] like, is the ceph metric exporter running on cloudcontrol? [07:18:37] ok [07:19:02] no, they get scraped by prometheus directly, but I'm thinking that it might be that we are hitting different thanos/prometheus or similar, and they have different data for some reason [07:19:12] or something like that [07:20:43] ah, I see [07:22:31] that happened in the past on toolforge iirc [07:26:15] * dcaro off [07:26:28] I'll let you handle this during regular work hours :) [07:27:29] sure [07:28:03] I'm hoping to be back in the laptop in max 2h from now [07:29:00] thanks for showing up in your PTO day! [08:58:21] * arturo in the laptop now [09:36:05] I have created this ticket: T374599 [09:36:05] T374599: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599 [10:07:52] * dhinus paged ToolsToolsDBWritable [10:08:44] * dcaro paged too [10:08:46] tools-db-1 is running, but read-only [10:08:55] ceph is having issues [10:09:34] mariadb crashed and recovered [10:09:44] hmm... the slow ops are not showing on ceph status [10:09:45] :/ [10:09:47] it's normal that it restarts as read-only after a crash [10:09:53] it seems it recovered [10:10:00] yeah, slow ops are flapping [10:10:08] it was present in ceph status just a few moments ago [10:10:11] I will SET GLOBAL read_only=OFF; [10:10:51] I don't see anything in `ceph crash ls` [10:11:37] would be nice to have a `ceph health detail` when the slow ops happened [10:11:49] dcaro: I have some info in my backscroll [10:11:49] (that would show the osds/nodes that were having issues) [10:11:51] https://www.irccloud.com/pastebin/R35wgI8G/ [10:11:57] the toolsdb alert is not firing anymore [10:12:37] those are 1023 (D5) and 1030 (F4) [10:13:17] root@cloudcephmon1006:~# ceph device ls | grep 175 [10:13:17] HFS1T9G32FEH-BA10A_KN0CNA143I040120F cloudcephosd1023:sdc osd.175 [10:13:17] root@cloudcephmon1006:~# ceph device ls | grep 242 [10:13:17] HFS1T9G32FEH-BA10A_KJA8N5701I0308I4P cloudcephosd1030:sdg osd.242 [10:14:41] https://www.irccloud.com/pastebin/GX7hOfP3/ [10:15:08] is this the known hard drive error we are negotiating with dell? [10:15:13] yep [10:15:39] did it increase? [10:16:20] let me check [10:16:56] for 1023 it did not increase [10:17:14] also, not for 1030 [10:18:15] there's two services reporting about the counter, one is smartd, and the other smart_failure (the latter does not like it) [10:18:18] https://www.irccloud.com/pastebin/RG9BPIgu/ [10:18:37] but, no movement on that drive for that counter yep [10:18:53] ack [10:19:57] can you attach the new task to T348643 and note the lack of increase in the counter? [10:19:57] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [10:20:08] (and any other notes you want to add) [10:20:19] ok [10:20:23] which new task? [10:20:35] do you think we should create a ticket about today's slow ops? [10:21:29] I thought it had created an automated task [10:21:35] then just a note there is enough [10:21:46] I think the cluster is struggling a bit extra due to the drainage [10:22:07] maybe T373632 [10:22:08] T373632: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632 [10:22:39] but I thought T373632 was about the prometheus weirdness rather than ceph having actual issues [10:23:00] (the ticket was created earlier today, then updated with this new alert) [10:23:12] that's the ticket yep [10:23:22] I think that the one about ceph not getting data did not create a ticket [10:23:39] ok, then I'll remove the parent task [10:23:48] ack, thanks [10:24:23] * dcaro disappears into the shadows [10:24:23] the ceph no data did create a task BTW, it is here: T374593 [10:24:24] T374593: CephClusterInUnknown - https://phabricator.wikimedia.org/T374593 [10:24:40] 👍 [10:24:47] thanks for showing up [11:16:30] new alert: tools k8s workers with many D procs [11:17:46] which seems to be a legit failure in this case [11:24:03] the worker with more D procs is tools-k8s-worker-nfs-16 [11:24:23] but I just checked and it has a bunch of containers getting oomkilled [11:24:44] and the D procs are those perl containers from the checkwiki tool [11:33:21] the nfs servers seems happy [11:33:33] tools-k8s-worker-nfs-28 is stuck, wont let me in via ssh [11:33:35] will reboot it [11:36:15] opened T374612 to track this work [11:36:16] T374612: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 [16:36:33] * dhinus offline