[00:04:00] FIRING: [2x] SystemdUnitFailed: man-db.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:34] RESOLVED: DiskSpace: Disk space thanos-be2006:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=thanos-be2006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:29:00] FIRING: [2x] SystemdUnitFailed: man-db.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:00] FIRING: [2x] SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.11.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:44:17] FIRING: [2x] SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.11.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:14] moss-be1001 has a failed disk, but it's also due to be retired soon, so I guess I'll just do that today. [07:19:00] RESOLVED: SystemdUnitFailed: man-db.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:31] that (thanos-be2006) was a consequence of the overnight disk-filling, which is T423690 [07:21:32] T423690: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690 [08:06:44] Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275260 please? do the eqiad apus rollover (then I can get started on draining the two old nodes, including the one with a failed disk) [08:33:43] dhinus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275286 [09:16:50] federico3: thanks for your review of my CR. Are you happy enough with it to give it a +1, please? :) [09:27:01] @Emperor done [09:29:38] TY :) [10:15:44] And could I get +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275366 please? to take those two old nodes out of hiera so I can decom them (they've been drained, demonstration of which is in the commit msg). [10:46:32] looking [11:26:03] did you come to a view on it? [11:52:32] @Emperor LGTM, done [11:56:57] TY [12:07:38] OK, now to address the disk-filling on thanos backends. I'm going to have to reimage them, here's a CR for the preseed changes - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275384 [12:08:07] I've tested the change already using the new virtualised testing setup, output in the CR text, so hopefully not too stressful to +1 , please :) [12:39:35] federico3: db2157 is now ok, once you've got the eqiad host let me know so I can check/fix if neeeded [12:40:08] @marostegui was it only one wiki on in or multiple ones? [12:42:59] federico3: it was fine all the wikis [12:46:43] federico3: I found the issue with pplwiki [12:46:46] On db2157 [12:47:05] il_to column cannot be dropped because it is part of the PK [12:47:29] so the PK is different there than the rest of the wikis [12:47:33] ah, so that would fail even with "IF EXISTS" [12:47:37] yep [12:48:18] Amir1: can you check MW files to see if the PK for imagelinks is supposed to be (`il_from`,`il_target_id`) or (`il_from`,`il_to`) ? [12:48:47] mw on master should be il_from, il_targt_id [12:48:54] https://phabricator.wikimedia.org/T415786 [12:49:08] Yep, for some reason I guess that wiki was create with a different pk or was skipped [12:49:10] that would explain it [12:49:16] on that host [12:49:19] marostegui: btw going-merry.toolforge.org/?table=imagelinks :D [12:50:29] oh nice! [12:50:33] I will fix the pk [12:53:24] ok, now all done [13:19:59] Any volunteers to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275384 please? As I mentioned in our meeting, I have tested the change... [15:45:28] marostegui: something that happens with very low frequency but it's a bit annoying and weird. Some really fast queries rarely (1 ppm basically) become really slow https://logstash.wikimedia.org/goto/0d2fee84b2f29f293d65ad8e406e848a [15:45:58] maybe not worth fixing but wonder if you have some way of fixing it so it doesn't pollute the logs [15:46:08] More context: T423349 [15:46:08] T423349: globalblocks query randomly becomes slow - https://phabricator.wikimedia.org/T423349 [15:46:20] Amir1: Is it always when it hit the same replica? [15:46:33] nope, it's many different replicas [15:46:44] and I optimized the table too [15:46:46] so that query is normally fast until it isn't? [15:47:09] yup [15:47:23] Is it slow now? [15:47:31] So I can get some traces of the optimizer [15:47:45] it doesn't become slow reliably [15:47:48] just random times [15:47:59] And then it fixes itself? [15:48:16] yup, it's usually one query even so it's not a burst in one second and gone [15:48:22] :O [15:48:59] one guess is that maybe it's slow down in network? [15:49:13] I wonder if it depends on the status of the innodb pool at that point [15:50:00] If it fixes itself...that maybe means as soon as the query runs, pages are back in the buffer pool and then that's why it gets "fixed" [15:52:02] could be but taking +5 seconds to read a couple rows from disk? that's a bit weird [15:52:15] is the table very big? [15:52:28] (the slow down gets reported when it's longer than 5s) [15:52:37] nope, it should be quite small and very heavily hit [15:52:41] if it happens across many replicas, I am more inclined for the buffer pool theory [15:52:56] but in any case, if it is that rare, is it worth spending time on this? [15:53:12] but also the query is empty results in 99.99% of the time [15:53:27] (it checks if the user is blocked, which most of the time it's not) [15:53:53] maybe lock contention on writes? [15:54:00] so a write blocking that read? [15:54:10] "but in any case, if it is that rare, is it worth spending time on this?" only problem is the logs being polluted otherwise no other issues [15:54:45] oh could be, I hate it when mw starts a commits on replicas even because it wants to enforce repeatable read [15:54:46] Maybe we need to fix the logs filters :) [15:55:04] fair [15:55:17] Like even if it is the lock on the writes, it is going to be hard to chase [15:56:12] yeah [19:04:01] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:18] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed