[00:04:00] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: man-db.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:16:34] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space thanos-be2006:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=thanos-be2006 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[00:29:00] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: man-db.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:44:00] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.11.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:44:17] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: ceph-3f38ada2-2d88-11ef-8c7c-bc97e1bb7c18@osd.11.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:17:14] <Emperor>	 moss-be1001 has a failed disk, but it's also due to be retired soon, so I guess I'll just do that today.
[07:19:00] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: man-db.service on thanos-be2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:21:31] <Emperor>	 that (thanos-be2006) was a consequence of the overnight disk-filling, which is T423690
[07:21:32] <stashbot>	 T423690: Thanos backends filling their root filesystems overnight - https://phabricator.wikimedia.org/T423690
[08:06:44] <Emperor>	 Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275260 please? do the eqiad apus rollover (then I can get started on draining the two old nodes, including the one with a failed disk)
[08:33:43] <marostegui>	 dhinus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275286
[09:16:50] <Emperor>	 federico3: thanks for your review of my CR. Are you happy enough with it to give it a +1, please? :)
[09:27:01] <federico3>	 @Emperor done
[09:29:38] <Emperor>	 TY :)
[10:15:44] <Emperor>	 And could I get +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275366 please? to take those two old nodes out of hiera so I can decom them (they've been drained, demonstration of which is in the commit msg).
[10:46:32] <federico3>	 looking
[11:26:03] <Emperor>	 did you come to a view on it?
[11:52:32] <federico3>	 @Emperor LGTM, done
[11:56:57] <Emperor>	 TY
[12:07:38] <Emperor>	 OK, now to address the disk-filling on thanos backends. I'm going to have to reimage them, here's a CR for the preseed changes - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275384
[12:08:07] <Emperor>	 I've tested the change already using the new virtualised testing setup, output in the CR text, so hopefully not too stressful to +1 , please :)
[12:39:35] <marostegui>	 federico3: db2157 is now ok, once you've got the eqiad host let me know so I can check/fix if neeeded
[12:40:08] <federico3>	 @marostegui was it only one wiki on in or multiple ones?
[12:42:59] <marostegui>	 federico3: it was fine all the wikis
[12:46:43] <marostegui>	 federico3: I found the issue with pplwiki
[12:46:46] <marostegui>	 On db2157
[12:47:05] <marostegui>	 il_to column cannot be dropped because it is part of the PK
[12:47:29] <marostegui>	 so the PK is different there than the rest of the wikis
[12:47:33] <federico3>	 ah, so that would fail even with "IF EXISTS"
[12:47:37] <marostegui>	 yep
[12:48:18] <marostegui>	 Amir1: can you check MW files to see if the PK for imagelinks is supposed to be (`il_from`,`il_target_id`) or (`il_from`,`il_to`) ?
[12:48:47] <Amir1>	 mw on master should be il_from, il_targt_id
[12:48:54] <marostegui>	 https://phabricator.wikimedia.org/T415786
[12:49:08] <marostegui>	 Yep, for some reason I guess that wiki was create with a different pk or was skipped
[12:49:10] <marostegui>	 that would explain it
[12:49:16] <marostegui>	 on that host
[12:49:19] <Amir1>	 marostegui: btw going-merry.toolforge.org/?table=imagelinks :D
[12:50:29] <marostegui>	 oh nice!
[12:50:33] <marostegui>	 I will fix the pk
[12:53:24] <marostegui>	 ok, now all done
[13:19:59] <Emperor>	 Any volunteers to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275384 please? As I mentioned in our meeting, I have tested the change...
[15:45:28] <Amir1>	 marostegui: something that happens with very low frequency but it's a bit annoying and weird. Some really fast queries rarely (1 ppm basically)  become really slow https://logstash.wikimedia.org/goto/0d2fee84b2f29f293d65ad8e406e848a 
[15:45:58] <Amir1>	 maybe not worth fixing but wonder if you have some way of fixing it so it doesn't pollute the logs 
[15:46:08] <Amir1>	 More context: T423349
[15:46:08] <stashbot>	 T423349: globalblocks query randomly becomes slow - https://phabricator.wikimedia.org/T423349
[15:46:20] <marostegui>	 Amir1: Is it always when it hit the same replica?
[15:46:33] <Amir1>	 nope, it's many different replicas
[15:46:44] <Amir1>	 and I optimized the table too
[15:46:46] <marostegui>	 so that query is normally fast until it isn't?
[15:47:09] <Amir1>	 yup
[15:47:23] <marostegui>	 Is it slow now?
[15:47:31] <marostegui>	 So I can get some traces of the optimizer
[15:47:45] <Amir1>	 it doesn't become slow reliably
[15:47:48] <Amir1>	 just random times
[15:47:59] <marostegui>	 And then it fixes itself?
[15:48:16] <Amir1>	 yup, it's usually one query even so it's not a burst in one second and gone
[15:48:22] <marostegui>	 :O
[15:48:59] <Amir1>	 one guess is that maybe it's slow down in network?
[15:49:13] <marostegui>	 I wonder if it depends on the status of the innodb pool at that point
[15:50:00] <marostegui>	 If it fixes itself...that maybe means as soon as the query runs, pages are back in the buffer pool and then that's why it gets "fixed"
[15:52:02] <Amir1>	 could be but taking +5 seconds to read a couple rows from disk? that's a bit weird
[15:52:15] <marostegui>	 is the table very big?
[15:52:28] <Amir1>	 (the slow down gets reported when it's longer than 5s)
[15:52:37] <Amir1>	 nope, it should be quite small and very heavily hit
[15:52:41] <marostegui>	 if it happens across many replicas, I am more inclined for the buffer pool theory
[15:52:56] <marostegui>	 but in any case, if it is that rare, is it worth spending time on this?
[15:53:12] <Amir1>	 but also the query is empty results in 99.99% of the time
[15:53:27] <Amir1>	 (it checks if the user is blocked, which most of the time it's not)
[15:53:53] <marostegui>	 maybe lock contention on writes?
[15:54:00] <marostegui>	 so a write blocking that read?
[15:54:10] <Amir1>	 "but in any case, if it is that rare, is it worth spending time on this?" only problem is the logs being polluted otherwise no other issues
[15:54:45] <Amir1>	 oh could be, I hate it when mw starts a commits on replicas even because it wants to enforce repeatable read
[15:54:46] <marostegui>	 Maybe we need to fix the logs filters :)
[15:55:04] <Amir1>	 fair
[15:55:17] <marostegui>	 Like even if it is the lock on the writes, it is going to be hard to chase
[15:56:12] <Amir1>	 yeah
[19:04:01] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:04:18] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed