[03:04:18] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:41:35] <jinxer-wm>	 FIRING: DiskSpace: Disk space ms-be1064:9100:/srv/swift-storage/sda3 3.896% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ms-be1064 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[07:12:15] <marostegui>	 dhinus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275748/2 ready 
[07:37:31] <dhinus>	 marostegui: awesome, +1d
[07:38:43] <marostegui>	 dhinus: merged, you take care of the lb part?
[07:40:13] <dhinus>	 sure
[07:40:27] <marostegui>	 thanks
[07:49:01] <jinxer-wm>	 FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:49:54] <dhinus>	 marostegui: clouddb1025@s6 is now pooled
[07:50:02] <marostegui>	 sweet
[07:50:03] <marostegui>	 thanks
[07:50:49] <dhinus>	 I think I can pool s4 as well?
[07:50:54] <marostegui>	 yeah, go for it
[07:51:31] <dhinus>	 done!
[07:52:29] <dhinus>	 I've also reset the wmf-pt-kill timeout on clouddb1015 to its default value
[07:52:33] <dhinus>	 we should be all set
[07:53:14] <marostegui>	 excellent thanks
[07:57:20] <jynus>	 es6 & es7 read only are running now, will finish by 10 UTC, and then we will see how much time it takes thew new clusters
[08:35:39] <Emperor>	 User report via #talk-to-sre on slack about cassandra-a on aqs1010 being down. Service does seem deliberately disabled (/etc/cassandra-a/service-enabled) is absent; I'll drop u.random an email.
[08:36:27] <marostegui>	 Emperor: probably worth tracking on a task?=
[08:36:29] <Emperor>	 [AFAICT it's been this way for a while, so maybe it's deliberate]
[08:37:41] <Emperor>	 Ah, I've found https://phabricator.wikimedia.org/T412830 it's being decommissioned
[09:09:04] <jinxer-wm>	 FIRING: MysqlPredictiveFreeDiskSpace: Host pc2021:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace
[09:09:16] <marostegui>	 ^ expected
[09:23:02] <federico3>	 thanks
[09:33:51] <federico3>	 @marostegui for T423998 I can see `maintain-views` running against `s7.hewiki_p.imagelinks` but it does not indicate if changes has been made or not.  I tried `describe imagelinks` in Quarry and it's showing the same error
[09:33:53] <stashbot>	 T423998: The imagelinks replica in hewiki is not available - https://phabricator.wikimedia.org/T423998
[09:34:16] <marostegui>	 federico3: you have to check inside hewiki_p
[09:34:47] <federico3>	 I selected hewiki_p in Quarry and run descrbe imagelinks
[09:35:22] <marostegui>	 I'd check an individual host and skip quarry for now, simply do a select * from hewiki_p.imagelinks on the host you ran the script at
[09:38:17] <dhinus>	 federico3: maintain-views should print the name of the views that it changed. if it doesn't, maybe you forgot to add "--replace"?
[09:38:48] <marostegui>	 federico3: the exact command is at https://phabricator.wikimedia.org/T422459
[09:39:19] <federico3>	 dhinus: this is what I see https://phabricator.wikimedia.org/P91263
[09:39:39] <marostegui>	 [11:35:22]  <marostegui> I'd check an individual host and skip quarry for now, simply do a select * from hewiki_p.imagelinks on the host you ran the script at
[09:39:54] <dhinus>	 "INFO [s7.frwiktionary_p.imagelinks]" means it did recreate that view
[09:41:16] <federico3>	 I'm trying to find hosts with the table
[09:41:26] <marostegui>	 federico3: all the hosts should have the table
[09:42:58] <federico3>	 I'm not finding it
[09:43:22] <marostegui>	 federico3: https://phabricator.wikimedia.org/T422459 this tells you exactly what and where you have to do it
[09:44:14] <dhinus>	 btw the new version of maintain-views is almost ready for merge, marostegui I replied to your comments on https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/9
[09:44:37] <marostegui>	 dhinus: yeah, sorry I have on my unread emails to review it, but I have 1000 thins on my plate at the moment
[09:44:46] <marostegui>	 I will try to get it reviewed this week
[09:45:10] <dhinus>	 marostegui: no rush at all
[09:45:38] <dhinus>	 I just wanted to make sure the gitlab notification did not get lost :)
[09:46:07] <marostegui>	 yeah, no, it is there :)
[09:47:11] <dhinus>	 I also pasted some examples in the comments at T351637 to see what the new version looks like on a real host
[09:47:12] <stashbot>	 T351637: [wikireplicas] add proper dry-run/diff mode to maintain-views - https://phabricator.wikimedia.org/T351637
[09:49:11] <federico3>	 dhinus: it should be ok now
[09:50:16] <federico3>	 marostegui: the task mention running on "all clouddb* and an-redacteddb1001" but it seems to need only the hosts in the replication chain for a given section?
[09:50:19] <dhinus>	 federico3: looks good thanks!
[09:50:28] <marostegui>	 federico3: well yeah, of course
[09:52:29] <jynus>	 new es backups on codfw worked, failed on eqiad, investigating
[09:52:53] <federico3>	 marostegui: when you have a sec db1161 needs the manual cleanup for "DROP INDEX il_to" in pplwiki
[09:53:09] <marostegui>	 ok, doing it now
[09:53:22] <jynus>	 [ERROR] - There are queries in PROCESSLIST running longer than 60s, aborting dump
[09:53:43] <marostegui>	 federico3: done
[09:54:37] <jynus>	 I think it is because dumps are running
[09:58:43] <federico3>	 marostegui: thanks
[10:00:34] <marostegui>	 federico3: if the issue is fixed you should probably update and close the UBN
[10:01:08] <federico3>	 I did
[10:01:46] <marostegui>	 excellent thanks
[10:02:20] <federico3>	 but there could be a glitch, see https://phabricator.wikimedia.org/T423998#11843261
[10:03:39] <marostegui>	 let's see what he says. We're not deleting data there, but we did remove a column so maybe the query needs to be redone to account for those matches
[10:13:20] <jynus>	 retrying worked, es backups went from over 8 hours to 7 minutes
[10:13:55] <jynus>	 *10 hours
[11:14:01] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:19:01] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:38:09] <Emperor>	 Is db1151 known-issue? It's been alerting for a while now
[11:39:01] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:39] <marostegui>	 federico3: can you check the above^
[11:43:41] <marostegui>	 ?
[11:44:12] <federico3>	 yes but for context I was not doing reboots on pc* so maybe it's unrelated
[11:44:21] <marostegui>	 federico3: I think he meants db1151
[11:45:13] <federico3>	 looking
[11:51:45] <marostegui>	 Amir1: the amount of moving pieces with dbctl and parsercache is making the refreshes _very_ interesting
[11:57:33] <Amir1>	 Interesting definition of interesting 
[11:58:46] <marostegui>	 I'd be really surprised if I don't bring the site down when pooling back
[11:59:56] <federico3>	 db1151 sneezed a few times after the reboot but then wmf_auto_restart_prometheus... and the exporter itself succeded. I restarted them both just in case and I'm not seeing errors 🤷
[12:00:29] <marostegui>	 thanks for checking federico3 
[12:06:34] <jinxer-wm>	 RESOLVED: MysqlPredictiveFreeDiskSpace: Host pc2021:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace
[12:17:05] <Amir1>	 it's been a while since we caused a db-induced outage
[12:17:14] <Amir1>	 right about time
[12:17:25] <Amir1>	 but jokes aside, since we have 8 of those, it should be fine
[12:17:30] <marostegui>	 I am definitely doing my best today with pc
[12:17:43] <Amir1>	 even empty should be fine as long as it's done one per day
[12:17:48] <marostegui>	 I repooled pc1 with the new host and so far it looks good
[14:04:01] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2251:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:04:10] <marostegui>	 ^ expected
[14:14:01] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2251:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:19:11] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2251:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:20:05] <jinxer-wm>	 FIRING: MySQLReplicaNotUsingGTID: MySQL replica db2251:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID
[14:25:05] <jinxer-wm>	 RESOLVED: MySQLReplicaNotUsingGTID: MySQL replica db2251:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID
[14:31:24] <marostegui>	 Amir1: I've replaced ms1 codfw master with the new HW, if you see somethign strange, let me know. Dealing with ms1 is as painful as with pc from a dbctl/cookbook point of view
[14:31:28] <marostegui>	 But I think it is all good
[14:31:44] <Amir1>	 thanks!