[03:04:18] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:35] FIRING: DiskSpace: Disk space ms-be1064:9100:/srv/swift-storage/sda3 3.896% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=ms-be1064 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:12:15] dhinus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275748/2 ready [07:37:31] marostegui: awesome, +1d [07:38:43] dhinus: merged, you take care of the lb part? [07:40:13] sure [07:40:27] thanks [07:49:01] FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:54] marostegui: clouddb1025@s6 is now pooled [07:50:02] sweet [07:50:03] thanks [07:50:49] I think I can pool s4 as well? [07:50:54] yeah, go for it [07:51:31] done! [07:52:29] I've also reset the wmf-pt-kill timeout on clouddb1015 to its default value [07:52:33] we should be all set [07:53:14] excellent thanks [07:57:20] es6 & es7 read only are running now, will finish by 10 UTC, and then we will see how much time it takes thew new clusters [08:35:39] User report via #talk-to-sre on slack about cassandra-a on aqs1010 being down. Service does seem deliberately disabled (/etc/cassandra-a/service-enabled) is absent; I'll drop u.random an email. [08:36:27] Emperor: probably worth tracking on a task?= [08:36:29] [AFAICT it's been this way for a while, so maybe it's deliberate] [08:37:41] Ah, I've found https://phabricator.wikimedia.org/T412830 it's being decommissioned [09:09:04] FIRING: MysqlPredictiveFreeDiskSpace: Host pc2021:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace [09:09:16] ^ expected [09:23:02] thanks [09:33:51] @marostegui for T423998 I can see `maintain-views` running against `s7.hewiki_p.imagelinks` but it does not indicate if changes has been made or not. I tried `describe imagelinks` in Quarry and it's showing the same error [09:33:53] T423998: The imagelinks replica in hewiki is not available - https://phabricator.wikimedia.org/T423998 [09:34:16] federico3: you have to check inside hewiki_p [09:34:47] I selected hewiki_p in Quarry and run descrbe imagelinks [09:35:22] I'd check an individual host and skip quarry for now, simply do a select * from hewiki_p.imagelinks on the host you ran the script at [09:38:17] federico3: maintain-views should print the name of the views that it changed. if it doesn't, maybe you forgot to add "--replace"? [09:38:48] federico3: the exact command is at https://phabricator.wikimedia.org/T422459 [09:39:19] dhinus: this is what I see https://phabricator.wikimedia.org/P91263 [09:39:39] [11:35:22] I'd check an individual host and skip quarry for now, simply do a select * from hewiki_p.imagelinks on the host you ran the script at [09:39:54] "INFO [s7.frwiktionary_p.imagelinks]" means it did recreate that view [09:41:16] I'm trying to find hosts with the table [09:41:26] federico3: all the hosts should have the table [09:42:58] I'm not finding it [09:43:22] federico3: https://phabricator.wikimedia.org/T422459 this tells you exactly what and where you have to do it [09:44:14] btw the new version of maintain-views is almost ready for merge, marostegui I replied to your comments on https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/9 [09:44:37] dhinus: yeah, sorry I have on my unread emails to review it, but I have 1000 thins on my plate at the moment [09:44:46] I will try to get it reviewed this week [09:45:10] marostegui: no rush at all [09:45:38] I just wanted to make sure the gitlab notification did not get lost :) [09:46:07] yeah, no, it is there :) [09:47:11] I also pasted some examples in the comments at T351637 to see what the new version looks like on a real host [09:47:12] T351637: [wikireplicas] add proper dry-run/diff mode to maintain-views - https://phabricator.wikimedia.org/T351637 [09:49:11] dhinus: it should be ok now [09:50:16] marostegui: the task mention running on "all clouddb* and an-redacteddb1001" but it seems to need only the hosts in the replication chain for a given section? [09:50:19] federico3: looks good thanks! [09:50:28] federico3: well yeah, of course [09:52:29] new es backups on codfw worked, failed on eqiad, investigating [09:52:53] marostegui: when you have a sec db1161 needs the manual cleanup for "DROP INDEX il_to" in pplwiki [09:53:09] ok, doing it now [09:53:22] [ERROR] - There are queries in PROCESSLIST running longer than 60s, aborting dump [09:53:43] federico3: done [09:54:37] I think it is because dumps are running [09:58:43] marostegui: thanks [10:00:34] federico3: if the issue is fixed you should probably update and close the UBN [10:01:08] I did [10:01:46] excellent thanks [10:02:20] but there could be a glitch, see https://phabricator.wikimedia.org/T423998#11843261 [10:03:39] let's see what he says. We're not deleting data there, but we did remove a column so maybe the query needs to be redone to account for those matches [10:13:20] retrying worked, es backups went from over 8 hours to 7 minutes [10:13:55] *10 hours [11:14:01] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:01] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1151:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:09] Is db1151 known-issue? It's been alerting for a while now [11:39:01] RESOLVED: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:39] federico3: can you check the above^ [11:43:41] ? [11:44:12] yes but for context I was not doing reboots on pc* so maybe it's unrelated [11:44:21] federico3: I think he meants db1151 [11:45:13] looking [11:51:45] Amir1: the amount of moving pieces with dbctl and parsercache is making the refreshes _very_ interesting [11:57:33] Interesting definition of interesting [11:58:46] I'd be really surprised if I don't bring the site down when pooling back [11:59:56] db1151 sneezed a few times after the reboot but then wmf_auto_restart_prometheus... and the exporter itself succeded. I restarted them both just in case and I'm not seeing errors 🤷 [12:00:29] thanks for checking federico3 [12:06:34] RESOLVED: MysqlPredictiveFreeDiskSpace: Host pc2021:9100 predictive low disk space on /srv - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/goto/Jdz2PnLNg?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMysqlPredictiveFreeDiskSpace [12:17:05] it's been a while since we caused a db-induced outage [12:17:14] right about time [12:17:25] but jokes aside, since we have 8 of those, it should be fine [12:17:30] I am definitely doing my best today with pc [12:17:43] even empty should be fine as long as it's done one per day [12:17:48] I repooled pc1 with the new host and so far it looks good [14:04:01] FIRING: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2251:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:10] ^ expected [14:14:01] FIRING: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2251:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:11] RESOLVED: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on db2251:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:05] FIRING: MySQLReplicaNotUsingGTID: MySQL replica db2251:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID [14:25:05] RESOLVED: MySQLReplicaNotUsingGTID: MySQL replica db2251:9104 not using GTID - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting - https://grafana.wikimedia.org/d/0fec1d02-1b0b-44c0-84b0-64894f3ba682/mariadb-gtid - https://alerts.wikimedia.org/?q=alertname%3DMySQLReplicaNotUsingGTID [14:31:24] Amir1: I've replaced ms1 codfw master with the new HW, if you see somethign strange, let me know. Dealing with ms1 is as painful as with pc from a dbctl/cookbook point of view [14:31:28] But I think it is all good [14:31:44] thanks!