[05:56:25] FIRING: SystemdUnitFailed: cassandra-a.service on restbase2028:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:25] RESOLVED: SystemdUnitFailed: cassandra-a.service on restbase2028:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:42] "yay" [09:40:31] we lost a race with a redirection change this time. [09:46:25] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:30] I am switching es5 master [10:01:38] Amir1: I am going to start reorganizing pc3 [10:01:56] marostegui: sure, I'm just deleting stuff from pc5 in eqiad [10:02:03] (pc1014) [10:02:19] I can pause for now, it's not anything urgent [10:02:30] (and we have until the switchover to clean it up) [10:03:02] Amir1: Would it affect pc3? [10:03:04] No, no? [10:03:31] I am going to fully depool pc3 [10:03:37] So I can operate with no pressure [10:03:57] yeah, it won't affect anything there [10:04:06] great [10:04:08] let me know if you like the new way of depooling [10:04:17] basically set both masters to 0 right? [10:04:43] yup: https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Depooling_a_parsercache_host [10:04:54] Great [10:04:59] I will upgrade kernel, mariadb etc too [10:05:16] you can use the upgrade cookbook, it's sooooo easy [10:05:29] ah yes, that includes the reboot [10:05:33] I always forget [10:05:35] > sudo cookbook sre.mysql.upgrade 'pc1014*' [10:06:07] https://phabricator.wikimedia.org/P71989 [10:07:16] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=All&var-role=All&from=now-1h&to=now [10:07:21] It looks it's being drained [10:09:28] your diff is different than mine when I depooled. It doesn't matter I think but just to be sure: https://phabricator.wikimedia.org/P71989 vs https://phabricator.wikimedia.org/P71926 [10:10:10] Ah [10:10:18] Because I did depool [10:10:20] Instead of 0 [10:10:39] that seems to work too (funnily enough, it shouldn't0 [10:11:40] I just fixed it just in case [10:16:21] Thanks! [10:29:05] Amir1: pc3 is done and back in production [10:29:17] \o/ [10:29:37] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=All&var-role=All&from=now-1h&to=now [10:30:12] https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m latency looks okay too [10:34:50] yeah, I am going to give it a few minutes and depool pc4 to reorganize pc4 codfw [11:02:52] jynus: I think m2 backups are finished per the dbbackups.backups but can you double check and confirm? [11:03:22] I belive so, I was checking them earlier, but let me confirm again [11:03:40] Thanks [11:03:51] m2 will run tonight at 0 hours [11:04:07] wanna do a quick run, if you are going to do any maintenance? [11:04:15] It is just a clone [11:04:18] So we should be good [11:04:26] ok! [11:04:31] Thanks! [11:05:07] oh, those backups take 7 hours, so it wouldn't be a quick run [11:10:46] Yeah m2 has otrs........ [11:21:26] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2233:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:58] ^ expected [12:14:02] I would appreciate another pair of eyes at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1110758/1/hieradata/hosts/dbproxy2005.yaml#4 [12:14:09] Basically that host/ip are what I sent [12:14:19] I double checked, but I'd prefer another recheck [12:25:33] Switching m2 proxy now [12:25:47] 👀 [12:27:03] Thank you Emperor <3 [12:29:58] NP [12:59:47] Emperor: o/ [12:59:54] do you have a min for ms-be1090? [13:01:38] as far as I can see from the BMC's webui, there are 24 JBODs [13:02:27] so dcops should already have fixed it, I think that the tooling part involves reboot into BIOS and set JBOD manually [13:06:00] Oh, I was going on the most recent update to the phab ticket [13:06:23] [and, err, these are meant to be hot-swappable drives, which is rather defeated by having to reboot every time?] [13:06:37] phab> T382874 [13:06:37] T382874: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874 [13:41:12] elukey: drive does look to be present (oddly, as of 19:02:51 on 10 Jan, about 3 minutes after the reboot) [13:43:09] PROBLEM - MariaDB sustained replica lag on s8 on db2195 is CRITICAL: 33.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [13:43:23] PROBLEM - MariaDB sustained replica lag on s8 on db2167 is CRITICAL: 23.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104 [13:45:23] RECOVERY - MariaDB sustained replica lag on s8 on db2167 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104 [13:46:09] RECOVERY - MariaDB sustained replica lag on s8 on db2195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104 [13:51:12] Emperor: sorry I got distracted by another thing - yes it is a pity that we can't use hot-swap at the moment, maybe there is a way via Redfish but so far I haven't found one. I can try to open a task to dig a bit more, but the limits of that SAS controller are a lot I am afraid. [13:51:34] for the monitoring part, I am going to prioritize it, hopefully we'll fix it in a couple of weeks [13:58:19] elukey: if we can't hot-swap drives on these systems, that would be quite a significant issue, I think. [13:59:26] (though maybe if we get a megacli-equivalent that actually works, it might be able to do it?) [14:04:18] Emperor: The hot swap works as intended, it is the JBOD setting that for the moment seems to require more work [14:04:45] we anticipated some limitations when talking about using SAS or upgrading all ms-bes to the new controller [14:14:34] elukey: I see that distinction, but from my perspective as service owner, "it's hot-swap but you have to reboot to actually use the new drive" isn't much good. [14:16:28] Emperor: sure I agree, but infra foundations tries to offer tools related to what the hw can do, the choice of the hw is between service owner and dcops [14:17:17] I am aware that there was a misunderstanding between the two teams, I tried to resolve the problem but dcops offered the possibility to upgrade the controllers [14:17:26] and IIRC your team decided against [14:18:05] now, I'll do my best to make the how swap really working, I just wanted to point out that some high level discussions about "this can't do XYZ" need also to involve dcops [14:18:14] (namely, it is also a service-owner/dcops responsibility) [14:18:47] on my side I am going to open a task to investigate this, I'll report it in here shortly, and we could take a decision after it (upgrade to a new controller or live with the limitation) [14:19:25] *hot swap gets working [14:26:08] elukey: My understanding was that the controller upgrade was for hosts where the BBU / hardware-RAID was a key requirement (which it isn't here); I didn't think that we wouldn't be able to hot-swap JBODs (because that seems like a ... strange ... restriction). If that turns out to be the case, we will be wanting to think about a new controller. [14:32:41] okok makes sense [14:43:30] Emperor: actually I checked hotswapping for backup1011. But for the nearer plane, there was mention of a more hidden one which was more difficult to access. [14:43:56] it is true that I did it for RAID, not JBOD [14:43:58] jynus: it's not a physical issue, it's that having hot-swapped a drive you can't make it into a JBOD [14:44:02] ah [14:44:05] without a reboot [14:44:20] sorry, I arrived late to the conversation [14:45:24] it is true that for backup hosts, hot swapping is Not a requirement also [15:50:51] Switching es4 eqiad master [16:39:37] Switchover all the things [16:46:26] RESOLVED: SystemdUnitFailed: user@499.service on dbprov2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:52] that was me, downtime mush have been expired [16:47:22] Emperor: added some info to https://phabricator.wikimedia.org/T377853#10454289, will try to work on it this or next week [16:47:36] maybe storcli or megactl solves both issues [16:58:25] Thanks! [19:46:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:41:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@m1.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed