[05:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: cassandra-a.service on restbase2028:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:16:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: cassandra-a.service on restbase2028:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:01:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:21:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:31:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:34:42] <Emperor>	 "yay"
[09:40:31] <Emperor>	 we lost a race with a redirection change this time.
[09:46:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:47:30] <marostegui>	 I am switching es5 master
[10:01:38] <marostegui>	 Amir1: I am going to start reorganizing pc3
[10:01:56] <Amir1>	 marostegui: sure, I'm just deleting stuff from pc5 in eqiad
[10:02:03] <Amir1>	 (pc1014)
[10:02:19] <Amir1>	 I can pause for now, it's not anything urgent
[10:02:30] <Amir1>	 (and we have until the switchover to clean it up)
[10:03:02] <marostegui>	 Amir1: Would it affect pc3?
[10:03:04] <marostegui>	 No, no?
[10:03:31] <marostegui>	 I am going to fully depool pc3
[10:03:37] <marostegui>	 So I can operate with no pressure
[10:03:57] <Amir1>	 yeah, it won't affect anything there
[10:04:06] <marostegui>	 great
[10:04:08] <Amir1>	 let me know if you like the new way of depooling
[10:04:17] <marostegui>	 basically set both masters to 0 right?
[10:04:43] <Amir1>	 yup: https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Depooling_a_parsercache_host
[10:04:54] <marostegui>	 Great
[10:04:59] <marostegui>	 I will upgrade kernel, mariadb etc too
[10:05:16] <Amir1>	 you can use the upgrade cookbook, it's sooooo easy
[10:05:29] <marostegui>	 ah yes, that includes the reboot
[10:05:33] <marostegui>	 I always forget
[10:05:35] <Amir1>	 > sudo cookbook sre.mysql.upgrade 'pc1014*'
[10:06:07] <marostegui>	 https://phabricator.wikimedia.org/P71989
[10:07:16] <Amir1>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=All&var-role=All&from=now-1h&to=now
[10:07:21] <Amir1>	 It looks it's being drained
[10:09:28] <Amir1>	 your diff is different than mine when I depooled. It doesn't matter I think but just to be sure: https://phabricator.wikimedia.org/P71989 vs https://phabricator.wikimedia.org/P71926
[10:10:10] <marostegui>	 Ah
[10:10:18] <marostegui>	 Because I did depool
[10:10:20] <marostegui>	 Instead of 0
[10:10:39] <Amir1>	 that seems to work too (funnily enough, it shouldn't0
[10:11:40] <marostegui>	 I just fixed it just in case
[10:16:21] <Amir1>	 Thanks!
[10:29:05] <marostegui>	 Amir1: pc3 is done and back in production
[10:29:17] <Amir1>	 \o/
[10:29:37] <Amir1>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=All&var-role=All&from=now-1h&to=now
[10:30:12] <Amir1>	 https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m latency looks okay too
[10:34:50] <marostegui>	 yeah, I am going to give it a few minutes and depool pc4 to reorganize pc4 codfw
[11:02:52] <marostegui>	 jynus: I think m2 backups are finished per the dbbackups.backups but can you double check and confirm?
[11:03:22] <jynus>	 I belive so, I was checking them earlier, but let me confirm again
[11:03:40] <marostegui>	 Thanks
[11:03:51] <jynus>	 m2 will run tonight at 0 hours
[11:04:07] <jynus>	 wanna do a quick run, if you are going to do any maintenance?
[11:04:15] <marostegui>	 It is just a clone
[11:04:18] <marostegui>	 So we should be good
[11:04:26] <jynus>	 ok!
[11:04:31] <marostegui>	 Thanks!
[11:05:07] <jynus>	 oh, those backups take 7 hours, so it wouldn't be a quick run
[11:10:46] <marostegui>	 Yeah m2 has otrs........
[11:21:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db2233:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:30:58] <marostegui>	 ^ expected
[12:14:02] <marostegui>	 I would appreciate another pair of eyes at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1110758/1/hieradata/hosts/dbproxy2005.yaml#4 
[12:14:09] <marostegui>	 Basically that host/ip are what I sent
[12:14:19] <marostegui>	 I double checked, but I'd prefer another recheck
[12:25:33] <marostegui>	 Switching m2 proxy now
[12:25:47] <Emperor>	 👀
[12:27:03] <marostegui>	 Thank you Emperor <3
[12:29:58] <Emperor>	 NP
[12:59:47] <elukey>	 Emperor: o/ 
[12:59:54] <elukey>	 do you have a min for ms-be1090?
[13:01:38] <elukey>	 as far as I can see from the BMC's webui, there are 24 JBODs
[13:02:27] <elukey>	 so dcops should already have fixed it, I think that the tooling part involves reboot into BIOS and set JBOD manually
[13:06:00] <Emperor>	 Oh, I was going on the most recent update to the phab ticket
[13:06:23] <Emperor>	 [and, err, these are meant to be hot-swappable drives, which is rather defeated by having to reboot every time?]
[13:06:37] <Emperor>	 phab> T382874
[13:06:37] <stashbot>	 T382874: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874
[13:41:12] <Emperor>	 elukey: drive does look to be present (oddly, as of 19:02:51 on 10 Jan, about 3 minutes after the reboot)
[13:43:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s8 on db2195 is CRITICAL: 33.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104
[13:43:23] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s8 on db2167 is CRITICAL: 23.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104
[13:45:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s8 on db2167 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2167&var-port=9104
[13:46:09] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s8 on db2195 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2195&var-port=9104
[13:51:12] <elukey>	 Emperor: sorry I got distracted by another thing - yes it is a pity that we can't use hot-swap at the moment, maybe there is a way via Redfish but so far I haven't found one. I can try to open a task to dig a bit more, but the limits of that SAS controller are a lot I am afraid.
[13:51:34] <elukey>	 for the monitoring part, I am going to prioritize it, hopefully we'll fix it in a couple of weeks
[13:58:19] <Emperor>	 elukey: if we can't hot-swap drives on these systems, that would be quite a significant issue, I think.
[13:59:26] <Emperor>	 (though maybe if we get a megacli-equivalent that actually works, it might be able to do it?)
[14:04:18] <elukey>	 Emperor: The hot swap works as intended, it is the JBOD setting that for the moment seems to require more work
[14:04:45] <elukey>	 we anticipated some limitations when talking about using SAS or upgrading all ms-bes to the new controller
[14:14:34] <Emperor>	 elukey: I see that distinction, but from my perspective as service owner, "it's hot-swap but you have to reboot to actually use the new drive" isn't much good.
[14:16:28] <elukey>	 Emperor: sure I agree, but infra foundations tries to offer tools related to what the hw can do, the choice of the hw is between service owner and dcops
[14:17:17] <elukey>	 I am aware that there was a misunderstanding between the two teams, I tried to resolve the problem but dcops offered the possibility to upgrade the controllers
[14:17:26] <elukey>	 and IIRC your team decided against
[14:18:05] <elukey>	 now, I'll do my best to make the how swap really working, I just wanted to point out that some high level discussions about "this can't do XYZ" need also to involve dcops
[14:18:14] <elukey>	 (namely, it is also a service-owner/dcops responsibility)
[14:18:47] <elukey>	 on my side I am going to open a task to investigate this, I'll report it in here shortly, and we could take a decision after it (upgrade to a new controller or live with the limitation)
[14:19:25] <elukey>	 *hot swap gets working
[14:26:08] <Emperor>	 elukey: My understanding was that the controller upgrade was for hosts where the BBU / hardware-RAID was a key requirement (which it isn't here); I didn't think that we wouldn't be able to hot-swap JBODs (because that seems like a ... strange ... restriction). If that turns out to be the case, we will be wanting to think about a new controller.
[14:32:41] <elukey>	 okok makes sense
[14:43:30] <jynus>	 Emperor: actually I checked hotswapping for backup1011. But for the nearer plane, there was mention of a more hidden one which was more difficult to access.
[14:43:56] <jynus>	 it is true that I did it for RAID, not JBOD
[14:43:58] <Emperor>	 jynus: it's not a physical issue, it's that having hot-swapped a drive you can't make it into a JBOD
[14:44:02] <jynus>	 ah
[14:44:05] <Emperor>	 without a reboot
[14:44:20] <jynus>	 sorry, I arrived late to the conversation
[14:45:24] <jynus>	 it is true that for backup hosts, hot swapping is Not a requirement also
[15:50:51] <marostegui>	 Switching es4 eqiad master
[16:39:37] <Amir1>	 Switchover all the things
[16:46:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on dbprov2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:46:52] <jynus>	 that was me, downtime mush have been expired
[16:47:22] <elukey>	 Emperor: added some info to https://phabricator.wikimedia.org/T377853#10454289, will try to work on it this or next week
[16:47:36] <elukey>	 maybe storcli or megactl solves both issues
[16:58:25] <Emperor>	 Thanks!
[19:46:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:41:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@m1.service on db2160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed