[08:22:03] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:14:08] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:27:03] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2239:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:29:25] <Emperor>	 should we silence those for 24h with the DBAs out today?
[10:33:19] <jynus>	 I don't know what does are
[10:33:24] <jynus>	 *those
[10:34:14] <jynus>	 although technically db2239 is mine
[10:36:05] <jynus>	 I'm handling an UBN, will check it later
[10:39:49] <jynus>	 actually that's wrong, the role is mine
[10:40:11] <jynus>	 but this is a test host volans and others were testing something, I was not involved
[10:40:20] <jynus>	 I will downtime it, and they can handle it later
[10:41:23] <volans>	 db2239? not me
[10:41:33] <volans>	 unless I'm not recalling it :)
[10:42:02] <jynus>	 you were involved somehow
[10:42:30] <jynus>	 I was asked if the dbas could setup a new dbstore, and I said as long it didn't touch an existing one, no problem
[10:42:43] <jynus>	 probably to test some automation or something
[10:43:02] <jynus>	 but maybe it was someone else
[10:47:34] <volans>	 lmk if you want me to have a look, not sure I can help but I can try
[10:54:30] <jynus>	 I've downtimed it for 15 days
[10:54:48] <jynus>	 taking care of something more important right now
[10:55:29] <volans>	 auew
[10:55:31] <volans>	 *sure
[10:57:28] <btullis>	 Updating wikireplica views `abuse_filter_action` for T378671
[12:53:34] <btullis>	 I rebooted an-redacteddb1001 and it looks like the s8 replication is broken again. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-redacteddb1001&service=MariaDB+Replica+SQL%3A+s8 - I know that m.arostegui was working on it last week, but I didn't think that anything was still ongoing.
[12:54:57] <btullis>	 Not urgent. I'll make a ticket and fess up that I didn't check before rebooting it.
[12:57:15] <jynus>	 if a schema change was ongoing, you may just increased the recovery time 10 times
[12:57:32] <jynus>	 as it now has to revert the changes and apply it again
[14:05:47] <btullis>	 jynus: Understood. I thought it was my chance to get an-redacted1001 rebooted for T376800 before the schema change was started again. m.arostegui said on the 7th that reverting would take around 2 days. Then applying would take another 10-12 days.
[14:07:00] <jynus>	 my understanding is that that was being executed, but I may be wrong
[14:40:44] <elukey>	 hey folks!
[14:41:02] <jynus>	 define folks :-D
[14:41:24] <elukey>	 yesterday db2217 paged (broken replication), we fixed the index on a table and left the node depooled for your final verification 
[14:41:27] <jynus>	 (no DBAs today)
[14:41:38] <elukey>	 that answers my question then :D
[14:42:03] <jynus>	 but you did what I would have done, so let's wait until tomorrow
[14:42:10] <elukey>	 super thanks!
[14:44:48] <jynus>	 ok, UBN fixed, taking a break to unstress
[16:27:58] <elukey>	 jynus: re: https://phabricator.wikimedia.org/T371416#10302257 - shall we set up a plan about what/how to test it?
[16:39:27] <jynus>	 I was taking care of that, but sadly the previous UBN delayed me
[16:39:44] <jynus>	 that was next on my list
[16:42:50] <jynus>	 what I was doing is setting up a host with the previous hw config first before comparing them
[16:43:04] <jynus>	 however, I will need dc ops assistance for taking out a disk later on
[16:51:03] <elukey>	 okok super thanks! Lemme know if you need any help
[16:53:25] <jynus>	 I will certainly keep you updated
[16:53:34] <elukey>	 <3
[16:53:49] <elukey>	 (no rush I was just curious, that's it)
[19:54:48] <jinxer-wm>	 FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1182:9104 has too large replication lag (11m 40s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1182&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[20:09:30] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s2 on db1182 is CRITICAL: 1442 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104
[20:19:48] <jinxer-wm>	 RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1182:9104 has too large replication lag (2m 5s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1182&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[20:20:30] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s2 on db1182 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104