[00:05:22] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 55.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[00:14:48] <jinxer-wm>	 FIRING: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (4m 3s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[00:14:48] <jinxer-wm>	 FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (4m 3s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[00:29:22] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[00:29:48] <jinxer-wm>	 RESOLVED: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (1m 45s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[00:29:48] <jinxer-wm>	 RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (1m 45s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[01:32:30] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 19 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[01:35:30] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[01:37:08] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2046:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:45:30] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 126.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[01:52:48] <jinxer-wm>	 FIRING: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (3m 3s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[01:54:48] <jinxer-wm>	 FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (3m 28s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[01:57:48] <jinxer-wm>	 RESOLVED: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (1m 39s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[01:59:48] <jinxer-wm>	 RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (1m 39s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[02:01:32] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[03:42:08] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:33:34] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 99.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[06:36:39] <arnaudb>	 dumps dumping
[06:40:34] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[07:42:08] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:43:53] <arnaudb>	 > profile::monitoring::notifications_enabled: false → it seems to be falling under the notification bug we have ongoing. should I downtime it until we get to it? (with a comment in hieradata to keep a trace)
[07:44:21] <marostegui>	 es2042?
[07:44:25] <arnaudb>	 yes
[07:44:28] <marostegui>	 You can leave it, I am pooling it now
[07:44:33] <arnaudb>	 oh ack
[07:44:34] <marostegui>	 it is being provisioned
[07:44:45] <marostegui>	 How are we doing on the replication paging thign?
[07:44:49] <marostegui>	 I am worried we are so blind
[07:49:13] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2042:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:49:17] <arnaudb>	 paging through alertmanager is ready to be merged, we were already alerting at warn & critical threshold for the past few months so I think we can trust those. As for the paging threshold: since icinga is not as "finely tuneable" on the notifications, we would have the expected outcome. On the icinga side, I saw that there was indeed no paging
[07:49:17] <arnaudb>	 config for this alert anymore, but I'm not well versed in icinga nor or way to configure it, so I've asked tappof if he had some insights about this yesterday. We could merge the alertmanager CR if you want to be on the safe side and iterate over the alert FOR (which I think would be the only needed tweak) if we see that the alert is paging too
[07:49:18] <arnaudb>	 soon/too late. I'm not worried about "paging too late" as its threshold is relatively close to critical
[07:50:06] <marostegui>	 But what was the issue with icinga?
[07:50:12] <arnaudb>	 still pending identification
[07:50:56] <marostegui>	 While I think alertmanager is a nice temporary solution, we really need to find the root cause and fix it. Especially before the break. Migrating to alertmanager is a nice thing, but we shouldn't be forced because icinga isn't doing what it is supposed to do
[07:51:16] <marostegui>	 It is worrying we lost paging like that, maybe o11y should be involved. It is a critical thing before the break
[07:51:18] <arnaudb>	 agreed, its why I asked tappof as I was previously mentionning :)
[07:51:49] <arnaudb>	 because I fear that it will take me too much time if I do it on my own as I'm lacking that part of icinga know how
[07:51:59] <marostegui>	 Yeah, totally, it makes sense to ask them
[07:52:09] <marostegui>	 tappof: You aware this is an unbreak now? :)
[07:52:23] <arnaudb>	 will keep you posted as soon as I get any news marostegui, fwiw its worrying to me as well
[07:52:28] <marostegui>	 thank you :*
[07:58:59] <arnaudb>	 ah marostegui while you're around the chan: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1100114/comment/f9b9b544_7a4693c1/ this question would benefit from your input
[08:00:19] <marostegui>	 I will take a look
[08:00:34] <arnaudb>	 lmk if its lacking context
[08:16:38] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 11.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[08:17:38] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[08:51:02] <marostegui>	 I am gonig to switch es4 codfw master
[09:14:44] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 63.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[09:19:46] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[09:34:29] <marostegui>	 Same with es5
[09:34:30] <marostegui>	 master
[09:43:29] <arnaudb>	 searching for the last time replication paged, I struggled with the volume of scattered IRC logs etc., I've drafted this: https://gitlab.wikimedia.org/arnaudb/tooling/#irc-log-query to help and query irc logs in terminal (only requires fzf as local dependency)
[09:45:32] <marostegui>	 They never stopped paging in irc - not sure about mail though
[09:45:40] <marostegui>	 And of course via phone
[09:47:08] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-mysqld-exporter.service on es2045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:47:29] <marostegui>	 ^ known
[09:47:35] <marostegui>	 it is being provisioned
[09:48:08] <marostegui>	 es2043 will show up probably too
[10:08:36] <marostegui>	 I am going to reboot all codfw proxies
[10:23:24] <marostegui>	 Amir1: x3 will have a normal topology right?
[10:23:28] <marostegui>	 As in...not like x2
[10:23:58] <Amir1>	 marostegui: yeah but we probably should put it in s8 for and start the split later
[10:24:16] <Amir1>	 (content of x3 are tables are currently in s8)
[10:24:30] * marostegui takes holidays
[10:24:44] <Amir1>	 we should devise a plan how to approach this. It's gonna be fun :P
[10:25:02] <marostegui>	 It is fine, I did that when we splitted s8 from s5
[10:25:07] <marostegui>	 And lots of wikis from s3 to s5
[10:25:14] <marostegui>	 It is just delicate but it should be fine
[10:25:19] <marostegui>	 Famous last words though
[10:25:43] <Amir1>	 haha, yup
[10:26:04] <marostegui>	 Amir1: So basically I can start adding those x3 hosts to s8
[10:26:10] <marostegui>	 And creating x3 concept of cluster in puppet etc
[10:26:11] <marostegui>	 right?
[10:26:19] <marostegui>	 I will add that to my to-do
[10:26:22] <marostegui>	 Is there any timeline?
[10:26:48] <Amir1>	 yup
[10:27:03] <Amir1>	 I'm waiting for WMDE, I sent them a request a year ago but it got lost
[10:27:09] <marostegui>	 XDDD
[10:27:18] <Amir1>	 https://phabricator.wikimedia.org/T351802
[10:27:26] <marostegui>	 but is this blocked on the HW?
[10:27:54] <Amir1>	 no
[10:28:13] <Amir1>	 it's the other way around
[10:28:18] <marostegui>	 excellent
[10:28:20] <Amir1>	 this is blocking proper deployment of x3
[10:29:43] <marostegui>	 Good, I will plan accordingly then
[10:30:55] <Amir1>	 one thing that we need to do with x3 is that it needs replication to the cloud as well since these tables are quite important
[10:31:26] <marostegui>	 woot?
[10:31:39] <marostegui>	 you mean wiki replicas?
[10:31:54] <Amir1>	 yup
[10:32:04] <marostegui>	 Please don't use cloud like that, you scared me
[10:32:08] <marostegui>	 You have no idea to which point
[10:32:35] <Amir1>	 :D sorry next time I use replication the steam point
[10:32:44] <marostegui>	 Just use blockchain
[10:32:52] <Amir1>	 xD
[10:34:43] <marostegui>	 I am going to migrate a couple of more hosts to 10.6.20 and if it all goes, next week I will push it to our repo
[10:35:10] <marostegui>	 I just saw we had a host with replication broken for 16h - this paging thing is terrible
[10:35:12] <Amir1>	 Thanks <3
[10:36:15] <Amir1>	 oh I think icinga is broken, promtheus doesn't page (it alerts) but icinga stopped sending email or page or anything since last week
[10:36:39] <marostegui>	 Amir1: What?
[10:38:34] <Amir1>	 ** RECOVERY alert - db2147 #page/MariaDB Replica Lag: s4 #page is OK *
[10:38:37] <Amir1>	 from  nagios@alert1002.wikimedia.org
[10:38:43] <Amir1>	 these emails stopped coming
[10:39:05] <marostegui>	 That probably matches the fact that we don't have that anymore in icinga, which is what I raised the other day
[10:39:35] <marostegui>	 mmmm wait
[10:39:38] <marostegui>	 Some hosts do have it
[10:39:46] <marostegui>	 Some don't
[10:40:04] <marostegui>	 eg: db1170 doesn't, db2147 does
[10:40:11] <marostegui>	 Can this be related to the dc switchover?
[10:40:21] <Amir1>	 it's very new
[10:40:38] <marostegui>	 Then way db1170 doesn't have the page in icinga but db2147 does
[10:40:41] <volans>	 marostegui: db1170 seems to have it no?
[10:40:43] <volans>	 https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1170&service=MariaDB+Replica+Lag%3A+s7
[10:40:44] <marostegui>	 For any of the checks
[10:41:06] <marostegui>	 volans: I don't see it in the main view of the host
[10:41:11] <volans>	 ah it's in the host titlem, not the desc title
[10:41:16] <marostegui>	 yeah
[10:41:19] <volans>	 weid
[10:41:20] <volans>	 weird
[10:41:22] <marostegui>	 db2147 does have it and db1170 doesn't
[10:48:43] <_joe_>	 I'm trying to understand what you don't see
[10:48:50] <_joe_>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1170&style=detail has the replica lag alerts
[10:49:02] <marostegui>	 _joe_: yes, but not paging
[10:49:08] <volans>	 doesn't have the # page in the description
[10:49:12] <volans>	 that means is not paging
[10:49:17] <marostegui>	 _joe_: https://phabricator.wikimedia.org/T381276#10378887
[10:49:38] <volans>	 that's controlled by $do_paging and $critical in modules/monitoring/manifests/service.pp
[10:49:38] <marostegui>	 And it is indeed not paging, as last sunday it should've
[10:49:38] <_joe_>	 marostegui: ahh ok so the problem is that , ok, I did not understand what you said before
[10:49:52] <_joe_>	 I would say from sampling that it's the dbs in the secundary DC that don't page
[10:50:06] <marostegui>	 _joe_: That is what I suspect too
[10:50:10] <marostegui>	 But that's "new"
[10:50:19] <marostegui>	 Definitely not the case before i went on sabbatical
[10:52:27] <Amir1>	 I don't think we had the issue until last week
[10:52:32] <volans>	 profile::monitoring::is_critical and profile::monitoring::do_paging are both true for db1170 (checked with the hiera lookup in spicerack)
[10:53:38] <marostegui>	 And for db1163 (dc master) we only page for read_only
[10:53:41] <marostegui>	 This is so weird
[10:54:01] <_joe_>	 so
[10:54:13] <_joe_>	 in profile::mariadb::core
[10:54:17] <_joe_>	         mariadb::monitor_replication { $shard:
[10:54:17] <_joe_>	             is_critical => $is_writeable_dc,
[10:54:17] <_joe_>	             source_dc   => $source_dc,
[10:54:17] <_joe_>	         }
[10:54:47] <_joe_>	 and before
[10:54:50] <_joe_>	 $is_writeable_dc = profile::mariadb::section_params::is_writeable_dc($shard)
[10:55:06] <_joe_>	 so I guess replica lag in the non-primary DC is by design?
[10:56:23] <Amir1>	 depends on how much lag, it shouldn't go above 1s
[10:56:38] <_joe_>	 Amir1: I'm saying that icinga is doing what puppet tells it to do
[10:56:48] <_joe_>	 not that it's a good idea, I don't think that's true
[10:57:05] <Amir1>	 I see
[10:59:41] <_joe_>	 I would say we should alert on replica lag for any db in the shards that have 'writeable_dc' set to 'mwprimary'
[10:59:50] <_joe_>	 for everything else I guess it's case by case
[11:02:21] <_joe_>	 btw, this behaviour is not new
[11:02:33] <_joe_>	 it's been coded like that since, uhm, 3 years at least
[11:02:49] <arnaudb>	 ha, i thought so as well but kept doubting
[11:02:51] <_joe_>	 we're just noticing because we've had a few replication broken cases
[11:04:42] <marostegui>	 _joe_: I am unsure that's the case, I am sure we've had pages in secondary DC while we had active-active
[11:04:44] <marostegui>	 I am 100% sure
[11:05:29] <_joe_>	 marostegui: not in the last 2-3 years about replica lag, unless someone magically rewrote git history
[11:05:36] <arnaudb>	 marostegui: I haven't been able to find anything with the tool I've shown you before, maybe if you have a pattern I can look for it, I can backlog all registered IRC logs for a channel since it started archiving and search through it in a matter of a few seconds with it https://asciinema.org/a/wstx1rsOTdSQaJjwRnFR6M0K2
[11:05:59] <marostegui>	 _joe_: I am talking about replication broken (but behaves the same as lag)
[11:05:59] <arnaudb>	 my first angle was to look through git and I landed on the same conclusion that _joe_ reached 
[11:06:23] <_joe_>	 marostegui: is it a separate alert?
[11:06:27] <marostegui>	 yes
[11:06:30] <arnaudb>	 marostegui: as I said I haven't been able to find any paging pattern yet, if you have some I'd be glad to sort through logs :D
[11:07:22] <arnaudb>	 for instance on operations, the last paging mariadb for replication was due to missing processes on 26/10
[11:07:39] <marostegui>	 but for which host? eqiad?
[11:07:40] <arnaudb>	 but this was for io
[11:07:49] <arnaudb>	 20241026.txt: [16:28:15] <icinga-wm>	 PROBLEM - mysqld processes on db1234 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[11:07:49] <marostegui>	 io?
[11:07:59] <marostegui>	 but that's not page
[11:08:04] <marostegui>	 (and it should=
[11:08:07] <marostegui>	 )
[11:08:14] <arnaudb>	 20241116.txt: [16:20:55] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 #page on db2150 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: metawiki. [Query snipped]
[11:08:15] <arnaudb>	 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:08:17] <arnaudb>	 wrong line sorry
[11:08:22] <arnaudb>	 16/11
[11:08:24] <volans>	 arnaudb: for pages you can just look at https://sal.toolforge.org/production
[11:08:52] <marostegui>	 I am seeing a page from db2144 July 18th for replication lag
[11:08:59] <marostegui>	 And at that time codfw was secondary
[11:09:12] <arnaudb>	 lets check
[11:09:35] <marostegui>	 Lots of them Feb 24th
[11:09:38] <marostegui>	 For codfw too
[11:09:53] <marostegui>	 Or Mar 7th
[11:10:48] <arnaudb>	 https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-data-persistence/20240718.txt
[11:11:21] <arnaudb>	 I don't see any #.page, is it supposed to be paging on operations maybe?
[11:11:44] <marostegui>	 yes
[11:11:46] <arnaudb>	 https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-operations/20240718.txt
[11:11:48] <marostegui>	 all pages go to operations
[11:11:55] <marostegui>	 they don't come to this channel
[11:12:03] <arnaudb>	 [11:58:05] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: x2 #page on db2144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica → this is the alert
[11:12:04] <marostegui>	 they never did
[11:12:13] <arnaudb>	 we do have an alert then! 
[11:12:15] <Seddon>	 Huh... Why did that page ping me
[11:12:20] <marostegui>	 yes, that is and in feb 24 we have plenty
[11:12:30] <arnaudb>	 sorry Seddon hasty copy paste 
[11:12:36] <Seddon>	 All good!
[11:28:14] <arnaudb>	 20241118.txt: [14:43:54] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 #pa.ge on db2216 is OK: OK slave_sql_lag Replication lag: 5.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:28:14] <arnaudb>	 this is the last time it we got a pa.ge notification for a replication issue
[11:32:06] <marostegui>	 But that could be simply because we didn't have any in the ACTIVE dc
[11:32:15] <marostegui>	 The problem seem to be the secondary 
[11:34:08] <Amir1>	 marostegui: FYI I'm starting a schema change on eqiad s8 but explicitly skipping db1167 (sanitarium master) so it shouldn't impact your revision alter table. Is that fine?
[11:35:20] <marostegui>	 yeahj
[11:37:09] <Amir1>	 awesome
[11:39:19] <marostegui>	 I made some recap at https://phabricator.wikimedia.org/T381276#10379062
[11:45:20] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 89.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[11:45:24] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 16.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[11:47:48] <jinxer-wm>	 FIRING: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (4m 18s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[11:49:26] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1155 is CRITICAL: 39.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13314
[11:49:48] <jinxer-wm>	 FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (4m 20s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[11:50:45] <Amir1>	 yay?
[11:51:26] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1155 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13314
[11:52:20] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)10 ge (W)5 ge 1.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[11:52:24] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)10 ge (W)5 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[11:57:48] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s3 on db1212 is CRITICAL: 22 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104
[12:03:05] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 598.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[12:22:13] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s3 on db1212 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1212&var-port=9104
[12:54:48] <jinxer-wm>	 RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (2m 1s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[12:57:05] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[12:57:48] <jinxer-wm>	 RESOLVED: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (2m 50s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[13:12:25] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 17.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[13:15:25] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)10 ge (W)5 ge 2.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[13:20:21] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 65 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[13:20:25] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 23 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[13:20:29] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1155 is CRITICAL: 68.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13314
[13:28:25] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104
[13:30:21] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[13:30:27] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1155 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13314
[13:42:08] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:59:41] <_joe_>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100457 for T381276
[13:59:42] <stashbot>	 T381276: replication breakage is not not paging anymore - https://phabricator.wikimedia.org/T381276
[15:56:22] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 11.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[15:58:24] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[17:13:38] <_joe_>	 the UBN! is fixed I think
[17:42:08] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:44:58] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 71.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[19:52:48] <jinxer-wm>	 FIRING: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (3m 42s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[19:55:18] <jinxer-wm>	 FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (1m 39s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[20:02:48] <jinxer-wm>	 RESOLVED: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (1m 40s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[20:05:18] <jinxer-wm>	 RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (1m 2s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[20:06:03] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[21:09:16] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 17.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[21:10:16] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[21:42:08] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2043:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed