[00:05:20] <dr0ptp4kt>	 !log DEPLOYED Refinery at 4e7a2b32 for changes: pageview allowlist 1305158 (+min.wikiquote) 1305162 (+bol.wikipedia), 1305156 (+isv.wikipedia); 1305980 (pv allowlist -api.wikimedia, sqoop +isvwiki); sqoop 1295064 (+globalimagelinks) 1295069 (+filerevision) using scap, then deployed onto HDFS (manual copyToLocal required additionally)
[00:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:08] <wikibugs>	 (03PS1) 10Ladsgroup: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666)
[00:11:21] <icinga-wm>	 PROBLEM - Host cirrussearch2103 is DOWN: PING CRITICAL - Packet loss = 100%
[00:11:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) (owner: 10Ladsgroup)
[00:11:47] <icinga-wm>	 RECOVERY - Host cirrussearch2103 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms
[00:13:17] <icinga-wm>	 PROBLEM - Host cirrussearch2102 is DOWN: PING CRITICAL - Packet loss = 100%
[00:14:47] <icinga-wm>	 RECOVERY - Host cirrussearch2102 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms
[00:15:01] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2103.codfw.wmnet with OS trixie
[00:18:37] <icinga-wm>	 RECOVERY - SSH on urldownloader2005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:20:11] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2102.codfw.wmnet with OS trixie
[00:21:42] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service urldownloader2005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:21:45] <wikibugs>	 (03PS2) 10Ladsgroup: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666)
[00:22:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) (owner: 10Ladsgroup)
[00:24:23] <wikibugs>	 (03PS3) 10Ladsgroup: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666)
[00:27:40] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) (owner: 10Ladsgroup)
[00:27:43] <icinga-wm>	 PROBLEM - Host es1039 #page is DOWN: PING CRITICAL - Packet loss = 100%
[00:29:31] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es7 #page on es1035 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:29:33] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es7 #page on es1040 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:29:34] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es7 #page on es1048 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:29:37] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es7 #page on es2039 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:29:54] <cdanis>	 sirenbot: stfu
[00:30:00] <cdanis>	 sirenbot: !ack
[00:30:21] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[00:30:35] <icinga-wm>	 RECOVERY - Host es1039 #page is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[00:31:33] <icinga-wm>	 PROBLEM - MariaDB read only es7 #page on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[00:32:31] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler es7 on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[00:32:33] <icinga-wm>	 PROBLEM - mysqld processes #page on es1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:33:31] <icinga-wm>	 PROBLEM - MariaDB Events es7 on es1039 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[00:33:33] <icinga-wm>	 PROBLEM - pt-heartbeat-wikimedia process on es1039 is CRITICAL: PROCS CRITICAL: 0 processes with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat
[00:34:31] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: es7 #page on es1035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:34:33] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: es7 #page on es1040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:34:34] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: es7 #page on es1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:34:37] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: es7 #page on es2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:35:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[00:39:07] <Amir1>	 jesus
[00:39:24] <cdanis>	 thing just rebooted, and I think the clock is wrong in SEL
[00:39:33] <cdanis>	 it's been up since
[00:40:11] <Amir1>	 first let me remove it from writes
[00:41:06] <cdanis>	 let me know what I can do, if you need more hands
[00:41:36] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1306798 (https://phabricator.wikimedia.org/T430765)
[00:41:42] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306799 (https://phabricator.wikimedia.org/T430765)
[00:42:22] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set es7 eqiad as read-only for maintenance - T430765', diff saved to https://phabricator.wikimedia.org/P94645 and previous config saved to /var/cache/conftool/dbconfig/20260701-004221-ladsgroup.json
[00:42:26] <stashbot>	 T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765
[00:42:28] <sukhe>	 heya folks. I am also around if you need an extra pair of hands. I am assuming we are doing a switchover?
[00:42:30] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074209 (10CDanis) p:05Triage→03High
[00:42:33] <sukhe>	 ok thanks Amir1 
[00:42:36] <Amir1>	 okay, the user impact should be gone now
[00:42:40] <cdanis>	 thanks sukhe 
[00:42:54] <sukhe>	 cdanis: sorry you caught a bad one <3
[00:43:01] <cdanis>	 and thanks Amir1 -- I went to look at the External storage wikitech page and was met with a diagram that mentioned PMTPA
[00:43:11] <Amir1>	 it's removed from writes
[00:43:20] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074213 (10CDanis)
[00:43:24] <sukhe>	 Amir1: can you later also tell us what steps you performed for posterity?
[00:43:26] <Amir1>	 cdanis: yeah, it took me like five minutes to remember how to remove ES from the write pool
[00:43:27] <sukhe>	 https://wikitech.wikimedia.org/wiki/Primary_database_switchover ?
[00:43:37] <sukhe>	 later
[00:44:03] <Amir1>	 it's a change I made that makes it removed from the RW pool of ES clusters and moves it to RO ones so replag wouldn't matter
[00:44:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306795 (owner: 10TrainBranchBot)
[00:44:09] <Amir1>	 sudo dbctl --scope eqiad section es7 ro "Maintenance - T430765"
[00:44:09] <Amir1>	 sudo dbctl --scope codfw section es7 ro "Maintenance - T430765"
[00:44:09] <Amir1>	 sudo dbctl config commit -m "Set es7 eqiad as read-only for maintenance - T430765"
[00:44:24] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074215 (10CDanis) 00:42:36 <Amir1> okay, the user impact should be gone now 00:43:11 <Amir1> it's removed from writes  Followups: documentation, documentation, documentation.
[00:44:57] <Amir1>	 okay, now I need to bring it back online
[00:45:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[00:45:18] <Amir1>	 the switchover would be much easier once the host is online
[00:46:29] <icinga-wm>	 RECOVERY - mysqld processes #page on es1039 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:47:29] <icinga-wm>	 RECOVERY - MariaDB Events es7 on es1039 is OK: OK - All 2 events in ops database are ENABLED https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[00:47:31] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es7 #page on es1035 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:47:32] <icinga-wm>	 RECOVERY - MariaDB Event Scheduler es7 on es1039 is OK: Version 10.11.16-MariaDB-log, Uptime 69s, read_only: True, event_scheduler: True, 14.53 QPS, connection latency: 0.015102s, query latency: 0.001073s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[00:47:33] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es7 #page on es1040 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:47:34] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es7 #page on es1048 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:47:37] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es7 #page on es2039 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:49:29] <icinga-wm>	 RECOVERY - pt-heartbeat-wikimedia process on es1039 is OK: PROCS OK: 3 processes with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat
[00:49:37] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: es7 #page on es2039 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:50:21] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[00:50:31] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: es7 #page on es1035 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:50:33] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: es7 #page on es1040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:50:34] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: es7 #page on es1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[00:51:03] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074230 (10Ladsgroup) For later: ` Jul 01 00:46:43 es1039 mysqld[5895]: 2026-07-01  0:46:43 6 [Warning] Detected table cache mutex contention at instance 1: 26% waits. Additional table cache instance cannot be a...
[00:51:16] <Amir1>	 okay, now that everything is normal. I do the switchover
[00:53:16] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Primary switchover es7 T430765
[00:53:19] <stashbot>	 T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765
[00:53:30] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set es1035 with weight 0 T430765', diff saved to https://phabricator.wikimedia.org/P94646 and previous config saved to /var/cache/conftool/dbconfig/20260701-005329-ladsgroup.json
[00:54:31] <icinga-wm>	 RECOVERY - MariaDB read only es7 #page on es1039 is OK: Version 10.11.16-MariaDB-log, Uptime 489s, read_only: False, event_scheduler: True, 31.83 QPS, connection latency: 0.028759s, query latency: 0.000833s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[00:55:01] <cdanis>	 😌
[00:55:33] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074250 (10ssingh) ` < Amir1> it's a change I made that makes it removed from the RW pool of ES clusters and moves it to RO ones so replag wouldn't matter chBot) < Amir1> sudo dbctl --scope eqiad section es7 ro...
[00:57:55] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1306798 (https://phabricator.wikimedia.org/T430765) (owner: 10Gerrit maintenance bot)
[00:58:26] <Amir1>	 !log Starting es7 eqiad failover from es1039 to es1035 - T430765
[00:58:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:29] <stashbot>	 T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765
[01:00:03] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote es1035 to es7 primary T430765', diff saved to https://phabricator.wikimedia.org/P94647 and previous config saved to /var/cache/conftool/dbconfig/20260701-010002-ladsgroup.json
[01:00:37] <sukhe>	 Amir1: for later, last question sorry -- are you following the steps at https://wikitech.wikimedia.org/wiki/Primary_database_switchover or is there something else we should reference?
[01:00:46] <sukhe>	 thinking of time when you or another db won't be around
[01:01:04] <sukhe>	 *dba
[01:01:13] <Amir1>	 we follow the checklist outlined in the ticket
[01:01:14] <Amir1>	 https://phabricator.wikimedia.org/T430765
[01:01:28] <Amir1>	 the checklist is produced by switchmaster (https://switchmaster.toolforge.org/
[01:01:50] <sukhe>	 thank you, updating the current task
[01:03:26] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074260 (10ssingh) For the checklist on the switchover steps:  ` < Amir1> we follow the checklist outlined in the ticket < Amir1> https://phabricator.wikimedia.org/T430765 < Amir1> the checklist is produced by s...
[01:03:36] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306799 (https://phabricator.wikimedia.org/T430765) (owner: 10Gerrit maintenance bot)
[01:03:55] <logmsgbot>	 !log ladsgroup@dns1004 START - running authdns-update
[01:05:52] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool es1039 T430765', diff saved to https://phabricator.wikimedia.org/P94648 and previous config saved to /var/cache/conftool/dbconfig/20260701-010551-ladsgroup.json
[01:05:55] <stashbot>	 T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765
[01:05:58] <logmsgbot>	 !log ladsgroup@dns1004 END - running authdns-update
[01:06:53] <Amir1>	 cdanis: sukhe: Pooling es7 back for writes
[01:07:00] <cdanis>	 thank you!
[01:07:09] <sukhe>	 Amir1: <3 please make sure you take time off in lieu of this
[01:07:17] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set es7 eqiad back to read-write - T430765', diff saved to https://phabricator.wikimedia.org/P94649 and previous config saved to /var/cache/conftool/dbconfig/20260701-010716-ladsgroup.json
[01:12:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306800
[01:12:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306800 (owner: 10TrainBranchBot)
[01:14:24] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074275 (10Ladsgroup) The switchover is done, the cluster is RW now: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=2026-07-01T00:12:34.145Z&to=2026-07-01T01:09:38.332Z&timezone=utc&var-...
[01:16:02] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074276 (10Ladsgroup) Also it's important to check mariadb logs (the systemd service logs) to make sure things are not firework-y. The crash recovery mechanism of MariaDB is quite robust these days but you never...
[01:16:28] <Amir1>	 I'm still around for a bit to finish my tiff clean up work. Ping me if there are issues
[01:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306800 (owner: 10TrainBranchBot)
[01:24:58] <wikibugs>	 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074277 (10Ladsgroup) And heartbeat needs a restart after crash (the `pt-heartbeat-wikimedia` systemd service)
[01:25:49] <Amir1>	 that pmpta graph is even funnier: https://wikitech.wikimedia.org/wiki/File:External_storage_single_cluster.png
[01:25:56] <Amir1>	 uploaded by 127.0.0.1
[01:45:42] <cdanis>	 I'm pretty sure I know who did that
[02:00:40] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[02:03:19] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2072.codfw.wmnet with OS trixie
[02:07:35] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 54s)
[02:09:42] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS trixie
[02:09:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:14] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2072.codfw.wmnet with reason: host reimage
[02:26:55] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage
[02:27:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:30:22] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2072.codfw.wmnet with reason: host reimage
[02:31:45] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2107.codfw.wmnet with OS trixie
[02:35:26] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage
[02:44:42] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:et-0/0/0 (Transport: Hurricane Electric (dc4841.sin1) {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:51:27] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2107.codfw.wmnet with reason: host reimage
[02:51:41] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2072.codfw.wmnet with OS trixie
[02:55:21] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2085.codfw.wmnet with OS trixie
[02:59:25] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2107.codfw.wmnet with reason: host reimage
[03:02:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[03:21:24] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2107.codfw.wmnet with OS trixie
[03:29:29] <icinga-wm>	 PROBLEM - Host es1039 #page is DOWN: PING CRITICAL - Packet loss = 100%
[03:30:37] <jelto>	 ! incidents 
[03:30:44] <jelto>	 !incidents
[03:30:45] <sirenbot>	 8118 (UNACKED)  Host es1039 (paged)
[03:30:45] <sirenbot>	 8112 (RESOLVED)  es1039 (paged)/MariaDB read only es7 (paged)
[03:30:45] <sirenbot>	 8116 (RESOLVED)  es1048 (paged)/MariaDB Replica Lag: es7 (paged)
[03:30:45] <sirenbot>	 8114 (RESOLVED)  es1040 (paged)/MariaDB Replica Lag: es7 (paged)
[03:30:45] <sirenbot>	 8115 (RESOLVED)  es1035 (paged)/MariaDB Replica Lag: es7 (paged)
[03:30:46] <sirenbot>	 8117 (RESOLVED)  es2039 (paged)/MariaDB Replica Lag: es7 (paged)
[03:30:46] <sirenbot>	 8111 (RESOLVED)  es2039 (paged)/MariaDB Replica IO: es7 (paged)
[03:30:46] <sirenbot>	 8108 (RESOLVED)  es1035 (paged)/MariaDB Replica IO: es7 (paged)
[03:30:46] <sirenbot>	 8109 (RESOLVED)  es1048 (paged)/MariaDB Replica IO: es7 (paged)
[03:30:47] <sirenbot>	 8110 (RESOLVED)  es1040 (paged)/MariaDB Replica IO: es7 (paged)
[03:30:47] <sirenbot>	 8113 (RESOLVED)  es1039 (paged)/mysqld processes (paged)
[03:30:48] <sirenbot>	 8107 (RESOLVED)  Host es1039 (paged)
[03:30:59] <jelto>	 !ack 8118
[03:30:59] <sirenbot>	 8118 (ACKED)  Host es1039 (paged)
[03:33:40] <icinga-wm>	 RECOVERY - Host es1039 #page is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[03:33:49] <jelto>	 wow great
[03:33:54] <slyngs>	 Nice
[03:34:38] <icinga-wm>	 PROBLEM - mysqld processes #page on es1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[03:34:38] <icinga-wm>	 PROBLEM - MariaDB Event Scheduler es7 on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[03:34:49] <jelto>	 !incidents
[03:34:50] <sirenbot>	 8119 (UNACKED)  es1039 (paged)/mysqld processes (paged)
[03:34:50] <sirenbot>	 8118 (RESOLVED)  Host es1039 (paged)
[03:34:50] <sirenbot>	 8112 (RESOLVED)  es1039 (paged)/MariaDB read only es7 (paged)
[03:34:50] <sirenbot>	 8116 (RESOLVED)  es1048 (paged)/MariaDB Replica Lag: es7 (paged)
[03:34:51] <sirenbot>	 8114 (RESOLVED)  es1040 (paged)/MariaDB Replica Lag: es7 (paged)
[03:34:51] <sirenbot>	 8115 (RESOLVED)  es1035 (paged)/MariaDB Replica Lag: es7 (paged)
[03:34:51] <sirenbot>	 8117 (RESOLVED)  es2039 (paged)/MariaDB Replica Lag: es7 (paged)
[03:34:51] <sirenbot>	 8111 (RESOLVED)  es2039 (paged)/MariaDB Replica IO: es7 (paged)
[03:34:51] <sirenbot>	 8108 (RESOLVED)  es1035 (paged)/MariaDB Replica IO: es7 (paged)
[03:34:52] <sirenbot>	 8109 (RESOLVED)  es1048 (paged)/MariaDB Replica IO: es7 (paged)
[03:34:52] <sirenbot>	 8110 (RESOLVED)  es1040 (paged)/MariaDB Replica IO: es7 (paged)
[03:34:53] <sirenbot>	 8113 (RESOLVED)  es1039 (paged)/mysqld processes (paged)
[03:34:53] <sirenbot>	 8107 (RESOLVED)  Host es1039 (paged)
[03:34:58] <jelto>	 !ack 8119
[03:34:59] <sirenbot>	 8119 (ACKED)  es1039 (paged)/mysqld processes (paged)
[03:35:15] <jelto>	 yeah es1039 has 2 min uptime
[03:35:18] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: es7 #page on es1039 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[03:35:19] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es7 #page on es1039 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[03:35:24] <icinga-wm>	 PROBLEM - MariaDB read only es7 on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[03:35:31] <jelto>	 !ack
[03:35:32] <sirenbot>	 8120 (ACKED)  es1039 (paged)/MariaDB Replica SQL: es7 (paged)
[03:35:32] <sirenbot>	 8121 (ACKED)  es1039 (paged)/MariaDB Replica IO: es7 (paged)
[03:35:38] <icinga-wm>	 PROBLEM - MariaDB Events es7 on es1039 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler
[03:36:43] <jelto>	 should we try depooling the host?
[03:38:09] <slyngs>	 I think so, probably best to let a dba check it before using it, if it crashed
[03:39:50] <slyngs>	 Is someone working on it. I'll just check the last puppet commit
[03:39:52] <jelto>	 I think the same thing happened a few hours ago, see https://phabricator.wikimedia.org/T430764 and cortobot
[03:41:21] <slyngs>	 Okay, let's depool it 
[03:41:58] <jelto>	 The ticket states "I leave es1039 depooled for HW inspection and what is wrong. Repool when needed."
[03:42:05] <jelto>	 let me check if the host is still depooled
[03:42:18] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: es7 #page on es1039 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response
[03:42:26] <slyngs>	 There's also this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1306798
[03:42:27] <jelto>	 !incidents
[03:42:28] <sirenbot>	 8119 (ACKED)  es1039 (paged)/mysqld processes (paged)
[03:42:28] <sirenbot>	 8120 (ACKED)  es1039 (paged)/MariaDB Replica SQL: es7 (paged)
[03:42:28] <sirenbot>	 8121 (ACKED)  es1039 (paged)/MariaDB Replica IO: es7 (paged)
[03:42:28] <sirenbot>	 8122 (UNACKED)  es1039 (paged)/MariaDB Replica Lag: es7 (paged)
[03:42:29] <sirenbot>	 8118 (RESOLVED)  Host es1039 (paged)
[03:42:29] <sirenbot>	 8112 (RESOLVED)  es1039 (paged)/MariaDB read only es7 (paged)
[03:42:29] <sirenbot>	 8116 (RESOLVED)  es1048 (paged)/MariaDB Replica Lag: es7 (paged)
[03:42:29] <sirenbot>	 8114 (RESOLVED)  es1040 (paged)/MariaDB Replica Lag: es7 (paged)
[03:42:29] <sirenbot>	 8115 (RESOLVED)  es1035 (paged)/MariaDB Replica Lag: es7 (paged)
[03:42:30] <sirenbot>	 8117 (RESOLVED)  es2039 (paged)/MariaDB Replica Lag: es7 (paged)
[03:42:30] <sirenbot>	 8111 (RESOLVED)  es2039 (paged)/MariaDB Replica IO: es7 (paged)
[03:42:31] <sirenbot>	 8108 (RESOLVED)  es1035 (paged)/MariaDB Replica IO: es7 (paged)
[03:42:31] <sirenbot>	 8109 (RESOLVED)  es1048 (paged)/MariaDB Replica IO: es7 (paged)
[03:42:32] <sirenbot>	 8110 (RESOLVED)  es1040 (paged)/MariaDB Replica IO: es7 (paged)
[03:42:32] <sirenbot>	 8113 (RESOLVED)  es1039 (paged)/mysqld processes (paged)
[03:42:33] <sirenbot>	 8107 (RESOLVED)  Host es1039 (paged)
[03:42:40] <jelto>	 !ack 8118
[03:42:40] <sirenbot>	 Attempt to ack incident 8118 failed.
[03:42:53] <slyngs>	 !8119
[03:43:00] <slyngs>	 !8122
[03:43:13] <jelto>	 !ack 8122
[03:43:14] <sirenbot>	 8122 (ACKED)  es1039 (paged)/MariaDB Replica Lag: es7 (paged)
[03:43:21] <slyngs>	 Weee :-)
[03:43:57] <jelto>	 yyeah that was probably the switch to another master ? es1039  -> es1035 ?