[00:05:20] !log DEPLOYED Refinery at 4e7a2b32 for changes: pageview allowlist 1305158 (+min.wikiquote) 1305162 (+bol.wikipedia), 1305156 (+isv.wikipedia); 1305980 (pv allowlist -api.wikimedia, sqoop +isvwiki); sqoop 1295064 (+globalimagelinks) 1295069 (+filerevision) using scap, then deployed onto HDFS (manual copyToLocal required additionally) [00:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:08] (03PS1) 10Ladsgroup: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) [00:11:21] PROBLEM - Host cirrussearch2103 is DOWN: PING CRITICAL - Packet loss = 100% [00:11:42] (03CR) 10CI reject: [V:04-1] Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) (owner: 10Ladsgroup) [00:11:47] RECOVERY - Host cirrussearch2103 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms [00:13:17] PROBLEM - Host cirrussearch2102 is DOWN: PING CRITICAL - Packet loss = 100% [00:14:47] RECOVERY - Host cirrussearch2102 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [00:15:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2103.codfw.wmnet with OS trixie [00:18:37] RECOVERY - SSH on urldownloader2005 is OK: SSH OK - OpenSSH_10.0p2 Debian-7+deb13u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:20:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2102.codfw.wmnet with OS trixie [00:21:42] RESOLVED: [2x] ProbeDown: Service urldownloader2005:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:45] (03PS2) 10Ladsgroup: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) [00:22:17] (03CR) 10CI reject: [V:04-1] Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) (owner: 10Ladsgroup) [00:24:23] (03PS3) 10Ladsgroup: Puppet 8: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) [00:27:40] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1306797 (https://phabricator.wikimedia.org/T372666) (owner: 10Ladsgroup) [00:27:43] PROBLEM - Host es1039 #page is DOWN: PING CRITICAL - Packet loss = 100% [00:29:31] PROBLEM - MariaDB Replica IO: es7 #page on es1035 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:29:33] PROBLEM - MariaDB Replica IO: es7 #page on es1040 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:29:34] PROBLEM - MariaDB Replica IO: es7 #page on es1048 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:29:37] PROBLEM - MariaDB Replica IO: es7 #page on es2039 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es1039.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es1039.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:29:54] sirenbot: stfu [00:30:00] sirenbot: !ack [00:30:21] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:30:35] RECOVERY - Host es1039 #page is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [00:31:33] PROBLEM - MariaDB read only es7 #page on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [00:32:31] PROBLEM - MariaDB Event Scheduler es7 on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [00:32:33] PROBLEM - mysqld processes #page on es1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:33:31] PROBLEM - MariaDB Events es7 on es1039 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [00:33:33] PROBLEM - pt-heartbeat-wikimedia process on es1039 is CRITICAL: PROCS CRITICAL: 0 processes with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat [00:34:31] PROBLEM - MariaDB Replica Lag: es7 #page on es1035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:34:33] PROBLEM - MariaDB Replica Lag: es7 #page on es1040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:34:34] PROBLEM - MariaDB Replica Lag: es7 #page on es1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:34:37] PROBLEM - MariaDB Replica Lag: es7 #page on es2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:35:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:39:07] jesus [00:39:24] thing just rebooted, and I think the clock is wrong in SEL [00:39:33] it's been up since [00:40:11] first let me remove it from writes [00:41:06] let me know what I can do, if you need more hands [00:41:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1306798 (https://phabricator.wikimedia.org/T430765) [00:41:42] (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306799 (https://phabricator.wikimedia.org/T430765) [00:42:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set es7 eqiad as read-only for maintenance - T430765', diff saved to https://phabricator.wikimedia.org/P94645 and previous config saved to /var/cache/conftool/dbconfig/20260701-004221-ladsgroup.json [00:42:26] T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765 [00:42:28] heya folks. I am also around if you need an extra pair of hands. I am assuming we are doing a switchover? [00:42:30] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074209 (10CDanis) p:05Triage→03High [00:42:33] ok thanks Amir1 [00:42:36] okay, the user impact should be gone now [00:42:40] thanks sukhe [00:42:54] cdanis: sorry you caught a bad one <3 [00:43:01] and thanks Amir1 -- I went to look at the External storage wikitech page and was met with a diagram that mentioned PMTPA [00:43:11] it's removed from writes [00:43:20] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074213 (10CDanis) [00:43:24] Amir1: can you later also tell us what steps you performed for posterity? [00:43:26] cdanis: yeah, it took me like five minutes to remember how to remove ES from the write pool [00:43:27] https://wikitech.wikimedia.org/wiki/Primary_database_switchover ? [00:43:37] later [00:44:03] it's a change I made that makes it removed from the RW pool of ES clusters and moves it to RO ones so replag wouldn't matter [00:44:03] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306795 (owner: 10TrainBranchBot) [00:44:09] sudo dbctl --scope eqiad section es7 ro "Maintenance - T430765" [00:44:09] sudo dbctl --scope codfw section es7 ro "Maintenance - T430765" [00:44:09] sudo dbctl config commit -m "Set es7 eqiad as read-only for maintenance - T430765" [00:44:24] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074215 (10CDanis) 00:42:36 okay, the user impact should be gone now 00:43:11 it's removed from writes Followups: documentation, documentation, documentation. [00:44:57] okay, now I need to bring it back online [00:45:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:45:18] the switchover would be much easier once the host is online [00:46:29] RECOVERY - mysqld processes #page on es1039 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:47:29] RECOVERY - MariaDB Events es7 on es1039 is OK: OK - All 2 events in ops database are ENABLED https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [00:47:31] RECOVERY - MariaDB Replica IO: es7 #page on es1035 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:47:32] RECOVERY - MariaDB Event Scheduler es7 on es1039 is OK: Version 10.11.16-MariaDB-log, Uptime 69s, read_only: True, event_scheduler: True, 14.53 QPS, connection latency: 0.015102s, query latency: 0.001073s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [00:47:33] RECOVERY - MariaDB Replica IO: es7 #page on es1040 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:47:34] RECOVERY - MariaDB Replica IO: es7 #page on es1048 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:47:37] RECOVERY - MariaDB Replica IO: es7 #page on es2039 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:49:29] RECOVERY - pt-heartbeat-wikimedia process on es1039 is OK: PROCS OK: 3 processes with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat [00:49:37] RECOVERY - MariaDB Replica Lag: es7 #page on es2039 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:50:21] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:50:31] RECOVERY - MariaDB Replica Lag: es7 #page on es1035 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:50:33] RECOVERY - MariaDB Replica Lag: es7 #page on es1040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:50:34] RECOVERY - MariaDB Replica Lag: es7 #page on es1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [00:51:03] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074230 (10Ladsgroup) For later: ` Jul 01 00:46:43 es1039 mysqld[5895]: 2026-07-01 0:46:43 6 [Warning] Detected table cache mutex contention at instance 1: 26% waits. Additional table cache instance cannot be a... [00:51:16] okay, now that everything is normal. I do the switchover [00:53:16] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Primary switchover es7 T430765 [00:53:19] T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765 [00:53:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set es1035 with weight 0 T430765', diff saved to https://phabricator.wikimedia.org/P94646 and previous config saved to /var/cache/conftool/dbconfig/20260701-005329-ladsgroup.json [00:54:31] RECOVERY - MariaDB read only es7 #page on es1039 is OK: Version 10.11.16-MariaDB-log, Uptime 489s, read_only: False, event_scheduler: True, 31.83 QPS, connection latency: 0.028759s, query latency: 0.000833s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [00:55:01] 😌 [00:55:33] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074250 (10ssingh) ` < Amir1> it's a change I made that makes it removed from the RW pool of ES clusters and moves it to RO ones so replag wouldn't matter chBot) < Amir1> sudo dbctl --scope eqiad section es7 ro... [00:57:55] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote es1035 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1306798 (https://phabricator.wikimedia.org/T430765) (owner: 10Gerrit maintenance bot) [00:58:26] !log Starting es7 eqiad failover from es1039 to es1035 - T430765 [00:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:29] T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765 [01:00:03] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Promote es1035 to es7 primary T430765', diff saved to https://phabricator.wikimedia.org/P94647 and previous config saved to /var/cache/conftool/dbconfig/20260701-010002-ladsgroup.json [01:00:37] Amir1: for later, last question sorry -- are you following the steps at https://wikitech.wikimedia.org/wiki/Primary_database_switchover or is there something else we should reference? [01:00:46] thinking of time when you or another db won't be around [01:01:04] *dba [01:01:13] we follow the checklist outlined in the ticket [01:01:14] https://phabricator.wikimedia.org/T430765 [01:01:28] the checklist is produced by switchmaster (https://switchmaster.toolforge.org/ [01:01:50] thank you, updating the current task [01:03:26] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074260 (10ssingh) For the checklist on the switchover steps: ` < Amir1> we follow the checklist outlined in the ticket < Amir1> https://phabricator.wikimedia.org/T430765 < Amir1> the checklist is produced by s... [01:03:36] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1306799 (https://phabricator.wikimedia.org/T430765) (owner: 10Gerrit maintenance bot) [01:03:55] !log ladsgroup@dns1004 START - running authdns-update [01:05:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool es1039 T430765', diff saved to https://phabricator.wikimedia.org/P94648 and previous config saved to /var/cache/conftool/dbconfig/20260701-010551-ladsgroup.json [01:05:55] T430765: Switchover es7 master (es1039 -> es1035) - https://phabricator.wikimedia.org/T430765 [01:05:58] !log ladsgroup@dns1004 END - running authdns-update [01:06:53] cdanis: sukhe: Pooling es7 back for writes [01:07:00] thank you! [01:07:09] Amir1: <3 please make sure you take time off in lieu of this [01:07:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Set es7 eqiad back to read-write - T430765', diff saved to https://phabricator.wikimedia.org/P94649 and previous config saved to /var/cache/conftool/dbconfig/20260701-010716-ladsgroup.json [01:12:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306800 [01:12:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306800 (owner: 10TrainBranchBot) [01:14:24] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074275 (10Ladsgroup) The switchover is done, the cluster is RW now: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=2026-07-01T00:12:34.145Z&to=2026-07-01T01:09:38.332Z&timezone=utc&var-... [01:16:02] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074276 (10Ladsgroup) Also it's important to check mariadb logs (the systemd service logs) to make sure things are not firework-y. The crash recovery mechanism of MariaDB is quite robust these days but you never... [01:16:28] I'm still around for a bit to finish my tiff clean up work. Ping me if there are issues [01:20:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306800 (owner: 10TrainBranchBot) [01:24:58] 06SRE, 06DBA: es7 primary (es1039.eqiad.wmnet) crashed - https://phabricator.wikimedia.org/T430764#12074277 (10Ladsgroup) And heartbeat needs a restart after crash (the `pt-heartbeat-wikimedia` systemd service) [01:25:49] that pmpta graph is even funnier: https://wikitech.wikimedia.org/wiki/File:External_storage_single_cluster.png [01:25:56] uploaded by 127.0.0.1 [01:45:42] I'm pretty sure I know who did that [02:00:40] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:03:19] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2072.codfw.wmnet with OS trixie [02:07:35] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 54s) [02:09:42] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS trixie [02:09:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:14] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2072.codfw.wmnet with reason: host reimage [02:26:55] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage [02:27:40] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:22] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2072.codfw.wmnet with reason: host reimage [02:31:45] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2107.codfw.wmnet with OS trixie [02:35:26] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage [02:44:42] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:et-0/0/0 (Transport: Hurricane Electric (dc4841.sin1) {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:51:27] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2107.codfw.wmnet with reason: host reimage [02:51:41] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2072.codfw.wmnet with OS trixie [02:55:21] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2085.codfw.wmnet with OS trixie [02:59:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2107.codfw.wmnet with reason: host reimage [03:02:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:21:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2107.codfw.wmnet with OS trixie [03:29:29] PROBLEM - Host es1039 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:30:37] ! incidents [03:30:44] !incidents [03:30:45] 8118 (UNACKED) Host es1039 (paged) [03:30:45] 8112 (RESOLVED) es1039 (paged)/MariaDB read only es7 (paged) [03:30:45] 8116 (RESOLVED) es1048 (paged)/MariaDB Replica Lag: es7 (paged) [03:30:45] 8114 (RESOLVED) es1040 (paged)/MariaDB Replica Lag: es7 (paged) [03:30:45] 8115 (RESOLVED) es1035 (paged)/MariaDB Replica Lag: es7 (paged) [03:30:46] 8117 (RESOLVED) es2039 (paged)/MariaDB Replica Lag: es7 (paged) [03:30:46] 8111 (RESOLVED) es2039 (paged)/MariaDB Replica IO: es7 (paged) [03:30:46] 8108 (RESOLVED) es1035 (paged)/MariaDB Replica IO: es7 (paged) [03:30:46] 8109 (RESOLVED) es1048 (paged)/MariaDB Replica IO: es7 (paged) [03:30:47] 8110 (RESOLVED) es1040 (paged)/MariaDB Replica IO: es7 (paged) [03:30:47] 8113 (RESOLVED) es1039 (paged)/mysqld processes (paged) [03:30:48] 8107 (RESOLVED) Host es1039 (paged) [03:30:59] !ack 8118 [03:30:59] 8118 (ACKED) Host es1039 (paged) [03:33:40] RECOVERY - Host es1039 #page is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [03:33:49] wow great [03:33:54] Nice [03:34:38] PROBLEM - mysqld processes #page on es1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:34:38] PROBLEM - MariaDB Event Scheduler es7 on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [03:34:49] !incidents [03:34:50] 8119 (UNACKED) es1039 (paged)/mysqld processes (paged) [03:34:50] 8118 (RESOLVED) Host es1039 (paged) [03:34:50] 8112 (RESOLVED) es1039 (paged)/MariaDB read only es7 (paged) [03:34:50] 8116 (RESOLVED) es1048 (paged)/MariaDB Replica Lag: es7 (paged) [03:34:51] 8114 (RESOLVED) es1040 (paged)/MariaDB Replica Lag: es7 (paged) [03:34:51] 8115 (RESOLVED) es1035 (paged)/MariaDB Replica Lag: es7 (paged) [03:34:51] 8117 (RESOLVED) es2039 (paged)/MariaDB Replica Lag: es7 (paged) [03:34:51] 8111 (RESOLVED) es2039 (paged)/MariaDB Replica IO: es7 (paged) [03:34:51] 8108 (RESOLVED) es1035 (paged)/MariaDB Replica IO: es7 (paged) [03:34:52] 8109 (RESOLVED) es1048 (paged)/MariaDB Replica IO: es7 (paged) [03:34:52] 8110 (RESOLVED) es1040 (paged)/MariaDB Replica IO: es7 (paged) [03:34:53] 8113 (RESOLVED) es1039 (paged)/mysqld processes (paged) [03:34:53] 8107 (RESOLVED) Host es1039 (paged) [03:34:58] !ack 8119 [03:34:59] 8119 (ACKED) es1039 (paged)/mysqld processes (paged) [03:35:15] yeah es1039 has 2 min uptime [03:35:18] PROBLEM - MariaDB Replica SQL: es7 #page on es1039 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:35:19] PROBLEM - MariaDB Replica IO: es7 #page on es1039 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:35:24] PROBLEM - MariaDB read only es7 on es1039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [03:35:31] !ack [03:35:32] 8120 (ACKED) es1039 (paged)/MariaDB Replica SQL: es7 (paged) [03:35:32] 8121 (ACKED) es1039 (paged)/MariaDB Replica IO: es7 (paged) [03:35:38] PROBLEM - MariaDB Events es7 on es1039 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [03:36:43] should we try depooling the host? [03:38:09] I think so, probably best to let a dba check it before using it, if it crashed [03:39:50] Is someone working on it. I'll just check the last puppet commit [03:39:52] I think the same thing happened a few hours ago, see https://phabricator.wikimedia.org/T430764 and cortobot [03:41:21] Okay, let's depool it [03:41:58] The ticket states "I leave es1039 depooled for HW inspection and what is wrong. Repool when needed." [03:42:05] let me check if the host is still depooled [03:42:18] PROBLEM - MariaDB Replica Lag: es7 #page on es1039 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:42:26] There's also this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1306798 [03:42:27] !incidents [03:42:28] 8119 (ACKED) es1039 (paged)/mysqld processes (paged) [03:42:28] 8120 (ACKED) es1039 (paged)/MariaDB Replica SQL: es7 (paged) [03:42:28] 8121 (ACKED) es1039 (paged)/MariaDB Replica IO: es7 (paged) [03:42:28] 8122 (UNACKED) es1039 (paged)/MariaDB Replica Lag: es7 (paged) [03:42:29] 8118 (RESOLVED) Host es1039 (paged) [03:42:29] 8112 (RESOLVED) es1039 (paged)/MariaDB read only es7 (paged) [03:42:29] 8116 (RESOLVED) es1048 (paged)/MariaDB Replica Lag: es7 (paged) [03:42:29] 8114 (RESOLVED) es1040 (paged)/MariaDB Replica Lag: es7 (paged) [03:42:29] 8115 (RESOLVED) es1035 (paged)/MariaDB Replica Lag: es7 (paged) [03:42:30] 8117 (RESOLVED) es2039 (paged)/MariaDB Replica Lag: es7 (paged) [03:42:30] 8111 (RESOLVED) es2039 (paged)/MariaDB Replica IO: es7 (paged) [03:42:31] 8108 (RESOLVED) es1035 (paged)/MariaDB Replica IO: es7 (paged) [03:42:31] 8109 (RESOLVED) es1048 (paged)/MariaDB Replica IO: es7 (paged) [03:42:32] 8110 (RESOLVED) es1040 (paged)/MariaDB Replica IO: es7 (paged) [03:42:32] 8113 (RESOLVED) es1039 (paged)/mysqld processes (paged) [03:42:33] 8107 (RESOLVED) Host es1039 (paged) [03:42:40] !ack 8118 [03:42:40] Attempt to ack incident 8118 failed. [03:42:53] !8119 [03:43:00] !8122 [03:43:13] !ack 8122 [03:43:14] 8122 (ACKED) es1039 (paged)/MariaDB Replica Lag: es7 (paged) [03:43:21] Weee :-) [03:43:57] yyeah that was probably the switch to another master ? es1039 -> es1035 ?