[00:02:36] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:48] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:06:22] !log Updated the Wikidata property suggester with data from the 2021-07-12 JSON dump (with pre-applied T132839 workarounds) [00:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:31] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [00:06:32] (03CR) 10Juan90264: [C: 03+1] Adding square logo and wordmark for Wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [00:25:18] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:36] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1463 MB (5% inode=94%): /tmp 1463 MB (5% inode=94%): /var/tmp 1463 MB (5% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [01:01:08] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:28] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [01:16:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:38] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:32] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:10] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:24] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:02] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:40] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:30] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:06] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:46] PROBLEM - SSH on mw1273.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:26] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:26] (03PS2) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 [05:24:01] (03CR) 10jerkins-bot: [V: 04-1] Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 (owner: 10Muehlenhoff) [05:26:34] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:17] (03PS3) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 [05:30:14] (03CR) 10jerkins-bot: [V: 04-1] Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 (owner: 10Muehlenhoff) [05:35:26] (03PS4) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 [05:36:44] (03CR) 10jerkins-bot: [V: 04-1] Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 (owner: 10Muehlenhoff) [06:01:50] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:15] (03PS1) 10Muehlenhoff: Use types for apt::pin [puppet] - 10https://gerrit.wikimedia.org/r/704889 [06:09:30] (03CR) 10jerkins-bot: [V: 04-1] Use types for apt::pin [puppet] - 10https://gerrit.wikimedia.org/r/704889 (owner: 10Muehlenhoff) [06:13:32] (03PS2) 10Muehlenhoff: Use types for apt::pin [puppet] - 10https://gerrit.wikimedia.org/r/704889 [06:19:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704889 (owner: 10Muehlenhoff) [06:22:58] (03PS1) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 [06:23:31] (03CR) 10jerkins-bot: [V: 04-1] Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [06:25:58] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704889 (owner: 10Muehlenhoff) [06:40:26] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:40:57] (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704889 (owner: 10Muehlenhoff) [06:41:00] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:22] RECOVERY - SSH on mw1273.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:53:54] (03PS2) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210716T0700) [07:00:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [07:02:50] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:31] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) >>! In T275873#7215311, @fgiunchedi wrote: >> This is tracked by upstream at https://github.com/prometheus/node_exporter/issues/1892 and their solution is to also ma... [07:18:42] (03PS3) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/704890 [07:21:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [07:25:49] (03Abandoned) 10Muehlenhoff: Support new src: prefix in apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/701536 (owner: 10Muehlenhoff) [07:26:48] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:12] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 2540 MB (9% inode=95%): /tmp 2540 MB (9% inode=95%): /var/tmp 2540 MB (9% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [07:35:21] seems like elastic1039 have lost its drive behind /srv [07:38:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:42:19] dcausse: that's a fun start to Friday! [07:42:50] yes :/ [07:43:26] dcausse: I see https://phabricator.wikimedia.org/T285643, maybe already WIP? [07:43:32] ah that's a known problem (T285643), might just be a downtime that expired [07:43:33] T285643: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 [07:43:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:43:55] elukey: yes looks like it :) [07:43:58] ack :) [07:44:21] this node is already banned from the cluster [07:44:38] going to try to extend the downtime by one week [07:45:05] dcausse: let's hope everything today is a simple as extending a downtime [07:45:32] * dcausse cross fingers :) [07:50:18] * RhinosF1 crosses everything and touches wood at once [07:51:18] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.518e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:51:18] that's me ^ [07:56:52] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:57:26] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:57:32] sigh... I'm not authorized to downtime these service on elastic1039 eventhough I'm root there :/ [08:01:15] (03PS1) 10Filippo Giunchedi: swift: enable listing_formats on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) [08:02:03] (03CR) 10jerkins-bot: [V: 04-1] swift: enable listing_formats on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:02:32] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:39] dcausse: ow :( I can do the downtime but yeah that should be fixed at the icinga level [08:03:02] dcausse: all services on 1039 ? [08:04:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:04:52] PROBLEM - Host db1127 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:59] godog: should only be the RAID and the disk space but all services is fine too, this host can't do much anyways [08:05:03] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:05:07] (03PS2) 10Filippo Giunchedi: swift: enable listing_formats on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) [08:05:25] dcausse: ack, will do [08:05:34] godog: thanks! ticket is T285643 [08:05:35] T285643: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 [08:05:40] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.02757 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:05:47] (03PS1) 10JMeybohm: dragonfly: Don't run pki::get_cert in ensure=absent case [puppet] - 10https://gerrit.wikimedia.org/r/704921 (https://phabricator.wikimedia.org/T286054) [08:06:21] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30234/console" [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:06:24] RECOVERY - Host db1127 is UP: PING OK - Packet loss = 0%, RTA = 3.07 ms [08:06:30] dcausse: sure np! [08:07:00] (03PS1) 10DCausse: rdf-streaming-updater: configure allowed kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/704922 (https://phabricator.wikimedia.org/T265526) [08:07:12] (03CR) 10DCausse: [C: 03+1] flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) (owner: 10JMeybohm) [08:09:54] PROBLEM - MariaDB Replica SQL: s7 on db1127 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:09:58] PROBLEM - MariaDB Replica IO: s7 on db1127 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:10:53] (03CR) 10JMeybohm: [C: 03+2] dragonfly: Don't run pki::get_cert in ensure=absent case [puppet] - 10https://gerrit.wikimedia.org/r/704921 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:11:04] PROBLEM - mysqld processes on db1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:11:56] PROBLEM - MariaDB read only s7 on db1127 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:12:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704890 (owner: 10Muehlenhoff) [08:13:35] (03PS1) 10Jgiannelos: tegola: Enable k8s probes. Fix typos in DB queries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/704923 [08:15:59] (03PS2) 10Jgiannelos: tegola: Enable k8s probes. Fix typos in DB queries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/704923 [08:19:04] (03PS1) 10Filippo Giunchedi: puppet_compiler: test ssh access to compilers [puppet] - 10https://gerrit.wikimedia.org/r/704924 [08:22:42] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30235/console" [puppet] - 10https://gerrit.wikimedia.org/r/704920 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [08:25:08] PROBLEM - MariaDB Replica Lag: s7 on db1127 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:26:02] (03CR) 10JMeybohm: [C: 03+1] rdf-streaming-updater: configure allowed kafka clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/704922 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [08:26:06] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:12] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:23] 10SRE, 10ops-codfw: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10MoritzMuehlenhoff) [08:26:24] PROBLEM - puppet last run on kubernetes1009 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:30] PROBLEM - puppet last run on kubernetes1006 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:40] 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10MoritzMuehlenhoff) [08:27:06] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:02] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:12] PROBLEM - puppet last run on kubernetes1017 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:16] PROBLEM - puppet last run on kubernetes1010 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:18] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:26] PROBLEM - puppet last run on kubernetes2005 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:30] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1127 due to RAM failures T286763', diff saved to https://phabricator.wikimedia.org/P16827 and previous config saved to /var/cache/conftool/dbconfig/20210716-082829-kormat.json [08:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:38] PROBLEM - puppet last run on kubernetes2014 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:38] PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:39] T286763: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 [08:28:42] PROBLEM - puppet last run on kubernetes1008 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:28:58] PROBLEM - puppet last run on kubernetes2006 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:08] 10SRE, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10MoritzMuehlenhoff) https://phabricator.wikimedia.org/T286763 is another instance where monitoring would have prevented a DB server from rebooting itself. [08:29:32] PROBLEM - puppet last run on kubernetes1011 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:32] PROBLEM - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:33] PROBLEM - puppet last run on kubernetes2004 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:34] oh, sorry for the puppet noise, that was me re-enabling it [08:29:37] (03PS1) 10Kormat: db1127: Disable notificactions. [puppet] - 10https://gerrit.wikimedia.org/r/704926 (https://phabricator.wikimedia.org/T286763) [08:29:58] PROBLEM - puppet last run on kubernetes2007 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:00] PROBLEM - puppet last run on kubernetes2011 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:06] PROBLEM - puppet last run on kubernetes1014 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:16] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:28] (03CR) 10Kormat: [C: 03+2] db1127: Disable notificactions. [puppet] - 10https://gerrit.wikimedia.org/r/704926 (https://phabricator.wikimedia.org/T286763) (owner: 10Kormat) [08:30:30] PROBLEM - puppet last run on kubernetes2015 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:34] PROBLEM - puppet last run on kubernetes1005 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:34] PROBLEM - puppet last run on kubernetes2013 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:30:42] PROBLEM - puppet last run on kubernetes2009 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:31:00] PROBLEM - puppet last run on kubernetes2017 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:33:02] ACKNOWLEDGEMENT - MariaDB Replica IO: s7 on db1127 is CRITICAL: CRITICAL slave_io_state could not connect Kormat RAM broken T286763 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:33:02] ACKNOWLEDGEMENT - MariaDB Replica Lag: s7 on db1127 is CRITICAL: CRITICAL slave_sql_lag could not connect Kormat RAM broken T286763 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:33:02] ACKNOWLEDGEMENT - MariaDB Replica SQL: s7 on db1127 is CRITICAL: CRITICAL slave_sql_state could not connect Kormat RAM broken T286763 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:33:02] ACKNOWLEDGEMENT - MariaDB read only s7 on db1127 is CRITICAL: Could not connect to localhost:3306 Kormat RAM broken T286763 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:33:02] ACKNOWLEDGEMENT - mysqld processes on db1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Kormat RAM broken T286763 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:38:36] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:38:36] RECOVERY - puppet last run on kubernetes2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:48] RECOVERY - puppet last run on kubernetes1017 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:48] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:41:33] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) There are a couple of puppet patches pending but otherwise things seem to work fine on thanos-fe2001! [08:45:12] RECOVERY - puppet last run on kubernetes2015 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:45:12] RECOVERY - puppet last run on kubernetes2005 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:26] RECOVERY - puppet last run on kubernetes1005 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:26] RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:47:49] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10fgiunchedi) Something else I noticed in `node-exporter`, `node_cpu_frequency_hertz` is gone thus the [[ https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&editPanel=29... [08:49:34] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [08:51:54] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:54] RECOVERY - puppet last run on kubernetes1009 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:54] RECOVERY - puppet last run on kubernetes1014 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:54] RECOVERY - puppet last run on kubernetes2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:58] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (33) node(s) change every puppet run: kubernetes2014, kubernetes1007, kubernetes2007, kubernetes2004, kubernetes1004, kubernetes1002, kubernetes1010, kubernetes1015, kubernetes1016, thanos-be1003, kubernetes2006, kubernetes2008, kubernetes1014, kubernetes2002, kubernetes1013, kubernetes2001, kubernetes2017, kubernetes20 [08:52:58] rnetes1008, kubernetes1006, labstore1006, kubernetes2010, kubernetes1011, kubernetes2015, kubernetes2005, kubernetes2011, kubernetes1005, kubernetes1003, kubernetes2016, kubernetes1017, kubernetes2003, kubernetes2009, kubernetes1009 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:53:48] RECOVERY - puppet last run on kubernetes1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:54:06] RECOVERY - puppet last run on kubernetes2017 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:54:06] RECOVERY - puppet last run on kubernetes2011 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:54:30] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T273281 [08:54:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T273281 [08:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:14] RECOVERY - puppet last run on kubernetes1006 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:56:16] RECOVERY - puppet last run on kubernetes2007 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:56:16] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:57:10] RECOVERY - puppet last run on kubernetes2014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:00:44] RECOVERY - puppet last run on kubernetes1010 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:00:44] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:00:44] RECOVERY - puppet last run on kubernetes2003 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:01:28] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T273281 [09:02:19] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T273281 [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:58] RECOVERY - puppet last run on kubernetes2013 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:03:43] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Lumen Scheduled Maintenance #: 21610453. Should be done by now but not overly worried. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:04:13] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Lumen Scheduled Maintenance #: 21610453. Should be done by now but not overly worried. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:08] RECOVERY - puppet last run on kubernetes1011 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:05:08] RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:14:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:16:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:00] (03PS3) 10Vgutierrez: Remove SSH key for samtar [puppet] - 10https://gerrit.wikimedia.org/r/704779 (https://phabricator.wikimedia.org/T286714) (owner: 10Samtar) [09:23:29] (03CR) 10Vgutierrez: [C: 03+2] Remove SSH key for samtar [puppet] - 10https://gerrit.wikimedia.org/r/704779 (https://phabricator.wikimedia.org/T286714) (owner: 10Samtar) [09:24:43] 10SRE, 10LDAP, 10Patch-For-Review: Remove SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Vgutierrez) 05Open→03Resolved [09:25:36] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Addshore) >>! In T285104#7193129, @Ladsgroup wrote: > The main person working on this is Kunal and he w... [09:26:06] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:25] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:32:24] !log restart rsyslog on kubestage1001.eqiad.wmnet [09:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:38] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:40:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:41:35] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) Yeah, we're looking at it :D It will take a bit of time, there some open questions like LVS... [09:47:21] !log cordon kubestage1002.eqiad.wmnet as it currently does not feed logs to logstash [09:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:00] (03CR) 10Effie Mouzeli: [C: 03+2] flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) (owner: 10JMeybohm) [09:54:47] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2007.codfw.wmnet [09:54:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet [09:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Use types for apt::pin [puppet] - 10https://gerrit.wikimedia.org/r/704889 (owner: 10Muehlenhoff) [09:59:06] (03PS1) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [10:02:30] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:24] (03CR) 10David Caro: [C: 03+2] ceph: Update health alert url to runbook [puppet] - 10https://gerrit.wikimedia.org/r/704545 (owner: 10David Caro) [10:04:37] (03CR) 10David Caro: [C: 03+2] ceph: Update dashboard links to tags [puppet] - 10https://gerrit.wikimedia.org/r/704547 (owner: 10David Caro) [10:06:20] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:08:35] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10jijiki) There are no logs that indicate throttling but rather what I see from the graphs . It appears that the CPU does not scale up higher than ~1GHz. Also, the message `[Mon Jul 12 07:17:38 2021] Code: Bad RIP value.` yi... [10:09:27] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:10:15] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: Enable k8s probes. Fix typos in DB queries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/704923 (owner: 10Jgiannelos) [10:10:25] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:13:01] (03PS2) 10Effie Mouzeli: tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 [10:21:37] (03PS2) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [10:22:42] 10SRE, 10SRE Observability: node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768 (10fgiunchedi) [10:23:52] (03CR) 10David Caro: [C: 03+2] wmcs: add kubernetes and kubeadm controllers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:23:56] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:01] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge: add task-id to k8s worker cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702091 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:24:06] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [10:24:13] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [10:25:39] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Enable k8s probes. Fix typos in DB queries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/704923 (owner: 10Jgiannelos) [10:26:44] (03Merged) 10jenkins-bot: wmcs: add kubernetes and kubeadm controllers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702089 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:26:47] (03Merged) 10jenkins-bot: wmcs.toolforge: add k8s worker add/remove cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702090 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:26:49] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Enable k8s probes. Fix typos in DB queries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/704923 (owner: 10Jgiannelos) [10:27:03] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 (owner: 10Effie Mouzeli) [10:27:23] (03Merged) 10jenkins-bot: wmcs.toolforge: add task-id to k8s worker cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702091 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [10:27:25] (03Merged) 10jenkins-bot: wmcs.ceph: rename the ceph controller to CephClusterController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702929 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [10:27:27] (03Merged) 10jenkins-bot: wmcs.ceph: add cookbook to bootstrap and add OSDs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702930 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [10:28:57] PROBLEM - Query Service HTTP Port on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.149 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:29:47] (03Merged) 10jenkins-bot: tegola-vector-tiles: Enable k8s probes. Fix typos in DB queries. [deployment-charts] - 10https://gerrit.wikimedia.org/r/704923 (owner: 10Jgiannelos) [10:29:58] RECOVERY - Query Service HTTP Port on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:31:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/704924 (owner: 10Filippo Giunchedi) [10:35:57] (03PS1) 10Btullis: Update sre.zookeeper.roll-restart to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/704937 (https://phabricator.wikimedia.org/T269925) [10:38:02] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:39:04] (03PS2) 10Btullis: Enable kerberos for btullis [puppet] - 10https://gerrit.wikimedia.org/r/704562 (https://phabricator.wikimedia.org/T285754) [10:41:37] (03CR) 10Btullis: [C: 03+2] Enable kerberos for btullis [puppet] - 10https://gerrit.wikimedia.org/r/704562 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [10:45:50] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:01:16] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:16] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:06:42] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:08:46] (03CR) 10Zfilipin: [C: 03+1] "> Patch Set 3:" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [11:13:44] (03CR) 10Elukey: [C: 03+1] Update sre.zookeeper.roll-restart to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/704937 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [11:14:01] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:14:19] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:16:46] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [11:17:18] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:20:50] (03CR) 10Btullis: [C: 03+2] Update sre.zookeeper.roll-restart to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/704937 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [11:23:24] (03Merged) 10jenkins-bot: Update sre.zookeeper.roll-restart to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/704937 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [11:26:54] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [11:29:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [11:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:16] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 973 mismatched wikiversions daniel_zahn https://phabricator.wikimedia.org/T286463 https://wikitech.wikimedia.org/wiki/Application_servers [11:33:13] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps2009.codfw.wmnet with reason: Service profiling tests [11:33:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2009.codfw.wmnet with reason: Service profiling tests [11:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:06] (03CR) 10Effie Mouzeli: [C: 03+2] Rakefile: Fix undefined error variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/704837 (owner: 10JMeybohm) [11:38:08] (03CR) 10Dzahn: [C: 03+1] "lgtm. also since this is not the $active_host in class profile::gitlab it should not create the actual backup set in Bacula (the basics fr" [puppet] - 10https://gerrit.wikimedia.org/r/704801 (https://phabricator.wikimedia.org/T285870) (owner: 10Jelto) [11:38:46] (03Merged) 10jenkins-bot: Rakefile: Fix undefined error variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/704837 (owner: 10JMeybohm) [11:39:06] (03CR) 10Dzahn: "ACK, that's different. everyone seems to be in favor of this anyways though" [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [11:39:54] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10aborrero) Sharing a bit our experience @ WMCS with ingress-nginx: > What is the general architecture? Basically you deploy a kubernetes Deployment with a tailored NGINX that is able to p... [11:40:59] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 (owner: 10Effie Mouzeli) [11:42:01] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:44:02] (03CR) 10Effie Mouzeli: [C: 03+2] common_templates: Don't fail if kafka.allowed_clusters is not defined [deployment-charts] - 10https://gerrit.wikimedia.org/r/704843 (owner: 10JMeybohm) [11:46:01] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:46:18] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:37] (03Merged) 10jenkins-bot: common_templates: Don't fail if kafka.allowed_clusters is not defined [deployment-charts] - 10https://gerrit.wikimedia.org/r/704843 (owner: 10JMeybohm) [11:47:04] (03Merged) 10jenkins-bot: flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) (owner: 10JMeybohm) [11:48:42] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: configure allowed kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/704922 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [11:50:08] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: configure allowed kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/704922 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [11:51:41] (03PS2) 10Filippo Giunchedi: puppet_compiler: test ssh access to compilers [puppet] - 10https://gerrit.wikimedia.org/r/704924 [11:52:19] (03PS1) 10Dzahn: site/conftool: add mw1426, mw1427, mw1428 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/704945 (https://phabricator.wikimedia.org/T279309) [11:52:31] (03PS3) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [11:53:20] (03CR) 10Dzahn: [C: 03+2] site/conftool: add mw1426, mw1427, mw1428 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/704945 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [11:55:08] (03CR) 10Filippo Giunchedi: [C: 03+2] puppet_compiler: test ssh access to compilers [puppet] - 10https://gerrit.wikimedia.org/r/704924 (owner: 10Filippo Giunchedi) [11:58:40] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:07] (03PS1) 10Effie Mouzeli: Add entries for mwdebug service [dns] - 10https://gerrit.wikimedia.org/r/704948 (https://phabricator.wikimedia.org/T283056) [11:59:55] (03PS2) 10Effie Mouzeli: Add entries for mwdebug service [dns] - 10https://gerrit.wikimedia.org/r/704948 (https://phabricator.wikimedia.org/T283056) [12:01:32] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:36] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [12:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:18] (03PS2) 10DCausse: rdf-streaming-updater: configure allowed kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/704922 (https://phabricator.wikimedia.org/T265526) [12:05:21] (03CR) 10Filippo Giunchedi: "Do you have a preview of the metrics somewhere I could look at?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [12:11:22] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [12:12:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[1426-1428].eqiad.wmnet with reason: new host [12:12:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[1426-1428].eqiad.wmnet with reason: new host [12:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:43] (03PS1) 10Effie Mouzeli: conftool-data: add tegola-vector-tiles discovery [puppet] - 10https://gerrit.wikimedia.org/r/704949 (https://phabricator.wikimedia.org/T283159) [12:14:28] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:14:38] !log mw1426, mw1427, mw1428, rebooting, new API servers moving into production [12:14:40] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [12:14:42] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 [12:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:11] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw142[6-8].eqiad.wmnet [12:16:11] (03CR) 10Effie Mouzeli: tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 (owner: 10Effie Mouzeli) [12:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:36] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw142[6-8].eqiad.wmnet [12:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:24] !log mw1426,mw1427,mw1428 - scap pull [12:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:11] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 (owner: 10Effie Mouzeli) [12:21:51] (03Merged) 10jenkins-bot: tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 (owner: 10Effie Mouzeli) [12:22:26] (03PS1) 10Dzahn: site/conftool: add mw1429 through mw1433 as appservers, rack B3 [puppet] - 10https://gerrit.wikimedia.org/r/704950 (https://phabricator.wikimedia.org/T279309) [12:23:18] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: configure allowed kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/704922 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [12:26:10] (03Merged) 10jenkins-bot: rdf-streaming-updater: configure allowed kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/704922 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [12:26:21] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/704952 [12:26:33] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw142[6-8].eqiad.wmnet [12:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:12] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [12:29:11] (03CR) 10Dzahn: [C: 03+2] site/conftool: add mw1429 through mw1433 as appservers, rack B3 [puppet] - 10https://gerrit.wikimedia.org/r/704950 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [12:30:03] (03CR) 10Dzahn: "ah, 1430 is missing of course, will add it" [puppet] - 10https://gerrit.wikimedia.org/r/704950 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [12:30:05] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [12:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:36] (03PS1) 10Dzahn: site: add mw1430 to regex for new appserver range [puppet] - 10https://gerrit.wikimedia.org/r/704954 (https://phabricator.wikimedia.org/T279309) [12:32:58] (03PS3) 10Effie Mouzeli: Add entries for mwdebug service [dns] - 10https://gerrit.wikimedia.org/r/704948 (https://phabricator.wikimedia.org/T283056) [12:33:43] (03CR) 10Dzahn: [C: 03+2] site: add mw1430 to regex for new appserver range [puppet] - 10https://gerrit.wikimedia.org/r/704954 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [12:33:50] (03PS2) 10Dzahn: site: add mw1430 to regex for new appserver range [puppet] - 10https://gerrit.wikimedia.org/r/704954 (https://phabricator.wikimedia.org/T279309) [12:35:08] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw1429.eqiad.wmnet [12:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:28] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw143[0-3].eqiad.wmnet [12:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:34] (03PS1) 10Effie Mouzeli: Add entries for tegola-vector-tiles service [dns] - 10https://gerrit.wikimedia.org/r/704955 (https://phabricator.wikimedia.org/T283159) [12:35:49] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:35:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw143[0-3].eqiad.wmnet [12:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1429.eqiad.wmnet [12:36:13] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:36:15] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/704952 (owner: 10Jgiannelos) [12:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:59] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:38:58] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/704952 (owner: 10Jgiannelos) [12:39:20] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:39:33] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [12:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:42] !log mw1412 through mw1428 - set to active in netbox (T279309) [12:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] T279309: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 [12:43:30] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:44:32] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:44:37] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: kubelet: enable TTLAfterFinished feature gate [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) [12:44:41] 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1012 - Icinga/HP RAID - 2021-07-16 - https://phabricator.wikimedia.org/T286766 (10dcaro) @RobH can someone take a look? If the server is still in warranty we might want to get a replacement for the battery. Thanks! [12:45:02] 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1012 - Icinga/HP RAID - 2021-07-16 - https://phabricator.wikimedia.org/T286766 (10dcaro) a:05dcaro→03None [12:46:38] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1012 - Icinga/HP RAID - 2021-07-16 - https://phabricator.wikimedia.org/T286766 (10dcaro) [12:47:01] (03PS1) 10DCausse: thanos-swift envoy listener: rewrite HTTP host header [puppet] - 10https://gerrit.wikimedia.org/r/704960 (https://phabricator.wikimedia.org/T265526) [12:47:25] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1012 - Icinga/HP RAID - 2021-07-16 - https://phabricator.wikimedia.org/T286766 (10dcaro) [12:47:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1429.eqiad.wmnet with reason: new host [12:47:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1429.eqiad.wmnet with reason: new host [12:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:57] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:48:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1430-1433].eqiad.wmnet with reason: new host [12:48:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1430-1433].eqiad.wmnet with reason: new host [12:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:35] !log mw1429 through mw1433 - initial puppet run, reboot, moving into production as appservers (T279309) [12:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:41] T279309: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 [12:51:41] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:52:13] (03PS2) 10Effie Mouzeli: conftool-data: add mwdebug discovery 1 [puppet] - 10https://gerrit.wikimedia.org/r/704799 (https://phabricator.wikimedia.org/T283056) [12:52:24] (03PS2) 10Effie Mouzeli: conftool-data: add tegola-vector-tiles discovery 1 [puppet] - 10https://gerrit.wikimedia.org/r/704949 (https://phabricator.wikimedia.org/T283159) [12:52:26] (03PS1) 10Jgiannelos: tegola-vector-tiles: Enable swift caching on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/704961 [12:54:02] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1012 - Icinga/HP RAID - 2021-07-16 - https://phabricator.wikimedia.org/T286766 (10dcaro) Silenced the alert for 10 days on alertmanager. [12:55:15] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Enable swift caching on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/704961 (owner: 10Jgiannelos) [12:55:48] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Enable swift caching on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/704961 (owner: 10Jgiannelos) [12:56:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1429.eqiad.wmnet [12:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:25] (03Merged) 10jenkins-bot: tegola-vector-tiles: Enable swift caching on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/704961 (owner: 10Jgiannelos) [13:02:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw143[0-3].eqiad.wmnet [13:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:07] (03PS2) 10Dzahn: site/conftool: remove mw1276 through mw1279 [puppet] - 10https://gerrit.wikimedia.org/r/704287 (https://phabricator.wikimedia.org/T280203) [13:04:03] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1012 - Icinga/HP RAID - 2021-07-16 - https://phabricator.wikimedia.org/T286766 (10dcaro) [13:04:06] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet - https://phabricator.wikimedia.org/T286748 (10dcaro) [13:04:56] (03CR) 10Dzahn: "next we will need some new canary API servers, to replace mw1276 - mw1279 (Id9596cca8dad791). Feel like picking some for that?" [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [13:05:49] (03CR) 10Vgutierrez: [C: 03+1] Add entries for mwdebug service [dns] - 10https://gerrit.wikimedia.org/r/704948 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [13:08:29] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:10:15] (03PS1) 10Jgiannelos: tegola-vector-tiles: Fix config section for caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/704963 [13:11:07] (03PS1) 10Effie Mouzeli: service::catalog: add mwdebug service 2 [puppet] - 10https://gerrit.wikimedia.org/r/704964 (https://phabricator.wikimedia.org/T283056) [13:11:13] (03CR) 10Jgiannelos: "Here is an example from the docs: https://tegola.io/documentation/configuration/#full-config-example" [deployment-charts] - 10https://gerrit.wikimedia.org/r/704963 (owner: 10Jgiannelos) [13:11:57] (03PS1) 10Dzahn: site/conftool: decom mw1270,mw1271,mw1274,mw1275 [puppet] - 10https://gerrit.wikimedia.org/r/704966 [13:13:41] (03PS2) 10Dzahn: site/conftool: decom mw1270,mw1272,mw1273,mw1274,mw1275 [puppet] - 10https://gerrit.wikimedia.org/r/704966 (https://phabricator.wikimedia.org/T280203) [13:16:49] PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:16] (03PS2) 10Jgiannelos: tegola-vector-tiles: Fix config section for caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/704963 [13:17:39] PROBLEM - Host kafka-logging2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:03] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:07] PROBLEM - Host authdns2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:17] PROBLEM - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:19] PROBLEM - Host lvs2007 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:25] PROBLEM - Host thanos-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:27] PROBLEM - Host elastic2055 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:27] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:31] PROBLEM - Host ms-fe2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:45] PROBLEM - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:51] PROBLEM - Host ms-be2044 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={thanos-compact,wmf_elasticsearch} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:18:57] PROBLEM - Host ms-be2029 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:57] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:19:09] PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:11] Errr [13:19:12] uhoh that seems like a bad timing for a netsplit [13:19:26] The worst [13:19:43] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:20:07] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:20:21] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:20:23] PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:20:27] i'm not sure if anything has paged yet, but somthing likely should [13:20:29] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:20:29] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.o [13:20:29] Mobileapps_%28service%29 [13:20:31] PROBLEM - Juniper virtual chassis ports on asw-a-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:20:50] majavah: type the word in here? [13:20:51] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:20:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:21:19] (03PS2) 10Effie Mouzeli: add mwdebug service to LVS [puppet] - 10https://gerrit.wikimedia.org/r/704964 (https://phabricator.wikimedia.org/T283056) [13:21:45] effie: mind looking at the alerts [13:21:51] XioNoX: hi ! rack A2 in codfw seems to have gone offline [13:21:59] all the affected hosts I see are i there [13:22:07] yup.. looks like that [13:22:32] PROBLEM - LVS text codfw port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::1 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:22:33] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [13:22:40] There it goes majavah [13:23:10] lvs2007 is high-traffic1 primary [13:23:23] here too [13:23:27] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw2396.codfw.wmnet, mw2302.codfw.wmnet, mw2397.codfw.wmnet, mw2404.codfw.wmnet, mw2399.codfw.wmnet, mw2298.codfw.wmnet, maps2005.codfw.wmnet, mw2291.codfw.wmnet, mw2401.codfw.wmnet, mw2402.codfw.wmnet, mw2308.codfw.wmnet, mw2294.codfw.wmnet, mw2403.codfw.wmnet, mw2400.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [13:23:28] what's the issue? [13:23:35] topranks: ^ [13:23:37] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw2396.codfw.wmnet, mw2296.codfw.wmnet, mw2295.codfw.wmnet, mw2302.codfw.wmnet, cp2029.codfw.wmnet, mw2404.codfw.wmnet, mw2293.codfw.wmnet, cp2028.codfw.wmnet, mw2403.codfw.wmnet, maps2005.codfw.wmnet, mw2292.codfw.wmnet, cp2027.codfw.wmnet, mw2401.codfw.wmnet, maps2001.codfw.wmnet, mw2299.codfw.wmnet, mw2294.codfw.wmnet, cp2030.codf [13:23:37] mw2405.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [13:23:38] indeed, lvs2010 took over lvs207 that apparently [13:23:47] https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=7 [13:23:54] checking user impact [13:24:02] i can still load the wikis fine [13:24:09] wikis worked for me from .de [13:24:22] we need someone in the us to check [13:24:35] eqsin upload isn't happy [13:24:36] (03CR) 10Arturo Borrero Gonzalez: "disclaimer: I'm sharing only to collect feedback. This patch is untested." [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) (owner: 10Arturo Borrero Gonzalez) [13:24:37] just a note that some people might still not be here due to a netsplit [13:24:43] there was a small slowdown on uncached requests [13:24:44] https://grafana.wikimedia.org/d/000000479/frontend-traffic [13:24:45] en ok from Canada [13:24:59] * jayme here [13:25:01] i'm here [13:25:01] tried to call Arzhel but did not work [13:25:03] but cached looks good? [13:25:03] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:25:04] ah :) [13:25:08] in a train though [13:25:21] XioNoX: it seems limited to rack A2 [13:25:37] PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [13:25:43] there appear to be no db hosts in A2, that's a pleasant surprise [13:25:51] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [13:25:58] a couple swift hosts [13:25:59] isn't topranks around? [13:26:00] XioNoX: do you want me to ring topranks? [13:26:04] I think XioNoX you're on vac, right? [13:26:05] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:26:13] it looks like cp2027 and cp2029 (A4) aren't reachable by lvs2010 [13:26:21] Let me go ahead and do that then :) [13:26:25] 5xx are subduing https://logstash.wikimedia.org/goto/757e5ad9e396e1324b8d037ee575852e [13:26:31] paravoid: it's ok I don't have anything to do in that train [13:26:34] sobanski: yup, thanks [13:26:48] uncached requess recovering, so maybe it was just the automatic failover? [13:27:01] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.649 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:27:11] rescheduling icinga checks on core routers [13:27:21] PROBLEM - configured eth on lvs2009 is CRITICAL: ens2f1np1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:27:32] can someone that's not XioNoX or topranks be IC? [13:27:42] swift latency is obviously elevated but not troubling afaics [13:27:45] * topranks looking [13:28:13] sry was having my lunch. [13:28:13] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:28:30] jynus or mutante can one of you be IC? [13:28:34] topranks: unacceptable ;) [13:28:34] yes [13:28:37] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:28:38] thanks jynus :) [13:28:40] creating doc [13:29:01] (03CR) 10DCausse: "PCC output https://puppet-compiler.wmflabs.org/compiler1003/30237/" [puppet] - 10https://gerrit.wikimedia.org/r/704960 (https://phabricator.wikimedia.org/T265526) (owner: 10DCausse) [13:29:04] so lvs in codfw are unable to reach row A [13:29:07] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [13:29:14] let's assume that switch is not coming back up [13:29:18] (I'll step away and let you guys work, I'm around though, let me know if I can help in any way :) [13:29:37] is Papaul working? [13:29:38] that may be the leg LVS has into row A's vlan [13:29:45] where is lvs2010 connected in row A? [13:29:45] XioNoX: not this week [13:30:05] Let me know if I should try reaching out to Willy / Rob [13:30:13] PROBLEM - Prometheus k8s cache not updating on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [13:30:19] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Fix config section for caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/704963 (owner: 10Jgiannelos) [13:30:29] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:30:33] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:36] we are seeing some alerts because of high load on appservers in other racks [13:31:03] yeah nothing on console so the switch is offline/dead [13:31:29] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:31:42] sobanski: nah, unless we can't live with that rack off, which I hope we don't :) [13:31:49] the "could be worse part. that rack is half empty [13:31:51] ACK [13:32:01] PROBLEM - configured eth on lvs2010 is CRITICAL: ens2f1np1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:32:17] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:32:20] that's lvs2010 link to row A [13:32:49] ns1.wm.o seems to be down too? [13:33:01] (03Merged) 10jenkins-bot: tegola-vector-tiles: Fix config section for caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/704963 (owner: 10Jgiannelos) [13:33:01] that's authdns2001 then [13:33:03] the list of ports/hosts: [13:33:09] https://www.irccloud.com/pastebin/6c1TX5QI/ [13:33:14] what paged is 2620:0:860:ed1a::1 unreachable, which seems to be still the case, not sure why though [13:33:25] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:33:26] I'm going to redirect ns1 to ns0 [13:33:31] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2003 is CRITICAL: 45 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [13:34:07] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [13:34:51] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:35:00] dcausse: elastic2037, elastic2038, elastic2055 have gone offline because the switch in that rack broke. do they need to be deactivated? [13:35:03] ryankemper: ^ [13:35:05] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2002 is CRITICAL: 47 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002 [13:35:32] mutante: no the cluster should survive without manual intervention [13:36:05] dcausse: great, thaks [13:36:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:17] app/api server response is back to normal [13:36:23] pushing this to the codfw routers https://www.irccloud.com/pastebin/dEQbkCWp/ [13:36:40] I am not following the current status- is topranks researching, did something recover? [13:36:43] godog: we dont need to to anything with ms-be hosts either? [13:36:45] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:37:04] dcausse: ^ 2 masters? [13:37:07] bad luck :) [13:37:10] heh [13:37:12] jynus: mediawiki has recovered [13:37:18] thanks, effie [13:37:34] please comment aloud if not busy fixing stuff [13:37:39] 0:-) [13:37:43] there were no mw hosts in that rack, it must have been due to swift? [13:37:49] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:37:53] mutante: if they are going to be down for the week-end I think we should mark another node as master eligible, prepping a patch [13:37:55] mutante: not really no, losing a row/rack is planned for swift [13:37:59] can someone check ns1? [13:38:07] I can [13:38:08] XioNoX: works for me [13:38:10] not sure if it's the train wifi blocking ping/dns [13:38:16] nice thanks! [13:38:20] XioNoX: works for me (dig @ ns1) [13:38:22] wfm as well [13:38:30] dcausse: cool :) [13:38:33] godog: yay [13:38:41] is everything fine with the LVSs? [13:38:45] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [13:38:51] (did they failover properly?) [13:39:00] I can't curl -6 http://text-lb.codfw.wikimedia.org from alert1001 [13:39:06] but https works ?! [13:39:35] curl text-lb.codfw works for me, both v4 and v6 [13:39:58] yeah from outside I can reach it too [13:40:01] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [13:40:31] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:41:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:41:45] maps has high 5xx indeed, not sure if related though [13:41:49] RECOVERY - Prometheus k8s cache not updating on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [13:41:53] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [13:42:04] next step network wise will be to replace the faulty switch. Note that it's also one of the two "spines" of row A, so that means we don't have redundancy anymore on the row A <-> routers links [13:42:13] seems lot of "services" affected, one of those clusters seems to be suffering the most [13:42:27] (kartotherian, proton, restbase [13:42:42] XioNoX: does the curl -6 test from alert1001 ring a bell ? I was looking into what actually paged [13:43:06] godog: nop [13:43:10] looking [13:43:10] i.e. no route to host for http but https works [13:43:12] I am the one with the uncommited DNS changes in netbox [13:44:01] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:44:17] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:44:23] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:04] I'll go for "maps 5xxs are not related" for now but wouldb e nice to get confirmation [13:45:07] there are LVS alerts on lvs2009,lvs2010 but the backend check alerts are > 20hours old. only the "IPVS diff checks" are from this [13:45:21] /q effie [13:45:35] XioNoX, when you have time, we should expect recoveries from most services after your change? [13:45:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps2009.codfw.wmnet with reason: Service profiling tests [13:45:40] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps2009.codfw.wmnet with reason: Service profiling tests [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:55] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:45:57] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [13:46:11] so some things are recovering, but others are alerting now [13:46:19] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb2fc193a90: Failed to establish a new connection: [Errno 113] No route to host): /v4/marker/pin-m+ffffff@2x.png https://wikitech.wikimedia.org/wiki/Maps%23Karto [13:46:39] jynus: which change? [13:46:47] https://www.irccloud.com/pastebin/dEQbkCWp/ [13:46:51] hnowlan_: ^ FYI there are troubles in codfw ATM, and maps spewing 5xx [13:47:09] jynus: no, this is done and unrelated, only to fix ns1 [13:47:13] ok [13:47:14] thanks [13:47:18] vgutierrez: wondering if "CRITICAL: Hosts in IPVS but unknown to PyBal:" would be fixed by pybal restart? [13:47:49] godog: ack, looking [13:48:15] hnowlan_: might not be related to rack A2 out of service, but wanted to make sure [13:48:17] not sure mutante, at least the listed cp hosts there aren't reachable by lvs2009/lvs2010 anymore due to row A going down for them [13:48:22] jynus: it comes down to 1/ LVS should have detected the realservers as unreachable and removed it to the pool, and 2/ how non LVS services handles a loss of one of theirs [13:48:43] and with lvs2007 down, restarting pybal on lvs2010 is going to be problematic [13:48:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:48] alert1001 is getting ICMP unreachable back for text-lb.codfw.wikimedia.org via IPv6 [13:49:06] 13:47:46.215058 84:18:88:0d:df:c8 > 2c:ea:7f:46:ac:b4, ethertype IPv6 (0x86dd), length 142: 2620:0:860:ed1a::1 > 2620:0:861:3:208:80:154:88: ICMP6, destination unreachable, unreachable address 2620:0:860:ed1a::1, length 88 [13:49:18] the codfw maps master is under a lot of load which probably explains the 500s [13:49:21] sry if that is obvious, working my way down the devices. [13:49:41] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [13:49:45] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:49:47] ah, maybe not enough realservers in the pool? [13:49:57] or just the opposite [13:50:07] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to trans [13:50:08] med out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:50:16] topranks: yeah saw the same, can't http but https works over ipv6 from alert1001 to text-lb.codfw [13:50:18] lvs is refusing to depool one cp server on each cluster (upload & text) because too many are down [13:50:30] s/lvs/pybal/ [13:50:35] godog: ^ that's probably why [13:50:38] connect to address 2620:0:860:ed1a::1 and port 80: No route to host < but from an appserver, lke mw2310 I can ping6 2620:0:860:ed1a::1 and do get a response [13:50:49] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:33] XioNoX: could be! I can't explain the http vs https difference, yet at least [13:51:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:51] what are the main ongoing user-impacting issues right now? Maps? [13:52:13] maybe http for text-lb too? [13:52:14] jynus: in terms of 5xx yes I'd say so [13:52:22] jynus: I guess mobileapps [13:52:28] thanks for the update [13:52:33] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:53:14] one maps server (maps2006) is getting a lot more load than the others [13:53:27] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [13:53:34] actually no, I'm wrong, they're all hot [13:53:49] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:53:49] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:54:07] lots of recoveries now, did something change? [13:54:10] just told Icinga to recheck all of them one more time without waiting [13:54:16] thanks, mutante [13:54:38] does that mean mobile apps is ok, or still errors? [13:54:39] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:39] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:43] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:55:29] who knows if we can re-enable maps in eqiad? [13:55:43] ^ hnowlan_ ? [13:55:51] jynus: ok in codfw, not recovered yet in eqiad.. hmm [13:55:57] thanks, mutante [13:55:59] XioNoX: yep, we can [13:56:11] PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [13:56:35] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:57:28] it's not depooled in DNS? [13:57:34] (03PS1) 10DCausse: elasticsearch@eqiad: change eligible masters for omega [puppet] - 10https://gerrit.wikimedia.org/r/704973 [13:57:50] who knows how to re-enable maps in eqiad then ? :) [13:57:56] currently codfw is serving all traffic - I don't actually know how to shift traffic to eqiad though [13:58:12] (03PS2) 10DCausse: elasticsearch@codfw: change eligible masters for omega [puppet] - 10https://gerrit.wikimedia.org/r/704973 [13:58:29] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned t [13:58:29] ected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:58:35] ACKNOWLEDGEMENT - Host authdns2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:36] ACKNOWLEDGEMENT - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:36] ACKNOWLEDGEMENT - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:36] ACKNOWLEDGEMENT - Host elastic2055 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:36] ACKNOWLEDGEMENT - Host kafka-logging2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:36] ACKNOWLEDGEMENT - Host lvs2007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:36] ACKNOWLEDGEMENT - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:37] ACKNOWLEDGEMENT - Host ms-be2029 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:37] ACKNOWLEDGEMENT - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:38] ACKNOWLEDGEMENT - Host ms-be2044 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:38] ACKNOWLEDGEMENT - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:39] ACKNOWLEDGEMENT - Host ms-fe2005 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:39] ACKNOWLEDGEMENT - Host thanos-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn switch broken in rack A3 [13:58:44] hnowlan_: any idea who might know? who did the switchover for maps? [13:59:41] XioNoX: I don't know :( it's a fairly standard pybal/discovery service, is there a standard procedure for flipping traffic between DCs? [13:59:44] https://config-master.wikimedia.org/discovery/discovery-basic.yaml says kartotherian is depooled on eqiad [14:00:01] RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:00:05] PROBLEM - kartotherian endpoints health on maps2003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:00:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:11] (03PS1) 10Vgutierrez: Revert "lvs: Set depool_threshold to .8 for upload & text" [puppet] - 10https://gerrit.wikimedia.org/r/704816 [14:01:19] ACKNOWLEDGEMENT - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 daniel_zahn switch broken in rack A3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:01:19] ACKNOWLEDGEMENT - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 daniel_zahn switch broken in rack A3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:01:19] ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal, AS64605/IPv4: Active - Anycast daniel_zahn switch broken in rack A3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:19] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 2, dormant: 0, excluded: 0, unused: 0: daniel_zahn switch broken in rack A3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:01:29] what about cxserver, that also seems to be the worst availability of all on graphs [14:01:46] that and ores [14:01:47] (03PS2) 10Vgutierrez: Revert "lvs: Set depool_threshold to .8 for upload & text" [puppet] - 10https://gerrit.wikimedia.org/r/704816 [14:03:38] someone from language teams to understand the impact on users of cxserver? [14:03:47] ACKNOWLEDGEMENT - configured eth on lvs2010 is CRITICAL: ens2f1np1 reporting no carrier. daniel_zahn switch broken in rack A2 https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:03:47] ACKNOWLEDGEMENT - configured eth on lvs2009 is CRITICAL: ens2f1np1 reporting no carrier. daniel_zahn switch broken in rack A2 https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:03:47] ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw-a-codfw is CRITICAL: CRIT: Down: 7 Unknown: 0 daniel_zahn switch broken in rack A2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:04:17] kart_, maybe? [14:05:17] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10Lea_WMDE) I approve. [14:05:33] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:06:16] (03CR) 10DCausse: "elasticsearch_6@production-search-omega-codfw.service on elastic2051.codfw.wmnet & elastic2038.codfw.wmnet must then be restarted" [puppet] - 10https://gerrit.wikimedia.org/r/704973 (owner: 10DCausse) [14:06:17] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:07:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:29] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:07:29] jynus: Let me look at what's going on. Reading backlog. [14:07:43] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [14:07:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:09] kart_, mosty interested on impact of cxserver on users [14:08:09] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is OK: OK - Certificate kartotherian.discovery.wmnet will expire on Wed 13 Dec 2023 11:06:02 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:08:17] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [14:09:39] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:10:09] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [14:10:13] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd8d19c5f60: Failed to establish a new connection: [Errno 113] No route to host): /?spec https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [14:10:16] we have some nice recoveries [14:10:23] jynus: Sure. Doing some tests, not reported by anything unusual so far, but checking again. [14:10:55] kart_, please check logs- it is possible it could be affecting only a subset of the users (codfw dc) [14:10:58] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10elal) Do you currently have **shell access** (Yes/No) No [14:11:19] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=eqiad [14:11:21] kart_, the context is we had some network issues on codfw [14:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:35] RECOVERY - kartotherian endpoints health on maps2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:11:44] kart_: can you tell me what errors are you seeing ? [14:12:20] kart_: we kave lost a row and cxserver is erroring, but I didn't get a chance to look at the errors [14:14:11] effie: seems like, https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-syslog-2021.07.16?id=i5epr3oBCxLmWkI6hyf6 - not useful for me as of now. [14:14:54] !log hnowlan@puppetmaster1001 conftool action : set/weight=5; selector: name=maps1004.eqiad.wmnet [14:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:53] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:59] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:16:45] kart_, the monitoring error says: "/v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received" [14:16:58] (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: Set depool_threshold to .8 for upload & text" [puppet] - 10https://gerrit.wikimedia.org/r/704816 (owner: 10Vgutierrez) [14:18:01] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wi [14:18:01] ikimedia.org/wiki/CX [14:18:09] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10Aklapper) (Hi, for future reference, please follow https://phabricator.wikimedia.org/project/profile/1564/ when creating such requests - thanks a lot! :) [14:18:31] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10Aklapper) a:05Lea_WMDE→03None [14:18:51] kart_, it is possible that it is not cxserver but restbase what is failing/timeing out [14:19:10] could you check if that could be the issue? [14:19:17] Sure. Checking. [14:19:49] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:20:18] oh, there is now a recovery [14:21:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:41] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:42] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:36] ah. restbase is OK? [14:23:31] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:23:31] ACKNOWLEDGEMENT - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (e [14:23:31] : 200) daniel_zahn eqiad not pooled https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:24:55] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10netops: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) [14:25:45] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:26:20] ^ kart_ while some things got better we are still getting this error [14:26:25] 10ops-codfw, 10DC-Ops, 10netops: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) [14:26:59] it could be restbase not cxserver, any feedback is welcome [14:27:50] 10ops-codfw, 10DC-Ops, 10netops: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) Please note I'm not putting this request into CyrusOne until after Arzhel confirms they are ready for this step. [14:29:33] Yeah, I'm looking at those tests [14:29:43] (03PS1) 10Jgiannelos: tegola-vector-tiles: Fix scheme for swift endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/704976 [14:29:59] but it is weird, because restbase host shouldn't be affected in theory [14:30:11] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to transl [14:30:11] ed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:30:35] (03CR) 10Dzahn: [C: 03+2] contint: remove erroneous hiera setting for labs [puppet] - 10https://gerrit.wikimedia.org/r/673286 (https://phabricator.wikimedia.org/T277526) (owner: 10Hashar) [14:30:39] log: Running homer against asw-a-codfw virtual-chassis to change the config for all ports on dead switch asw-a2-codfw to disabled. [14:30:47] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet - https://phabricator.wikimedia.org/T286748 (10RobH) This host is out of warranty, but we've regularly seen HP raid controller batteries fail at 3+ years and require replacement. We tend to buy a... [14:31:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:01] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:32:22] (03CR) 10Dzahn: [C: 03+2] beta: add warning motd and link to term of uses [puppet] - 10https://gerrit.wikimedia.org/r/699207 (https://phabricator.wikimedia.org/T100837) (owner: 10Hashar) [14:33:01] topranks: ! Not :? [14:33:07] 10ops-codfw, 10DC-Ops, 10netops: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10cmooney) First log of an issue was this sent from the master switch in the virtual-chassis: Jul 16, 2021 @ 13:15:55.000 %-SNMP_TRAP_LINK_DOWN: ifIndex 927, ifAdminStatus up(1), ifOperStatus down(2),... [14:33:13] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Proton [14:33:19] You did Log: not !log [14:34:45] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: Fix scheme for swift endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/704976 (owner: 10Jgiannelos) [14:34:56] jynus, https://grafana.wikimedia.org/d/000000068/restbase?viewPanel=18&orgId=1&from=now-24h&to=now in case that is helpful. [14:35:03] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:35:09] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [14:35:45] subbu, thanks [14:36:26] shows page_summary_title and page_relaed_title endpoints having greater 503s. [14:36:30] 10SRE, 10Beta-Cluster-Infrastructure, 10Cloud-VPS, 10Patch-For-Review: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837 (10hashar) 05Open→03Resolved Done! ` $ ssh deployment-deploy01.deployment-prep.eqiad1.wikimedia.cloud Deb... [14:36:40] oh, restabase router external [14:36:51] that could explain why only cxserver affected [14:36:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:57] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:37:19] not sure what 'restbase router external' means but glad you know. :) [14:37:20] (03Merged) 10jenkins-bot: tegola-vector-tiles: Fix scheme for swift endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/704976 (owner: 10Jgiannelos) [14:37:35] !log vgutierrez@lvs2010:~$ sudo -i ifdown ens2f1np1 [14:37:37] my assumption is that it is failing to query mw [14:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:22] or external pages for citations [14:38:36] RECOVERY - LVS text codfw port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 623 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:38:39] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:39:07] RhinosF1: too much haste on my side - thanks for the heads up! [14:39:10] vgutierrez, that seeming very good? [14:39:18] !log: Ran homer against asw-a-codfw virtual-chassis to change the config for all ports on dead switch asw-a2-codfw to disabled. [14:39:27] topranks: np [14:39:58] topranks: and no `:` [14:40:17] ah ffs. thanks :) [14:40:22] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:40:24] !log Ran homer against asw-a-codfw virtual-chassis to change the config for all ports on dead switch asw-a2-codfw to disabled. [14:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:33] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:40:50] !log Running homer to disable et-0/0/0 on cr1-codfw, which connects to currently dead device asw-a2-codfw T286787 [14:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:56] T286787: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 [14:41:03] !log vgutierrez@lvs2009:~$ sudo -i ifdown ens2f1np1 [14:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:15] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:45] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:43:13] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:43:15] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:11] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:13] (03PS1) 10Vgutierrez: lvs: Disable row A NIC on lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/704977 [14:45:37] PROBLEM - LVS shellbox codfw port 4008/tcp - Shellbox- shellbox.svc.codfw.wmnet IPv4 on shellbox.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.51 and port 4008: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:46:26] PROBLEM - LVS search codfw port 9200/tcp - Elasticsearch search for MediaWiki IPv4 #page on search.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.30 and port 9200: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:46:43] PROBLEM - LVS push-notifications codfw port 4104/tcp - Push-notifications service push-notifications.svc.codfw.wmnet IPv4 on push-notifications.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.56 and port 4104: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:46:48] PROBLEM - LVS apaches codfw port 80/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet IPv4 #page on appservers.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.1 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:46:56] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.22 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:47:05] PROBLEM - LVS apertium codfw port 4737/tcp - Machine Translation service. apertium.svc.codfw.wmnet IPv4 on apertium.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.11 and port 4737: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:47:10] PROBLEM - LVS videoscaler codfw port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.codfw.wmnet IPv4 #page on videoscaler.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.5 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:47:23] PROBLEM - LVS linkrecommendation-external codfw port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.codfw.wmnet IPv4 on linkrecommendation.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.23 and port 4006: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:47:55] vgutierrez: need help? [14:48:03] PROBLEM - LVS blubberoid codfw port 4666/tcp - Blubberoid- blubberoid.svc.codfw.wmnet -https- IPv4 on blubberoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.31 and port 4666: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:48:38] PROBLEM - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::9 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:48:43] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting [14:48:43] /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:48:45] RECOVERY - LVS push-notifications codfw port 4104/tcp - Push-notifications service push-notifications.svc.codfw.wmnet IPv4 on push-notifications.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:49:00] PROBLEM - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.44 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:49:04] PROBLEM - LVS prometheus codfw port 80/tcp - Prometheus monitoring IPv4 #page on prometheus.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.25 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:49:10] RECOVERY - LVS videoscaler codfw port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.codfw.wmnet IPv4 #page on videoscaler.svc.codfw.wmnet is OK: OK - Certificate jobrunner.discovery.wmnet will expire on Mon 19 May 2025 02:00:11 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:49:11] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:49:43] PROBLEM - LVS similar-users codfw port 4110/tcp - Similar-users/sockpuppet- similar-users.svc.codfw.wmnet IPv4 on similar-users.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.57 and port 4110: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:49:45] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to transl [14:49:45] ed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:50:13] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:50:15] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:50:36] RECOVERY - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:51:01] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:51:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:05] 10SRE, 10ops-codfw, 10DC-Ops, 10netops: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) I've opened support ticket 2022508 with cyrunsone to have them use remote hands to powercycle this. > The switch is a Juniper QFX5100-48S-6Q, labeled asw-a2-codfw, located in U26 (re... [14:52:05] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [14:52:56] RECOVERY - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:53:03] RECOVERY - LVS apertium codfw port 4737/tcp - Machine Translation service. apertium.svc.codfw.wmnet IPv4 on apertium.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 5945 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:53:05] RECOVERY - LVS shellbox codfw port 4008/tcp - Shellbox- shellbox.svc.codfw.wmnet IPv4 on shellbox.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 368 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:53:05] RECOVERY - LVS similar-users codfw port 4110/tcp - Similar-users/sockpuppet- similar-users.svc.codfw.wmnet IPv4 on similar-users.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:53:17] RECOVERY - LVS linkrecommendation-external codfw port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.codfw.wmnet IPv4 on linkrecommendation.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:53:57] RECOVERY - LVS blubberoid codfw port 4666/tcp - Blubberoid- blubberoid.svc.codfw.wmnet -https- IPv4 on blubberoid.svc.codfw.wmnet is OK: OK - Certificate blubberoid.discovery.wmnet will expire on Sat 02 Aug 2025 03:55:34 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:54:07] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:54:29] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fe3f0a42ef0: Failed to establish a new connection: [Errno 113] No route to host): /?spec https://wikitech.wikimedia.org/wiki/CX [14:54:36] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:54:41] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f94ec926d30: Failed to establish a new connection: [Errno 113] No route to host): /en.wikipedia.org/v1/page/mobile-html/User%3ABSitzmann_%28WMF%29%2FMCS%2FT [14:54:41] ankenstein https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:54:48] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24405 bytes in 0.507 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:54:57] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:01] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:55:05] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:25] PROBLEM - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.44 and port 443: No route to host https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [14:55:59] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:56:21] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:27] RECOVERY - configured eth on lvs2010 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:56:27] RECOVERY - configured eth on lvs2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:56:37] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f78b9dc6c50: Failed to establish a new connection: [Errno 113] No route to host): /en.wiktionary.org/v1/page/definition/cat https://wikitech.wikimedia.org/wiki/M [14:56:37] s_%28service%29 [14:56:53] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:59] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [14:57:03] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:23] RECOVERY - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [14:57:36] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Majavah) [14:58:14] RECOVERY - LVS search codfw port 9200/tcp - Elasticsearch search for MediaWiki IPv4 #page on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 614 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:58:50] RECOVERY - LVS prometheus codfw port 80/tcp - Prometheus monitoring IPv4 #page on prometheus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10959 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:59:24] PROBLEM - LVS ores codfw port 443/tcp - Objective Revision Evaluation Service. ores.svc.codfw.wmnet IPv4 #page on ores.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.10 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:59:56] PROBLEM - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::9 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:00:07] sigh.. the checks are flapping [15:00:17] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:00:31] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:01:01] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:01:52] RECOVERY - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: OK - Certificate wikipedia.com will expire on Thu 19 Aug 2021 08:01:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:02:20] PROBLEM - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::9 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:02:59] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:03:05] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fd2f61eaac8: Failed to establish a new connection: [Errno 113] No route to host): /en.wikipedia.org/v1/pdf/Bar/a4/mobile https://wikitech.wikimedia.org/wiki/Proton [15:03:20] RECOVERY - LVS ores codfw port 443/tcp - Objective Revision Evaluation Service. ores.svc.codfw.wmnet IPv4 #page on ores.svc.codfw.wmnet is OK: OK - Certificate ores.discovery.wmnet will expire on Sun 03 Aug 2025 06:31:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:04:20] RECOVERY - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:04:49] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:04:59] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:04:59] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:05:17] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:06:11] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:06:57] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f3c81677ba8: Failed to establish a new connection: [Errno 113] No route to host): /en.wikipedia.org/v1/pdf/// https://wikitech.wikimedia.org/wiki/Proton [15:07:51] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [15:07:52] RECOVERY - LVS apaches codfw port 80/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet IPv4 #page on appservers.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:08:34] PROBLEM - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.44 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:08:51] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [15:08:55] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:09:13] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:09:49] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:10:36] PROBLEM - LVS prometheus codfw port 80/tcp - Prometheus monitoring IPv4 #page on prometheus.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.25 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:10:51] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:10:51] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:10:53] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:37] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:12:23] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:30] RECOVERY - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:12:51] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f8d9c6248d0: Failed to establish a new connection: [Errno 113] No route to host): /_info/version https://wikitech.wikimedia.org/wiki/Proton [15:14:03] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:14:10] !log vgutierrez@lvs2010:~$ sudo -i ifup ens2f1np1 [15:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:21] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:14:41] !log set alert2001 as active in netbox (was staged) - T247966 [15:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:48] T247966: Migrate role::alerting_host to Buster - https://phabricator.wikimedia.org/T247966 [15:14:49] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f25313ff940: Failed to establish a new connection: [Errno 113] No route to host): /en.wikipedia.org/v1/pdf/// https://wikitech.wikimedia.org/wiki/Proton [15:15:03] PROBLEM - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.44 and port 443: No route to host https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [15:15:07] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [15:15:07] /wiki/Services/Monitoring/restbase [15:15:36] PROBLEM - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::9 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:16:19] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:16:30] RECOVERY - LVS prometheus codfw port 80/tcp - Prometheus monitoring IPv4 #page on prometheus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10959 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:17:01] RECOVERY - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [15:17:55] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f77b8f99080: Failed to establish a new connection: [Errno 113] No route to host): /v2/translate/en/qqq/TestClient https://wikitech.wikimedia.org/wiki/CX [15:18:15] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:18:19] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:18:24] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.22 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:18:25] PROBLEM - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.44 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:18:43] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 [15:18:43] ng: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [15:18:47] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:18:47] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:18:54] can we do anything to the flapping checks? [15:19:34] RECOVERY - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: OK - Certificate wikipedia.com will expire on Thu 19 Aug 2021 08:01:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:19:50] it's being looked at, majavah [15:20:17] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:20:22] RECOVERY - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:20:45] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:21:03] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:21:22] PROBLEM - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: connect to address 208.80.153.232 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:21:41] PROBLEM - LVS proton codfw port 4030/tcp - Proton PDF rendering service. proton.svc.codfw.wmnet IPv4 on proton.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.21 and port 4030: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:21:44] PROBLEM - LVS jobrunner codfw port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.codfw.wmnet IPv4 #page on jobrunner.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.26 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:21:47] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:21:51] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fec48738f98: Failed to establish a new connection: [Errno 113] No route to host): /?spec https://wikitech.wikimedia.org/wiki/CX [15:22:17] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:22:43] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:23:41] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:23:41] RECOVERY - LVS proton codfw port 4030/tcp - Proton PDF rendering service. proton.svc.codfw.wmnet IPv4 on proton.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1009 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:23:45] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f8a1437c9e8: Failed to establish a new connection: [Errno 113] No route to host): /v2/page/en/es/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikit [15:23:45] media.org/wiki/CX [15:24:09] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve e [15:24:09] metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) i [15:24:09] AL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:24:21] !log downtime flappy pages in codfw for 40 minutes [15:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:37] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:25:32] there's your answer majavah ^^ for now, as people work on mitigating the underlying issue :-) [15:25:38] RECOVERY - LVS jobrunner codfw port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.codfw.wmnet IPv4 #page on jobrunner.svc.codfw.wmnet is OK: OK - Certificate jobrunner.discovery.wmnet will expire on Mon 19 May 2025 02:00:11 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:25:41] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:25:41] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f1a9d422668: Failed to establish a new connection: [Errno 113] No route to host): /en.wikipedia.org/v1/page/mobile-sections/User%3ABSitzmann_%28WMF%29%2FMCS [15:25:42] 2FFrankenstein https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:26:27] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:26:37] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:26:37] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:26:48] apergos: we all understand. You can blame me for cursing it this morning with a let's hope everything is a simple as an expired downtime. [15:27:03] tsk tsk... [15:27:16] RECOVERY - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: OK - Certificate wikipedia.com will expire on Thu 19 Aug 2021 08:01:10 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:27:37] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:27:41] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f378778dcc0: Failed to establish a new connection: [Errno 113] No route to host): /en.wikipedia.org/v1/data/css/mobile/pagelib https://wikitech.wikimedia.org/wik [15:27:41] apps_%28service%29 [15:28:12] PROBLEM - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.44 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:28:27] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:37] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:37] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:54] apergos: yeah I should have known when I said that. It went the exact opposite. [15:29:45] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by NewConnectionError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f785817c208: Failed to establish a new connection: [Errno 113] No route to host): /v1/list/mt/en/es https://wikitech.wikimedia.org/wiki/CX [15:29:46] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) Remote hands has completed the powercycle of the switch (via removing all power cables). Both before and after power removal, all LEDs are illuminated, which is no... [15:29:49] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:30:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:33] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:31:19] PROBLEM - configured eth on lvs2010 is CRITICAL: ens2f1np1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:31:47] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:54] kart_: still around? [15:32:06] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24405 bytes in 0.544 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:33:57] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:34:03] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:34:03] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:34:21] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [15:34:21] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_80: Servers cp2027.codfw.wmnet are marked down but pooled: uploadlb_80: Servers cp2028.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2396.codfw.wmnet, mw2302.codfw.wmnet, mw2308.codfw.wmnet, mw2294.codfw.wmnet, mw2296.codfw.wmnet, mw2297.codfw.wmnet, mw2292.codfw.wmnet, mw2299.codfw.wmnet are marked down but pooled: testlb_8 [15:34:21] rs cp2029.codfw.wmnet are marked down but pooled: api_80: Servers mw2298.codfw.wmnet, mw2294.codfw.wmnet, mw2253.codfw.wmnet, mw2304.codfw.wmnet, mw2399.codfw.wmnet, mw2403.codfw.wmnet, mw2292.codfw.wmnet, mw2299.codfw.wmnet are marked down but pooled: testlb_443: Servers cp2029.codfw.wmnet are marked down but pooled: kartotherian_6533: Servers maps2001.codfw.wmnet are marked down but pooled: uploadlb_443: Servers cp2028.codfw.wmnet are m [15:34:21] wn but pooled: textlb_443: Servers cp2029.codfw.wmnet are marked down but pooled: kartotherian-ssl_443: Servers maps2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:34:31] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [15:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:47] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw2297.codfw.wmnet, mw2253.codfw.wmnet, mw2396.codfw.wmnet, mw2304.codfw.wmnet, mw2302.codfw.wmnet, maps2001.codfw.wmnet, mw2403.codfw.wmnet, mw2399.codfw.wmnet, cp2028.codfw.wmnet, mw2298.codfw.wmnet, maps2005.codfw.wmnet, mw2292.codfw.wmnet, mw2308.codfw.wmnet, mw2299.codfw.wmnet, mw2294.codfw.wmnet, cp2030.codfw.wmnet, mw2296.codf [15:35:47] ) https://wikitech.wikimedia.org/wiki/PyBal [15:35:59] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:07] ^^ that's expected on lvs2010 [15:36:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:15] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:23] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:37:09] effie: yes. Is issue coming from cxserver or somewhere else? I see lots of other services being affected. I'm not getting anything. [15:37:29] kart_: does cxserver interact with restbase? [15:38:13] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:38:23] !log jiji@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:04] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) a:03Papaul I'll attempt to summarize the IRC discussion. @ayounsi, @cmooney, and myself discussed how it is likely safer to let a single switch sit broken over t... [15:39:49] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:39:52] RECOVERY - LVS docker-registry codfw port 443/tcp - docker registry service IPv4 #page on docker-registry.svc.codfw.wmnet is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:40:11] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [15:40:19] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:51] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) [15:41:53] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:42:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw2302.codfw.wmnet, mw2308.codfw.wmnet, mw2294.codfw.wmnet, mw2253.codfw.wmnet, mw2297.codfw.wmnet, mw2293.codfw.wmnet, mw2403.codfw.wmnet, mw2299.codfw.wmnet are marked down but pooled: kartotherian_6533: Servers maps2001.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2308.codfw.wmnet, mw2253.codfw.wmnet, mw2401.cod [15:42:09] , mw2405.codfw.wmnet, mw2297.codfw.wmnet, mw2295.codfw.wmnet, mw2293.codfw.wmnet, mw2403.codfw.wmnet are marked down but pooled: kartotherian-ssl_443: Servers maps2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:42:19] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:42:28] effie: yes. It fetches page(s) using restbase. [15:43:19] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:43:59] kart_: then that is why it was erroring [15:44:07] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:45:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:03] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:47:39] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw2297.codfw.wmnet, maps2001.codfw.wmnet, mw2295.codfw.wmnet, mw2302.codfw.wmnet, mw2293.codfw.wmnet, mw2253.codfw.wmnet, mw2401.codfw.wmnet, mw2308.codfw.wmnet, mw2299.codfw.wmnet, mw2294.codfw.wmnet, mw2403.codfw.wmnet, mw2405.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [15:48:30] !log restart pybal on lvs2010 [15:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:15] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:51:31] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:51:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:15] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:54:10] (03Abandoned) 10Vgutierrez: lvs: Disable row A NIC on lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/704977 (owner: 10Vgutierrez) [15:55:02] (03PS1) 10Filippo Giunchedi: smokeping: don't poll authdns2001 [puppet] - 10https://gerrit.wikimedia.org/r/704992 (https://phabricator.wikimedia.org/T286787) [15:56:36] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: don't poll authdns2001 [puppet] - 10https://gerrit.wikimedia.org/r/704992 (https://phabricator.wikimedia.org/T286787) (owner: 10Filippo Giunchedi) [15:57:15] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:58:53] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:02:13] PROBLEM - configured eth on lvs2009 is CRITICAL: ens2f1np1 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:03:37] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [16:05:01] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:05:33] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:08:29] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:10:49] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:11:05] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:15:37] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:17:03] (03CR) 1020after4: [C: 03+2] selenium: Upgrade WebdriverIO to v7 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [16:17:38] (03CR) 1020after4: [V: 03+2 C: 03+2] selenium: Upgrade WebdriverIO to v7 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [16:20:33] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:21:01] (03PS1) 10Urbanecm: logos/manage.py: Set user-agent on all requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704995 (https://phabricator.wikimedia.org/T286797) [16:22:13] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [16:22:44] (03PS1) 10Urbanecm: otrs_wikiwiki: Update logo to use VRT instead of OTRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704996 (https://phabricator.wikimedia.org/T280400) [16:24:03] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:28:49] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [16:29:11] !log restarting pybal on lvs2009 to decrease api depool threshold [16:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:59] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [16:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:45] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:31:11] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [16:33:45] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:34:59] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:36:12] (03CR) 10Bstorm: "I think this is a very good idea with cron jobs overall. However, I think it needs to be enabled on kube-apiserver and kube-controller-man" [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) (owner: 10Arturo Borrero Gonzalez) [16:37:26] (03PS3) 10Bstorm: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:37:54] (03CR) 10jerkins-bot: [V: 04-1] Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:42:12] (03CR) 10Bstorm: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:42:30] (03CR) 10Bstorm: "Ferran, I think this is all unblocked, and Jbond's suggestions should get you past the -1 caused by Jenkins. Are you able to proceed with " [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:49:07] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:49:07] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:59:55] (03CR) 10Bstorm: "My doc reference for that https://v1-18.docs.kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/" [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) (owner: 10Arturo Borrero Gonzalez) [17:06:01] (03CR) 10Bstorm: "Existing users of jobs before the api were using activeDeadlineSeconds and things like that:https://v1-18.docs.kubernetes.io/docs/concepts" [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) (owner: 10Arturo Borrero Gonzalez) [17:10:45] (03PS1) 10Holger Knust: Swap dumper (snapshot1009) and testbed (1013) in preparation for T286069 [puppet] - 10https://gerrit.wikimedia.org/r/705000 (https://phabricator.wikimedia.org/T286069) [17:18:04] (03CR) 10ArielGlenn: [C: 03+2] Swap dumper (snapshot1009) and testbed (1013) in preparation for T286069 [puppet] - 10https://gerrit.wikimedia.org/r/705000 (https://phabricator.wikimedia.org/T286069) (owner: 10Holger Knust) [17:32:29] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] openstack galera: set monitor on failover (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704846 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [17:35:43] (03CR) 10Bstorm: openstack galera: set monitor on failover (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704846 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [17:49:22] (03PS1) 10ArielGlenn: add dumps-roots to the dumpsdata roles so people in that group get access [puppet] - 10https://gerrit.wikimedia.org/r/705006 [17:49:53] (03CR) 10jerkins-bot: [V: 04-1] add dumps-roots to the dumpsdata roles so people in that group get access [puppet] - 10https://gerrit.wikimedia.org/r/705006 (owner: 10ArielGlenn) [17:50:35] (03PS2) 10ArielGlenn: add dumps-roots to the dumpsdata roles so people in that group get access [puppet] - 10https://gerrit.wikimedia.org/r/705006 (https://phabricator.wikimedia.org/T277629) [17:51:02] (03CR) 10jerkins-bot: [V: 04-1] add dumps-roots to the dumpsdata roles so people in that group get access [puppet] - 10https://gerrit.wikimedia.org/r/705006 (https://phabricator.wikimedia.org/T277629) (owner: 10ArielGlenn) [17:53:13] (03PS3) 10ArielGlenn: add dumps-roots to the dumpsdata roles so people in that group get access [puppet] - 10https://gerrit.wikimedia.org/r/705006 (https://phabricator.wikimedia.org/T277629) [18:00:42] (03PS2) 10Bstorm: openstack galera: set monitor on failover [puppet] - 10https://gerrit.wikimedia.org/r/704846 (https://phabricator.wikimedia.org/T286675) [18:01:17] (03CR) 10Bstorm: openstack galera: set monitor on failover (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704846 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [18:09:14] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr It looks like Chris is going to be out for a while. Moving this task over to @Jclark-ctr, who should be back Tuesday or Wednesday. Thanks, Willy [18:09:40] (03CR) 10Bstorm: metricsinfra: Add HAProxy for distributing http traffic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [18:10:27] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr It looks like Chris is going to be out for a while. @Jclark-ctr - can you prioritize this one, when you're back next week? Rob has o... [18:16:25] (03CR) 10Bstorm: metricsinfra: Add HAProxy for distributing http traffic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [18:17:56] (03CR) 10Bstorm: metricsinfra: Add HAProxy for distributing http traffic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [18:19:09] !log [Elastic] Given that we will likely have switch A3 out of commission over the weekend, Search team is going to change masters so that we no longer have a master in row A3. New desired config: `B1 (elastic2042), C2 (elastic2047), D2 (elastic2051)`, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/704973 [18:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:15] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [18:21:33] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch@codfw: change eligible masters for omega [puppet] - 10https://gerrit.wikimedia.org/r/704973 (owner: 10DCausse) [18:24:11] (03PS1) 10RobH: copernicium imaging details [puppet] - 10https://gerrit.wikimedia.org/r/705008 (https://phabricator.wikimedia.org/T282272) [18:24:45] !log [Elastic] `puppet-merge`d https://gerrit.wikimedia.org/r/c/operations/puppet/+/704973; ran puppet across `elastic2*` hosts: `sudo cumin 'P{elastic2*}' 'sudo run-puppet-agent'` (puppet run succeeded on all but the 3 nodes taken offline by the switch failure: `elastic[2037-2038,2055].codfw.wmnet`) [18:24:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr Hi @Jclark-ctr - it looks like Chris going to be out for a while. Dell has one last suggestion in figuring out a solution for... [18:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:17] (03PS2) 10RobH: copernicium imaging details [puppet] - 10https://gerrit.wikimedia.org/r/705008 (https://phabricator.wikimedia.org/T282272) [18:25:50] (03CR) 10Holger Knust: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705006 (https://phabricator.wikimedia.org/T277629) (owner: 10ArielGlenn) [18:26:16] (03CR) 10RobH: [C: 03+2] copernicium imaging details [puppet] - 10https://gerrit.wikimedia.org/r/705008 (https://phabricator.wikimedia.org/T282272) (owner: 10RobH) [18:27:55] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:28:14] !log [Elastic] Restarted `elasticsearch_6@production-search-omega-codfw.service` on `elastic2051`; will restart on `elastic2038` by powercycling the node from mgmt port given that it is ssh unreachable [18:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:15] !log [Elastic] Kicked off powercycle on `elastic2038`, this will effectively restart its `elasticsearch_6@production-search-omega-codfw.service`. We're back to 3 eligible masters for `codfw-omega` [18:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:20] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10RobH) [18:36:23] (03PS2) 10Brennen Bearnes: explicitly set ansible_python_interpreter to python3 [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/704143 [18:36:36] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] explicitly set ansible_python_interpreter to python3 [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/704143 (owner: 10Brennen Bearnes) [18:42:00] (03PS1) 10RobH: fixing copernicium entries [puppet] - 10https://gerrit.wikimedia.org/r/705009 (https://phabricator.wikimedia.org/T282272) [18:45:11] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:47:10] (03CR) 10RobH: [C: 03+2] fixing copernicium entries [puppet] - 10https://gerrit.wikimedia.org/r/705009 (https://phabricator.wikimedia.org/T282272) (owner: 10RobH) [18:49:01] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:53:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` copernicium.wikimedia.org ` The log can be found in `/var/log... [19:02:04] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:03:13] (03PS2) 10Majavah: metricsinfra: Add HAProxy for distributing http traffic [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) [19:06:44] (03PS3) 10Majavah: metricsinfra: Add HAProxy for distributing http traffic [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) [19:08:20] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on copernicium.wikimedia.org with reason: REIMAGE [19:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:30] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on copernicium.wikimedia.org with reason: REIMAGE [19:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:09] (03CR) 10Bstorm: "That reduced the code more than I expected! My comment on using puppet yaml here vs in horizon is not blocking for this patch. I think it " [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [19:13:18] (03PS4) 10Majavah: metricsinfra: Add HAProxy for distributing http traffic [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) [19:14:22] (03CR) 10Majavah: "> I think it would be better done as a separate thing anyway, if we decided to do that." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [19:16:29] (03CR) 10Bstorm: [C: 03+2] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/703708 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [19:17:56] (03PS3) 10Majavah: metricsinfra: Remove alertmanager apache proxy [puppet] - 10https://gerrit.wikimedia.org/r/704522 (https://phabricator.wikimedia.org/T286335) [19:19:08] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:19:12] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['copernicium.wikimedia.org'] ` and were **ALL** successful. [19:22:22] (03PS1) 10RobH: copernicium should be bullseye [puppet] - 10https://gerrit.wikimedia.org/r/705013 (https://phabricator.wikimedia.org/T282272) [19:22:32] (03PS1) 10Majavah: metricsinfra: Add separate alertmanager support [puppet] - 10https://gerrit.wikimedia.org/r/705014 (https://phabricator.wikimedia.org/T286335) [19:22:58] (03CR) 10Bstorm: [C: 03+2] "Looks good. I was going to ask if prometheus still requires it, but that doesn't need to be sorted on this patch." [puppet] - 10https://gerrit.wikimedia.org/r/704522 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [19:23:49] (03CR) 10RobH: [C: 03+2] copernicium should be bullseye [puppet] - 10https://gerrit.wikimedia.org/r/705013 (https://phabricator.wikimedia.org/T282272) (owner: 10RobH) [19:25:08] (03PS2) 10Bstorm: metricsinfra: Add separate alertmanager support [puppet] - 10https://gerrit.wikimedia.org/r/705014 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [19:36:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` copernicium.wikimedia.org ` The log can be found in `/var/log... [19:42:31] (03CR) 10Bstorm: [C: 03+2] metricsinfra: Add separate alertmanager support [puppet] - 10https://gerrit.wikimedia.org/r/705014 (https://phabricator.wikimedia.org/T286335) (owner: 10Majavah) [19:43:54] (03CR) 10Bstorm: [C: 03+2] toolforge::prometheus: Update PAWS ingress target [puppet] - 10https://gerrit.wikimedia.org/r/704277 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [19:45:08] (03CR) 10Bstorm: [C: 03+2] "They don't do anything, but we don't want them either! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/703618 (owner: 10Majavah) [19:48:21] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on copernicium.wikimedia.org with reason: REIMAGE [19:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:32] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on copernicium.wikimedia.org with reason: REIMAGE [19:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:00] (03CR) 10Zabe: [C: 03+1] "I tested this patch locally and it works for me, thanks for fixing this issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704995 (https://phabricator.wikimedia.org/T286797) (owner: 10Urbanecm) [20:13:12] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10RobH) a:05RobH→03MoritzMuehlenhoff So this has an initial puppet run failure for megacli and bullseye, which I chatted with Mortiz about and he is aware. This task is reassigned to... [20:16:58] PROBLEM - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:28] (03PS1) 10Cwhite: logstash: extract syslog object logic from normalize_level [puppet] - 10https://gerrit.wikimedia.org/r/705018 [20:33:30] (03PS1) 10Cwhite: logstash: add gitlab ECS transformations [puppet] - 10https://gerrit.wikimedia.org/r/705019 [20:35:28] (03CR) 10jerkins-bot: [V: 04-1] logstash: extract syslog object logic from normalize_level [puppet] - 10https://gerrit.wikimedia.org/r/705018 (owner: 10Cwhite) [20:39:06] (03PS2) 10Cwhite: logstash: extract syslog object logic from normalize_level [puppet] - 10https://gerrit.wikimedia.org/r/705018 [21:10:40] (03CR) 10Andrew Bogott: [C: 03+1] "I think this will be noisy (I've seen a couple of failovers that weren't due to hitting the connection limit) but we'd probably be better " [puppet] - 10https://gerrit.wikimedia.org/r/704846 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [21:13:03] (03CR) 10Bstorm: [C: 03+2] openstack galera: set monitor on failover [puppet] - 10https://gerrit.wikimedia.org/r/704846 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [21:17:43] RECOVERY - SSH on gerrit2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['copernicium.wikimedia.org'] ` Of which those **FAILED**: ` ['copernicium.wikimedia.org'] ` [21:58:35] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:03:07] (03PS2) 10Bstorm: cloud galera: have haproxy shut down sessions when marked [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) [22:04:37] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:04:45] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:06:07] (03CR) 10Bstorm: "Can't find the same doc for Mariadb, but it's *usually* the same behavior https://dev.mysql.com/doc/refman/8.0/en/mysql-tips.html#mysql-re" [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [22:06:29] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:15:34] (03PS1) 10Dave Pifke: webperf: ingest navtiming & coal logs in Logstash [puppet] - 10https://gerrit.wikimedia.org/r/705030 (https://phabricator.wikimedia.org/T285897) [22:22:01] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:59:25] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:45:14] (03PS8) 10Juan90264: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405)