[00:04:45] (03CR) 10Bugreporter: "Unresolve. A corresponding talk namespace must be defined, otherwise such page will go nowhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [00:05:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071053 (owner: 10TrainBranchBot) [00:08:11] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10123906 (10Papaul) [00:09:51] RECOVERY - Check whether ferm is active by checking the default input chain on mw1495 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:18:28] (03PS4) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) [00:23:49] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:24:27] (03PS5) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) [00:37:01] (03CR) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [00:49:15] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:49:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:49:47] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:01:56] (03CR) 10Krinkle: logging: Fix local variables leaking into global scope (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński) [01:07:55] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:08:17] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:08:55] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:16:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [01:20:47] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:45] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:55] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:23:41] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:27:15] 06SRE, 06Editing-team, 06Fundraising-Backlog, 06Traffic-Icebox, and 5 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085#10123988 (10Pppery) 05Open→03Stalled [01:39:52] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [01:40:07] (03PS4) 10Bartosz Dziewoński: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 [01:45:24] (03CR) 10Krinkle: [C:03+1] "Test plan:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński) [01:47:00] (03PS3) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 [01:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:29] (03CR) 10Krinkle: [C:03+1] "The diff looks fine but I don't really trust any static review of this. Let's test this by cherry-picking on mwdebug instead and verifying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński) [01:48:34] (03PS2) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 [01:54:16] (03CR) 10Krinkle: logging: Simplify extra debug logging configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński) [01:59:03] (03Restored) 10Krinkle: wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [01:59:06] (03PS2) 10Krinkle: wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [01:59:45] (03CR) 10CI reject: [V:04-1] wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [02:01:33] (03CR) 10Krinkle: "@bd808@wikimedia.org @abogott@wikimedia.org: It seems the debug file enabled here is similar to what we already have in logging.php with M" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [02:02:08] (03PS3) 10Krinkle: wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [02:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:36:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:42:29] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:42:33] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:57:27] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:57:33] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:57:37] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:01:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:17] !log depooling cp2041 && cp2038 due to high purged lag [04:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:34] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp(2038|2041).codfw.wmnet [04:24:48] 06SRE, 06Traffic, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10124079 (10Vgutierrez) this has been triggered again in cp2038 and cp2041: ` vgutierrez@cumin1002:~$ sudo -i cumin 'cp[2038,2041].codfw.wmnet' 'journalctl -u purged.service --sin... [04:24:56] !log restarting purged in cp2038 && cp2041 - T334078 [04:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:01] T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 [04:39:30] 06SRE, 06MediaWiki-Engineering, 10MediaWiki-extensions-BounceHandler, 10Observability-Metrics, 07Grafana: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761#10124092 (10Krinkle) I've documented the following on Wikitech: https://wikitech.wikimedia.org/wiki/BounceHandler >>! From **[... [04:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:57:43] !log repool cp2038 [04:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:49] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:05:47] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:07:30] FIRING: Processor usage over 85%: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [05:12:33] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:31] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:35] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:21] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [05:22:30] RESOLVED: Processor usage over 85%: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85% - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25 [05:29:35] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:29:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:37] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:39] !log repool cp2041 [05:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:54:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:54:45] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:54:45] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:41] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:49] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:31] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Limit the number of groups pushed to DebMonitor. [puppet] - 10https://gerrit.wikimedia.org/r/1070594 (owner: 10Slyngshede) [06:17:45] RECOVERY - Host gerrit1004 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [06:18:23] PROBLEM - SSH on gerrit1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:24:09] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [06:24:47] (03PS3) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) [06:24:47] (03PS1) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) [06:25:25] (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [06:25:28] (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [06:28:50] (03PS4) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) [06:28:51] (03PS2) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) [06:28:53] (03CR) 10C. Scott Ananian: "Done (well, in the commit message)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [06:29:28] (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [06:29:32] (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian) [06:30:31] (03PS5) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) [06:30:31] (03PS3) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) [06:54:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071024 (owner: 10Dzahn) [06:57:18] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2087.codfw.wmnet [06:57:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124177 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [06:57:36] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2087.codfw.wmnet with OS bullseye [06:57:41] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2087.codfw.wmnet with OS bullseye [06:57:41] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2087.codfw.wmnet [06:57:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [06:57:50] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [06:57:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124180 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240906T0700) [07:00:48] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2087.codfw.wmnet with OS bullseye [07:01:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [07:09:33] (03PS1) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071 [07:10:49] RECOVERY - Host gerrit1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.16 ms [07:12:02] (03PS2) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071 [07:12:25] (03PS1) 10Muehlenhoff: Cleanup firewall::service configs for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/1071072 (https://phabricator.wikimedia.org/T370677) [07:17:13] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:36] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2087.codfw.wmnet with reason: host reimage [07:20:00] (03CR) 10Brouberol: [C:03+2] airflow: deploy the scheduler via a separate Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [07:20:11] (03PS1) 10Muehlenhoff: Cleanup firewall::service configs for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1071073 (https://phabricator.wikimedia.org/T370677) [07:21:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2087.codfw.wmnet with reason: host reimage [07:26:00] (03PS1) 10Jelto: deployment_server: add wikidata-query-gui service [puppet] - 10https://gerrit.wikimedia.org/r/1071075 (https://phabricator.wikimedia.org/T350793) [07:31:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:32:16] (03PS1) 10Muehlenhoff: Correct firewall services for releases [puppet] - 10https://gerrit.wikimedia.org/r/1071076 (https://phabricator.wikimedia.org/T370677) [07:33:24] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:36:03] (03PS1) 10Brouberol: airflow: fix badly formatted Deployment separation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071077 (https://phabricator.wikimedia.org/T368737) [07:36:22] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:37:31] (03CR) 10Brouberol: [C:03+2] airflow: fix badly formatted Deployment separation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071077 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [07:39:22] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:39:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:40:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:46:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:49:25] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [07:49:35] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 10s) [07:51:13] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts acmechief1001.eqiad.wmnet [07:51:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:52:52] (03PS1) 10Slyngshede: P:idm: Add ecdsa-sha2-nistp256 to allowed key types. [puppet] - 10https://gerrit.wikimedia.org/r/1071123 (https://phabricator.wikimedia.org/T371956) [07:55:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:56:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:58:16] (03PS1) 10JMeybohm: rename/renumber kubernetes2020,2033 to wikikube-worker2093,2094 [puppet] - 10https://gerrit.wikimedia.org/r/1071124 (https://phabricator.wikimedia.org/T372878) [07:58:49] (03CR) 10Slyngshede: [V:03+1] "Added Jesse as reviewer to get input on the sanity of the Puppet code." [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede) [07:59:02] (03CR) 10JMeybohm: [C:03+2] rename/renumber kubernetes2020,2033 to wikikube-worker2093,2094 [puppet] - 10https://gerrit.wikimedia.org/r/1071124 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [07:59:37] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2020.codfw.wmnet [08:00:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:00:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2020.codfw.wmnet [08:00:20] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2033.codfw.wmnet [08:00:54] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host kubernetes2033.codfw.wmnet [08:01:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:01:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:01:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts acmechief1001.eqiad.wmnet [08:01:29] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124321 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002... [08:03:38] (03PS1) 10Slyngshede: Git: Add missing .gitreview file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180) [08:06:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:07:10] (03PS1) 10Tiziano Fogli: opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178) [08:08:15] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:09:37] (03PS1) 10Slyngshede: P:idp Prometheus blackbox monitoring for IDP. [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) [08:10:39] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3898/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [08:11:11] (03PS1) 10Elukey: admin_ng: set disablePSPMutations for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071132 (https://phabricator.wikimedia.org/T369491) [08:13:00] (03CR) 10DCausse: "correct, although I might perhaps be overcautious because I remember we had issues with querying un-mapped fields in the past... but looki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [08:13:12] (03PS5) 10DCausse: search: use the stem field when searching mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) [08:13:48] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3899/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [08:14:56] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3900/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [08:18:20] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2020 to wikikube-worker2093 [08:18:37] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:18:46] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2033 to wikikube-worker2094 [08:19:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:22] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3901/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [08:23:39] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:23:45] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:24:29] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2087.codfw.wmnet with OS bullseye [08:24:30] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts acmechief2001.codfw.wmnet [08:24:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [08:25:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede) [08:25:48] (03CR) 10Slyngshede: [C:03+2] Git: Add missing .gitreview file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede) [08:27:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:27:50] (03Merged) 10jenkins-bot: Git: Add missing .gitreview file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede) [08:28:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3902/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [08:28:49] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:31:00] !log jayme@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:31:13] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2093 [08:31:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2093 [08:31:35] (03PS1) 10Elukey: services: update Proton's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071134 (https://phabricator.wikimedia.org/T367981) [08:31:37] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2033 to wikikube-worker2094 - jayme@cumin1002" [08:31:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2033 to wikikube-worker2094 - jayme@cumin1002" [08:31:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:31:58] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2094 [08:32:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2020 to wikikube-worker2093 [08:32:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2094 [08:32:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes202... [08:32:55] (03CR) 10Btullis: [C:03+2] Enable IPv6 for the envoyproxy on DPE Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/1070949 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis) [08:32:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2033 to wikikube-worker2094 [08:33:16] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes203... [08:36:45] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2087.codfw.wmnet [08:36:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2087.codfw.wmnet [08:36:47] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:36:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:36:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:36:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts acmechief2001.codfw.wmnet [08:37:01] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124378 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002... [08:38:33] (03CR) 10Elukey: [C:03+2] services: update Proton's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071134 (https://phabricator.wikimedia.org/T367981) (owner: 10Elukey) [08:38:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: kubernetes2035 (renamed to wikikube-worker2087) reporting "Comm Error: Backplane 0" - https://phabricator.wikimedia.org/T374019#10124380 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Reimage worked fine now, thanks! [08:40:34] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/proton: sync [08:41:19] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: sync [08:42:08] (03PS3) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071 [08:42:17] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2093.codfw.wmnet [08:42:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124386 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [08:42:34] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124387 (10MoritzMuehlenhoff) [08:43:04] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2093.codfw.wmnet with OS bullseye [08:43:14] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2093 [08:43:19] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:43:33] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124388 (10MoritzMuehlenhoff) 05Open→03Resolved All done! [08:43:39] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [08:45:45] (03CR) 10Filippo Giunchedi: [C:03+1] opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli) [08:45:56] (03CR) 10Slyngshede: [V:03+1] "Plan is: rollout blackbox check, then absent Icinga checks and finally remove them from Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [08:46:27] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2093 - jayme@cumin1002" [08:46:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2093 - jayme@cumin1002" [08:46:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:46:32] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2093.codfw.wmnet 135.16.192.10.in-addr.arpa 5.3.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:46:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2093.codfw.wmnet 135.16.192.10.in-addr.arpa 5.3.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:46:35] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2093 [08:47:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2093 [08:47:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2093 [08:48:16] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2094.codfw.wmnet [08:48:25] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/proton: sync [08:48:27] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2094.codfw.wmnet with OS bullseye [08:48:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124417 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [08:48:37] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2094 [08:48:37] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [08:48:46] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:49:47] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: sync [08:49:48] (03PS5) 10DCausse: wdqs: better isolation of categories components [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) [08:49:48] (03PS5) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) [08:49:48] (03PS5) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) [08:49:51] (03CR) 10Btullis: [V:03+1 C:03+2] Add the anycast VIP for radosgw to DPE Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/1070950 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis) [08:51:55] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2094 - jayme@cumin1002" [08:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:51:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2094 - jayme@cumin1002" [08:52:00] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:00] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2094.codfw.wmnet 224.16.192.10.in-addr.arpa 4.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:52:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2094.codfw.wmnet 224.16.192.10.in-addr.arpa 4.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:52:04] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2094 [08:52:26] (03PS1) 10Brouberol: airflow: configure metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098) [08:52:49] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10124427 (10elukey) [08:53:43] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10124429 (10elukey) ` dragonfly-supernode | 1.0.6-2 | bookworm-wikimedia | main | amd64 ` Next steps: - reimage codfw outside the deployment window - let it bake for some days - do the same for eqiad [08:53:47] 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10124430 (10elukey) [08:54:37] (03PS2) 10Brouberol: airflow: configure metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098) [08:54:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2094 [08:54:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2094 [08:55:37] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync [08:57:23] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync [09:00:27] (03CR) 10Filippo Giunchedi: [C:03+1] "Very cool! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [09:07:33] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol) [09:09:04] (03CR) 10Elukey: [C:03+2] "Tested on build2001, worked nicely :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [09:10:22] (03CR) 10Brouberol: [C:03+2] airflow: configure metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol) [09:12:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:12:37] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2094.codfw.wmnet with reason: host reimage [09:12:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:14:02] (03PS1) 10Muehlenhoff: Fix up Phabricator firewall services, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1071146 [09:14:03] (03PS1) 10Muehlenhoff: Fix up Phabricator firewall services, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1071147 (https://phabricator.wikimedia.org/T370677) [09:15:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2094.codfw.wmnet with reason: host reimage [09:16:21] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [09:17:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:18:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071123 (https://phabricator.wikimedia.org/T371956) (owner: 10Slyngshede) [09:20:27] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet [09:24:09] (03PS4) 10Elukey: doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) [09:25:27] (03CR) 10Alexandros Kosiaris: [C:03+2] service: Remove php7.2 specific health check [puppet] - 10https://gerrit.wikimedia.org/r/1070993 (owner: 10Alexandros Kosiaris) [09:25:35] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [09:26:34] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet [09:27:50] FIRING: KubernetesCalicoDown: ml-serve2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:28:39] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet [09:35:23] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet [09:37:44] (03PS1) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) [09:38:43] (03Merged) 10jenkins-bot: doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [09:42:11] (03PS1) 10Muehlenhoff: debmonitor: Also use adduser on Bullseye to create the system user [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071154 (https://phabricator.wikimedia.org/T372472) [09:42:56] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207 (10klausman) 03NEW [09:45:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2094.codfw.wmnet with OS bullseye [09:46:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [09:46:26] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2007.codfw.wmnet [09:46:33] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071154 (https://phabricator.wikimedia.org/T372472) (owner: 10Muehlenhoff) [09:46:58] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10124631 (10klausman) I already tried a reboot and a complete powercycle to revive the disk, to no avail. [09:47:08] !log homer lsw1-b6-codfw* commit 'T372878' [09:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:10] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [09:47:55] (03CR) 10Btullis: airflow: enable visualizing logs of DAG runs in the webserver UI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [09:48:01] !log homer cr*codfw* commit 'T372878' [09:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:24] 07sre-alert-triage, 10Data-Platform-SRE (2024.09.06 - 2024.09.27): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10124635 (10Gehel) [09:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:17] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:50] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2094.codfw.wmnet [09:51:08] (03CR) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [09:53:04] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v8.13.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1071156 [09:53:39] (03CR) 10JMeybohm: [C:04-1] sre.k8s.renumber-node: Run puppet on registry (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 (owner: 10Clément Goubert) [09:55:43] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 355, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:47] (03PS5) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 [09:57:19] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2094.codfw.wmnet [09:57:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2094.codfw.wmnet [09:57:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124728 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [10:00:10] 06SRE, 10CheckUser, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210 (10Dreamy_Jazz) 03NEW [10:00:53] 06SRE, 10CheckUser, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124779 (10Dreamy_Jazz) [10:01:54] (03PS1) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) [10:03:14] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10124781 (10JMeybohm) [10:03:22] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124782 (10Dreamy_Jazz) [10:05:28] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2093.codfw.wmnet with reason: host reimage [10:05:47] (03PS2) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) [10:05:51] (03CR) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [10:06:47] (03CR) 10Btullis: [C:03+1] "Nice. Thanks for that change to the networkpolicy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [10:06:54] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124795 (10Dreamy_Jazz) [10:06:57] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124797 (10Dreamy_Jazz) [10:07:02] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124798 (10Dreamy_Jazz) [10:08:16] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v8.13.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1071156 (owner: 10Elukey) [10:08:55] (03CR) 10Muehlenhoff: [C:03+2] debmonitor: Also use adduser on Bullseye to create the system user [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071154 (https://phabricator.wikimedia.org/T372472) (owner: 10Muehlenhoff) [10:09:02] (03CR) 10Brouberol: [C:03+2] airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [10:09:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2093.codfw.wmnet with reason: host reimage [10:10:11] (03Merged) 10jenkins-bot: airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [10:10:45] (03PS1) 10Elukey: Upstream release v8.13.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1071159 [10:11:00] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v8.13.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1071159 (owner: 10Elukey) [10:12:55] (03PS20) 10Elukey: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [10:13:11] (03CR) 10Elukey: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:17:22] !log uploaded spicerack_8.13.0 to apt.wikimedia.org bullseye-wikimedia [10:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:37] (03PS1) 10Muehlenhoff: Bump changelog [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071160 [10:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:21:10] (03Abandoned) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 for wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1070973 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:21:56] (03PS1) 10Filippo Giunchedi: mediawiki: port login failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071161 (https://phabricator.wikimedia.org/T350597) [10:23:10] !log install spicerack 8.13.0 on cumin2002 [10:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:48] PROBLEM - Host db1246 #page is DOWN: PING CRITICAL - Packet loss = 100% [10:24:06] here [10:24:27] oncallers, you need help? [10:24:29] I just woke up [10:24:33] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10124846 (10Clement_Goubert) [10:24:35] here [10:24:38] !incidents [10:24:38] 5138 (UNACKED) Host db1246 (paged) - PING - Packet loss = 100% [10:24:42] !ack 5138 [10:24:43] 5138 (ACKED) Host db1246 (paged) - PING - Packet loss = 100% [10:24:52] ok, now looking into what on earth [10:25:03] let's depool it [10:25:06] it's a normal replica [10:25:11] ok [10:25:19] who does it? [10:25:25] on it [10:25:31] ok, thanks [10:25:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1246 (It's sad)', diff saved to https://phabricator.wikimedia.org/P68731 and previous config saved to /var/cache/conftool/dbconfig/20240906-102551-ladsgroup.json [10:26:02] ``` [10:26:05] https://www.irccloud.com/pastebin/pUGGhyXi/ [10:26:13] akosiaris: for future reference, https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica [10:26:14] it's also for dumps [10:26:17] (03CR) 10Muehlenhoff: [C:03+2] Bump changelog [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071160 (owner: 10Muehlenhoff) [10:26:39] RECOVERY - Host db1246 #page is UP: PING WARNING - Packet loss = 50%, RTA = 28.59 ms [10:27:04] I can't ssh into it, it's probably some hw/network issue [10:27:21] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:51] PROBLEM - SSH on db1246 is CRITICAL: connect to address 10.64.48.172 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:27:54] yeah, ssh connection refused immediately [10:27:55] it looks to have come back into single-user / emergency mode [10:28:09] console is at "Give root password for maintenance" point [10:28:31] ah, so it failed the fsck [10:29:21] Amir1: shall I leave investigating the sad system to you and/or arnaud.b ? [10:29:44] yeah, if there is a ticket, It'd be amazing [10:29:47] (03CR) 10Clément Goubert: k8s: rename mw232[012], kubernetes2031 to wikikube-workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:29:50] so I can eat breakfast [10:29:53] and boot up [10:30:01] I'll make one, tag it DBA [10:30:26] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:30:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:30:33] akosiaris: you mind downtiming that host while I write up a ticket? [10:30:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:31:29] (03PS2) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) [10:31:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:31:47] (03CR) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 to wikikube-workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:31:49] Emperor: will do [10:32:51] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode [10:33:01] ta [10:33:04] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode [10:33:05] I suppose duration in days? [10:33:22] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 5.493 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:33:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.938 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:33:30] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode [10:33:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode [10:33:36] gave it a downtime of 5 days [10:34:02] Yeah, no point it p.aging us over the weekend [10:34:03] T374215 [10:34:04] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [10:38:06] (03CR) 10Clément Goubert: [C:03+1] k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:38:42] !log factory reset of sretest2001 [10:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:09] !log uploaded debmonitor-client 0.4.0-2+deb11u1 on bullseye-wikimedia (didn't rebuild the other suites since the fix is specific to Bullseye) T372472 [10:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:12] T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client - https://phabricator.wikimedia.org/T372472 [10:39:27] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:40:24] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:43:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [10:44:23] (03PS1) 10Clément Goubert: kubernetes: Rename mw233[2-4] [puppet] - 10https://gerrit.wikimedia.org/r/1071164 (https://phabricator.wikimedia.org/T372878) [10:47:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10124967 (10elukey) @Jhancock.wm Hi! I tried to factory reset the sretest2001's BMC, and now I am getting some errors when using the Redfish API (unauthorized etc..). I... [10:52:27] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124969 (10Ladsgroup) In less than a month, wikitech will go inside production and this wou... [10:54:38] (03PS2) 10Filippo Giunchedi: mediawiki: port login failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071161 (https://phabricator.wikimedia.org/T350597) [10:54:38] (03PS1) 10Filippo Giunchedi: mediawiki: port account creation failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071165 (https://phabricator.wikimedia.org/T350597) [10:59:28] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125008 (10Dreamy_Jazz) >>! In T374210#10124969, @Ladsgroup wrote: > In less than a month,... [10:59:42] 06SRE, 06Traffic, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10125005 (10Vgutierrez) 05Stalled→03In progress a:03Vgutierrez [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240906T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240906T1100). Please do the needful. [11:00:16] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on lists1004.wikimedia.org with reason: T373980 [11:00:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [11:00:25] T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed. - https://phabricator.wikimedia.org/T373980 [11:00:29] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on lists1004.wikimedia.org with reason: T373980 [11:02:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071168 [11:02:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071168 (owner: 10TrainBranchBot) [11:02:18] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [11:02:30] ^ expected because of the reboot [11:02:56] !log rolling out debmonitor-client 0.4.0-2+deb11u1 on bullseye-wikimedia on bullseye hosts T372472 [11:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:59] T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client - https://phabricator.wikimedia.org/T372472 [11:03:18] wow I picked the right moment to eat [11:03:32] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [11:04:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [11:04:55] 06SRE, 10Bitu, 06Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474#10125026 (10SLyngshede-WMF) 05Open→03In progress [11:06:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [11:08:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:43] FIRING: [4x] ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2093.codfw.wmnet with OS bullseye [11:09:53] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125034 (10Dreamy_Jazz) Looking at the stack trace again, I see this isn't actually failing... [11:10:02] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [11:13:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:04] !log installing Linux 5.10.223 on bullseye hosts [11:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:35] (03CR) 10Hnowlan: [C:03+2] k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [11:16:54] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10125060 (10kamila) [11:17:04] (03CR) 10Clément Goubert: [C:03+1] mediawiki: port login failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071161 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [11:17:11] (03CR) 10Clément Goubert: [C:03+1] mediawiki: port account creation failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071165 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [11:20:09] 10ops-eqiad, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10125067 (10ABran-WMF) That bad reboot seems to stem from a hardware issue: ` The system board BP1 PG voltage is within range. Fri Sep 06 2024 10:20:57 The system board BP1 PG voltage is outsid... [11:20:34] PROBLEM - mailman3_runners on lists1004 is CRITICAL: PROCS CRITICAL: 15 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:20:51] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:22:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [11:22:48] status [11:23:48] !log homer lsw1-b6-codfw* commit 'T372878' [11:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:52] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:24:24] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2031 to wikikube-worker2095 [11:24:36] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2320 to wikikube-worker2096 [11:24:41] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:25:21] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:26:46] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2093.codfw.wmnet [11:26:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125115 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f... [11:27:54] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2031 to wikikube-worker2095 - hnowlan@cumin1002" [11:28:43] RESOLVED: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071168 (owner: 10TrainBranchBot) [11:30:24] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:30:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2321:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2321 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:32:40] (03CR) 10JMeybohm: [C:03+1] kubernetes: Rename mw233[2-4] [puppet] - 10https://gerrit.wikimedia.org/r/1071164 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [11:32:45] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:32:46] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2096 [11:33:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2096 [11:33:43] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2332.codfw.wmnet [11:33:52] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2031 to wikikube-worker2095 - hnowlan@cumin1002" [11:33:52] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:33:53] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095 [11:34:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2332.codfw.wmnet [11:34:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2320 to wikikube-worker2096 [11:34:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2333.codfw.wmnet [11:34:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125185 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2320 to w... [11:34:37] (03PS1) 10Muehlenhoff: Fix firewall service definitions for CI [puppet] - 10https://gerrit.wikimedia.org/r/1071175 (https://phabricator.wikimedia.org/T370677) [11:34:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095 [11:34:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2333.codfw.wmnet [11:35:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2334.codfw.wmnet [11:35:31] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2321 to wikikube-worker2097 [11:35:35] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2031 to wikikube-worker2095 [11:35:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2334.codfw.wmnet [11:35:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw2321:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:35:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from kubernetes2... [11:35:48] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:37:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:37:40] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2010.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2302.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2062.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2353.codfw.wmnet, mw2394.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker [11:37:40] fw.wmnet, wikikube-worker2087.codfw.wmnet, mw2395.codfw.wmnet, wikikube-worker2024.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2037.codfw.wmnet, mw2369.codfw.wmnet, mw2437.codfw.wmnet, mw2445.codfw.wmnet, kubernetes2047.codfw.wmnet, wikikube-worker2046.codfw.wmnet, mw2425.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2068.codfw.wmnet, mw2357.codfw.wmnet, wikikube-worker2038.codfw.wmnet, wikikube-worker2009. [11:37:40] net, wikikube-worker2072.codfw.wmnet, mw2373.codfw.wmnet, kubernetes2049.codfw.wmnet, parse2015.codfw.wmnet, mw2311.codfw.wmnet, parse2011.codfw.wmnet, mw2446.codfw.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [11:37:42] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2056.codfw.wmnet, wikikube-worker2071.codfw.wmnet, mw2373.codfw.wmnet, mw2335.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2032.codfw.wmnet, kubernetes2005.codfw.wmnet, wikikube-worker2086.codfw.wmnet, mw2440.codfw.wmnet, mw2366.codfw.wmnet, mw2337.codfw.wmnet, kubernetes2006.codfw.wmnet, mw228 [11:37:42] wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2068.codfw.wmnet, mw2359.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2002.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2302.codfw.wmnet, mw2301.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2445.codfw.wmnet, wikikube-worker2038.codfw.wmnet, wikikube-worker2064.codfw.wmnet, kubernetes2015.codfw.wmnet, mw [11:37:42] fw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2059.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2021.codfw.wmnet, wikikube-worker2070.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [11:38:08] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Rename mw233[2-4] [puppet] - 10https://gerrit.wikimedia.org/r/1071164 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [11:39:12] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2321 to wikikube-worker2097 - hnowlan@cumin1002" [11:39:39] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2321 to wikikube-worker2097 - hnowlan@cumin1002" [11:39:39] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:39:40] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2097 [11:39:42] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:39:42] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:40:10] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2097 [11:40:21] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:21] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2322 to wikikube-worker2098 [11:40:36] (03PS2) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [11:40:36] (03CR) 10Arnaudb: "I've tried to exclusively stick to the existing logic, only replacing the plumbing and wire to limit this iteration's scope" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [11:40:37] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:40:50] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2321 to wikikube-worker2097 [11:40:51] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:41:01] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125240 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2321 to w... [11:41:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2332 to wikikube-worker2099 [11:41:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:43:54] (03PS1) 10Muehlenhoff: Revert "codesearch: replace ferm::service with firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1071176 [11:43:57] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2322 to wikikube-worker2098 - hnowlan@cumin1002" [11:45:57] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:46:07] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2322 to wikikube-worker2098 - hnowlan@cumin1002" [11:46:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:08] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2098 [11:46:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:46:30] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2098 [11:47:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2322 to wikikube-worker2098 [11:47:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2322 to w... [11:48:26] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2095.codfw.wmnet wikikube-worker2096.codfw.wmnet wikikube-worker2097.codfw.wmnet wikikube-worker2098.codfw.wmnet on all recursors [11:48:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2095.codfw.wmnet wikikube-worker2096.codfw.wmnet wikikube-worker2097.codfw.wmnet wikikube-worker2098.codfw.wmnet on all recursors [11:49:36] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2332 to wikikube-worker2099 - cgoubert@cumin1002" [11:49:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2332 to wikikube-worker2099 - cgoubert@cumin1002" [11:49:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:41] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2099 [11:49:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2099 [11:50:03] !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2097.codfw.wmnet [11:50:16] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2097.codfw.wmnet with OS bullseye [11:50:20] !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2096.codfw.wmnet [11:50:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125289 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:50:25] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2097 [11:50:27] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:50:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125291 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:50:30] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2096.codfw.wmnet with OS bullseye [11:50:31] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:50:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2332 to wikikube-worker2099 [11:50:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2334:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2334 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:50:45] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:50:48] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2332 to... [11:50:51] !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2095.codfw.wmnet [11:50:52] !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2098.codfw.wmnet [11:51:02] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125309 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:51:02] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bullseye [11:51:07] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2098.codfw.wmnet with OS bullseye [11:51:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125310 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [11:51:14] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2095 [11:51:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125311 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:51:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [11:52:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2333 to wikikube-worker2100 [11:52:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:54:01] (03PS3) 10Sergio Gimeno: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907) [11:54:04] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2097 - hnowlan@cumin1002" [11:54:52] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2097 - hnowlan@cumin1002" [11:54:52] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:54:52] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2097.codfw.wmnet 175.16.192.10.in-addr.arpa 5.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:54:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2097.codfw.wmnet 175.16.192.10.in-addr.arpa 5.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:54:56] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2097 [11:55:16] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:00:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2097 [12:00:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2097 [12:00:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071178 [12:00:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071178 (owner: 10TrainBranchBot) [12:00:30] what [12:00:39] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [12:00:40] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2096 [12:00:47] jnuche: do you know why the branch cut pretest starts at now? :) [12:01:39] I'm running it manually to see if I can find out why it's been broken for the last couple of days [12:01:54] ah [12:02:04] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2333 to wikikube-worker2100 - cgoubert@cumin1002" [12:02:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2333 to wikikube-worker2100 - cgoubert@cumin1002" [12:02:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:02:10] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2100 [12:02:10] jelto and I were about to restart Gerrit :D [12:02:19] albeit I haven't put in the dpeloyment calendar [12:02:28] (03CR) 10Ladsgroup: [C:03+1] "LGTM. CCing Lego" [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney) [12:02:28] I guess once the tests are running, we have enough time to restart the server [12:02:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2100 [12:02:43] (03CR) 10Alexandros Kosiaris: [C:03+2] deployment_server: Remove buster php-readline stanza [puppet] - 10https://gerrit.wikimedia.org/r/1070994 (owner: 10Alexandros Kosiaris) [12:02:58] go ahead if you need to restart gerrit, I'll just kick it off the job again if I need to [12:03:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2333 to wikikube-worker2100 [12:03:11] *kick off [12:03:11] I'll add a downtime for 15m, one sec [12:03:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2333 to... [12:03:32] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on gerrit.wikimedia.org with reason: Gerrit reboot [12:03:33] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:15:00 on gerrit.wikimedia.org with reason: Gerrit reboot [12:03:55] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on gerrit1003.wikimedia.org with reason: Gerrit reboot [12:04:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit1003.wikimedia.org with reason: Gerrit reboot [12:04:20] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2334 to wikikube-worker2101 [12:05:03] hashar: let me know when I should start the reboot cookbook for gerrit1003 [12:05:41] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [12:06:23] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2095 - hnowlan@cumin1002" [12:06:43] jelto: can we do gerrit2002 first? [12:06:50] that is gerrit-replica :) [12:07:08] it was rebooted yesterday already, probably by mutante [12:07:30] and that solved the issue? :) [12:07:37] yes [12:07:42] \o/ [12:07:45] lets do gerrit1003 [12:07:46] :) [12:08:05] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2095 - hnowlan@cumin1002" [12:08:05] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:08:05] ok I'll do the reboot now for gerrit1003 [12:08:05] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2095.codfw.wmnet 222.16.192.10.in-addr.arpa 2.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:08:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2095.codfw.wmnet 222.16.192.10.in-addr.arpa 2.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:08:09] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095 [12:08:19] !log upgrade ganeti-test2003 to bookworm for some bullseye->bookworm VM migration tests [12:08:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095 [12:08:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2095 [12:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gerrit1003.wikimedia.org [12:09:29] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2096 - hnowlan@cumin1002" [12:09:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2096 - hnowlan@cumin1002" [12:09:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:34] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2096.codfw.wmnet 173.16.192.10.in-addr.arpa 3.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:09:37] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2096.codfw.wmnet 173.16.192.10.in-addr.arpa 3.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:09:37] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2096 [12:10:08] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:10:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2096 [12:10:37] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2096 [12:12:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:12:23] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2098 [12:12:23] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2101 [12:12:32] Found reboot since 2024-09-06 12:08:46.859551 for hosts gerrit1003.wikimedia.org [12:12:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2101 [12:12:54] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [12:13:02] gerrit web interface is back already, cookbook still doing checks [12:13:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2334 to wikikube-worker2101 [12:13:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2334 to... [12:14:16] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2099.codfw.wmnet wikikube-worker2100.codfw.wmnet wikikube-worker2101.codfw.wmnet on all recursors [12:14:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2099.codfw.wmnet wikikube-worker2100.codfw.wmnet wikikube-worker2101.codfw.wmnet on all recursors [12:15:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1003.wikimedia.org [12:15:25] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2099.codfw.wmnet [12:15:26] hashar: reboot done [12:15:30] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2099.codfw.wmnet [12:15:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125390 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:15:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125391 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [12:15:57] jelto: congratulations! [12:15:58] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2098 - hnowlan@cumin1002" [12:16:02] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2098 - hnowlan@cumin1002" [12:16:03] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:16:03] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2098.codfw.wmnet 176.16.192.10.in-addr.arpa 6.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:16:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2098.codfw.wmnet 176.16.192.10.in-addr.arpa 6.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:16:06] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2098 [12:16:27] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2029.codfw.wmnet [12:16:31] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2029.codfw.wmnet [12:16:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2098 [12:16:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2098 [12:16:40] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.k8s.pool-depool-node (exit_code=97) depool for host wikikube-worker2029.codfw.wmnet [12:16:40] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2029.codfw.wmnet [12:16:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125393 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:16:50] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2097.codfw.wmnet with reason: host reimage [12:16:50] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125394 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [12:17:06] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2029.codfw.wmnet [12:17:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125395 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:17:43] (03CR) 10DCausse: [C:03+2] search: Update Cirrus Saneitizer alert [alerts] - 10https://gerrit.wikimedia.org/r/1071004 (owner: 10Ebernhardson) [12:17:44] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2029.codfw.wmnet [12:17:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125396 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [12:18:03] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2099.codfw.wmnet [12:18:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125398 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:18:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2099.codfw.wmnet with OS bullseye [12:18:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2099 [12:18:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [12:18:35] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:18:52] PROBLEM - Host kubernetes2033 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:54] (03Merged) 10jenkins-bot: search: Update Cirrus Saneitizer alert [alerts] - 10https://gerrit.wikimedia.org/r/1071004 (owner: 10Ebernhardson) [12:19:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2097.codfw.wmnet with reason: host reimage [12:20:30] PROBLEM - Host kubernetes2031 is DOWN: PING CRITICAL - Packet loss = 100% [12:20:30] PROBLEM - Host mw2321 is DOWN: PING CRITICAL - Packet loss = 100% [12:20:56] expected ^ [12:21:20] I'll go clean them up in a minute [12:21:22] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [12:21:53] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2099 - cgoubert@cumin1002" [12:21:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2099 - cgoubert@cumin1002" [12:21:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:21:58] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2099.codfw.wmnet 201.16.192.10.in-addr.arpa 1.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:22:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2099.codfw.wmnet 201.16.192.10.in-addr.arpa 1.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:22:02] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2099 [12:22:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2099 [12:22:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2099 [12:23:58] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2100.codfw.wmnet [12:24:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125427 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:24:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2100.codfw.wmnet with OS bullseye [12:24:29] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2100 [12:24:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [12:24:40] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:25:32] PROBLEM - Host mw2332 is DOWN: PING CRITICAL - Packet loss = 100% [12:27:27] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2096.codfw.wmnet with reason: host reimage [12:28:05] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2100 - cgoubert@cumin1002" [12:28:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2100 - cgoubert@cumin1002" [12:28:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:28:10] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2100.codfw.wmnet 202.16.192.10.in-addr.arpa 2.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:28:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2100.codfw.wmnet 202.16.192.10.in-addr.arpa 2.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:28:14] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2100 [12:28:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2100 [12:28:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2100 [12:29:09] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2101.codfw.wmnet [12:29:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125439 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [12:29:30] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2101.codfw.wmnet with OS bullseye [12:29:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071178 (owner: 10TrainBranchBot) [12:29:40] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2101 [12:29:41] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [12:30:04] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:32:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2096.codfw.wmnet with reason: host reimage [12:33:12] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2101 - cgoubert@cumin1002" [12:33:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2101 - cgoubert@cumin1002" [12:33:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:33:17] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2101.codfw.wmnet 203.16.192.10.in-addr.arpa 3.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:33:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2101.codfw.wmnet 203.16.192.10.in-addr.arpa 3.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:33:21] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2101 [12:33:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2101 [12:33:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2101 [12:37:16] !log homer cr*codfw* commit 'T372878' [12:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:19] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [12:38:15] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2066.codfw.wmnet [12:38:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2066.codfw.wmnet [12:39:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2099.codfw.wmnet with reason: host reimage [12:40:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:40:20] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce support for multiple flat networks [puppet] - 10https://gerrit.wikimedia.org/r/1071189 (https://phabricator.wikimedia.org/T374020) [12:40:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071190 [12:40:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071190 (owner: 10TrainBranchBot) [12:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:41:17] (03PS1) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 [12:41:36] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce support for multiple flat networks [puppet] - 10https://gerrit.wikimedia.org/r/1071189 (https://phabricator.wikimedia.org/T374020) [12:41:41] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071189 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [12:43:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2099.codfw.wmnet with reason: host reimage [12:43:04] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2097.codfw.wmnet with OS bullseye [12:43:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125567 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2097.codfw.wm... [12:44:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2100.codfw.wmnet with reason: host reimage [12:45:18] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw1486.eqiad.wmnet [12:45:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw1486.eqiad.wmnet [12:45:22] !log homer lsw1-b3-codfw* commit [12:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:56] (03PS1) 10Filippo Giunchedi: graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597) [12:46:15] (03PS4) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071 [12:46:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:47:41] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host kubestage2001.codfw.wmnet [12:47:43] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [12:47:49] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125594 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumbering for host kubestage2... [12:48:11] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2097.codfw.wmnet [12:48:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2097.codfw.wmnet [12:48:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2097.codfw.wmnet [12:48:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2100.codfw.wmnet with reason: host reimage [12:48:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125596 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor... [12:48:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [12:48:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:48] (03CR) 10CI reject: [V:04-1] graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi) [12:48:55] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bullseye [12:49:22] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host kubestage2001 [12:49:28] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [12:49:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2001.codfw.wmnet... [12:49:45] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:50:11] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 423, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:50:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2101.codfw.wmnet with reason: host reimage [12:51:26] RESOLVED: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:51:28] (03PS1) 10JMeybohm: rename/renumber kubernetes2034 to wikikube-worker2102 [puppet] - 10https://gerrit.wikimedia.org/r/1071194 (https://phabricator.wikimedia.org/T372878) [12:51:38] (03CR) 10Elukey: "Thanks! To do things properly we should also update the changelog, with something like "Update maintainer to XXXX"" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (owner: 10Ilias Sarantopoulos) [12:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:52:18] (03CR) 10JMeybohm: [C:03+2] rename/renumber kubernetes2034 to wikikube-worker2102 [puppet] - 10https://gerrit.wikimedia.org/r/1071194 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [12:52:32] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kubestage2001 - jayme@cumin1002" [12:52:36] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kubestage2001 - jayme@cumin1002" [12:52:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:37] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestage2001.codfw.wmnet 195.0.192.10.in-addr.arpa 5.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:52:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestage2001.codfw.wmnet 195.0.192.10.in-addr.arpa 5.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:52:40] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host kubestage2001 [12:52:50] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2101.codfw.wmnet with reason: host reimage [12:53:10] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10125617 (10hnowlan) [12:53:18] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 341, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:53:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubestage2001 [12:53:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kubestage2001 [12:54:03] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2096.codfw.wmnet with OS bullseye [12:54:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [12:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:54:45] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2034 to wikikube-worker2102 [12:55:01] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [12:56:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:56:22] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:57:49] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2096.codfw.wmnet [12:57:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2096.codfw.wmnet [12:57:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2096.codfw.wmnet [12:58:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125623 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering... [12:58:19] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2034 to wikikube-worker2102 - jayme@cumin1002" [12:58:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2034 to wikikube-worker2102 - jayme@cumin1002" [12:58:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:58:39] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2102 [13:00:44] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2102 [13:01:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2034 to wikikube-worker2102 [13:01:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes203... [13:02:28] !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2102.codfw.wmnet [13:02:38] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2102.codfw.wmnet with OS bullseye [13:02:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125629 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi... [13:02:48] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2102 [13:02:50] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [13:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:06:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2099.codfw.wmnet with OS bullseye [13:06:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [13:06:57] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:07:10] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2098.codfw.wmnet with reason: host reimage [13:07:25] !incidents [13:07:25] 5138 (ACKED) Host db1246 (paged) - PING - Packet loss = 100% [13:07:25] 5142 (UNACKED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [13:07:26] aha [13:07:31] !ack 5142 [13:07:32] 5142 (ACKED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [13:07:34] thanks [13:07:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, mw2396.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2046.codfw.wmnet, m [13:07:57] dfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2077.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, kubernetes2059.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2366.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2043.codfw.wmnet, kubernetes2006.codfw.wmnet, mw2398.codfw.wmnet, wikikube [13:07:57] 002.codfw.wmnet, wikikube-worker2090.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2055.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2016.codfw.wmnet https://wikitech.wikimedia.org/wiki/PyBal [13:08:22] akosiaris: need some help w/ wikifunctions? [13:08:28] I can also lend a hand [13:08:29] (03PS2) 10Filippo Giunchedi: graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597) [13:08:31] and I am oncall heh [13:09:05] I'm here too, mostly! [13:10:00] !oncall [13:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:10:21] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2098.codfw.wmnet with reason: host reimage [13:10:38] !oncall-now [13:10:39] Oncall now for team SRE, rotation business_hours: [13:10:39] b.black, a.kosiaris, f.abfur, c.danis [13:10:43] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [13:10:51] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:55] I am trying to understand if this is wf only and it looks like it [13:11:04] so, no harm to the rest of the projects overall [13:11:10] but I didn't expect was pybal alerting [13:11:18] but what* [13:11:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071190 (owner: 10TrainBranchBot) [13:11:44] I see it resolved. Thankfully wf is in it's own mw deployment [13:11:50] so it can't hurt the rest of the wikis [13:11:57] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:11:58] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:12:40] akosiaris: the probe for pybal is Special:BlankPage so if php-fpm can't answer... [13:12:54] ah no that's just for monitoring [13:13:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [13:14:25] maybe wiki functions shouldn't have it's own LVS [13:14:33] (03CR) 10Elukey: [C:03+2] admin_ng: set disablePSPMutations for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071132 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:14:58] claime: it's the idleconnnection thing btw [13:15:05] yeah [13:15:06] it probably killed all connections or something [13:15:51] idleconnection won't depool a realserver if the TCP connection gets closed [13:16:15] it will depool the server if t he TCP connection gets closed and an immediate reconnection fails [13:16:18] *the [13:16:49] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [13:17:06] funnily enough, this is TCP indeed. So what? even apache got backfilled? [13:17:13] ah wait, all pods weren't ready, right? [13:17:15] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:17:18] * akosiaris checking [13:18:26] interesting, codfw only [13:18:27] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [13:19:32] yup, both pods were in not ready state [13:19:39] 1 still is [13:20:58] !log homer lsw1-b6-codfw* commit 'T372878' [13:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:01] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:21:57] yup, at least the apache container wasn't ready [13:22:52] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2102 - jayme@cumin1002" [13:22:56] (03PS1) 10Muehlenhoff: ganeti: Install bridge-utils on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1071199 [13:22:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2102 - jayme@cumin1002" [13:22:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:57] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2102.codfw.wmnet 226.16.192.10.in-addr.arpa 6.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:23:00] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2102.codfw.wmnet 226.16.192.10.in-addr.arpa 6.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:23:01] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2102 [13:23:26] http status 414 [13:24:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [13:25:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2100.codfw.wmnet with OS bullseye [13:25:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [13:25:41] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:25:47] PROBLEM - Host kubernetes2034 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2102 [13:26:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2102 [13:26:53] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 28, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:27:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2101.codfw.wmnet with OS bullseye [13:27:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [13:28:37] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2099.codfw.wmnet [13:28:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2099.codfw.wmnet [13:28:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2099.codfw.wmnet [13:28:58] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2095.codfw.wmnet with OS bullseye [13:28:59] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2095.codfw.wmnet [13:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:29:52] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2100.codfw.wmnet [13:29:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2100.codfw.wmnet [13:29:55] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2100.codfw.wmnet [13:30:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125698 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:31:05] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2095.codfw.wm... [13:31:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125700 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor... [13:31:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125715 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:31:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125716 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:32:42] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2101.codfw.wmnet [13:32:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2101.codfw.wmnet [13:32:45] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2101.codfw.wmnet [13:32:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125733 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:32:57] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125734 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo... [13:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:36:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bullseye [13:36:07] aha, so again [13:36:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125748 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with... [13:36:19] I 'll set up a silence [13:37:18] should we scale up the deployment? [13:38:01] (03PS1) 10Jforrester: Fix typo in browser vendor prefix [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071202 (https://phabricator.wikimedia.org/T374180) [13:38:19] cdanis: from apache logs: {"timestamp": "2024-09-06T12:55:08", "RequestTime": "101", "Client-IP": "127.0.0.1", "Handle/Status": "-/414" yadada [13:38:23] 414 is URI too long [13:38:26] ah [13:38:28] heh [13:38:28] I don't think this is a capacity issue [13:38:30] yeah [13:38:32] fair enough [13:38:49] I 'll file a task for aw team though [13:38:57] thanks <3 [13:39:01] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125742 (10Jdforrester-WMF) >>! In T374210#10125034, @Dreamy_Jazz wrote: > As such, I think... [13:39:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:40:36] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125761 (10Dreamy_Jazz) >>! In T374210#10125742, @Jdforrester-WMF wrote: >>>! In T374210#10... [13:40:39] 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125763 (10Dreamy_Jazz) [13:41:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:44:49] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2102.codfw.wmnet with reason: host reimage [13:46:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:46:21] (03CR) 10CDanis: [C:03+1] admin_ng: set disablePSPMutations for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071132 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [13:46:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [13:48:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2102.codfw.wmnet with reason: host reimage [13:49:01] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:49:51] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10125807 (10elukey) I've released spicerack 8.13.0 that collects the latest changes for the redfish module, and inst... [13:51:59] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database bdrwiki (T371759) [13:52:01] T371759: Prepare and check storage layer for bdrwiki - https://phabricator.wikimedia.org/T371759 [13:52:04] !log homer lsw1-a6-codfw* commit 'T372878' [13:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:52:10] !log btullis@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) for database bdrwiki (T371759) [13:53:05] (03CR) 10Muehlenhoff: P:idp Prometheus blackbox monitoring for IDP. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede) [13:56:01] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [13:56:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [13:56:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host kubestage2001.codfw.wmnet [13:56:10] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125845 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host kubestage2001.... [13:58:46] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [13:58:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [13:59:03] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [13:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:59:46] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bookworm [13:59:47] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 30, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:30] task: https://phabricator.wikimedia.org/T374241 [14:01:45] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:02:10] (03CR) 10Cwhite: [C:03+1] opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli) [14:02:47] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:27] (03PS1) 10Brouberol: airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) [14:05:41] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:05:47] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 30, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:55] ah, this is probably me ^, lemme fix that [14:05:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 337, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:07:43] !log silence alerts based on alertname=PHPFPMTooBusy,deployment=mw-wikifunctions,site=codfw T374241 [14:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:46] T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241 [14:09:53] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:09:58] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:10:24] !log restart pybal on lvs1019 [14:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:59] (03CR) 10Filippo Giunchedi: [C:03+2] opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli) [14:12:01] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 419, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2102.codfw.wmnet with OS bullseye [14:13:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2102.codfw.wmne... [14:15:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2.018s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:15:21] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:51] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:55] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:17] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2102.codfw.wmnet [14:17:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2102.codfw.wmnet [14:17:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2102.codfw.wmnet [14:17:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125940 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host wikikube-worke... [14:18:26] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10125945 (10JMeybohm) [14:20:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2.018s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:20:51] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [14:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:21:33] PROBLEM - people.wikimedia.org requires authentication on people1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:22:06] FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:25] RECOVERY - people.wikimedia.org requires authentication on people1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:22:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [14:24:11] PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:25:20] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database bdrwiki (T371759) [14:25:23] T371759: Prepare and check storage layer for bdrwiki - https://phabricator.wikimedia.org/T371759 [14:27:06] RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:55] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes1059.eqiad.wmnet [14:27:57] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes1059.eqiad.wmnet [14:28:15] !log repool kubernetes1059 T365993 [14:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:28] T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 [14:30:17] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10126008 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:34:47] (03PS2) 10Brouberol: airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) [14:34:55] (03CR) 10Elukey: "Forgot to mention the last time (sorry) but we may think about refactoring both cookbooks with SREBatchRunnerBase, that offers restart/reb" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb) [14:35:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247 (10cmooney) 03NEW p:05Triage→03Medium [14:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:33] (03PS1) 10Kamila Součková: kubernetes: rename mw2430 to wikikube-worker2103 [puppet] - 10https://gerrit.wikimedia.org/r/1071221 (https://phabricator.wikimedia.org/T372878) [14:40:39] (03CR) 10Bking: [C:03+2] wdqs: better isolation of categories components [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:41:03] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:41:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bookworm [14:42:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:42:36] !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2095.codfw.wmnet [14:42:50] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bullseye [14:42:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126049 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe... [14:42:54] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2095.codfw.wmnet with OS bullseye [14:42:55] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2095.codfw.wmnet [14:43:01] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [14:43:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [14:43:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126053 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering... [14:44:05] (03PS2) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 [14:44:06] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bullseye [14:44:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [14:45:47] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10126055 (10elukey) Tried to update Wikitech and https://wikitech.wikimedia.org/wiki/Puppet#Private_puppet, the documentation... [14:46:35] (03PS3) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 [14:47:19] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:47:19] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:47:19] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:47:48] (03PS1) 10Muehlenhoff: Re-add and absent data.yaml entry for manuel-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1071222 (https://phabricator.wikimedia.org/T373927) [14:47:49] (03CR) 10Ilias Sarantopoulos: "You're right! I added a changelog entry for all the affected images" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (owner: 10Ilias Sarantopoulos) [14:48:27] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10126061 (10elukey) Next and last step - wait for the new conftool release, and then close! [14:49:12] (03CR) 10Dzahn: [C:03+2] planet: drop firewall rule for http from localhost [puppet] - 10https://gerrit.wikimedia.org/r/1071024 (owner: 10Dzahn) [14:49:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:49:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse) [14:50:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:50:46] (03PS1) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) [14:51:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database bdrwiki (T371759) [14:51:06] T371759: Prepare and check storage layer for bdrwiki - https://phabricator.wikimedia.org/T371759 [14:51:09] (03PS2) 10Kgraessle: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823) [14:51:33] (03PS6) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) [14:52:04] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2098.codfw.wmnet with OS bullseye [14:52:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [14:54:36] (03CR) 10Muehlenhoff: [C:03+2] Re-add and absent data.yaml entry for manuel-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1071222 (https://phabricator.wikimedia.org/T373927) (owner: 10Muehlenhoff) [14:55:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:55:32] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249 (10Clement_Goubert) 03NEW [14:56:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:57:34] (03CR) 10Btullis: "Looks good. I just want to check something, because I remember that aqu set up some statsd related mappings on the analytics instance some" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol) [14:59:33] (03CR) 10Cathal Mooney: [C:03+1] Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi) [15:00:24] (03CR) 10Scott French: [C:03+1] kubernetes: rename mw2430 to wikikube-worker2103 [puppet] - 10https://gerrit.wikimedia.org/r/1071221 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [15:00:29] (03PS1) 10Ladsgroup: tables-catalog: Another batch of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071227 (https://phabricator.wikimedia.org/T363581) [15:00:36] (03CR) 10Cathal Mooney: [C:03+1] "I guess at some point we should look at upstream ganeti and move away from the use of this. Until it's no longer packaged in debian I gue" [puppet] - 10https://gerrit.wikimedia.org/r/1071199 (owner: 10Muehlenhoff) [15:01:23] (03CR) 10Cathal Mooney: [C:03+1] "sorry ignore me - as you said it's our internal tooling not ganeti that needs it." [puppet] - 10https://gerrit.wikimedia.org/r/1071199 (owner: 10Muehlenhoff) [15:02:56] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2098.codfw.wmnet [15:02:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2098.codfw.wmnet [15:02:59] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2098.codfw.wmnet [15:03:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126147 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering... [15:04:51] (03PS2) 10Ladsgroup: tables-catalog: Another batch of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071227 (https://phabricator.wikimedia.org/T363581) [15:04:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10126164 (10Jhancock.wm) I'm honestly not sure. I don't see anything missing but I'm also not sure what may have been there before the reset. I've emailed Richard at S... [15:04:57] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Another batch of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071227 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [15:05:07] (03CR) 10Btullis: "Here they are, for reference:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol) [15:05:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:13] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2430.codfw.wmnet [15:07:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2430.codfw.wmnet [15:08:12] (03PS1) 10Alexandros Kosiaris: ats: Revert the /api/ changes on the CDN side [puppet] - 10https://gerrit.wikimedia.org/r/1071229 (https://phabricator.wikimedia.org/T364400) [15:08:56] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw2430 to wikikube-worker2103 [puppet] - 10https://gerrit.wikimedia.org/r/1071221 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [15:10:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:11:05] (03CR) 10Alexandros Kosiaris: [C:03+2] ats: Revert the /api/ changes on the CDN side [puppet] - 10https://gerrit.wikimedia.org/r/1071229 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris) [15:11:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:13:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: Move traffic off lvs1017 to lvs1020 to troubleshooot faulty link [15:13:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: Move traffic off lvs1017 to lvs1020 to troubleshooot faulty link [15:14:32] !log disabling PyBal on lvs1017 to shift traffic to lvs1020 and allow work to fix faulty fibre link T374247 [15:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:35] T374247: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247 [15:14:44] (03CR) 10Alexandros Kosiaris: [C:03+2] ats: Fix issue with /api/ pointing to /w/rest.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070274 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris) [15:14:59] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:15:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247#10126210 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c63ff66a-28d3-4567-b7cc-a03c0da01345) set by cmooney@cumin1002 for 2:0... [15:15:40] ^^ bgp alert is pybal, my bad didn't downtime the CRs [15:15:45] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: hw troubleshooting: host won't boot lists backplane error for pay-lb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374054#10126212 (10Jhancock.wm) looks like us shutting down the server to move it fixed the error. Can you take a look and co... [15:15:52] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2430 to wikikube-worker2103 [15:15:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:15:59] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:16:10] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:18:08] (03PS1) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) [15:18:50] (03PS2) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) [15:19:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10126233 (10Eevans) >>! In T373097#10121448, @MatthewVernon wrote: > There are 4 swift servers in `C4` - ms-be2058 ms-be2064 ms... [15:19:40] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2430 to wikikube-worker2103 - kamila@cumin1002" [15:21:29] (03CR) 10CI reject: [V:04-1] keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [15:22:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2430 to wikikube-worker2103 - kamila@cumin1002" [15:22:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:09] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2103 [15:23:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:23:36] (03PS3) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) [15:23:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2103 [15:24:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2430 to wikikube-worker2103 [15:24:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2430 to wi... [15:24:39] PROBLEM - Host mw2320 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10126245 (10Jhancock.wm) @klausman this one isn't under warranty and I don't have an exact match for the drive. will a 1.92Tb drive work... [15:25:32] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [15:26:37] (03PS4) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) [15:27:16] !log dzahn@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and A:durum [15:27:26] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:27:50] !log rolling restarts on durum machines [15:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:03] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2103.codfw.wmnet on all recursors [15:28:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2103.codfw.wmnet on all recursors [15:28:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10126265 (10Eevans) >>! In T373101#10121463, @MatthewVernon wrote: > There are some impact Swift servers: > - ms-be2054 and ms-be2078 and than... [15:29:31] !log kamila@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2103.codfw.wmnet [15:29:47] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126281 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by kamila@cumin1002 Renumber... [15:29:49] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:30:13] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:30:49] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:31:13] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:31:32] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:31:36] (03CR) 10Andrew Bogott: "seems good! I'm always amazed at how many files we have to touch for something like this :(" [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [15:31:38] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10126290 (10Eevans) >>! In T373102#10121495, @MatthewVernon wrote: > These racks have the following Swift/Ceph nodes: > - ms-f... [15:32:09] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez) [15:32:39] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2103.codfw.wmnet with OS bullseye [15:32:49] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2103 [15:32:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [15:34:34] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:34:54] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:35:05] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:35:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10126317 (10klausman) >>! In T374207#10126245, @Jhancock.wm wrote: > @klausman this one isn't under warranty and I don't have an exact m... [15:35:23] (03CR) 10JHathaway: [C:03+1] "looks good, some minor suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede) [15:35:47] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:05] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:36:47] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:37:05] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:37:05] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:37:51] (03PS1) 10Ilias Sarantopoulos: ml-services: re-deploy prod articlequality and update staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071232 (https://phabricator.wikimedia.org/T360455) [15:39:44] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2103 - kamila@cumin1002" [15:39:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2103 - kamila@cumin1002" [15:39:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:49] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2103.codfw.wmnet 60.16.192.10.in-addr.arpa 0.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:39:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2103.codfw.wmnet 60.16.192.10.in-addr.arpa 0.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:39:53] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2103 [15:40:15] (03CR) 10BryanDavis: "I think all of this is rearranging the deck chairs on the Titanic. Effie is working on T292707 and the child task T371374 that will be cha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [15:40:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2103 [15:40:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2103 [15:40:27] PROBLEM - Host mw2322 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:33] PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100% [15:41:01] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:41:01] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:42:04] the check.wikimedia-dns.org would be because durum hosts are booted [15:42:13] the other ones should have no relation [15:42:33] !log install spicerack 8.13.0 on cumin1002 [15:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:42] 10ops-codfw, 06DC-Ops, 06serviceops: Comm Error: backplane 0 when reimaging wikikube-worker2095 - https://phabricator.wikimedia.org/T374258 (10hnowlan) 03NEW [15:42:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247#10126347 (10cmooney) 05Open→03Resolved Ok we have replaced the optic in lvs1017 (same model as the one taken from lvs1019 for the record),... [15:42:43] !log enabling PyBal on lvs1017 to make primary again after repairing faulty fiber link T374247 [15:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:46] T374247: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247 [15:43:01] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:43:01] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:43:04] (03PS21) 10Elukey: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [15:43:33] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1017.eqiad.wmnet [15:43:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1017.eqiad.wmnet [15:43:55] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@ad2c434] (releasing): (no justification provided) [15:44:21] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10126366 (10Jhancock.wm) I've made an RMA request with Dell. Should be here early next week. [15:44:37] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@ad2c434] (releasing): (no justification provided) (duration: 00m 41s) [15:44:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10126370 (10Jclark-ctr) ganeti1039 b2 u4 cableid 4893 port 2 [15:45:35] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [15:47:39] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:48:41] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:49:28] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:50:11] (03PS2) 10Jdlrobson: Enable appearance menu for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) [15:50:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [15:50:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt) [15:50:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson) [15:52:59] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:53:33] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:53:49] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:53:51] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:53:51] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:54:31] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:49] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:51] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:54:51] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:55:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071236 [15:55:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071236 (owner: 10TrainBranchBot) [15:58:22] (03PS1) 10EoghanGaffney: lists: Mask mailman3 service on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/1071237 [15:58:26] (03PS1) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [15:58:50] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [15:58:53] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:58:58] (03CR) 10Elukey: [C:03+2] sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:59:01] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:02] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2103.codfw.wmnet with reason: host reimage [15:59:11] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:27] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:00:11] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:00:23] PROBLEM - Host mw2430 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:29] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:00:53] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:01:01] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:02:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2103.codfw.wmnet with reason: host reimage [16:02:55] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:02:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10126458 (10elukey) My bad, it was because my factory reset for some reason didn't restore the ADMIN password to its original state. Thanks for the follow up! [16:04:07] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10126459 (10elukey) Great news, the first version of the Supermicro support in provision is live on cumin nodes (nam... [16:04:23] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2095.codfw.wmnet with OS bullseye [16:04:33] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [16:04:57] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:04:59] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:05:59] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:05:59] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:06:11] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and A:durum [16:09:32] (03PS1) 10Jdrewniak: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) [16:14:00] Southparkfan: does your grafana login work now? [16:16:05] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126495 (10hnowlan) [16:21:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071236 (owner: 10TrainBranchBot) [16:24:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2103.codfw.wmnet with OS bullseye [16:25:03] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-worker2103.codfw.wmnet with OS bullseye co... [16:29:13] (03PS3) 10JHathaway: vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) [16:29:52] (03CR) 10JHathaway: "done!" [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway) [16:30:17] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2103.codfw.wmnet [16:30:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2103.codfw.wmnet [16:30:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2103.codfw.wmnet [16:30:30] 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126581 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by kamila@cumin1002 Renumbering for host wikikube-worker2103.codfw.wmnet com... [16:32:01] (03CR) 10CI reject: [V:04-1] vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway) [16:32:53] (03PS4) 10JHathaway: vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) [16:33:27] RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:42:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 335, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:42:56] (03CR) 10Dzahn: [C:03+2] Fix firewall service definitions for CI [puppet] - 10https://gerrit.wikimedia.org/r/1071175 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [16:46:14] !log ran homer on cr*codfw* for T372878 [16:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:18] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:48:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 417, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:30] mutante: yes, works fine - thanks :) [16:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:52:08] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249#10126668 (10kamila) [16:52:44] Southparkfan: great to hear, then it was indeed the LDAP sync [16:54:21] someone else was renamed and this broke the script, Filippo/o11y is tracking it in https://phabricator.wikimedia.org/T374173#10124175 [16:55:34] (03PS2) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [16:55:52] yep, he fixed one sync run so it works for you. but there is also follow-up ticket for next time [16:55:57] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [16:58:19] signing the actual volunteer NDA was probably the least complex part here [16:58:47] (03PS3) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [16:59:09] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [16:59:54] but it works now, much thanks. finally part of the NDA community - only waiting on Netbox access to be fixed, but I/F is working on it [17:01:28] Southparkfan: every once in a while we need someone like you to ask for it to test the process, heh [17:02:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071245 [17:02:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071245 (owner: 10TrainBranchBot) [17:04:11] (03PS4) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:04:32] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:06:12] (03PS5) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:06:33] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:08:11] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:08:45] (03PS6) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:09:07] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:11:28] (03PS7) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:11:49] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:12:38] (03PS1) 10Kamila Součková: kubernetes: rename mw2431 to wikikube-worker2104 [puppet] - 10https://gerrit.wikimedia.org/r/1071246 (https://phabricator.wikimedia.org/T372878) [17:13:10] (03PS8) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:16:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:17:09] (03CR) 10Dzahn: [C:03+2] "looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/1071175 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [17:19:08] (03CR) 10Dzahn: [C:03+2] Correct firewall services for releases [puppet] - 10https://gerrit.wikimedia.org/r/1071076 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [17:19:29] (03PS9) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:22:12] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:23:39] (03PS10) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:26:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:27:27] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10126728 (10Dwisehaupt) Host OS installed and built out with basics. Awaiting the completion of T374269 to finish config and testing. [17:28:18] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10126731 (10Dwisehaupt) payments2006 built out and mariadb cloned out. Awaiting completion of T374269 to finish config and testing. [17:28:26] (03CR) 10Scott French: [C:03+1] kubernetes: rename mw2431 to wikikube-worker2104 [puppet] - 10https://gerrit.wikimedia.org/r/1071246 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [17:34:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071245 (owner: 10TrainBranchBot) [17:35:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:36:22] (03CR) 10Dzahn: [C:03+2] "looks good. https://releases.wikimedia.org/ still works and httpb from deployment can talk to releases1003" [puppet] - 10https://gerrit.wikimedia.org/r/1071076 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [17:36:42] (03CR) 10Legoktm: "From what I remember, there's only supposed to be one runner per queue, and having e.g. two outbound runners might be an issue? Idk if tha" [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney) [17:37:34] (03PS11) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:38:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:40:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:41:19] (03PS12) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:42:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:42:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:51:53] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:52:30] (03PS13) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [17:52:52] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [17:54:47] PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [17:55:04] !log Import corto 0.3.1-1 into bookworm-wikimedia apt archive [17:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10126769 (10Jhancock.wm) 05Open→03Resolved cool. I'll do that. thanks! [17:58:44] what's going on with the asw2-d-eqiad VC? [18:00:26] (03PS14) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [18:02:56] (03PS1) 10Scott French: admin: normalize swfrench dot files across hosts [puppet] - 10https://gerrit.wikimedia.org/r/1071247 [18:07:00] (03PS21) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) [18:08:26] bblack cdanis: ^ one of the VCPs seems to be flapping for months, and the linked Wikitech page requires a high prio task for netops - thoughts? [18:08:47] (03CR) 10Scott French: [C:03+2] admin: normalize swfrench dot files across hosts [puppet] - 10https://gerrit.wikimedia.org/r/1071247 (owner: 10Scott French) [18:08:56] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3903/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [18:08:59] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:10:04] nothing user-facing, but resilience loss isn't ideal either [18:12:18] !log Import ncmonitor 1.2.1-1 into bookworm-wikimedia apt archive [18:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:16:55] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:20:32] (03PS1) 10Dreamy Jazz: Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) [18:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:22:34] (03CR) 10BCornwall: [V:03+1 C:03+2] "I added the config line of the gdrive-creds.json file. I'm taking the liberty of just merging this in since it's a really low-risk additio" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [18:23:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) (owner: 10Dreamy Jazz) [18:25:57] (03PS1) 10Jforrester: tests: Disable all Beta Cluster CI testing, all failing [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071253 (https://phabricator.wikimedia.org/T374242) [18:26:11] (03PS1) 10Jforrester: Don't pass empty type/returnType to zobject lookup when undefined [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199) [18:26:31] (03PS2) 10Jforrester: Don't pass empty type/returnType to zobject lookup when undefined [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199) [18:26:55] RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [18:27:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:28:59] (03PS15) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) [18:29:00] (03CR) 10Kosta Harlan: [C:03+1] Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) (owner: 10Dreamy Jazz) [18:29:57] PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [18:30:59] RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [18:31:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:34] (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking) [18:31:41] (03PS1) 10Scott French: admin: tweak swfrench dot files [puppet] - 10https://gerrit.wikimedia.org/r/1071255 [18:32:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2200.codfw.wmnet with reason: Maintenance [18:32:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:32:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2200.codfw.wmnet with reason: Maintenance [18:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:34:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:36:24] (03CR) 10Scott French: [C:03+2] admin: tweak swfrench dot files [puppet] - 10https://gerrit.wikimedia.org/r/1071255 (owner: 10Scott French) [18:36:29] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10126900 (10BCornwall) 05Open→03Resolved Corto's now running on alert1001 :) [18:39:06] Southparkfan: thanks, filed T374272 [18:39:06] T374272: asw2-d-eqiad vcp links flapping - https://phabricator.wikimedia.org/T374272 [18:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:40:40] cdanis: cool - as far as I know there's nothing critical in D4 either [18:42:22] although its leaf is runnin g on one link now, hopefully the other one doesn't flame out [18:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:47:31] (03PS1) 10Ebernhardson: cirrus: Also exclude labtestwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071258 [18:50:51] (03CR) 10Ebernhardson: [C:03+2] cirrus: Also exclude labtestwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071258 (owner: 10Ebernhardson) [18:51:58] (03Merged) 10jenkins-bot: cirrus: Also exclude labtestwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071258 (owner: 10Ebernhardson) [18:57:14] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:57:20] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:59:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071253 (https://phabricator.wikimedia.org/T374242) (owner: 10Jforrester) [19:00:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199) (owner: 10Jforrester) [19:00:56] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:01:03] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:06:56] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@6ca00a7] (releasing): (no justification provided) [19:07:40] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@6ca00a7] (releasing): (no justification provided) (duration: 00m 43s) [19:10:48] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10127037 (10VRiley-WMF) Ah, the blinking light did activate. I have swapped the HDD, and it should be good to go. Let us know if there is anything else we can help with. Th... [19:11:11] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10127039 (10VRiley-WMF) 05Open→03Resolved [19:31:51] (03CR) 10Bartosz Dziewoński: "Yeah, looks like the whole wikitech.php file is about to be removed (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/105933" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [19:31:56] (03Abandoned) 10Bartosz Dziewoński: wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [19:45:25] (03CR) 10Dzahn: [C:03+2] Cleanup firewall::service configs for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/1071072 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [20:02:07] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, wikikube- [20:02:07] 84.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2076.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, mw2425.codfw.wmnet, wikikube-worker2030.codfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2089.codfw.wmnet, wikikube-worker2062.cod [20:02:07] , kubernetes2016.codfw.wmnet, mw2394.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2440.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2014.codfw.wmnet, wikikube-worker2101.codfw.wmn https://wikitech.wikimedia.org/wiki/PyBal [20:02:35] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, mw2424.codfw.wmnet, parse2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2099.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, mw [20:02:35] fw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2313.codfw.wmnet, mw2302.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2397.codfw.wmnet, mw2314.codfw.wmnet, kubernetes2022.codfw.wmnet, wikikube-worker2014.codfw.wmnet, parse2012.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worke [20:02:35] dfw.wmnet, kubernetes2044.codfw.wmnet, mw2336.codfw.wmnet, parse2014.codfw.wmnet, mw2376.codfw.wmnet, wikikube-worker2024.codfw.wmnet, mw2426.codfw.wmnet, mw2371.codfw.wmnet, wikikube-w https://wikitech.wikimedia.org/wiki/PyBal [20:03:46] (03CR) 10Dzahn: [C:03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1071072 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [20:04:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:04:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:06:17] (03CR) 10Dzahn: [C:03+1] "looks reasonable to me. I would have probably used a selector but nothing wrong with this that I could argue about :)" [puppet] - 10https://gerrit.wikimedia.org/r/1071237 (owner: 10EoghanGaffney) [20:07:37] (03CR) 10Dzahn: [C:03+2] Cleanup firewall::service configs for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1071073 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [20:13:26] (03CR) 10Dzahn: [C:03+2] "/etc/nftables/input/10_miscweb-http-envoy.nft]/ensure: removed - no problem, os-reports.wikimedia.org is up, as an example" [puppet] - 10https://gerrit.wikimedia.org/r/1071073 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [20:15:25] (03CR) 10Dzahn: [C:03+2] Revert "codesearch: replace ferm::service with firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1071176 (owner: 10Muehlenhoff) [20:21:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 58.12s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:23:48] (03CR) 10Dzahn: [C:03+2] "noop confirmed on codesearch9 - had no effect until we actually switch the firewall provider" [puppet] - 10https://gerrit.wikimedia.org/r/1071176 (owner: 10Muehlenhoff) [20:26:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:26:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3m 43s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:30:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:31:03] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in - https://phabricator.wikimedia.org/T325406#10127186 (10jhathaway) [20:39:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 7.187s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:39:22] (03PS2) 10Jforrester: Use default width/height on gallery to avoid parser instance [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071265 (https://phabricator.wikimedia.org/T374146) [20:40:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071265 (https://phabricator.wikimedia.org/T374146) (owner: 10Jforrester) [20:40:49] (03PS1) 10Jforrester: ZObjectStore::findZTesterResult: Trim our own error so we don't break logstash [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071266 (https://phabricator.wikimedia.org/T374241) [20:44:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 7.187s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:45:45] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, wikikube-worker2079.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wiki [20:45:45] ker2036.codfw.wmnet, parse2009.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, parse2018.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2022.codfw.wmnet, wikikube-worker2027.codfw.wmnet, [20:45:45] odfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2090.c https://wikitech.wikimedia.org/wiki/PyBal [20:46:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-wor [20:46:09] codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2044.codfw.wmnet, mw2431.codfw.wmnet, wikikube-worker2022.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.codfw.wmnet, mw2419.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wik [20:46:09] rker2090.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2014.codfw.wmnet, wikikube-worker2062.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2353.codfw.wmnet, mw2449.codfw.wmn https://wikitech.wikimedia.org/wiki/PyBal [20:47:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 4m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:48:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:48:45] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:50:38] (03CR) 10Ebernhardson: [C:03+1] "curious..I also remember having issues in the past querying un-mapped fields. But indeed the MediaSearch query is querying it, and i did a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [20:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:52:14] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1071146/3905/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1071146 (owner: 10Muehlenhoff) [20:52:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 4m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:57:57] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10127280 (10Dwisehaupt) 05Open→03Resolved Host is built and config will continue in T372933 [20:58:19] (03CR) 10Dzahn: [V:03+1 C:03+2] "looks good, only effect is on ssh between phab servers and it just does the resolve. (we are still using ferm here)" [puppet] - 10https://gerrit.wikimedia.org/r/1071146 (owner: 10Muehlenhoff) [20:59:08] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: hw troubleshooting: host won't boot lists backplane error for pay-lb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374054#10127286 (10Dwisehaupt) 05Open→03Resolved Thanks. It's back online and up. Hopefully it has a transient error. [21:02:43] (03PS4) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) [21:03:04] (03CR) 10CI reject: [V:04-1] phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:03:21] (03CR) 10Dzahn: "how about this instead https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071028" [puppet] - 10https://gerrit.wikimedia.org/r/1071147 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff) [21:04:37] (03PS5) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) [21:11:17] (03CR) 10Cwhite: [C:03+2] logstash: put logging-sd200[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070353 (https://phabricator.wikimedia.org/T373651) (owner: 10Cwhite) [21:11:27] (03PS2) 10Cwhite: logstash: put logging-sd200[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070353 (https://phabricator.wikimedia.org/T373651) [21:17:08] (03CR) 10Cwhite: [V:03+2 C:03+2] logstash: put logging-sd200[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070353 (https://phabricator.wikimedia.org/T373651) (owner: 10Cwhite) [21:36:03] jjj [21:36:09] oops [21:56:46] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10127348 (10colewhite) 05Open→03In progress a:03colewhite [22:14:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:16:55] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:21:44] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:35:03] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:35:08] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:37:42] (03PS2) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) [22:37:47] (03PS3) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) [22:37:58] (03CR) 10Jdlrobson: [C:04-1] "I think this is okay to merge once we've setup the messages and defined audience correctly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [22:39:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński) [22:39:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński) [22:40:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:41:44] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:53:16] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:06:56] PROBLEM - SSH on aphlict1002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:07:56] RECOVERY - SSH on aphlict1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:08:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:10:18] (03CR) 10Jdlrobson: [C:04-1] Add Web search experiment quickSurvey on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [23:11:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:15:56] (03PS2) 10Jdrewniak: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) [23:16:25] FIRING: SystemdUnitFailed: user@499.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:20:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:29:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:31:25] RESOLVED: SystemdUnitFailed: user@499.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:00] (03PS3) 10Jdrewniak: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) [23:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071279 [23:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071279 (owner: 10TrainBranchBot) [23:38:40] (03CR) 10Jdlrobson: [C:03+1] "This looks good to deploy to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [23:39:47] (03PS4) 10Jdlrobson: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [23:40:04] (03CR) 10Jdlrobson: [C:03+1] Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak) [23:43:07] Hello folks, it's friday afternoon, but I'm wonder if it's ok to deploy a beta-cluster only config change? [23:55:26] nvm