[00:04:45] <wikibugs>	 (03CR) 10Bugreporter: "Unresolve. A corresponding talk namespace must be defined, otherwise such page will go nowhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[00:05:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071053 (owner: 10TrainBranchBot)
[00:08:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10123906 (10Papaul)
[00:09:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1495 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:18:28] <wikibugs>	 (03PS4) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130)
[00:23:49] <icinga-wm>	 PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:24:27] <wikibugs>	 (03PS5) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130)
[00:37:01] <wikibugs>	 (03CR) 10RLazarus: sre.switchdc.mediawiki: Wait for k8s maintenance jobs to stop (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[00:49:15] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:49:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:49:47] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:51:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[01:01:56] <wikibugs>	 (03CR) 10Krinkle: logging: Fix local variables leaking into global scope (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński)
[01:07:55] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:08:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:08:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:16:20] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[01:20:47] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:21:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:21:55] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:22:45] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:22:55] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:23:41] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:27:15] <wikibugs>	 06SRE, 06Editing-team, 06Fundraising-Backlog, 06Traffic-Icebox, and 5 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085#10123988 (10Pppery) 05Open→03Stalled
[01:39:52] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1070673 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus)
[01:40:07] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716
[01:45:24] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] "Test plan:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński)
[01:47:00] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344
[01:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:48:29] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] "The diff looks fine but I don't really trust any static review of this. Let's test this by cherry-picking on mwdebug instead and verifying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński)
[01:48:34] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: logging: Simplify extra debug logging configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685
[01:54:16] <wikibugs>	 (03CR) 10Krinkle: logging: Simplify extra debug logging configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070685 (owner: 10Bartosz Dziewoński)
[01:59:03] <wikibugs>	 (03Restored) 10Krinkle: wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[01:59:06] <wikibugs>	 (03PS2) 10Krinkle: wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[01:59:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[02:01:33] <wikibugs>	 (03CR) 10Krinkle: "@bd808@wikimedia.org @abogott@wikimedia.org: It seems the debug file enabled here is similar to what we already have in logging.php with M" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[02:02:08] <wikibugs>	 (03PS3) 10Krinkle: wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[02:20:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[02:36:12] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:42:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:42:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:42:33] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:57:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:57:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:57:37] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:01:12] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:15:17] <vgutierrez>	 !log depooling cp2041 && cp2038 due to high purged lag
[04:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:34] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp(2038|2041).codfw.wmnet
[04:24:48] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10124079 (10Vgutierrez) this has been triggered again in cp2038 and cp2041: ` vgutierrez@cumin1002:~$ sudo -i cumin 'cp[2038,2041].codfw.wmnet' 'journalctl -u purged.service --sin...
[04:24:56] <vgutierrez>	 !log restarting purged in cp2038 && cp2041 - T334078
[04:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:01] <stashbot>	 T334078: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078
[04:39:30] <wikibugs>	 06SRE, 06MediaWiki-Engineering, 10MediaWiki-extensions-BounceHandler, 10Observability-Metrics, 07Grafana: Bouncehandler is broken - https://phabricator.wikimedia.org/T338761#10124092 (10Krinkle) I've documented the following on Wikitech: https://wikitech.wikimedia.org/wiki/BounceHandler  >>! From **[...
[04:51:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[04:57:43] <vgutierrez>	 !log repool cp2038
[04:57:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:49] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:05:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:07:30] <jinxer-wm>	 FIRING: Processor usage over 85%: Alert for device cr1-codfw.wikimedia.org - Processor usage over 85%   - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25
[05:12:33] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:14:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:14:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:16:21] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[05:22:30] <jinxer-wm>	 RESOLVED: Processor usage over 85%: Device cr1-codfw.wikimedia.org recovered from Processor usage over 85%   - https://alerts.wikimedia.org/?q=alertname%3DProcessor+usage+over+85%25
[05:29:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:29:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:30:37] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:37:39] <vgutierrez>	 !log repool cp2041
[05:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:54:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:54:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:54:45] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:58:41] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:58:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:58:49] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:11:31] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp Limit the number of groups pushed to DebMonitor. [puppet] - 10https://gerrit.wikimedia.org/r/1070594 (owner: 10Slyngshede)
[06:17:45] <icinga-wm>	 RECOVERY - Host gerrit1004 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
[06:18:23] <icinga-wm>	 PROBLEM - SSH on gerrit1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:20:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[06:24:09] <icinga-wm>	 PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100%
[06:24:47] <wikibugs>	 (03PS3) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538)
[06:24:47] <wikibugs>	 (03PS1) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538)
[06:25:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[06:25:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[06:28:50] <wikibugs>	 (03PS4) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538)
[06:28:51] <wikibugs>	 (03PS2) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538)
[06:28:53] <wikibugs>	 (03CR) 10C. Scott Ananian: "Done (well, in the commit message)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[06:29:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[06:29:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538) (owner: 10C. Scott Ananian)
[06:30:31] <wikibugs>	 (03PS5) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on most wikis which use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070975 (https://phabricator.wikimedia.org/T363538)
[06:30:31] <wikibugs>	 (03PS3) 10C. Scott Ananian: Elevate pseudo-namespace MOS to a real namespace on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071067 (https://phabricator.wikimedia.org/T363538)
[06:54:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071024 (owner: 10Dzahn)
[06:57:18] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2087.codfw.wmnet
[06:57:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124177 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi...
[06:57:36] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2087.codfw.wmnet with OS bullseye
[06:57:41] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2087.codfw.wmnet with OS bullseye
[06:57:41] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2087.codfw.wmnet
[06:57:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki...
[06:57:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube...
[06:57:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124180 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f...
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240906T0700)
[07:00:48] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2087.codfw.wmnet with OS bullseye
[07:01:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki...
[07:09:33] <wikibugs>	 (03PS1) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071
[07:10:49] <icinga-wm>	 RECOVERY - Host gerrit1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.16 ms
[07:12:02] <wikibugs>	 (03PS2) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071
[07:12:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Cleanup firewall::service configs for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/1071072 (https://phabricator.wikimedia.org/T370677)
[07:17:13] <icinga-wm>	 PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100%
[07:18:36] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2087.codfw.wmnet with reason: host reimage
[07:20:00] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: deploy the scheduler via a separate Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070619 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[07:20:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Cleanup firewall::service configs for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1071073 (https://phabricator.wikimedia.org/T370677)
[07:21:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2087.codfw.wmnet with reason: host reimage
[07:26:00] <wikibugs>	 (03PS1) 10Jelto: deployment_server: add wikidata-query-gui service [puppet] - 10https://gerrit.wikimedia.org/r/1071075 (https://phabricator.wikimedia.org/T350793)
[07:31:37] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[07:32:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Correct firewall services for releases [puppet] - 10https://gerrit.wikimedia.org/r/1071076 (https://phabricator.wikimedia.org/T370677)
[07:33:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:36:03] <wikibugs>	 (03PS1) 10Brouberol: airflow: fix badly formatted Deployment separation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071077 (https://phabricator.wikimedia.org/T368737)
[07:36:22] <icinga-wm>	 PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:37:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: fix badly formatted Deployment separation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071077 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[07:39:22] <icinga-wm>	 RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:39:34] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[07:40:12] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[07:46:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:49:25] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow
[07:49:35] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 10s)
[07:51:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts acmechief1001.eqiad.wmnet
[07:51:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:52:52] <wikibugs>	 (03PS1) 10Slyngshede: P:idm: Add ecdsa-sha2-nistp256 to allowed key types. [puppet] - 10https://gerrit.wikimedia.org/r/1071123 (https://phabricator.wikimedia.org/T371956)
[07:55:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[07:56:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:58:16] <wikibugs>	 (03PS1) 10JMeybohm: rename/renumber kubernetes2020,2033 to wikikube-worker2093,2094 [puppet] - 10https://gerrit.wikimedia.org/r/1071124 (https://phabricator.wikimedia.org/T372878)
[07:58:49] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "Added Jesse as reviewer to get input on the sanity of the Puppet code." [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede)
[07:59:02] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] rename/renumber kubernetes2020,2033 to wikikube-worker2093,2094 [puppet] - 10https://gerrit.wikimedia.org/r/1071124 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm)
[07:59:37] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2020.codfw.wmnet
[08:00:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:00:15] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2020.codfw.wmnet
[08:00:20] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2033.codfw.wmnet
[08:00:54] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host kubernetes2033.codfw.wmnet
[08:01:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:01:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:01:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts acmechief1001.eqiad.wmnet
[08:01:29] <wikibugs>	 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124321 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002...
[08:03:38] <wikibugs>	 (03PS1) 10Slyngshede: Git: Add missing .gitreview file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180)
[08:06:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:07:10] <wikibugs>	 (03PS1) 10Tiziano Fogli: opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178)
[08:08:15] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:09:37] <wikibugs>	 (03PS1) 10Slyngshede: P:idp Prometheus blackbox monitoring for IDP. [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655)
[08:10:39] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3898/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[08:11:11] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set disablePSPMutations for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071132 (https://phabricator.wikimedia.org/T369491)
[08:13:00] <wikibugs>	 (03CR) 10DCausse: "correct, although I might perhaps be overcautious because I remember we had issues with querying un-mapped fields in the past... but looki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse)
[08:13:12] <wikibugs>	 (03PS5) 10DCausse: search: use the stem field when searching mul labels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401)
[08:13:48] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3899/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[08:14:56] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3900/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[08:18:20] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2020 to wikikube-worker2093
[08:18:37] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:18:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2033 to wikikube-worker2094
[08:19:37] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:20:11] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:20:22] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3901/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[08:23:39] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:23:45] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:24:29] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2087.codfw.wmnet with OS bullseye
[08:24:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts acmechief2001.codfw.wmnet
[08:24:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube...
[08:25:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede)
[08:25:48] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Git: Add missing .gitreview file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede)
[08:27:45] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:27:50] <wikibugs>	 (03Merged) 10jenkins-bot: Git: Add missing .gitreview file. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1071126 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede)
[08:28:00] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3902/console" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[08:28:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:31:00] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[08:31:13] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2093
[08:31:26] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2093
[08:31:35] <wikibugs>	 (03PS1) 10Elukey: services: update Proton's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071134 (https://phabricator.wikimedia.org/T367981)
[08:31:37] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2033 to wikikube-worker2094 - jayme@cumin1002"
[08:31:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2033 to wikikube-worker2094 - jayme@cumin1002"
[08:31:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:31:58] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2094
[08:32:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2020 to wikikube-worker2093
[08:32:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2094
[08:32:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes202...
[08:32:55] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Enable IPv6 for the envoyproxy on DPE Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/1070949 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis)
[08:32:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2033 to wikikube-worker2094
[08:33:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes203...
[08:36:45] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2087.codfw.wmnet
[08:36:47] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2087.codfw.wmnet
[08:36:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:36:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: acmechief2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:36:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:36:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts acmechief2001.codfw.wmnet
[08:37:01] <wikibugs>	 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124378 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002...
[08:38:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update Proton's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071134 (https://phabricator.wikimedia.org/T367981) (owner: 10Elukey)
[08:38:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: kubernetes2035 (renamed to wikikube-worker2087) reporting "Comm Error: Backplane 0" - https://phabricator.wikimedia.org/T374019#10124380 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Reimage worked fine now, thanks!
[08:40:34] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/proton: sync
[08:41:19] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: sync
[08:42:08] <wikibugs>	 (03PS3) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071
[08:42:17] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2093.codfw.wmnet
[08:42:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124386 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi...
[08:42:34] <wikibugs>	 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124387 (10MoritzMuehlenhoff)
[08:43:04] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2093.codfw.wmnet with OS bullseye
[08:43:14] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2093
[08:43:19] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:43:33] <wikibugs>	 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10124388 (10MoritzMuehlenhoff) 05Open→03Resolved All done!
[08:43:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki...
[08:45:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli)
[08:45:56] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "Plan is: rollout blackbox check, then absent Icinga checks and finally remove them from Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[08:46:27] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2093 - jayme@cumin1002"
[08:46:31] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2093 - jayme@cumin1002"
[08:46:32] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:46:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2093.codfw.wmnet 135.16.192.10.in-addr.arpa 5.3.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:46:35] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2093.codfw.wmnet 135.16.192.10.in-addr.arpa 5.3.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:46:35] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2093
[08:47:53] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2093
[08:47:53] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2093
[08:48:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2094.codfw.wmnet
[08:48:25] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/proton: sync
[08:48:27] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2094.codfw.wmnet with OS bullseye
[08:48:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124417 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi...
[08:48:37] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2094
[08:48:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki...
[08:48:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:49:47] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: sync
[08:49:48] <wikibugs>	 (03PS5) 10DCausse: wdqs: better isolation of categories components [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009)
[08:49:48] <wikibugs>	 (03PS5) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009)
[08:49:48] <wikibugs>	 (03PS5) 10DCausse: wdqs: do not add categories on main and scholarly endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009)
[08:49:51] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add the anycast VIP for radosgw to DPE Ceph servers [puppet] - 10https://gerrit.wikimedia.org/r/1070950 (https://phabricator.wikimedia.org/T330153) (owner: 10Btullis)
[08:51:55] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2094 - jayme@cumin1002"
[08:51:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:51:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2094 - jayme@cumin1002"
[08:52:00] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:52:00] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2094.codfw.wmnet 224.16.192.10.in-addr.arpa 4.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:52:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2094.codfw.wmnet 224.16.192.10.in-addr.arpa 4.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:52:04] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2094
[08:52:26] <wikibugs>	 (03PS1) 10Brouberol: airflow: configure metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098)
[08:52:49] <wikibugs>	 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10124427 (10elukey)
[08:53:43] <wikibugs>	 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10124429 (10elukey) ` dragonfly-supernode | 1.0.6-2 | bookworm-wikimedia | main | amd64 `  Next steps: - reimage codfw outside the deployment window - let it bake for some days - do the same for eqiad
[08:53:47] <wikibugs>	 06SRE, 06serviceops: Migrate dragonfly-supernodes to Bookworm - https://phabricator.wikimedia.org/T332011#10124430 (10elukey)
[08:54:37] <wikibugs>	 (03PS2) 10Brouberol: airflow: configure metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098)
[08:54:37] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2094
[08:54:37] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2094
[08:55:37] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync
[08:57:23] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync
[09:00:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Very cool! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[09:07:33] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol)
[09:09:04] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Tested on build2001, worked nicely :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey)
[09:10:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: configure metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071138 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol)
[09:12:05] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:12:37] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2094.codfw.wmnet with reason: host reimage
[09:12:43] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:14:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix up Phabricator firewall services, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1071146
[09:14:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix up Phabricator firewall services, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1071147 (https://phabricator.wikimedia.org/T370677)
[09:15:01] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2094.codfw.wmnet with reason: host reimage
[09:16:21] <jinxer-wm>	 FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[09:17:45] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[09:18:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071123 (https://phabricator.wikimedia.org/T371956) (owner: 10Slyngshede)
[09:20:27] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet
[09:24:09] <wikibugs>	 (03PS4) 10Elukey: doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410)
[09:25:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] service: Remove php7.2 specific health check [puppet] - 10https://gerrit.wikimedia.org/r/1070993 (owner: 10Alexandros Kosiaris)
[09:25:35] <wikibugs>	 (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey)
[09:26:34] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet
[09:27:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:28:39] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2005.codfw.wmnet
[09:35:23] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2005.codfw.wmnet
[09:37:44] <wikibugs>	 (03PS1) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737)
[09:38:43] <wikibugs>	 (03Merged) 10jenkins-bot: doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey)
[09:42:11] <wikibugs>	 (03PS1) 10Muehlenhoff: debmonitor: Also use adduser on Bullseye to create the system user [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071154 (https://phabricator.wikimedia.org/T372472)
[09:42:56] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207 (10klausman) 03NEW
[09:45:58] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2094.codfw.wmnet with OS bullseye
[09:46:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube...
[09:46:26] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2007.codfw.wmnet
[09:46:33] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071154 (https://phabricator.wikimedia.org/T372472) (owner: 10Muehlenhoff)
[09:46:58] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10124631 (10klausman) I already tried a reboot and a complete powercycle to revive the disk, to no avail.
[09:47:08] <jayme>	 !log homer lsw1-b6-codfw* commit 'T372878'
[09:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:10] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[09:47:55] <wikibugs>	 (03CR) 10Btullis: airflow: enable visualizing logs of DAG runs in the webserver UI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[09:48:01] <jayme>	 !log homer cr*codfw* commit 'T372878'
[09:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:24] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.09.06 - 2024.09.27): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10124635 (10Gehel)
[09:48:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:50:17] <icinga-wm>	 PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:50:50] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2094.codfw.wmnet
[09:51:08] <wikibugs>	 (03CR) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[09:53:04] <wikibugs>	 (03PS1) 10Elukey: CHANGELOG: add changelogs for release v8.13.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1071156
[09:53:39] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] sre.k8s.renumber-node: Run puppet on registry (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922 (owner: 10Clément Goubert)
[09:55:43] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 355, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:56:47] <wikibugs>	 (03PS5) 10Clément Goubert: sre.k8s.renumber-node: Run puppet on registry [cookbooks] - 10https://gerrit.wikimedia.org/r/1070922
[09:57:19] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2094.codfw.wmnet
[09:57:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2094.codfw.wmnet
[09:57:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10124728 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f...
[10:00:10] <wikibugs>	 06SRE, 10CheckUser, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210 (10Dreamy_Jazz) 03NEW
[10:00:53] <wikibugs>	 06SRE, 10CheckUser, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124779 (10Dreamy_Jazz)
[10:01:54] <wikibugs>	 (03PS1) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878)
[10:03:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10124781 (10JMeybohm)
[10:03:22] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124782 (10Dreamy_Jazz)
[10:05:28] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2093.codfw.wmnet with reason: host reimage
[10:05:47] <wikibugs>	 (03PS2) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737)
[10:05:51] <wikibugs>	 (03CR) 10Brouberol: airflow: enable visualizing logs of DAG runs in the webserver UI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[10:06:47] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice. Thanks for that change to the networkpolicy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[10:06:54] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124795 (10Dreamy_Jazz)
[10:06:57] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124797 (10Dreamy_Jazz)
[10:07:02] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124798 (10Dreamy_Jazz)
[10:08:16] <wikibugs>	 (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v8.13.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1071156 (owner: 10Elukey)
[10:08:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] debmonitor: Also use adduser on Bullseye to create the system user [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071154 (https://phabricator.wikimedia.org/T372472) (owner: 10Muehlenhoff)
[10:09:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[10:09:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2093.codfw.wmnet with reason: host reimage
[10:10:11] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: enable visualizing logs of DAG runs in the webserver UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071153 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol)
[10:10:45] <wikibugs>	 (03PS1) 10Elukey: Upstream release v8.13.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1071159
[10:11:00] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v8.13.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1071159 (owner: 10Elukey)
[10:12:55] <wikibugs>	 (03PS20) 10Elukey: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372)
[10:13:11] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[10:17:22] <elukey>	 !log uploaded spicerack_8.13.0 to apt.wikimedia.org bullseye-wikimedia
[10:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071160
[10:20:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[10:21:10] <wikibugs>	 (03Abandoned) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 for wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1070973 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan)
[10:21:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mediawiki: port login failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071161 (https://phabricator.wikimedia.org/T350597)
[10:23:10] <elukey>	 !log install spicerack 8.13.0 on cumin2002
[10:23:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:48] <icinga-wm>	 PROBLEM - Host db1246 #page is DOWN: PING CRITICAL - Packet loss = 100%
[10:24:06] <Emperor>	 here
[10:24:27] <Emperor>	 oncallers, you need help?
[10:24:29] <Amir1>	 I just woke up
[10:24:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10124846 (10Clement_Goubert)
[10:24:35] <akosiaris>	 here
[10:24:38] <akosiaris>	 !incidents
[10:24:38] <sirenbot>	 5138 (UNACKED)  Host db1246 (paged) - PING  - Packet loss = 100%
[10:24:42] <akosiaris>	 !ack 5138
[10:24:43] <sirenbot>	 5138 (ACKED)  Host db1246 (paged) - PING  - Packet loss = 100%
[10:24:52] <akosiaris>	 ok, now looking into what on earth
[10:25:03] <Amir1>	 let's depool it
[10:25:06] <Amir1>	 it's a normal replica
[10:25:11] <akosiaris>	 ok
[10:25:19] <akosiaris>	 who does it?
[10:25:25] <Amir1>	 on it
[10:25:31] <akosiaris>	 ok, thanks
[10:25:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1246 (It's sad)', diff saved to https://phabricator.wikimedia.org/P68731 and previous config saved to /var/cache/conftool/dbconfig/20240906-102551-ladsgroup.json
[10:26:02] <Amir1>	 ```
[10:26:05] <Amir1>	 https://www.irccloud.com/pastebin/pUGGhyXi/
[10:26:13] <Emperor>	 akosiaris: for future reference, https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica
[10:26:14] <Amir1>	 it's also for dumps
[10:26:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bump changelog [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/1071160 (owner: 10Muehlenhoff)
[10:26:39] <icinga-wm>	 RECOVERY - Host db1246 #page is UP: PING WARNING - Packet loss = 50%, RTA = 28.59 ms
[10:27:04] <Amir1>	 I can't ssh into it, it's probably some hw/network issue
[10:27:21] <icinga-wm>	 RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:27:51] <icinga-wm>	 PROBLEM - SSH on db1246 is CRITICAL: connect to address 10.64.48.172 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:27:54] <akosiaris>	 yeah, ssh connection refused immediately
[10:27:55] <Emperor>	 it looks to have come back into single-user / emergency mode
[10:28:09] <Emperor>	 console is at "Give root password for maintenance" point
[10:28:31] <akosiaris>	 ah, so it failed the fsck 
[10:29:21] <Emperor>	 Amir1: shall I leave investigating the sad system to you and/or arnaud.b ?
[10:29:44] <Amir1>	 yeah, if there is a ticket, It'd be amazing
[10:29:47] <wikibugs>	 (03CR) 10Clément Goubert: k8s: rename mw232[012], kubernetes2031 to wikikube-workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan)
[10:29:50] <Amir1>	 so I can eat breakfast
[10:29:53] <Amir1>	 and boot up
[10:30:01] <Emperor>	 I'll make one, tag it DBA
[10:30:26] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:30:28] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:30:33] <Emperor>	 akosiaris: you mind downtiming that host while I write up a ticket?
[10:30:49] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[10:31:29] <wikibugs>	 (03PS2) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878)
[10:31:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[10:31:47] <wikibugs>	 (03CR) 10Hnowlan: k8s: rename mw232[012], kubernetes2031 to wikikube-workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan)
[10:31:49] <akosiaris>	 Emperor: will do
[10:32:51] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode
[10:33:01] <Emperor>	 ta
[10:33:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode
[10:33:05] <akosiaris>	 I suppose duration in days?
[10:33:22] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 5.493 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:33:24] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.938 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:33:30] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode
[10:33:33] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on db1246.eqiad.wmnet with reason: Server failed, rebooted in emergency/single user mode
[10:33:36] <akosiaris>	 gave it a downtime of 5 days
[10:34:02] <Emperor>	 Yeah, no point it p.aging us over the weekend
[10:34:03] <Emperor>	 T374215
[10:34:04] <stashbot>	 T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215
[10:38:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan)
[10:38:42] <elukey>	 !log factory reset of sretest2001
[10:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:09] <moritzm>	 !log uploaded debmonitor-client 0.4.0-2+deb11u1 on bullseye-wikimedia (didn't rebuild the other suites since the fix is specific to Bullseye) T372472
[10:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:12] <stashbot>	 T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client - https://phabricator.wikimedia.org/T372472
[10:39:27] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[10:40:24] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[10:43:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet
[10:44:23] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: Rename mw233[2-4] [puppet] - 10https://gerrit.wikimedia.org/r/1071164 (https://phabricator.wikimedia.org/T372878)
[10:47:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10124967 (10elukey) @Jhancock.wm Hi! I tried to factory reset the sretest2001's BMC, and now I am getting some errors when using the Redfish API (unauthorized etc..). I...
[10:52:27] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10124969 (10Ladsgroup) In less than a month, wikitech will go inside production and this wou...
[10:54:38] <wikibugs>	 (03PS2) 10Filippo Giunchedi: mediawiki: port login failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071161 (https://phabricator.wikimedia.org/T350597)
[10:54:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mediawiki: port account creation failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071165 (https://phabricator.wikimedia.org/T350597)
[10:59:28] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125008 (10Dreamy_Jazz) >>! In T374210#10124969, @Ladsgroup wrote: > In less than a month,...
[10:59:42] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078#10125005 (10Vgutierrez) 05Stalled→03In progress a:03Vgutierrez
[11:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240906T0700)
[11:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240906T1100). Please do the needful.
[11:00:16] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on lists1004.wikimedia.org with reason: T373980
[11:00:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org
[11:00:25] <stashbot>	 T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed. - https://phabricator.wikimedia.org/T373980
[11:00:29] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on lists1004.wikimedia.org with reason: T373980
[11:02:03] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071168
[11:02:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071168 (owner: 10TrainBranchBot)
[11:02:18] <icinga-wm>	 PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[11:02:30] <jelto>	 ^ expected because of the reboot
[11:02:56] <moritzm>	 !log rolling out debmonitor-client 0.4.0-2+deb11u1 on bullseye-wikimedia on bullseye hosts T372472
[11:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:59] <stashbot>	 T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client - https://phabricator.wikimedia.org/T372472
[11:03:18] <arnaudb>	 wow I picked the right moment to eat
[11:03:32] <icinga-wm>	 RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms
[11:04:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet
[11:04:55] <wikibugs>	 06SRE, 10Bitu, 06Infrastructure-Foundations: Implementation of request flow - https://phabricator.wikimedia.org/T335474#10125026 (10SLyngshede-WMF) 05Open→03In progress
[11:06:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org
[11:08:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:08:43] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:09:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2093.codfw.wmnet with OS bullseye
[11:09:53] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125034 (10Dreamy_Jazz) Looking at the stack trace again, I see this isn't actually failing...
[11:10:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube...
[11:13:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:15:04] <moritzm>	 !log installing Linux 5.10.223 on bullseye hosts
[11:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:35] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] k8s: rename mw232[012], kubernetes2031 to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1071158 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan)
[11:16:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10125060 (10kamila)
[11:17:04] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: port login failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071161 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[11:17:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: port account creation failures alert from icinga/statsd [alerts] - 10https://gerrit.wikimedia.org/r/1071165 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[11:20:09] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10125067 (10ABran-WMF) That bad reboot seems to stem from a hardware issue: `  The system board BP1 PG voltage is within range.  Fri Sep 06 2024 10:20:57  The system board BP1 PG voltage is outsid...
[11:20:34] <icinga-wm>	 PROBLEM - mailman3_runners on lists1004 is CRITICAL: PROCS CRITICAL: 15 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:20:51] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:21:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:22:48] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin
[11:22:48] <icinga-wm>	 status
[11:23:48] <jayme>	 !log homer lsw1-b6-codfw* commit 'T372878'
[11:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:52] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[11:24:24] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2031 to wikikube-worker2095
[11:24:36] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2320 to wikikube-worker2096
[11:24:41] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[11:25:21] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:26:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:26:46] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2093.codfw.wmnet
[11:26:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125115 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering f...
[11:27:54] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2031 to wikikube-worker2095 - hnowlan@cumin1002"
[11:28:43] <jinxer-wm>	 RESOLVED: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:28:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071168 (owner: 10TrainBranchBot)
[11:30:24] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[11:30:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2321:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2321 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:32:40] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kubernetes: Rename mw233[2-4] [puppet] - 10https://gerrit.wikimedia.org/r/1071164 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert)
[11:32:45] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:32:46] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2096
[11:33:40] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2096
[11:33:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2332.codfw.wmnet
[11:33:52] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2031 to wikikube-worker2095 - hnowlan@cumin1002"
[11:33:52] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:33:53] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095
[11:34:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2332.codfw.wmnet
[11:34:19] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2320 to wikikube-worker2096
[11:34:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2333.codfw.wmnet
[11:34:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125185 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2320 to w...
[11:34:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix firewall service definitions for CI [puppet] - 10https://gerrit.wikimedia.org/r/1071175 (https://phabricator.wikimedia.org/T370677)
[11:34:56] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095
[11:34:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2333.codfw.wmnet
[11:35:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2334.codfw.wmnet
[11:35:31] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2321 to wikikube-worker2097
[11:35:35] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2031 to wikikube-worker2095
[11:35:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2334.codfw.wmnet
[11:35:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw2321:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:35:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from kubernetes2...
[11:35:48] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[11:37:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:37:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2010.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2302.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2062.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2353.codfw.wmnet, mw2394.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker
[11:37:40] <icinga-wm>	 fw.wmnet, wikikube-worker2087.codfw.wmnet, mw2395.codfw.wmnet, wikikube-worker2024.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2037.codfw.wmnet, mw2369.codfw.wmnet, mw2437.codfw.wmnet, mw2445.codfw.wmnet, kubernetes2047.codfw.wmnet, wikikube-worker2046.codfw.wmnet, mw2425.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2068.codfw.wmnet, mw2357.codfw.wmnet, wikikube-worker2038.codfw.wmnet, wikikube-worker2009.
[11:37:40] <icinga-wm>	 net, wikikube-worker2072.codfw.wmnet, mw2373.codfw.wmnet, kubernetes2049.codfw.wmnet, parse2015.codfw.wmnet, mw2311.codfw.wmnet, parse2011.codfw.wmnet, mw2446.codfw.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal
[11:37:42] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2056.codfw.wmnet, wikikube-worker2071.codfw.wmnet, mw2373.codfw.wmnet, mw2335.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2032.codfw.wmnet, kubernetes2005.codfw.wmnet, wikikube-worker2086.codfw.wmnet, mw2440.codfw.wmnet, mw2366.codfw.wmnet, mw2337.codfw.wmnet, kubernetes2006.codfw.wmnet, mw228
[11:37:42] <icinga-wm>	 wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2023.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2068.codfw.wmnet, mw2359.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2002.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2302.codfw.wmnet, mw2301.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2445.codfw.wmnet, wikikube-worker2038.codfw.wmnet, wikikube-worker2064.codfw.wmnet, kubernetes2015.codfw.wmnet, mw
[11:37:42] <icinga-wm>	 fw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2059.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2021.codfw.wmnet, wikikube-worker2070.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal
[11:38:08] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes: Rename mw233[2-4] [puppet] - 10https://gerrit.wikimedia.org/r/1071164 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert)
[11:39:12] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2321 to wikikube-worker2097 - hnowlan@cumin1002"
[11:39:39] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2321 to wikikube-worker2097 - hnowlan@cumin1002"
[11:39:39] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:39:40] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2097
[11:39:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:39:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:40:10] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2097
[11:40:21] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:40:21] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2322 to wikikube-worker2098
[11:40:36] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191)
[11:40:36] <wikibugs>	 (03CR) 10Arnaudb: "I've tried to exclusively stick to the existing logic, only replacing the plumbing and wire to limit this iteration's scope" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb)
[11:40:37] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[11:40:50] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2321 to wikikube-worker2097
[11:40:51] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:41:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125240 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2321 to w...
[11:41:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2332 to wikikube-worker2099
[11:41:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:43:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "codesearch: replace ferm::service with firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1071176
[11:43:57] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2322 to wikikube-worker2098 - hnowlan@cumin1002"
[11:45:57] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:46:07] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2322 to wikikube-worker2098 - hnowlan@cumin1002"
[11:46:08] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:46:08] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2098
[11:46:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:46:30] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2098
[11:47:09] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2322 to wikikube-worker2098
[11:47:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2322 to w...
[11:48:26] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2095.codfw.wmnet wikikube-worker2096.codfw.wmnet wikikube-worker2097.codfw.wmnet wikikube-worker2098.codfw.wmnet on all recursors
[11:48:29] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2095.codfw.wmnet wikikube-worker2096.codfw.wmnet wikikube-worker2097.codfw.wmnet wikikube-worker2098.codfw.wmnet on all recursors
[11:49:36] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2332 to wikikube-worker2099 - cgoubert@cumin1002"
[11:49:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2332 to wikikube-worker2099 - cgoubert@cumin1002"
[11:49:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:49:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2099
[11:49:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2099
[11:50:03] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2097.codfw.wmnet
[11:50:16] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2097.codfw.wmnet with OS bullseye
[11:50:20] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2096.codfw.wmnet
[11:50:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125289 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe...
[11:50:25] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2097
[11:50:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi...
[11:50:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125291 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe...
[11:50:30] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2096.codfw.wmnet with OS bullseye
[11:50:31] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[11:50:33] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2332 to wikikube-worker2099
[11:50:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw2334:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2334 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:50:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi...
[11:50:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2332 to...
[11:50:51] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2095.codfw.wmnet
[11:50:52] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2098.codfw.wmnet
[11:51:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125309 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe...
[11:51:02] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bullseye
[11:51:07] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2098.codfw.wmnet with OS bullseye
[11:51:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125310 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe...
[11:51:14] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2095
[11:51:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125311 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi...
[11:51:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi...
[11:52:00] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2333 to wikikube-worker2100
[11:52:15] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 25% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:54:01] <wikibugs>	 (03PS3) 10Sergio Gimeno: EventStreamConfig and stream registration for homepage modules analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062416 (https://phabricator.wikimedia.org/T370907)
[11:54:04] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2097 - hnowlan@cumin1002"
[11:54:52] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2097 - hnowlan@cumin1002"
[11:54:52] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:54:52] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2097.codfw.wmnet 175.16.192.10.in-addr.arpa 5.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:54:56] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2097.codfw.wmnet 175.16.192.10.in-addr.arpa 5.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:54:56] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2097
[11:55:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[12:00:21] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2097
[12:00:21] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2097
[12:00:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071178
[12:00:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071178 (owner: 10TrainBranchBot)
[12:00:30] <hashar>	 what
[12:00:39] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[12:00:40] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2096
[12:00:47] <hashar>	 jnuche: do you know why the branch cut pretest starts at now? :)
[12:01:39] <jnuche>	 I'm running it manually to see if I can find out why it's been broken for the last couple of days
[12:01:54] <hashar>	 ah
[12:02:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2333 to wikikube-worker2100 - cgoubert@cumin1002"
[12:02:09] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2333 to wikikube-worker2100 - cgoubert@cumin1002"
[12:02:09] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:02:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2100
[12:02:10] <hashar>	 jelto and I were about to restart Gerrit :D
[12:02:19] <hashar>	 albeit I haven't put in the dpeloyment calendar
[12:02:28] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "LGTM. CCing Lego" [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney)
[12:02:28] <hashar>	 I guess once the tests are running, we have enough time to restart the server
[12:02:30] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2100
[12:02:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deployment_server: Remove buster php-readline stanza [puppet] - 10https://gerrit.wikimedia.org/r/1070994 (owner: 10Alexandros Kosiaris)
[12:02:58] <jnuche>	 go ahead if you need to restart gerrit, I'll just kick it off the job again if I need to
[12:03:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2333 to wikikube-worker2100
[12:03:11] <jnuche>	 *kick off
[12:03:11] <jelto>	 I'll add a downtime for 15m, one sec
[12:03:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2333 to...
[12:03:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on gerrit.wikimedia.org with reason: Gerrit reboot
[12:03:33] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:15:00 on gerrit.wikimedia.org with reason: Gerrit reboot
[12:03:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on gerrit1003.wikimedia.org with reason: Gerrit reboot
[12:04:09] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit1003.wikimedia.org with reason: Gerrit reboot
[12:04:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2334 to wikikube-worker2101
[12:05:03] <jelto>	 hashar: let me know when I should start the reboot cookbook for gerrit1003
[12:05:41] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[12:06:23] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2095 - hnowlan@cumin1002"
[12:06:43] <hashar>	 jelto: can we do gerrit2002 first?
[12:06:50] <hashar>	 that is gerrit-replica :)
[12:07:08] <jelto>	 it was rebooted yesterday already, probably by mutante
[12:07:30] <hashar>	 and that solved the issue? :)
[12:07:37] <jelto>	 yes
[12:07:42] <hashar>	 \o/
[12:07:45] <hashar>	 lets do gerrit1003
[12:07:46] <hashar>	 :)
[12:08:05] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2095 - hnowlan@cumin1002"
[12:08:05] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:08:05] <jelto>	 ok I'll do the reboot now for gerrit1003
[12:08:05] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2095.codfw.wmnet 222.16.192.10.in-addr.arpa 2.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:08:08] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2095.codfw.wmnet 222.16.192.10.in-addr.arpa 2.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:08:09] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2095
[12:08:19] <moritzm>	 !log upgrade ganeti-test2003 to bookworm for some bullseye->bookworm VM migration tests
[12:08:40] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2095
[12:08:40] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2095
[12:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:43] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gerrit1003.wikimedia.org
[12:09:29] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2096 - hnowlan@cumin1002"
[12:09:33] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2096 - hnowlan@cumin1002"
[12:09:33] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:09:34] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2096.codfw.wmnet 173.16.192.10.in-addr.arpa 3.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:09:37] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2096.codfw.wmnet 173.16.192.10.in-addr.arpa 3.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:09:37] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2096
[12:10:08] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[12:10:36] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2096
[12:10:37] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2096
[12:12:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:12:23] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2098
[12:12:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2101
[12:12:32] <jelto>	 Found reboot since 2024-09-06 12:08:46.859551 for hosts gerrit1003.wikimedia.org
[12:12:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2101
[12:12:54] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[12:13:02] <jelto>	 gerrit web interface is back already, cookbook still doing checks
[12:13:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2334 to wikikube-worker2101
[12:13:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2334 to...
[12:14:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2099.codfw.wmnet wikikube-worker2100.codfw.wmnet wikikube-worker2101.codfw.wmnet on all recursors
[12:14:19] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2099.codfw.wmnet wikikube-worker2100.codfw.wmnet wikikube-worker2101.codfw.wmnet on all recursors
[12:15:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1003.wikimedia.org
[12:15:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2099.codfw.wmnet
[12:15:26] <jelto>	 hashar: reboot done
[12:15:30] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2099.codfw.wmnet
[12:15:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125390 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb...
[12:15:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125391 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin...
[12:15:57] <hashar>	 jelto: congratulations!
[12:15:58] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2098 - hnowlan@cumin1002"
[12:16:02] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2098 - hnowlan@cumin1002"
[12:16:03] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:16:03] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2098.codfw.wmnet 176.16.192.10.in-addr.arpa 6.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:16:06] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2098.codfw.wmnet 176.16.192.10.in-addr.arpa 6.7.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:16:06] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2098
[12:16:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2029.codfw.wmnet
[12:16:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2029.codfw.wmnet
[12:16:33] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2098
[12:16:33] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2098
[12:16:40] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.k8s.pool-depool-node (exit_code=97) depool for host wikikube-worker2029.codfw.wmnet
[12:16:40] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2029.codfw.wmnet
[12:16:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125393 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb...
[12:16:50] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2097.codfw.wmnet with reason: host reimage
[12:16:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125394 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin...
[12:17:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2029.codfw.wmnet
[12:17:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125395 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb...
[12:17:43] <wikibugs>	 (03CR) 10DCausse: [C:03+2] search: Update Cirrus Saneitizer alert [alerts] - 10https://gerrit.wikimedia.org/r/1071004 (owner: 10Ebernhardson)
[12:17:44] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2029.codfw.wmnet
[12:17:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125396 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin...
[12:18:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2099.codfw.wmnet
[12:18:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125398 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb...
[12:18:18] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2099.codfw.wmnet with OS bullseye
[12:18:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2099
[12:18:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[12:18:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[12:18:52] <icinga-wm>	 PROBLEM - Host kubernetes2033 is DOWN: PING CRITICAL - Packet loss = 100%
[12:18:54] <wikibugs>	 (03Merged) 10jenkins-bot: search: Update Cirrus Saneitizer alert [alerts] - 10https://gerrit.wikimedia.org/r/1071004 (owner: 10Ebernhardson)
[12:19:22] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2097.codfw.wmnet with reason: host reimage
[12:20:30] <icinga-wm>	 PROBLEM - Host kubernetes2031 is DOWN: PING CRITICAL - Packet loss = 100%
[12:20:30] <icinga-wm>	 PROBLEM - Host mw2321 is DOWN: PING CRITICAL - Packet loss = 100%
[12:20:56] <claime>	 expected ^
[12:21:20] <claime>	 I'll go clean them up in a minute
[12:21:22] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse)
[12:21:53] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2099 - cgoubert@cumin1002"
[12:21:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2099 - cgoubert@cumin1002"
[12:21:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:21:58] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2099.codfw.wmnet 201.16.192.10.in-addr.arpa 1.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:22:01] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2099.codfw.wmnet 201.16.192.10.in-addr.arpa 1.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:22:02] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2099
[12:22:47] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2099
[12:22:47] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2099
[12:23:58] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2100.codfw.wmnet
[12:24:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125427 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb...
[12:24:19] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2100.codfw.wmnet with OS bullseye
[12:24:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2100
[12:24:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[12:24:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[12:25:32] <icinga-wm>	 PROBLEM - Host mw2332 is DOWN: PING CRITICAL - Packet loss = 100%
[12:27:27] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2096.codfw.wmnet with reason: host reimage
[12:28:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2100 - cgoubert@cumin1002"
[12:28:10] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2100 - cgoubert@cumin1002"
[12:28:10] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:28:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2100.codfw.wmnet 202.16.192.10.in-addr.arpa 2.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:28:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2100.codfw.wmnet 202.16.192.10.in-addr.arpa 2.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:28:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2100
[12:28:31] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2100
[12:28:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2100
[12:29:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2101.codfw.wmnet
[12:29:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125439 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb...
[12:29:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2101.codfw.wmnet with OS bullseye
[12:29:39] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071178 (owner: 10TrainBranchBot)
[12:29:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2101
[12:29:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[12:30:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[12:30:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:32:53] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2096.codfw.wmnet with reason: host reimage
[12:33:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2101 - cgoubert@cumin1002"
[12:33:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2101 - cgoubert@cumin1002"
[12:33:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:33:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2101.codfw.wmnet 203.16.192.10.in-addr.arpa 3.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:33:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2101.codfw.wmnet 203.16.192.10.in-addr.arpa 3.0.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:33:21] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2101
[12:33:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2101
[12:33:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2101
[12:37:16] <claime>	 !log homer cr*codfw* commit 'T372878'
[12:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:19] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[12:38:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2066.codfw.wmnet
[12:38:17] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2066.codfw.wmnet
[12:39:39] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2099.codfw.wmnet with reason: host reimage
[12:40:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:40:20] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce support for multiple flat networks [puppet] - 10https://gerrit.wikimedia.org/r/1071189 (https://phabricator.wikimedia.org/T374020)
[12:40:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071190
[12:40:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071190 (owner: 10TrainBranchBot)
[12:41:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:41:17] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191
[12:41:36] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce support for multiple flat networks [puppet] - 10https://gerrit.wikimedia.org/r/1071189 (https://phabricator.wikimedia.org/T374020)
[12:41:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071189 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[12:43:02] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2099.codfw.wmnet with reason: host reimage
[12:43:04] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2097.codfw.wmnet with OS bullseye
[12:43:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125567 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2097.codfw.wm...
[12:44:53] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2100.codfw.wmnet with reason: host reimage
[12:45:18] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host mw1486.eqiad.wmnet
[12:45:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host mw1486.eqiad.wmnet
[12:45:22] <hnowlan>	 !log homer lsw1-b3-codfw* commit
[12:45:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597)
[12:46:15] <wikibugs>	 (03PS4) 10JMeybohm: renumber-node: Allow the cookbook to run for kubestage nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1071071
[12:46:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:47:41] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host kubestage2001.codfw.wmnet
[12:47:43] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet
[12:47:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125594 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumbering for host kubestage2...
[12:48:11] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2097.codfw.wmnet
[12:48:13] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2097.codfw.wmnet
[12:48:14] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2097.codfw.wmnet
[12:48:19] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2100.codfw.wmnet with reason: host reimage
[12:48:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125596 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor...
[12:48:23] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet
[12:48:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw1476:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:48:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597) (owner: 10Filippo Giunchedi)
[12:48:55] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bullseye
[12:49:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host kubestage2001
[12:49:28] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[12:49:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2001.codfw.wmnet...
[12:49:45] <icinga-wm>	 PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:50:11] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 423, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:50:18] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2101.codfw.wmnet with reason: host reimage
[12:51:26] <jinxer-wm>	 RESOLVED: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search#Saneitizer_(background_repair_process) - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[12:51:28] <wikibugs>	 (03PS1) 10JMeybohm: rename/renumber kubernetes2034 to wikikube-worker2102 [puppet] - 10https://gerrit.wikimedia.org/r/1071194 (https://phabricator.wikimedia.org/T372878)
[12:51:38] <wikibugs>	 (03CR) 10Elukey: "Thanks! To do things properly we should also update the changelog, with something like "Update maintainer to XXXX"" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (owner: 10Ilias Sarantopoulos)
[12:51:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:52:18] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] rename/renumber kubernetes2034 to wikikube-worker2102 [puppet] - 10https://gerrit.wikimedia.org/r/1071194 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm)
[12:52:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kubestage2001 - jayme@cumin1002"
[12:52:36] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host kubestage2001 - jayme@cumin1002"
[12:52:37] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:52:37] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestage2001.codfw.wmnet 195.0.192.10.in-addr.arpa 5.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:52:40] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestage2001.codfw.wmnet 195.0.192.10.in-addr.arpa 5.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:52:40] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host kubestage2001
[12:52:50] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2101.codfw.wmnet with reason: host reimage
[12:53:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10125617 (10hnowlan)
[12:53:18] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 341, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:53:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubestage2001
[12:53:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host kubestage2001
[12:54:03] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2096.codfw.wmnet with OS bullseye
[12:54:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku...
[12:54:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:54:45] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2034 to wikikube-worker2102
[12:55:01] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[12:56:08] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:56:22] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:57:49] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2096.codfw.wmnet
[12:57:51] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2096.codfw.wmnet
[12:57:51] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2096.codfw.wmnet
[12:58:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125623 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering...
[12:58:19] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2034 to wikikube-worker2102 - jayme@cumin1002"
[12:58:39] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2034 to wikikube-worker2102 - jayme@cumin1002"
[12:58:39] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:58:39] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2102
[13:00:44] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2102
[13:01:22] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2034 to wikikube-worker2102
[13:01:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes203...
[13:02:28] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2102.codfw.wmnet
[13:02:38] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2102.codfw.wmnet with OS bullseye
[13:02:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125629 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by jayme@cumin1002 Renumberi...
[13:02:48] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2102
[13:02:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki...
[13:04:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:05:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:06:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2099.codfw.wmnet with OS bullseye
[13:06:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[13:06:57] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:07:10] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2098.codfw.wmnet with reason: host reimage
[13:07:25] <claime>	 !incidents
[13:07:25] <sirenbot>	 5138 (ACKED)  Host db1246 (paged) - PING  - Packet loss = 100%
[13:07:25] <sirenbot>	 5142 (UNACKED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[13:07:26] <akosiaris>	 aha
[13:07:31] <claime>	 !ack 5142
[13:07:32] <sirenbot>	 5142 (ACKED)  ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw)
[13:07:34] <akosiaris>	 thanks
[13:07:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, wikikube-worker2021.codfw.wmnet, mw2396.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2046.codfw.wmnet, m
[13:07:57] <icinga-wm>	 dfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2077.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, kubernetes2059.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2366.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2043.codfw.wmnet, kubernetes2006.codfw.wmnet, mw2398.codfw.wmnet, wikikube
[13:07:57] <icinga-wm>	 002.codfw.wmnet, wikikube-worker2090.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2055.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2016.codfw.wmnet https://wikitech.wikimedia.org/wiki/PyBal
[13:08:22] <fabfur>	 akosiaris: need some help w/ wikifunctions? 
[13:08:28] <cdanis>	 I can also lend a hand
[13:08:29] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: remove mw graphite-based alerts [puppet] - 10https://gerrit.wikimedia.org/r/1071193 (https://phabricator.wikimedia.org/T350597)
[13:08:31] <cdanis>	 and I am oncall heh
[13:09:05] <bblack>	 I'm here too, mostly!
[13:10:00] <fabfur>	 !oncall
[13:10:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:10:21] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:10:29] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2098.codfw.wmnet with reason: host reimage
[13:10:38] <fabfur>	 !oncall-now
[13:10:39] <sirenbot>	 Oncall now for team SRE, rotation business_hours:
[13:10:39] <sirenbot>	 b.black, a.kosiaris, f.abfur, c.danis
[13:10:43] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage
[13:10:51] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:10:55] <akosiaris>	 I am trying to understand if this is wf only and it looks like it
[13:11:04] <akosiaris>	 so, no harm to the rest of the projects overall
[13:11:10] <akosiaris>	 but I didn't expect was pybal alerting
[13:11:18] <akosiaris>	 but what*
[13:11:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071190 (owner: 10TrainBranchBot)
[13:11:44] <akosiaris>	 I see it resolved. Thankfully wf is in it's own mw deployment
[13:11:50] <akosiaris>	 so it can't hurt the rest of the wikis
[13:11:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:11:58] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:12:40] <claime>	 akosiaris: the probe for pybal is Special:BlankPage so if php-fpm can't answer...
[13:12:54] <claime>	 ah no that's just for monitoring
[13:13:22] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage
[13:14:25] <akosiaris>	 maybe wiki functions shouldn't have it's own LVS 
[13:14:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: set disablePSPMutations for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071132 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[13:14:58] <akosiaris>	 claime: it's the idleconnnection thing btw
[13:15:05] <claime>	 yeah
[13:15:06] <akosiaris>	 it probably killed all connections or something
[13:15:51] <vgutierrez>	 idleconnection won't depool a realserver if the TCP connection gets closed
[13:16:15] <vgutierrez>	 it will depool the server if t he TCP connection gets closed and an immediate reconnection fails
[13:16:18] <vgutierrez>	 *the
[13:16:49] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[13:17:06] <akosiaris>	 funnily enough, this is TCP indeed. So what? even apache got backfilled?
[13:17:13] <akosiaris>	 ah wait, all pods weren't ready, right?
[13:17:15] <logmsgbot>	 !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[13:17:18] * akosiaris checking
[13:18:26] <akosiaris>	 interesting, codfw only
[13:18:27] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[13:19:32] <akosiaris>	 yup, both pods were in not ready state
[13:19:39] <akosiaris>	 1 still is
[13:20:58] <claime>	 !log homer lsw1-b6-codfw* commit 'T372878'
[13:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:01] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[13:21:57] <akosiaris>	 yup, at least the apache container wasn't ready
[13:22:52] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2102 - jayme@cumin1002"
[13:22:56] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Install bridge-utils on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1071199
[13:22:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2102 - jayme@cumin1002"
[13:22:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:22:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2102.codfw.wmnet 226.16.192.10.in-addr.arpa 6.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:23:00] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2102.codfw.wmnet 226.16.192.10.in-addr.arpa 6.2.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:23:01] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2102
[13:23:26] <akosiaris>	 http status 414
[13:24:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet
[13:25:21] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2100.codfw.wmnet with OS bullseye
[13:25:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[13:25:41] <icinga-wm>	 PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:25:47] <icinga-wm>	 PROBLEM - Host kubernetes2034 is DOWN: PING CRITICAL - Packet loss = 100%
[13:26:02] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2102
[13:26:02] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2102
[13:26:53] <icinga-wm>	 RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 28, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:27:03] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2101.codfw.wmnet with OS bullseye
[13:27:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[13:28:37] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2099.codfw.wmnet
[13:28:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2099.codfw.wmnet
[13:28:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2099.codfw.wmnet
[13:28:58] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2095.codfw.wmnet with OS bullseye
[13:28:59] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2095.codfw.wmnet
[13:29:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:29:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2100.codfw.wmnet
[13:29:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2100.codfw.wmnet
[13:29:55] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2100.codfw.wmnet
[13:30:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125698 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo...
[13:31:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker2095.codfw.wm...
[13:31:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125700 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering for host wikikube-wor...
[13:31:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125715 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo...
[13:31:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125716 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo...
[13:32:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2101.codfw.wmnet
[13:32:44] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2101.codfw.wmnet
[13:32:45] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2101.codfw.wmnet
[13:32:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125733 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo...
[13:32:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125734 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumbering for host wikikube-wo...
[13:34:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:35:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:36:06] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bullseye
[13:36:07] <akosiaris>	 aha, so again
[13:36:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125748 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with...
[13:36:19] <akosiaris>	 I 'll set up a silence
[13:37:18] <cdanis>	 should we scale up the deployment?
[13:38:01] <wikibugs>	 (03PS1) 10Jforrester: Fix typo in browser vendor prefix [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071202 (https://phabricator.wikimedia.org/T374180)
[13:38:19] <akosiaris>	 cdanis: from apache logs: {"timestamp": "2024-09-06T12:55:08", "RequestTime": "101", "Client-IP": "127.0.0.1", "Handle/Status": "-/414" yadada
[13:38:23] <akosiaris>	 414 is URI too long
[13:38:26] <cdanis>	 ah
[13:38:28] <cdanis>	 heh
[13:38:28] <akosiaris>	 I don't think this is a capacity issue
[13:38:30] <cdanis>	 yeah
[13:38:32] <cdanis>	 fair enough
[13:38:49] <akosiaris>	 I 'll file a task for aw team though
[13:38:57] <cdanis>	 thanks <3
[13:39:01] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125742 (10Jdforrester-WMF) >>! In T374210#10125034, @Dreamy_Jazz wrote: > As such, I think...
[13:39:45] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:40:36] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125761 (10Dreamy_Jazz) >>! In T374210#10125742, @Jdforrester-WMF wrote: >>>! In T374210#10...
[13:40:39] <wikibugs>	 06SRE, 10CheckUser, 06DBA, 07Wikimedia-production-error: Error connecting to db1237 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection timed out on labswiki - https://phabricator.wikimedia.org/T374210#10125763 (10Dreamy_Jazz)
[13:41:15] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:44:49] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2102.codfw.wmnet with reason: host reimage
[13:46:15] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:46:21] <wikibugs>	 (03CR) 10CDanis: [C:03+1] admin_ng: set disablePSPMutations for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071132 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[13:46:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet
[13:48:24] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2102.codfw.wmnet with reason: host reimage
[13:49:01] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:49:51] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10125807 (10elukey) I've released spicerack 8.13.0 that collects the latest changes for the redfish module, and inst...
[13:51:59] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database bdrwiki (T371759)
[13:52:01] <stashbot>	 T371759: Prepare and check storage layer for bdrwiki - https://phabricator.wikimedia.org/T371759
[13:52:04] <jayme>	 !log homer lsw1-a6-codfw* commit 'T372878'
[13:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:07] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[13:52:10] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) for database bdrwiki (T371759)
[13:53:05] <wikibugs>	 (03CR) 10Muehlenhoff: P:idp Prometheus blackbox monitoring for IDP. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071131 (https://phabricator.wikimedia.org/T3670655) (owner: 10Slyngshede)
[13:56:01] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet
[13:56:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet
[13:56:04] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host kubestage2001.codfw.wmnet
[13:56:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125845 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host kubestage2001....
[13:58:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet
[13:58:47] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet
[13:59:03] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[13:59:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:59:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bookworm
[13:59:47] <icinga-wm>	 RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 30, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:01:30] <akosiaris>	 task: https://phabricator.wikimedia.org/T374241
[14:01:45] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 25% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:02:10] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli)
[14:02:47] <icinga-wm>	 PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:04:27] <wikibugs>	 (03PS1) 10Brouberol: airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098)
[14:05:41] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[14:05:47] <icinga-wm>	 RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 30, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:05:55] <akosiaris>	 ah, this is probably me ^, lemme fix that
[14:05:59] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 337, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:07:43] <akosiaris>	 !log silence alerts based on alertname=PHPFPMTooBusy,deployment=mw-wikifunctions,site=codfw T374241
[14:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:46] <stashbot>	 T374241: wikifunctions.org failures in codfw with 414 error - https://phabricator.wikimedia.org/T374241
[14:09:53] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse)
[14:09:58] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070958 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse)
[14:10:24] <akosiaris>	 !log restart pybal on lvs1019
[14:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] opensearch: ignore hosts with unknown team in role_owner [alerts] - 10https://gerrit.wikimedia.org/r/1071128 (https://phabricator.wikimedia.org/T374178) (owner: 10Tiziano Fogli)
[14:12:01] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 419, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:13:04] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2102.codfw.wmnet with OS bullseye
[14:13:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2102.codfw.wmne...
[14:15:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2.018s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:15:21] <jinxer-wm>	 FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:15:51] <jinxer-wm>	 RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:16:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:17:17] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2102.codfw.wmnet
[14:17:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2102.codfw.wmnet
[14:17:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2102.codfw.wmnet
[14:17:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10125940 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by jayme@cumin1002 Renumbering for host wikikube-worke...
[14:18:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10125945 (10JMeybohm)
[14:20:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2.018s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:20:51] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage
[14:20:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:21:33] <icinga-wm>	 PROBLEM - people.wikimedia.org requires authentication on people1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:22:06] <jinxer-wm>	 FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:22:25] <icinga-wm>	 RECOVERY - people.wikimedia.org requires authentication on people1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:22:49] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:23:28] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage
[14:24:11] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[14:25:20] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database bdrwiki (T371759)
[14:25:23] <stashbot>	 T371759: Prepare and check storage layer for bdrwiki - https://phabricator.wikimedia.org/T371759
[14:27:06] <jinxer-wm>	 RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:55] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubernetes1059.eqiad.wmnet
[14:27:57] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubernetes1059.eqiad.wmnet
[14:28:15] <akosiaris>	 !log repool kubernetes1059 T365993
[14:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:28] <stashbot>	 T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993
[14:30:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10126008 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:34:47] <wikibugs>	 (03PS2) 10Brouberol: airflow: broaden collected metrics and tag them correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098)
[14:34:55] <wikibugs>	 (03CR) 10Elukey: "Forgot to mention the last time (sorry) but we may think about refactoring both cookbooks with SREBatchRunnerBase, that offers restart/reb" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb)
[14:35:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247 (10cmooney) 03NEW p:05Triage→03Medium
[14:36:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:33] <wikibugs>	 (03PS1) 10Kamila Součková: kubernetes: rename mw2430 to wikikube-worker2103 [puppet] - 10https://gerrit.wikimedia.org/r/1071221 (https://phabricator.wikimedia.org/T372878)
[14:40:39] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs: better isolation of categories components [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse)
[14:41:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:41:50] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bookworm
[14:42:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:42:36] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2095.codfw.wmnet
[14:42:50] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bullseye
[14:42:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126049 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by hnowlan@cumin1002 Renumbe...
[14:42:54] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2095.codfw.wmnet with OS bullseye
[14:42:55] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2095.codfw.wmnet
[14:43:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi...
[14:43:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku...
[14:43:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126053 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering...
[14:44:05] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191
[14:44:06] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2095.codfw.wmnet with OS bullseye
[14:44:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi...
[14:45:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10126055 (10elukey) Tried to update Wikitech and https://wikitech.wikimedia.org/wiki/Puppet#Private_puppet, the documentation...
[14:46:35] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: amd-pytorch: change image ownership to ml team [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191
[14:47:19] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[14:47:19] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[14:47:19] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[14:47:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Re-add and absent data.yaml entry for manuel-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1071222 (https://phabricator.wikimedia.org/T373927)
[14:47:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "You're right! I added a changelog entry for all the affected images" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1071191 (owner: 10Ilias Sarantopoulos)
[14:48:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10126061 (10elukey) Next and last step - wait for the new conftool release, and then close!
[14:49:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] planet: drop firewall rule for http from localhost [puppet] - 10https://gerrit.wikimedia.org/r/1071024 (owner: 10Dzahn)
[14:49:35] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse)
[14:49:36] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070955 (https://phabricator.wikimedia.org/T374009) (owner: 10DCausse)
[14:50:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:50:46] <wikibugs>	 (03PS1) 10Kgraessle: Enable AutoModerator on ukwik [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823)
[14:51:03] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database bdrwiki (T371759)
[14:51:06] <stashbot>	 T371759: Prepare and check storage layer for bdrwiki - https://phabricator.wikimedia.org/T371759
[14:51:09] <wikibugs>	 (03PS2) 10Kgraessle: Enable AutoModerator on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071223 (https://phabricator.wikimedia.org/T373823)
[14:51:33] <wikibugs>	 (03PS6) 10DCausse: wdqs: common module and profile should not define categories_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1070956 (https://phabricator.wikimedia.org/T374009)
[14:52:04] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2098.codfw.wmnet with OS bullseye
[14:52:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku...
[14:54:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Re-add and absent data.yaml entry for manuel-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1071222 (https://phabricator.wikimedia.org/T373927) (owner: 10Muehlenhoff)
[14:55:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:55:32] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249 (10Clement_Goubert) 03NEW
[14:56:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:57:34] <wikibugs>	 (03CR) 10Btullis: "Looks good. I just want to check something, because I remember that aqu set up some statsd related mappings on the analytics instance some" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol)
[14:59:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi)
[15:00:24] <wikibugs>	 (03CR) 10Scott French: [C:03+1] kubernetes: rename mw2430 to wikikube-worker2103 [puppet] - 10https://gerrit.wikimedia.org/r/1071221 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[15:00:29] <wikibugs>	 (03PS1) 10Ladsgroup: tables-catalog: Another batch of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071227 (https://phabricator.wikimedia.org/T363581)
[15:00:36] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "I guess at some point we should look at upstream ganeti and move away from the use of this.  Until it's no longer packaged in debian I gue" [puppet] - 10https://gerrit.wikimedia.org/r/1071199 (owner: 10Muehlenhoff)
[15:01:23] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "sorry ignore me - as you said it's our internal tooling not ganeti that needs it." [puppet] - 10https://gerrit.wikimedia.org/r/1071199 (owner: 10Muehlenhoff)
[15:02:56] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2098.codfw.wmnet
[15:02:58] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2098.codfw.wmnet
[15:02:59] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2098.codfw.wmnet
[15:03:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126147 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by hnowlan@cumin1002 Renumbering...
[15:04:51] <wikibugs>	 (03PS2) 10Ladsgroup: tables-catalog: Another batch of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071227 (https://phabricator.wikimedia.org/T363581)
[15:04:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10126164 (10Jhancock.wm) I'm honestly not sure. I don't see anything missing but I'm also not sure what may have been there before the reset.   I've emailed Richard at S...
[15:04:57] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Another batch of core tables [puppet] - 10https://gerrit.wikimedia.org/r/1071227 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[15:05:07] <wikibugs>	 (03CR) 10Btullis: "Here they are, for reference:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071213 (https://phabricator.wikimedia.org/T369098) (owner: 10Brouberol)
[15:05:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2430.codfw.wmnet
[15:07:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2430.codfw.wmnet
[15:08:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ats: Revert the /api/ changes on the CDN side [puppet] - 10https://gerrit.wikimedia.org/r/1071229 (https://phabricator.wikimedia.org/T364400)
[15:08:56] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw2430 to wikikube-worker2103 [puppet] - 10https://gerrit.wikimedia.org/r/1071221 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[15:10:45] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:11:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] ats: Revert the /api/ changes on the CDN side [puppet] - 10https://gerrit.wikimedia.org/r/1071229 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris)
[15:11:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:13:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: Move traffic off lvs1017 to lvs1020 to troubleshooot faulty link
[15:13:54] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: Move traffic off lvs1017 to lvs1020 to troubleshooot faulty link
[15:14:32] <topranks>	 !log disabling PyBal on lvs1017 to shift traffic to lvs1020 and allow work to fix faulty fibre link T374247
[15:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:35] <stashbot>	 T374247: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247
[15:14:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] ats: Fix issue with /api/ pointing to /w/rest.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070274 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris)
[15:14:59] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:15:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247#10126210 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c63ff66a-28d3-4567-b7cc-a03c0da01345) set by cmooney@cumin1002 for 2:0...
[15:15:40] <topranks>	 ^^ bgp alert is pybal, my bad didn't downtime the CRs
[15:15:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: hw troubleshooting: host won't boot lists backplane error for pay-lb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374054#10126212 (10Jhancock.wm) looks like us shutting down the server to move it fixed the error. Can you take a look and co...
[15:15:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2430 to wikikube-worker2103
[15:15:59] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:15:59] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:16:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:18:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020)
[15:18:50] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020)
[15:19:09] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10126233 (10Eevans) >>! In T373097#10121448, @MatthewVernon wrote: > There are 4 swift servers in `C4` - ms-be2058 ms-be2064 ms...
[15:19:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2430 to wikikube-worker2103 - kamila@cumin1002"
[15:21:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[15:22:08] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2430 to wikikube-worker2103 - kamila@cumin1002"
[15:22:08] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:22:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2103
[15:23:05] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:23:36] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020)
[15:23:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2103
[15:24:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2430 to wikikube-worker2103
[15:24:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2430 to wi...
[15:24:39] <icinga-wm>	 PROBLEM - Host mw2320 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10126245 (10Jhancock.wm) @klausman  this one isn't under warranty and I don't have an exact match for the drive. will a 1.92Tb drive work...
[15:25:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[15:26:37] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: keystone: hooks: create security group rule for additional instance CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020)
[15:27:16] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and A:durum
[15:27:26] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:27:50] <mutante>	 !log rolling restarts on durum machines
[15:27:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2103.codfw.wmnet on all recursors
[15:28:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2103.codfw.wmnet on all recursors
[15:28:34] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, and 2 others: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10126265 (10Eevans) >>! In T373101#10121463, @MatthewVernon wrote: > There are some impact Swift servers: >  - ms-be2054 and ms-be2078 and than...
[15:29:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2103.codfw.wmnet
[15:29:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126281 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by kamila@cumin1002 Renumber...
[15:29:49] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:30:13] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:30:49] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:31:13] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:31:32] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:31:36] <wikibugs>	 (03CR) 10Andrew Bogott: "seems good! I'm always amazed at how many files we have to touch for something like this :(" [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[15:31:38] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06collaboration-services, and 3 others: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10126290 (10Eevans) >>! In T373102#10121495, @MatthewVernon wrote: > These racks have the following Swift/Ceph nodes: >  - ms-f...
[15:32:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071230 (https://phabricator.wikimedia.org/T374020) (owner: 10Arturo Borrero Gonzalez)
[15:32:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2103.codfw.wmnet with OS bullseye
[15:32:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2103
[15:32:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik...
[15:34:34] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:34:54] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:35:05] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:35:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10126317 (10klausman) >>! In T374207#10126245, @Jhancock.wm wrote: > @klausman  this one isn't under warranty and I don't have an exact m...
[15:35:23] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "looks good, some minor suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/1003442 (owner: 10Slyngshede)
[15:35:47] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:36:05] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:36:47] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:37:05] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:37:05] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:37:51] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: re-deploy prod articlequality and update staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071232 (https://phabricator.wikimedia.org/T360455)
[15:39:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2103 - kamila@cumin1002"
[15:39:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2103 - kamila@cumin1002"
[15:39:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:39:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2103.codfw.wmnet 60.16.192.10.in-addr.arpa 0.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:39:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2103.codfw.wmnet 60.16.192.10.in-addr.arpa 0.6.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:39:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2103
[15:40:15] <wikibugs>	 (03CR) 10BryanDavis: "I think all of this is rearranging the deck chairs on the Titanic. Effie is working on T292707 and the child task T371374 that will be cha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[15:40:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2103
[15:40:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2103
[15:40:27] <icinga-wm>	 PROBLEM - Host mw2322 is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:33] <icinga-wm>	 PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:01] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:41:01] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:42:04] <mutante>	 the check.wikimedia-dns.org would be because durum hosts are booted
[15:42:13] <mutante>	 the other ones should have no relation 
[15:42:33] <elukey>	 !log install spicerack 8.13.0 on cumin1002
[15:42:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:42] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Comm Error: backplane 0 when reimaging wikikube-worker2095 - https://phabricator.wikimedia.org/T374258 (10hnowlan) 03NEW
[15:42:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247#10126347 (10cmooney) 05Open→03Resolved Ok we have replaced the optic in lvs1017 (same model as the one taken from lvs1019 for the record),...
[15:42:43] <topranks>	 !log enabling PyBal on lvs1017 to make primary again after repairing faulty fiber link T374247
[15:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:46] <stashbot>	 T374247: LInk errors from lvs1017 to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T374247
[15:43:01] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:43:01] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:43:04] <wikibugs>	 (03PS21) 10Elukey: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372)
[15:43:33] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1017.eqiad.wmnet
[15:43:34] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1017.eqiad.wmnet
[15:43:55] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@ad2c434] (releasing): (no justification provided)
[15:44:21] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2198 - https://phabricator.wikimedia.org/T374095#10126366 (10Jhancock.wm) I've made an RMA request with Dell. Should be here early next week.
[15:44:37] <logmsgbot>	 !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@ad2c434] (releasing): (no justification provided) (duration: 00m 41s)
[15:44:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10126370 (10Jclark-ctr) ganeti1039 b2  u4  cableid 4893 port 2
[15:45:35] <icinga-wm>	 RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms
[15:47:39] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:48:41] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:49:28] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:50:11] <wikibugs>	 (03PS2) 10Jdlrobson: Enable appearance menu for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020)
[15:50:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson)
[15:50:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071037 (https://phabricator.wikimedia.org/T373703) (owner: 10Physikerwelt)
[15:50:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020) (owner: 10Jdlrobson)
[15:52:59] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:53:33] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:53:49] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:53:51] <icinga-wm>	 PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:53:51] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:54:31] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:54:49] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:54:51] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:54:51] <icinga-wm>	 RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:55:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071236
[15:55:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071236 (owner: 10TrainBranchBot)
[15:58:22] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Mask mailman3 service on non-active host [puppet] - 10https://gerrit.wikimedia.org/r/1071237
[15:58:26] <wikibugs>	 (03PS1) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[15:58:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[15:58:53] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:58:58] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:59:01] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:59:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2103.codfw.wmnet with reason: host reimage
[15:59:11] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:59:27] <icinga-wm>	 PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:00:11] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:00:23] <icinga-wm>	 PROBLEM - Host mw2430 is DOWN: PING CRITICAL - Packet loss = 100%
[16:00:29] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:00:53] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:01:01] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:02:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2103.codfw.wmnet with reason: host reimage
[16:02:55] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:02:57] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:03:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10126458 (10elukey) My bad, it was because my factory reset for some reason didn't restore the ADMIN password to its original state. Thanks for the follow up!
[16:04:07] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10126459 (10elukey) Great news, the first version of the Supermicro support in provision is live on cumin nodes (nam...
[16:04:23] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2095.codfw.wmnet with OS bullseye
[16:04:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku...
[16:04:57] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:04:59] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:05:59] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:05:59] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:06:11] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and A:durum
[16:09:32] <wikibugs>	 (03PS1) 10Jdrewniak: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039)
[16:14:00] <mutante>	 Southparkfan: does your grafana login work now?
[16:16:05] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126495 (10hnowlan)
[16:21:57] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071236 (owner: 10TrainBranchBot)
[16:24:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2103.codfw.wmnet with OS bullseye
[16:25:03] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-worker2103.codfw.wmnet with OS bullseye co...
[16:29:13] <wikibugs>	 (03PS3) 10JHathaway: vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257)
[16:29:52] <wikibugs>	 (03CR) 10JHathaway: "done!" [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway)
[16:30:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2103.codfw.wmnet
[16:30:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2103.codfw.wmnet
[16:30:20] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.renumber-node (exit_code=0) Renumbering for host wikikube-worker2103.codfw.wmnet
[16:30:30] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10126581 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by kamila@cumin1002 Renumbering for host wikikube-worker2103.codfw.wmnet com...
[16:32:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257) (owner: 10JHathaway)
[16:32:53] <wikibugs>	 (03PS4) 10JHathaway: vrts_aliases: add retry logic [puppet] - 10https://gerrit.wikimedia.org/r/1070671 (https://phabricator.wikimedia.org/T368257)
[16:33:27] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[16:42:51] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 335, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:42:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Fix firewall service definitions for CI [puppet] - 10https://gerrit.wikimedia.org/r/1071175 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[16:46:14] <kamila_>	 !log ran homer on cr*codfw* for T372878
[16:46:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:18] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[16:48:55] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 417, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:50:30] <Southparkfan>	 mutante: yes, works fine - thanks :)
[16:51:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:52:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T374249#10126668 (10kamila)
[16:52:44] <mutante>	 Southparkfan: great to hear, then it was indeed the LDAP sync 
[16:54:21] <Southparkfan>	 someone else was renamed and this broke the script, Filippo/o11y is tracking it in https://phabricator.wikimedia.org/T374173#10124175
[16:55:34] <wikibugs>	 (03PS2) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[16:55:52] <mutante>	 yep, he fixed one sync run so it works for you. but there is also follow-up ticket for next time
[16:55:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[16:58:19] <Southparkfan>	 signing the actual volunteer NDA was probably the least complex part here
[16:58:47] <wikibugs>	 (03PS3) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[16:59:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[16:59:54] <Southparkfan>	 but it works now, much thanks. finally part of the NDA community - only waiting on Netbox access to be fixed, but I/F is working on it
[17:01:28] <mutante>	 Southparkfan: every once in a while we need someone like you to ask for it to test the process, heh
[17:02:50] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071245
[17:02:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071245 (owner: 10TrainBranchBot)
[17:04:11] <wikibugs>	 (03PS4) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:04:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:06:12] <wikibugs>	 (03PS5) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:06:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:08:11] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[17:08:45] <wikibugs>	 (03PS6) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:09:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:11:28] <wikibugs>	 (03PS7) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:11:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:12:38] <wikibugs>	 (03PS1) 10Kamila Součková: kubernetes: rename mw2431 to wikikube-worker2104 [puppet] - 10https://gerrit.wikimedia.org/r/1071246 (https://phabricator.wikimedia.org/T372878)
[17:13:10] <wikibugs>	 (03PS8) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:16:44] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:17:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/1071175 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[17:19:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Correct firewall services for releases [puppet] - 10https://gerrit.wikimedia.org/r/1071076 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[17:19:29] <wikibugs>	 (03PS9) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:22:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:23:39] <wikibugs>	 (03PS10) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:26:34] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:27:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frban2002 - https://phabricator.wikimedia.org/T369931#10126728 (10Dwisehaupt) Host OS installed and built out with basics. Awaiting the completion of T374269 to finish config and testing.
[17:28:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10126731 (10Dwisehaupt) payments2006 built out and mariadb cloned out. Awaiting completion of T374269 to finish config and testing.
[17:28:26] <wikibugs>	 (03CR) 10Scott French: [C:03+1] kubernetes: rename mw2431 to wikikube-worker2104 [puppet] - 10https://gerrit.wikimedia.org/r/1071246 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[17:34:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071245 (owner: 10TrainBranchBot)
[17:35:49] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:36:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "looks good. https://releases.wikimedia.org/  still works and httpb from deployment can talk to releases1003" [puppet] - 10https://gerrit.wikimedia.org/r/1071076 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[17:36:42] <wikibugs>	 (03CR) 10Legoktm: "From what I remember, there's only supposed to be one runner per queue, and having e.g. two outbound runners might be an issue? Idk if tha" [puppet] - 10https://gerrit.wikimedia.org/r/1071049 (owner: 10EoghanGaffney)
[17:37:34] <wikibugs>	 (03PS11) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:38:29] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:40:45] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:41:19] <wikibugs>	 (03PS12) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:42:05] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:42:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:51:53] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:52:30] <wikibugs>	 (03PS13) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[17:52:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[17:54:47] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[17:55:04] <brett>	 !log Import corto 0.3.1-1 into bookworm-wikimedia apt archive
[17:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: spinning disk failure for ml-serve2005.codfw.wmnet - https://phabricator.wikimedia.org/T374207#10126769 (10Jhancock.wm) 05Open→03Resolved cool. I'll do that. thanks!
[17:58:44] <Southparkfan>	 what's going on with the asw2-d-eqiad VC?
[18:00:26] <wikibugs>	 (03PS14) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[18:02:56] <wikibugs>	 (03PS1) 10Scott French: admin: normalize swfrench dot files across hosts [puppet] - 10https://gerrit.wikimedia.org/r/1071247
[18:07:00] <wikibugs>	 (03PS21) 10BCornwall: Create corto deployment/configuration [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789)
[18:08:26] <Southparkfan>	 bblack cdanis: ^ one of the VCPs seems to be flapping for months, and the linked Wikitech page requires a high prio task for netops - thoughts?
[18:08:47] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin: normalize swfrench dot files across hosts [puppet] - 10https://gerrit.wikimedia.org/r/1071247 (owner: 10Scott French)
[18:08:56] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3903/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall)
[18:08:59] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:10:04] <Southparkfan>	 nothing user-facing, but resilience loss isn't ideal either
[18:12:18] <brett>	 !log Import ncmonitor 1.2.1-1 into bookworm-wikimedia apt archive
[18:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:03] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:16:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:17:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[18:20:32] <wikibugs>	 (03PS1) 10Dreamy Jazz: Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021)
[18:20:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[18:22:34] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "I added the config line of the gdrive-creds.json file. I'm taking the liberty of just merging this in since it's a really low-risk additio" [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall)
[18:23:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) (owner: 10Dreamy Jazz)
[18:25:57] <wikibugs>	 (03PS1) 10Jforrester: tests: Disable all Beta Cluster CI testing, all failing [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071253 (https://phabricator.wikimedia.org/T374242)
[18:26:11] <wikibugs>	 (03PS1) 10Jforrester: Don't pass empty type/returnType to zobject lookup when undefined [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199)
[18:26:31] <wikibugs>	 (03PS2) 10Jforrester: Don't pass empty type/returnType to zobject lookup when undefined [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199)
[18:26:55] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[18:27:45] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:28:59] <wikibugs>	 (03PS15) 10Bking: statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416)
[18:29:00] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Define wgCheckUserCentralIndexRangesToExclude to exclude WMCS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071251 (https://phabricator.wikimedia.org/T373021) (owner: 10Dreamy Jazz)
[18:29:57] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[18:30:59] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[18:31:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:31:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statistics hosts: enable CPUWeight (cgroupsv2) [puppet] - 10https://gerrit.wikimedia.org/r/1071238 (https://phabricator.wikimedia.org/T372416) (owner: 10Bking)
[18:31:41] <wikibugs>	 (03PS1) 10Scott French: admin: tweak swfrench dot files [puppet] - 10https://gerrit.wikimedia.org/r/1071255
[18:32:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2200.codfw.wmnet with reason: Maintenance
[18:32:45] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:32:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2200.codfw.wmnet with reason: Maintenance
[18:33:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:34:45] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:36:24] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin: tweak swfrench dot files [puppet] - 10https://gerrit.wikimedia.org/r/1071255 (owner: 10Scott French)
[18:36:29] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10126900 (10BCornwall) 05Open→03Resolved Corto's now running on alert1001 :)
[18:39:06] <cdanis>	 Southparkfan: thanks, filed T374272
[18:39:06] <stashbot>	 T374272: asw2-d-eqiad vcp links flapping - https://phabricator.wikimedia.org/T374272
[18:40:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:40:40] <Southparkfan>	 cdanis: cool - as far as I know there's nothing critical in D4 either 
[18:42:22] <Southparkfan>	 although its leaf is runnin g on one link now, hopefully the other one doesn't flame out
[18:45:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:47:31] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Also exclude labtestwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071258
[18:50:51] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] cirrus: Also exclude labtestwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071258 (owner: 10Ebernhardson)
[18:51:58] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Also exclude labtestwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071258 (owner: 10Ebernhardson)
[18:57:14] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[18:57:20] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:59:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071253 (https://phabricator.wikimedia.org/T374242) (owner: 10Jforrester)
[19:00:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071254 (https://phabricator.wikimedia.org/T374199) (owner: 10Jforrester)
[19:00:56] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:01:03] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:06:56] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@6ca00a7] (releasing): (no justification provided)
[19:07:40] <logmsgbot>	 !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@6ca00a7] (releasing): (no justification provided) (duration: 00m 43s)
[19:10:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10127037 (10VRiley-WMF) Ah, the blinking light did activate. I have swapped the HDD, and it should be good to go. Let us know if there is anything else we can help with. Th...
[19:11:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10127039 (10VRiley-WMF) 05Open→03Resolved
[19:31:51] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Yeah, looks like the whole wikitech.php file is about to be removed (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/105933" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[19:31:56] <wikibugs>	 (03Abandoned) 10Bartosz Dziewoński: wikitech: Replace `ldap-s-1-debug.log` hack with MW_DEBUG_LOCAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[19:45:25] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Cleanup firewall::service configs for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/1071072 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[20:02:07] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, wikikube-
[20:02:07] <icinga-wm>	 84.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2076.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, mw2425.codfw.wmnet, wikikube-worker2030.codfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2089.codfw.wmnet, wikikube-worker2062.cod
[20:02:07] <icinga-wm>	 , kubernetes2016.codfw.wmnet, mw2394.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2440.codfw.wmnet, mw2419.codfw.wmnet, wikikube-worker2014.codfw.wmnet, wikikube-worker2101.codfw.wmn https://wikitech.wikimedia.org/wiki/PyBal
[20:02:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, mw2424.codfw.wmnet, parse2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2099.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, mw
[20:02:35] <icinga-wm>	 fw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2313.codfw.wmnet, mw2302.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2397.codfw.wmnet, mw2314.codfw.wmnet, kubernetes2022.codfw.wmnet, wikikube-worker2014.codfw.wmnet, parse2012.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worke
[20:02:35] <icinga-wm>	 dfw.wmnet, kubernetes2044.codfw.wmnet, mw2336.codfw.wmnet, parse2014.codfw.wmnet, mw2376.codfw.wmnet, wikikube-worker2024.codfw.wmnet, mw2426.codfw.wmnet, mw2371.codfw.wmnet, wikikube-w https://wikitech.wikimedia.org/wiki/PyBal
[20:03:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1071072 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[20:04:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:04:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:06:17] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "looks reasonable to me. I would have probably used a selector but nothing wrong with this that I could argue about :)" [puppet] - 10https://gerrit.wikimedia.org/r/1071237 (owner: 10EoghanGaffney)
[20:07:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Cleanup firewall::service configs for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1071073 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[20:13:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "/etc/nftables/input/10_miscweb-http-envoy.nft]/ensure: removed - no problem, os-reports.wikimedia.org is up, as an example" [puppet] - 10https://gerrit.wikimedia.org/r/1071073 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[20:15:25] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "codesearch: replace ferm::service with firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1071176 (owner: 10Muehlenhoff)
[20:21:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 58.12s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:23:48] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "noop confirmed on codesearch9 - had no effect until we actually switch the firewall provider" [puppet] - 10https://gerrit.wikimedia.org/r/1071176 (owner: 10Muehlenhoff)
[20:26:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:26:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3m 43s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:30:45] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:31:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in - https://phabricator.wikimedia.org/T325406#10127186 (10jhathaway)
[20:39:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 7.187s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:39:22] <wikibugs>	 (03PS2) 10Jforrester: Use default width/height on gallery to avoid parser instance [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071265 (https://phabricator.wikimedia.org/T374146)
[20:40:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/UploadWizard] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071265 (https://phabricator.wikimedia.org/T374146) (owner: 10Jforrester)
[20:40:49] <wikibugs>	 (03PS1) 10Jforrester: ZObjectStore::findZTesterResult: Trim our own error so we don't break logstash [extensions/WikiLambda] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071266 (https://phabricator.wikimedia.org/T374241)
[20:44:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 7.187s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:45:45] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, wikikube-worker2079.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, wiki
[20:45:45] <icinga-wm>	 ker2036.codfw.wmnet, parse2009.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, parse2018.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2022.codfw.wmnet, wikikube-worker2027.codfw.wmnet, 
[20:45:45] <icinga-wm>	 odfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2090.c https://wikitech.wikimedia.org/wiki/PyBal
[20:46:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-wor
[20:46:09] <icinga-wm>	 codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2044.codfw.wmnet, mw2431.codfw.wmnet, wikikube-worker2022.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.codfw.wmnet, mw2419.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wik
[20:46:09] <icinga-wm>	 rker2090.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2014.codfw.wmnet, wikikube-worker2062.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2353.codfw.wmnet, mw2449.codfw.wmn https://wikitech.wikimedia.org/wiki/PyBal
[20:47:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 4m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:48:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:48:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:50:38] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "curious..I also remember having issues in the past querying un-mapped fields. But indeed the MediaSearch query is querying it, and i did a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse)
[20:51:56] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:52:14] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1071146/3905/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1071146 (owner: 10Muehlenhoff)
[20:52:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 4m 0s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:57:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10127280 (10Dwisehaupt) 05Open→03Resolved Host is built and config will continue in T372933
[20:58:19] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "looks good, only effect is on ssh between phab servers and it just does the resolve. (we are still using ferm here)" [puppet] - 10https://gerrit.wikimedia.org/r/1071146 (owner: 10Muehlenhoff)
[20:59:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: hw troubleshooting: host won't boot lists backplane error for pay-lb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T374054#10127286 (10Dwisehaupt) 05Open→03Resolved Thanks. It's back online and up. Hopefully it has a transient error.
[21:02:43] <wikibugs>	 (03PS4) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677)
[21:03:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[21:03:21] <wikibugs>	 (03CR) 10Dzahn: "how about this instead https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071028" [puppet] - 10https://gerrit.wikimedia.org/r/1071147 (https://phabricator.wikimedia.org/T370677) (owner: 10Muehlenhoff)
[21:04:37] <wikibugs>	 (03PS5) 10Dzahn: phabricator: syntax fixes for firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1071028 (https://phabricator.wikimedia.org/T370677)
[21:11:17] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: put logging-sd200[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070353 (https://phabricator.wikimedia.org/T373651) (owner: 10Cwhite)
[21:11:27] <wikibugs>	 (03PS2) 10Cwhite: logstash: put logging-sd200[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070353 (https://phabricator.wikimedia.org/T373651)
[21:17:08] <wikibugs>	 (03CR) 10Cwhite: [V:03+2 C:03+2] logstash: put logging-sd200[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070353 (https://phabricator.wikimedia.org/T373651) (owner: 10Cwhite)
[21:36:03] <jhathaway>	 jjj
[21:36:09] <jhathaway>	 oops
[21:56:46] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10127348 (10colewhite) 05Open→03In progress a:03colewhite
[22:14:10] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:16:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:20:56] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[22:21:44] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[22:35:03] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:35:08] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:37:42] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406)
[22:37:47] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406)
[22:37:58] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "I think this is okay to merge once we've setup the messages and defined audience correctly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[22:39:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński)
[22:39:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński)
[22:40:14] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:41:44] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[22:53:16] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:06:56] <icinga-wm>	 PROBLEM - SSH on aphlict1002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:07:56] <icinga-wm>	 RECOVERY - SSH on aphlict1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:08:18] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:10:18] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Add Web search experiment quickSurvey on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[23:11:18] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:15:56] <wikibugs>	 (03PS2) 10Jdrewniak: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039)
[23:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:20:18] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:29:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:31:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@499.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:37:00] <wikibugs>	 (03PS3) 10Jdrewniak: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039)
[23:38:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071279
[23:38:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071279 (owner: 10TrainBranchBot)
[23:38:40] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "This looks good to deploy to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[23:39:47] <wikibugs>	 (03PS4) 10Jdlrobson: Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[23:40:04] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Add Web search experiment quickSurvey on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071241 (https://phabricator.wikimedia.org/T373039) (owner: 10Jdrewniak)
[23:43:07] <jan_drewniak>	 Hello folks, it's friday afternoon, but I'm wonder if it's ok to deploy a beta-cluster only config change?
[23:55:26] <jan_drewniak>	 nvm