[00:09:42] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074730 (owner: 10TrainBranchBot)
[00:15:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[02:38:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:56] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:09:56] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:14:56] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:06:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:29:00] <wikibugs>	 (03PS1) 10Slyngshede: C:idm:deployment: Add structlog dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1074846
[06:30:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074846 (owner: 10Slyngshede)
[06:31:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Doh, good catch:-)" [puppet] - 10https://gerrit.wikimedia.org/r/1074713 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey)
[06:32:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10166449 (10ABran-WMF) those servers are a bit sensitive, @wiki_willy do you think this would be manageable to check if we have a spare disk during this week?
[06:32:26] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] C:idm:deployment: Add structlog dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1074846 (owner: 10Slyngshede)
[06:35:14] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 (owner: 10Slyngshede)
[06:38:24] <wikibugs>	 (03Merged) 10jenkins-bot: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 (owner: 10Slyngshede)
[06:40:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppet checkout on pybaltest [puppet] - 10https://gerrit.wikimedia.org/r/1047509 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[06:42:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1074167
[06:43:23] <wikibugs>	 (03PS4) 10Slyngshede: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162
[06:45:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1074167 (owner: 10Muehlenhoff)
[06:47:12] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 (owner: 10Slyngshede)
[06:48:51] <wikibugs>	 (03PS3) 10Slyngshede: UI for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[06:49:36] <wikibugs>	 (03Merged) 10jenkins-bot: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 (owner: 10Slyngshede)
[06:51:32] <wikibugs>	 (03PS4) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820)
[06:55:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] icinga: Enable profile::auto_restarts::service for keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1074358 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[06:58:40] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387)
[06:58:58] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetserver: fix SHA1 path for labsprivate [puppet] - 10https://gerrit.wikimedia.org/r/1074713 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:30] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kafka::broker: Add the external-services DNS name to the certs [puppet] - 10https://gerrit.wikimedia.org/r/1074411 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm)
[07:07:49] <wikibugs>	 (03PS1) 10Elukey: requestctl: change comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074849
[07:08:07] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] requestctl: change comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074849 (owner: 10Elukey)
[07:09:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10166477 (10elukey) >>! In T374443#10161254, @MoritzMuehlenhoff wrote: >>>! In T374443#10161219, @elukey wrote: >> The move was d...
[07:14:39] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira)
[07:14:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:16:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023)
[07:20:42] <wikibugs>	 (03PS1) 10Slyngshede: C:idm Enable audit logging for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1074853
[07:22:01] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff)
[07:22:06] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4074/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede)
[07:22:45] <wikibugs>	 (03CR) 10Elukey: [C:03+1] mw_rc_irc: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074429 (owner: 10Muehlenhoff)
[07:23:13] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4076/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede)
[07:24:21] <wikibugs>	 (03PS7) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443)
[07:24:27] <wikibugs>	 (03CR) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey)
[07:24:56] <wikibugs>	 (03PS2) 10Slyngshede: C:idm Enable audit logging for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1074853
[07:25:57] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4077/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede)
[07:26:58] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4078/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede)
[07:27:53] <wikibugs>	 (03CR) 10Hashar: [C:04-2] "> gate-and-submit will run against the rebased version of the change, right?" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar)
[07:28:56] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023)
[07:30:39] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] sre.switchdc.databases: update Phabricator more (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[07:30:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff)
[07:37:09] <wikibugs>	 (03CR) 10Jelto: [V:04-1] "pcc fails with" [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn)
[07:45:12] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798)
[07:47:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[07:49:00] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey)
[07:49:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[07:52:29] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts poolcounter1004.eqiad.wmnet
[07:53:27] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts poolcounter1004.eqiad.wmnet
[07:55:15] <wikibugs>	 (03PS1) 10Elukey: profile::lvs::realserver: update poolcounter hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074949 (https://phabricator.wikimedia.org/T332015)
[07:58:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074949 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[07:59:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::lvs::realserver: update poolcounter hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074949 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[08:01:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[08:02:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10166552 (10MoritzMuehlenhoff)
[08:03:11] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira)
[08:04:08] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira)
[08:08:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2424 -> wikikube-worker2124 [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[08:08:44] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[08:12:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede)
[08:12:13] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2424 to wikikube-worker2124
[08:12:29] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm Enable audit logging for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede)
[08:12:34] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[08:12:41] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:13:19] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:16:20] <elukey>	 !log elukey@puppetmaster1001:~$ sudo puppet cert destroy performance.discovery.wmnet
[08:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023)
[08:16:45] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2424 to wikikube-worker2124 - jiji@cumin1002"
[08:16:55] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023)
[08:17:40] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff)
[08:18:09] <moritzm>	 !log installing systemd bugfix updates from Bookworm point release
[08:18:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:32] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[08:21:03] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2424 to wikikube-worker2124 - jiji@cumin1002"
[08:21:03] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:21:04] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2124
[08:21:16] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2124
[08:21:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff)
[08:21:39] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kafka::broker: Add the external-services DNS name to the certs [puppet] - 10https://gerrit.wikimedia.org/r/1074411 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm)
[08:21:55] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2424 to wikikube-worker2124
[08:24:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove frlog2001 and frpm2001 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1074538 (https://phabricator.wikimedia.org/T375239) (owner: 10Dwisehaupt)
[08:24:47] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for puppetmaster1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002
[08:25:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: add cert expiration monitoring for apis [puppet] - 10https://gerrit.wikimedia.org/r/1074552 (https://phabricator.wikimedia.org/T348725) (owner: 10Dwisehaupt)
[08:25:49] <hashar>	 !log Updated CI job operations-puppet-tests-bullseye to image rebuild for Puppet 7 # T330490
[08:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:53] <stashbot>	 T330490: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490
[08:26:19] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for puppetmaster1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002
[08:26:39] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for puppetmaster2001.codfw.wmnet: Renew puppet certificate - elukey@cumin1002
[08:26:45] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2124.codfw.wmnet on all recursors
[08:26:49] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2124.codfw.wmnet on all recursors
[08:32:52] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231)
[08:33:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231) (owner: 10JMeybohm)
[08:34:27] <wikibugs>	 (03PS2) 10JMeybohm: Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231)
[08:34:45] <wikibugs>	 (03PS3) 10JMeybohm: Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231)
[08:38:09] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231) (owner: 10JMeybohm)
[08:39:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166669 (10ayounsi) a:03ayounsi
[08:42:11] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Acknowledged" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar)
[08:45:10] <wikibugs>	 (03CR) 10Hashar: [V:03+2 C:03+2] "Done" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar)
[08:49:28] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job poolcounter_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:53:13] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job poolcounter_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:54:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166734 (10ayounsi) Opened high priority JTAC case 2024-0923-266479 and attached logs/debug output.
[08:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:57:07] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[08:57:25] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts poolcounter2003.codfw.wmnet
[08:57:36] <jayme>	 !log roll-restarting all kafka clusters for certificate changes - T374729
[08:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:40] <stashbot>	 T374729: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729
[09:00:49] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[09:01:01] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[09:03:31] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis)
[09:04:56] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2124 - jiji@cumin1002"
[09:05:16] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2124 - jiji@cumin1002"
[09:05:16] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:05:17] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2124.codfw.wmnet 79.0.192.10.in-addr.arpa 9.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:05:20] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2124.codfw.wmnet 79.0.192.10.in-addr.arpa 9.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:05:21] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2124
[09:05:26] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.netbox
[09:05:38] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2124
[09:05:38] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2124
[09:07:41] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:07:41] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts poolcounter2003.codfw.wmnet
[09:10:08] <wikibugs>	 (03CR) 10Elukey: sre.network.tls: start from scratch if CSR is missing (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi)
[09:12:26] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/weight=10; selector: name=registry2005.codfw.wmnet
[09:12:47] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=registry2005.codfw.wmnet
[09:16:24] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts poolcounter2004.codfw.wmnet
[09:18:49] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw
[09:20:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad
[09:20:39] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.netbox
[09:20:51] <wikibugs>	 (03PS1) 10Elukey: role::poolcounter::server: cleanup after Bookworm migration [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015)
[09:21:26] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw
[09:21:49] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad
[09:21:53] <wikibugs>	 (03PS2) 10Elukey: role::poolcounter::server: cleanup after Bookworm migration [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015)
[09:21:55] <wikibugs>	 (03PS1) 10Ayounsi: Enable and scrape gNMIc api Prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361)
[09:22:23] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/pooled=no; selector: name=registry200(3|4).codfw.wmnet
[09:23:54] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361) (owner: 10Ayounsi)
[09:23:57] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2124.codfw.wmnet with reason: host reimage
[09:26:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: poolcounter2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002"
[09:27:45] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2124.codfw.wmnet with reason: host reimage
[09:28:17] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=registry2004.codfw.wmnet
[09:29:34] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: poolcounter2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002"
[09:29:34] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:29:35] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts poolcounter2004.codfw.wmnet
[09:30:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10166770 (10elukey) 05Open→03Resolved
[09:34:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:34:41] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10166787 (10MoritzMuehlenhoff)
[09:35:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10166778 (10elukey) 05Open→03Resolved a:03elukey
[09:35:38] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::poolcounter::server: cleanup after Bookworm migration [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey)
[09:40:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti1039 - ganeti1052 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074957 (https://phabricator.wikimedia.org/T365650)
[09:40:52] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:41:46] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:41:46] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:41:50] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:42:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1039 - ganeti1052 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074957 (https://phabricator.wikimedia.org/T365650) (owner: 10Muehlenhoff)
[09:42:46] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:42:46] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:44:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad
[09:46:07] <wikibugs>	 (03PS1) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692)
[09:46:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[09:46:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad
[09:48:18] <wikibugs>	 (03PS1) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820)
[09:48:18] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2124.codfw.wmnet with OS bullseye
[09:48:41] <wikibugs>	 (03PS6) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179)
[09:49:02] <wikibugs>	 (03PS2) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692)
[09:49:04] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:49:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[09:50:16] <wikibugs>	 (03CR) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi)
[09:51:04] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad
[09:51:24] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10166816 (10dcaro) >>! In T372814#10165304, @Jclark-ctr wrote: > @Andrew  i see this ticket is in my name. is there something i need to do for this?...
[09:52:05] <wikibugs>	 (03PS3) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692)
[09:52:56] <wikibugs>	 (03PS4) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692)
[09:53:02] <effie>	 !log homer cr*codfw* commit 'T372878'
[09:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:07] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[09:54:29] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10166817 (10dcaro) a:05Jclark-ctr→03dcaro
[09:58:40] <wikibugs>	 (03PS5) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692)
[09:58:40] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2124.codfw.wmnet
[09:59:33] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2124.codfw.wmnet
[09:59:34] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2124.codfw.wmnet
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1000)
[10:00:17] <wikibugs>	 (03PS6) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692)
[10:00:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10166843 (10MoritzMuehlenhoff) >>! In T365650#10165298, @Jclark-ctr wrote: > @MoritzMuehlenhoff  can you update puppet  site.pp is mis...
[10:01:01] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4082/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[10:02:57] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 293, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:05:51] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372 (10fnegri) 03NEW
[10:06:01] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10166884 (10fnegri)
[10:06:33] <dcausse>	 jouncebot: nowandnext
[10:06:34] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1000)
[10:06:34] <jouncebot>	 In 2 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1300)
[10:07:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM, maybe test-cookbook it before merging so we are sure it works (it not already done!)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi)
[10:08:21] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10166889 (10fnegri) ipmi-sel confirms a "Thermal Trip" both for June 20th and Sep 21st:  ` fnegri@cloudvirt1063:~$ sudo ipmi-sel ID  | Date        | Time     | Name             |...
[10:08:23] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-update: enable calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074090 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[10:08:45] <moritzm>	 !log rolling out debmonitor-client updates T216832
[10:08:48] <wikibugs>	 (03PS1) 10Btullis: Remove the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692)
[10:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:49] <stashbot>	 T216832: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832
[10:09:34] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4083/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[10:09:38] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-update: enable calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074090 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[10:10:38] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "Thanks ! already tested, except the very last PS which I don't think requires testing as it's quite minor." [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi)
[10:11:00] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[10:11:23] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:12:17] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host registry1005.eqiad.wmnet
[10:12:18] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.netbox
[10:12:36] <wikibugs>	 (03PS1) 10Btullis: Clean up the test secrets after testing [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692)
[10:13:23] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4084/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[10:14:46] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[10:14:54] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "I found only 2 discrepancies: pc2007 is not marked as a master on puppet (may not be an issue as it may not be yet fully setup) CC Amir. A" [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French)
[10:15:09] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:15:30] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM registry1005.eqiad.wmnet - elukey@cumin1002"
[10:15:35] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM registry1005.eqiad.wmnet - elukey@cumin1002"
[10:15:35] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:15:35] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache registry1005.eqiad.wmnet on all recursors
[10:15:38] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) registry1005.eqiad.wmnet on all recursors
[10:16:03] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM registry1005.eqiad.wmnet - elukey@cumin1002"
[10:16:07] <wikibugs>	 (03PS6) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692)
[10:16:07] <wikibugs>	 (03PS7) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692)
[10:16:07] <wikibugs>	 (03PS2) 10Btullis: Remove the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692)
[10:16:08] <wikibugs>	 (03PS2) 10Btullis: Clean up the test secrets after testing [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692)
[10:16:08] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM registry1005.eqiad.wmnet - elukey@cumin1002"
[10:16:54] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4085/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[10:17:43] <wikibugs>	 (03PS1) 10Elukey: Set puppet 7 for registry1005 [puppet] - 10https://gerrit.wikimedia.org/r/1074965 (https://phabricator.wikimedia.org/T375374)
[10:18:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Set puppet 7 for registry1005 [puppet] - 10https://gerrit.wikimedia.org/r/1074965 (https://phabricator.wikimedia.org/T375374) (owner: 10Elukey)
[10:18:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host registry1005.eqiad.wmnet with OS bookworm
[10:22:45] <jinxer-wm>	 FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[10:23:30] <wikibugs>	 (03PS3) 10Btullis: Absent the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692)
[10:23:30] <wikibugs>	 (03PS3) 10Btullis: Clean up the test secrets after testing [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692)
[10:23:36] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 375, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:23:57] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi)
[10:25:41] <effie>	 !log homer lsw1-a6-codfw* commit 'T372878'
[10:25:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:45] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[10:27:41] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10166945 (10MoritzMuehlenhoff) 05Open→03Resolved Updated deb has been rolled out fleetwide, closing.
[10:27:49] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage
[10:29:52] <jinxer-wm>	 FIRING: ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:23] <jinxer-wm>	 RESOLVED: ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:31:52] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage
[10:33:38] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Also move the apt::pin under the buster conditional [puppet] - 10https://gerrit.wikimedia.org/r/1073467 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff)
[10:33:43] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2425.codfw.wmnet
[10:34:13] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2124.codfw.wmnet
[10:34:15] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2124.codfw.wmnet
[10:34:17] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2425.codfw.wmnet
[10:34:28] <wikibugs>	 (03CR) 10Elukey: [C:03+1] On Bookworm create the system user using systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1073469 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff)
[10:35:45] <jinxer-wm>	 FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[10:36:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Also move the apt::pin under the buster conditional [puppet] - 10https://gerrit.wikimedia.org/r/1073467 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff)
[10:37:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] On Bookworm create the system user using systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1073469 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff)
[10:37:39] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add a datahubsearch cluster and assign the correct hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074389 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[10:37:45] <jinxer-wm>	 RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[10:38:04] <Dreamy_Jazz>	 !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration
[10:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:45] <jinxer-wm>	 RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[10:43:02] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:43:39] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2425.codfw.wmnet with reason: reimage
[10:43:42] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2425.codfw.wmnet with reason: reimage
[10:43:54] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:49:40] <wikibugs>	 (03PS1) 10Elukey: Revert "Set puppet 7 for registry1005" [puppet] - 10https://gerrit.wikimedia.org/r/1074972
[10:51:52] <jynus>	 !log starting db master table checks on s1 (db1163, db2203) T375186
[10:51:54] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host registry1005.eqiad.wmnet with OS bookworm
[10:51:54] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host registry1005.eqiad.wmnet
[10:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:56] <stashbot>	 T375186: databases preswitchover checks - https://phabricator.wikimedia.org/T375186
[10:52:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] Revert "Set puppet 7 for registry1005" [puppet] - 10https://gerrit.wikimedia.org/r/1074972 (owner: 10Elukey)
[10:53:28] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[10:53:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[10:53:43] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert "Set puppet 7 for registry1005" [puppet] - 10https://gerrit.wikimedia.org/r/1074972 (owner: 10Elukey)
[10:53:53] <wikibugs>	 (03PS3) 10DCausse: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195)
[10:54:11] <wikibugs>	 (03CR) 10DCausse: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[10:56:44] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on registry1005.eqiad.wmnet with reason: WIP - working on puppet runs
[10:56:47] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on registry1005.eqiad.wmnet with reason: WIP - working on puppet runs
[10:57:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host registry1005.eqiad.wmnet
[10:59:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch registry1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1074973 (https://phabricator.wikimedia.org/T375374)
[10:59:34] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Switch registry1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1074973 (https://phabricator.wikimedia.org/T375374) (owner: 10Muehlenhoff)
[11:00:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch registry1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1074973 (https://phabricator.wikimedia.org/T375374) (owner: 10Muehlenhoff)
[11:02:12] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host registry1005.eqiad.wmnet
[11:03:56] <wikibugs>	 (03CR) 10DCausse: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[11:13:03] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[11:14:21] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse)
[11:15:28] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[11:15:38] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:16:09] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[11:16:20] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:17:02] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[11:17:06] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:20:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix site.pp after adding new Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074978
[11:21:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix site.pp after adding new Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074978 (owner: 10Muehlenhoff)
[11:41:20] <wikibugs>	 (03PS3) 10Muehlenhoff: envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690
[11:41:28] <wikibugs>	 (03PS4) 10Muehlenhoff: envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690
[11:45:39] <moritzm>	 !log installing cups security updates
[11:45:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:22] <wikibugs>	 (03CR) 10Jaime Nuche: [C:03+1] "> Then when using scap3 for deployment, Puppet was made to NOT install the Jenkins package since it is not prepared by Puppet. I guess it " [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar)
[11:47:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167117 (10VRiley-WMF) With this information, I'm going to reach back out to Dell.
[11:49:01] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:49:11] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:51:01] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 375, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:51:11] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 293, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:51:39] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2425.codfw.wmnet with reason: reimage
[11:51:42] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2425.codfw.wmnet with reason: reimage
[11:52:19] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2425.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL
[11:54:02] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: toggle notifications for db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1074986 (https://phabricator.wikimedia.org/T375186)
[11:54:03] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1074986 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb)
[11:54:24] <wikibugs>	 (03PS1) 10Effie Mouzeli: kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878)
[11:55:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:55:12] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:55:19] <wikibugs>	 (03CR) 10Jforrester: "Neat!" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar)
[11:55:51] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2425.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL
[11:57:10] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff)
[11:59:03] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 375, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:59:12] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 293, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:59:52] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[12:00:37] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[12:00:50] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli)
[12:02:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] bacula::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074427 (owner: 10Muehlenhoff)
[12:02:58] <moritzm>	 effie: I'll merge your patch along
[12:03:46] <effie>	 cheers thanx
[12:04:15] <moritzm>	 merged
[12:05:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Enable and scrape gNMIc api Prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361) (owner: 10Ayounsi)
[12:05:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking)
[12:09:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] No longer include config-master on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff)
[12:10:45] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2425.codfw.wmnet
[12:10:46] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2425.codfw.wmnet
[12:12:30] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2425 to wikikube-worker2125
[12:12:40] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[12:12:56] <jynus>	 !log restarting replication on pc1013 after crash T375382
[12:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:04] <stashbot>	 T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382
[12:13:15] <jynus>	 ^ heads up _joe_ moritzm this could have caused some mw errors
[12:13:30] <_joe_>	 ack, thanks
[12:14:01] <jynus>	 in the past it used to be very loggy, but I think it wasn't noticed that much this time
[12:14:29] <moritzm>	 ok
[12:14:29] <icinga-wm>	 PROBLEM - config-master.wikimedia.org requires authentication on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:14:33] <jynus>	 I think specially as it was only down for 9 seconds
[12:14:44] <moritzm>	 the config-master alert should be harmless
[12:15:07] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:15:13] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:15:54] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2425 to wikikube-worker2125 - jiji@cumin1002"
[12:17:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: apache2.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:17:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add monitoring to network devices gRPC endpoints (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi)
[12:17:34] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Enable and scrape gNMIc api Prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361) (owner: 10Ayounsi)
[12:17:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "No longer include config-master on Puppet 5 frontends" [puppet] - 10https://gerrit.wikimedia.org/r/1074994
[12:18:01] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2425 to wikikube-worker2125 - jiji@cumin1002"
[12:18:02] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:18:02] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2125
[12:18:17] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2125
[12:18:56] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2425 to wikikube-worker2125
[12:20:32] <wikibugs>	 (03PS1) 10Slyngshede: C:idm setup structlogger instance. [puppet] - 10https://gerrit.wikimedia.org/r/1074998
[12:21:26] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4086/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede)
[12:22:02] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2125.codfw.wmnet on all recursors
[12:22:05] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2125.codfw.wmnet on all recursors
[12:22:19] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4087/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede)
[12:24:15] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4088/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede)
[12:24:42] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167250 (10ABran-WMF) as @jcrespo  found on P69389 this crash is due to a memory issue on channel:0 slot:1
[12:24:43] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm setup structlogger instance. [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede)
[12:26:05] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000
[12:26:14] <wikibugs>	 (03PS2) 10DCausse: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000
[12:26:50] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167256 (10ABran-WMF) This confirm the position of the stick that is in error in DIMM slot A9: {F57531822} {F57531824}
[12:27:04] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2125.codfw.wmnet
[12:27:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: apache2.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:27:27] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2125.codfw.wmnet with OS bullseye
[12:27:37] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2125
[12:27:43] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[12:27:54] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000 (owner: 10DCausse)
[12:29:33] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000 (owner: 10DCausse)
[12:29:48] <wikibugs>	 (03PS3) 10Brouberol: cloudnative-pg-cluster: facilitate the import of an external database [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074968 (https://phabricator.wikimedia.org/T374950)
[12:32:19] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:32:33] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:33:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:35:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:35:56] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2125 - jiji@cumin1002"
[12:36:00] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2125 - jiji@cumin1002"
[12:36:00] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:36:00] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2125.codfw.wmnet 81.0.192.10.in-addr.arpa 1.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:36:03] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2125.codfw.wmnet 81.0.192.10.in-addr.arpa 1.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:36:04] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2125
[12:36:27] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2125
[12:36:27] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2125
[12:44:16] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075005
[12:50:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167324 (10VRiley-WMF) After working with Dell and explaining the issue, they can confirm that there is no hardware issues in the TSR report. I did provide them the image that @Jclark-ct...
[12:54:13] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: use kafka "external-services" fqdn with use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006
[12:54:17] <logmsgbot>	 !log mnz@deploy1003 Started deploy [airflow-dags/research@3e2d3b8]: deploy reference risk DAG
[12:54:23] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2125.codfw.wmnet with reason: host reimage
[12:54:52] <logmsgbot>	 !log mnz@deploy1003 Finished deploy [airflow-dags/research@3e2d3b8]: deploy reference risk DAG (duration: 00m 59s)
[12:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:58:01] <wikibugs>	 (03CR) 10DCausse: "Tested with one job & kafka-main in I0cc7640 and worked ok." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse)
[12:58:04] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2125.codfw.wmnet with reason: host reimage
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:11] <Lucas_WMDE>	 I can’t deploy anyway, so good ^^
[13:02:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1069991 (owner: 10EoghanGaffney)
[13:03:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10167367 (10MoritzMuehlenhoff)
[13:13:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:15:12] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:15:16] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:15:24] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:17:12] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:17:16] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:17:24] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:17:30] <wikibugs>	 (03PS1) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283)
[13:18:11] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2125.codfw.wmnet with OS bullseye
[13:19:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:21:02] <effie>	 !log homer cr*codfw* commit 'T372878'
[13:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:06] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[13:21:24] <effie>	 !log homer lsw1-a6-codfw* commit 'T372878'
[13:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:15] <wikibugs>	 (03PS5) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200)
[13:23:33] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh)
[13:23:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:24:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh)
[13:25:16] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10167416 (10Volans) Thanks for the summary @ssingh. I have a local proposal that will send out when ready. There is one main point to decide and...
[13:27:32] <wikibugs>	 (03PS2) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283)
[13:29:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:29:28] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 291, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:29:58] <wikibugs>	 (03PS3) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283)
[13:30:44] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on cr3-ulsfo with reason: waiting for JTAC
[13:30:58] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cr3-ulsfo with reason: waiting for JTAC
[13:31:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10167444 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a9eff4bb-15d3-41a4-8dd6-65ccc0663c06) set by ayounsi@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their serv...
[13:31:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:31:50] <wikibugs>	 (03PS4) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283)
[13:33:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:33:47] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2125.codfw.wmnet
[13:33:49] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2125.codfw.wmnet
[13:33:49] <wikibugs>	 (03PS5) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283)
[13:33:51] <logmsgbot>	 !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2125.codfw.wmnet
[13:35:32] <wikibugs>	 (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:35:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:36:07] <wikibugs>	 (03PS3) 10Hashar: contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278)
[13:37:19] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788)
[13:37:30] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 373, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:38:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar)
[13:38:19] <wikibugs>	 (03PS6) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283)
[13:38:56] <hashar>	 00:00:54.145   1) profile::configmaster on debian-11-x86_64 test compilation with default parameters is expected to compile into a catalogue without dependency cycles
[13:38:56] <hashar>	 00:00:54.145        error during compilation: Function lookup() did not find a value for the name 'profile::configmaster::server_name' (file: /srv/workspace/puppet/modules/profile/manifests/configmaster.pp, line: 8) on node 4c9703cf2f06.integration.eqiad1.wikimedia.cloud
[13:39:06] <hashar>	 something is broken in the puppet specs
[13:40:05] <sukhe>	 yeah
[13:40:08] <sukhe>	 see -sre
[13:40:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:40:10] <sukhe>	 sending a patch
[13:41:02] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272)
[13:41:33] <hashar>	 sukhe: thank you! )
[13:42:01] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Done. I also tweaked the logic to avoid repeating the domain name more times than necessary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[13:42:30] <wikibugs>	 (03PS1) 10Ssingh: spec: remove profile_configmaster_spec.rb [puppet] - 10https://gerrit.wikimedia.org/r/1075013
[13:42:44] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:42:46] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:43:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:43:46] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:44:43] <wikibugs>	 (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:49:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167485 (10Papaul) @Jclark-ctr @ABran-WMF  @VRiley-WMF can I take over this task and try to re-image it?
[13:49:45] <wikibugs>	 (03CR) 10Btullis: hdfs: add new an-workers to insetup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[13:50:28] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr3-ulsfo
[13:50:36] <wikibugs>	 (03CR) 10Muehlenhoff: spec: remove profile_configmaster_spec.rb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[13:50:49] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service_proxy: Add a listener for the http interface of graphite [puppet] - 10https://gerrit.wikimedia.org/r/1075016 (https://phabricator.wikimedia.org/T374887)
[13:50:55] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo
[13:51:03] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr4-ulsfo
[13:51:23] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo
[13:52:20] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr3-eqsin
[13:52:34] <wikibugs>	 (03PS2) 10Stevemunene: hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788)
[13:52:54] <wikibugs>	 (03CR) 10Majavah: "i think it should be possible to fix the tests instead of removing them, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[13:52:54] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-eqsin
[13:53:48] <wikibugs>	 (03CR) 10Stevemunene: hdfs: add new an-workers to insetup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[13:54:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[13:55:49] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqsin
[13:56:15] <wikibugs>	 (03CR) 10Hashar: "CI fails due to a temporary glitch in the rspec tests." [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar)
[13:56:23] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqsin
[13:56:38] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr1-drmrs
[13:56:58] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-drmrs
[13:57:27] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-drmrs
[13:57:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: ExtensionDistributor: reach graphite via the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887)
[13:57:47] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-drmrs
[13:57:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[13:58:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ExtensionDistributor: reach graphite via the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) (owner: 10Giuseppe Lavagetto)
[13:58:28] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-b12-drmrs
[13:58:35] <wikibugs>	 (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[13:58:41] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b12-drmrs
[13:59:07] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-b13-drmrs
[13:59:17] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene)
[13:59:20] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b13-drmrs
[13:59:47] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host registry1005.eqiad.wmnet with OS bookworm
[14:00:32] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr1-esams
[14:00:54] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-esams
[14:02:17] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-bw27-esams
[14:02:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-bw27-esams
[14:02:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[14:02:44] <jinxer-wm>	 Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ...
[14:02:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:03:30] <wikibugs>	 (03CR) 10Ssingh: "No strong opinions either way, I will just update the spec." [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[14:03:43] <logmsgbot>	 !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@3e2d3b8]: Deploy latest DAGs to analytics Airflow instance. T369868.
[14:03:56] <stashbot>	 T369868: Improve handling of delete, restore, and merge from incremental update - https://phabricator.wikimedia.org/T369868
[14:04:32] <logmsgbot>	 !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@3e2d3b8]: Deploy latest DAGs to analytics Airflow instance. T369868. (duration: 00m 48s)
[14:06:20] <wikibugs>	 (03CR) 10Ssingh: "error during compilation: Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Httpd] is already de" [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[14:06:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10167577 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:06:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167571 (10Volans) AFAIK `pc1015` should be the candidate host if we want to fail it over, from `dbctl`: `         "note": "Hot spare for pc4 and cold spare for pc3", `
[14:07:15] <wikibugs>	 (03CR) 10Ssingh: "^ Running it locally, seems like there is more work required." [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[14:07:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Let's just remove it, not sure if it's actually still useful for anything." [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[14:12:47] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage
[14:13:13] <wikibugs>	 (03PS2) 10Ssingh: spec: remove profile_configmaster_spec.rb [puppet] - 10https://gerrit.wikimedia.org/r/1075013
[14:13:49] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-by27-esams
[14:13:55] <wikibugs>	 (03CR) 10Ssingh: "I tried fixing it but since this blocks CI, I am removing it. If someone has a fix, please feel free to update it 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[14:14:02] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-by27-esams
[14:14:10] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-esams
[14:14:30] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-esams
[14:15:38] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr1-codfw
[14:15:49] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-codfw
[14:16:04] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage
[14:16:20] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] spec: remove profile_configmaster_spec.rb [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh)
[14:16:47] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-codfw
[14:16:58] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-codfw
[14:17:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[14:17:44] <jinxer-wm>	 Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ...
[14:17:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:17:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167645 (10jcrespo) good catch, let's then start by moving replication from pc4 to: pc3: pc1013 -> pc1015, in the earliest binlog possible, for warmup (this should be a noop), and later we can patch/run dbct...
[14:18:16] <wikibugs>	 (03PS6) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200)
[14:19:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "thirdparty/otelcol-contrib isn't a thing in bookworm and will need to be added prior to this patch" [puppet] - 10https://gerrit.wikimedia.org/r/1074434 (owner: 10Herron)
[14:21:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:04-1] "I tested this in Pontoon and I'm getting invalid configuration:" [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron)
[14:24:32] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move pc1015 configuration to master of pc3 section [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382)
[14:24:54] <wikibugs>	 (03CR) 10Ssingh: "Turning this on only for Wikimedia DNS. We will turn this on for internal recursors next week. I am pretty sure this should be fine but no" [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh)
[14:25:21] <wikibugs>	 (03PS1) 10Herron: apt: add thirdparty/otelcol-contrib bookworm component [puppet] - 10https://gerrit.wikimedia.org/r/1075025
[14:25:54] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh)
[14:26:13] <wikibugs>	 (03CR) 10Mforns: hieradata::services_proxy::envoy.yaml: fix duplicated port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns)
[14:27:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] apt: add thirdparty/otelcol-contrib bookworm component [puppet] - 10https://gerrit.wikimedia.org/r/1075025 (owner: 10Herron)
[14:27:42] <wikibugs>	 (03CR) 10Herron: [C:03+2] apt: add thirdparty/otelcol-contrib bookworm component [puppet] - 10https://gerrit.wikimedia.org/r/1075025 (owner: 10Herron)
[14:29:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167674 (10ABran-WMF) sure! you can reimage it @Papaul
[14:30:04] <wikibugs>	 (03CR) 10Bking: [C:03+1] "+1 to merge once the change passes CI. Partman is a "guess and check" type application so there may be more iterations ;)" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[14:30:15] <jynus>	 !log restarting and moving replication source of pc1015 T375382
[14:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:27] <stashbot>	 T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382
[14:30:48] <sukhe>	 !log sudo cumin 'O:wikidough' 'run-puppet-agent'
[14:30:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:31] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host registry1005.eqiad.wmnet with OS bookworm
[14:33:54] <wikibugs>	 (03PS9) 10Herron: thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586
[14:36:51] <wikibugs>	 (03PS1) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408)
[14:37:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[14:37:15] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: Move pc1015 configuration to master of pc3 section [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[14:37:51] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: Move pc1015 configuration to master of pc3 section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[14:38:04] <wikibugs>	 (03PS2) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408)
[14:38:13] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:18] <wikibugs>	 (03PS3) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408)
[14:38:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[14:38:50] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[14:40:28] <wikibugs>	 (03PS4) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408)
[14:41:14] <wikibugs>	 (03PS1) 10MVernon: hiera: specify cluster for apus nodes [puppet] - 10https://gerrit.wikimedia.org/r/1075027 (https://phabricator.wikimedia.org/T279621)
[14:42:39] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[14:45:48] <wikibugs>	 (03PS5) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408)
[14:47:03] <icinga-wm>	 PROBLEM - Host pc1013 #page is DOWN: PING CRITICAL - Packet loss = 100%
[14:47:19] <sukhe>	 hi
[14:47:21] <vgutierrez>	 !incidents
[14:47:21] <sukhe>	 !incidents
[14:47:21] <sirenbot>	 5267 (ACKED)  Host pc1013 (paged) - PING  - Packet loss = 100%
[14:47:21] <sirenbot>	 5267 (ACKED)  Host pc1013 (paged) - PING  - Packet loss = 100%
[14:47:25] <volans>	 jynus: our friend came back
[14:47:26] <denisse>	 Here.
[14:47:29] * Emperor here
[14:47:29] <sukhe>	 !ack 5267
[14:47:30] <sirenbot>	 5267 (ACKED)  Host pc1013 (paged) - PING  - Packet loss = 100%
[14:47:45] <_joe_>	 volans: wdym?
[14:47:50] <volans>	 I guess we might have to force the failover earlier than expected...
[14:48:02] <volans>	 _joe_: it had already failed
[14:48:04] <wikibugs>	 (03CR) 10Jcrespo: "heads up" [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[14:48:05] <_joe_>	 volans: are you handling the alert?
[14:48:12] <sukhe>	 https://sal.toolforge.org/log/QZTMHpIBFk7ipym_lMyU
[14:48:24] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: name=registry1005.eqiad.wmnet
[14:48:38] <Emperor>	 Is it me, or did that p.age everyone immediately rather than just the oncall folk?
[14:48:39] <moritzm>	 there's a DIMM error in SEL
[14:48:56] <jynus>	 yeah: T374215
[14:48:56] <stashbot>	 T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215
[14:49:00] <jynus>	 not that
[14:49:01] <_joe_>	 Emperor: no idea because I'm oncall
[14:49:02] <sukhe>	 Emperor: weird because I ACKed it here even before it paged on the app
[14:49:04] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc3 on pc2013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1013.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1013.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:49:10] <kamila_>	 Emperor: didn't page me 
[14:49:12] <jynus>	 T375382
[14:49:13] <stashbot>	 T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382
[14:49:15] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.93% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:49:17] <volans>	 _joe_: we were discussing if we should failover or not in the DP meeting and the consensus was to try to failover but without a rush checking with the DBAs that are OOO today, but I guess at this point we have to failover sooner than expected
[14:49:17] <_joe_>	 but I responded at the first page
[14:49:21] <moritzm>	 surprinsingly it logged a successful succesful self-heal earlier
[14:49:33] <jynus>	 we expected it worked for longer until we failover it
[14:49:38] <vgutierrez>	 Emperor: I didn't get paged by splunk, just IRC hashtag
[14:49:43] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=registry1005.eqiad.wmnet
[14:49:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński)
[14:49:48] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: name=registry1005.eqiad.wmnet
[14:49:49] <_joe_>	 if there's a dimm error I guess we have no alternative
[14:49:59] <jynus>	 it is not booting up?
[14:50:20] <_joe_>	 is anyone trying to boot it?
[14:50:22] <moritzm>	 console is dead
[14:50:26] <moritzm>	 I'll powercycle it
[14:50:30] <_joe_>	 yep
[14:50:33] <Emperor>	 oh, yes, sorry, I'm an idiot and got emailed by nagios rather than p.aged by splunk
[14:50:46] <sukhe>	 is there anything we need to do in the meantime?
[14:50:46] <jynus>	 I'd prefer to boot it ap and later failover than do it without
[14:50:56] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true,weight=10; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad
[14:50:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167799 (10Papaul) @ABran-WMF osorry forgot to ask, are we re-imaging with Bullseye?
[14:51:16] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: Move pc1015 configuration to master of pc3 section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[14:51:16] <_joe_>	 jynus: ack, moritz has powercycled it AIUI
[14:51:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frlog2001 - https://phabricator.wikimedia.org/T375239#10167791 (10Jhancock.wm) a:03Papaul
[14:51:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpm2001 - https://phabricator.wikimedia.org/T375297#10167797 (10Jhancock.wm) a:03Papaul
[14:51:31] <jynus>	 let see if it comes back, it will be faser
[14:51:35] <moritzm>	 !log powercycle pc1013 (DIMM error in DIMM_A9)
[14:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:39] <jynus>	 if not I am preparing pc1015
[14:51:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10167805 (10Jhancock.wm) I forgot to hit submit on my last update. pay-lb2001 was moved on Friday.   The two latest decons have left us with another...
[14:52:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:52:15] <jynus>	 how bad it is for mediawiki errors?
[14:52:21] <jynus>	 I see
[14:52:26] <_joe_>	 yeah
[14:52:45] <jynus>	 merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075024
[14:52:46] <moritzm>	 it's booting now, took a while to get POST checks
[14:52:53] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn)
[14:52:57] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm)
[14:52:58] <jynus>	 sometimes it ask for an enter: moritzm
[14:53:05] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Move pc1015 configuration to master of pc3 section [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[14:53:34] <wikibugs>	 (03PS4) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485)
[14:53:43] <moritzm>	 grub is up now
[14:53:48] <jynus>	 root@puppetmaster1001:~$ puppet-merge: To ensure consistent locking please run puppet-merge from: puppetserver1001.eqiad.wmnet
[14:53:49] <moritzm>	 and system is booting
[14:53:53] <jynus>	 help with this ^
[14:54:02] <volans>	 jynus: just go to puppetserver1001
[14:54:05] <_joe_>	 jynus: go to puppetserver1001 :)
[14:54:08] <volans>	 same UI as before
[14:54:08] <sukhe>	 jynus: just run from puppetserver
[14:54:11] <jynus>	 ok, I am stupid
[14:54:12] <sukhe>	 nothing else changed
[14:54:13] <icinga-wm>	 RECOVERY - Host pc1013 #page is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[14:54:15] <jinxer-wm>	 FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.63% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:54:18] <sukhe>	 ok great
[14:54:21] <moritzm>	 pc1013 is back
[14:54:24] <sukhe>	 thanks moritzm <3
[14:54:32] <_joe_>	 well the server is up
[14:54:36] <moritzm>	 the question is whether this is stable enough or will re-appear
[14:54:37] <_joe_>	 mariadb isn't I guess
[14:54:40] <jynus>	 moritzm: it will
[14:54:49] <_joe_>	 jynus: are you starting the database?
[14:54:59] <jynus>	 I am on it
[14:55:04] <_joe_>	 ack
[14:55:22] <arnaudb>	 Sep 23 10:58:45 pc1013 kernel: MCE: Killing mysqld:1332 due to hardware memory corruption fault at 7f4e020fd5c0
[14:55:31] <arnaudb>	 last line of the previous boot kernel log
[14:55:37] <jynus>	 same thing that happened before
[14:55:38] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:55:41] <arnaudb>	 yep no surprise here
[14:55:46] <_joe_>	 !incidents
[14:55:47] <sirenbot>	 5267 (ACKED)  Host pc1013 (paged) - PING  - Packet loss = 100%
[14:55:51] <jynus>	 service up, outage should fix now
[14:55:57] <jynus>	 but I would like to do the failover asap
[14:56:01] <jynus>	 so it won't happen again
[14:56:05] <volans>	 +1
[14:56:10] <_joe_>	 yeah I think it's sensible at this point, +1
[14:56:15] <arnaudb>	 thanks jynus 
[14:56:16] <denisse>	 +1
[14:56:25] <jynus>	 I need some help as it is currently depooled on 2 sections
[14:56:34] <jynus>	 I am not familiar with day to day dbctl operations
[14:56:49] <_joe_>	 jynus: I can try to help, and so can volans I guess
[14:56:52] <kamila_>	 +1
[14:56:54] <arnaudb>	 i can too
[14:56:54] <volans>	 I would go with dbctl "edit
[14:57:04] <volans>	 and just adjust it at yur will or I can if you prefer
[14:57:10] <volans>	 and you check it before committing
[14:57:10] <jynus>	 if the rest can confirm mw okness
[14:57:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:57:17] <jynus>	 while we go through the failover
[14:57:25] <_joe_>	 jynus: ack
[14:57:34] <jynus>	 volans: either would work
[14:57:44] <arnaudb>	 server is unresponsive again
[14:57:47] <jynus>	 :-(
[14:57:50] <_joe_>	 sigh
[14:57:56] <_joe_>	 ok
[14:57:59] * volans preparing dbctl edit
[14:58:01] <jynus>	 yep, it crashed again
[14:58:02] <volans>	 to submit for review
[14:58:06] <_joe_>	 yep
[14:58:07] <arnaudb>	 ack volans 
[14:58:12] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: allow 3 new federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074199 (https://phabricator.wikimedia.org/T364233) (owner: 10Ryan Kemper)
[14:58:26] <_joe_>	 ok, I'll monitor mediawiki
[14:58:39] <jynus>	 I would like to restart pc1015 once before pooling it
[14:58:41] <jynus>	 doing it now
[14:58:48] <jynus>	 to apply puppet changes
[14:59:14] <_joe_>	 oh you mean mariadb, not the whole server
[14:59:15] <jinxer-wm>	 RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.36% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:59:18] <moritzm>	 meh, pc1013 is OOW since less than three months...
[14:59:18] <arnaudb>	 jynus: maybe an upgrade cookbook would be nice ?
[14:59:20] <wikibugs>	 (03PS10) 10Herron: thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586
[14:59:28] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:41] <jynus>	 arnaudb: that's ok, it is the restart after the puppet config that needs to be done
[14:59:42] <arnaudb>	 (depending on the production impact)
[14:59:46] <arnaudb>	 ack
[15:00:27] <jynus>	 _joe_: the issue is that pc1015 was a hot spare for pc4, not pc5, so it its a longer process
[15:00:50] <_joe_>	 so with the server unresponsive, we're bound to have more slowdowns in mediawiki
[15:01:03] <volans>	 jynus: try  dbctl config diff and check the output
[15:01:04] <_joe_>	 jynus: take your time
[15:01:28] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#10167841 (10brouberol) `cirrus-streaming-updater` is replacing the list of brokers by the external services service name: https://gerrit.wi...
[15:01:28] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks great, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074968 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol)
[15:01:35] <_joe_>	 right now pc1013 responds to network, so it refuses connections and that is fast. The problem is when it's down, we have a pretty generous connection timeout
[15:01:40] <jynus>	 volans: looks good, let me be sure pc1015 is ok
[15:01:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse)
[15:01:45] <volans>	 sure
[15:01:59] <arnaudb>	 lgtm volans looks like what we do in other switchovers
[15:02:11] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add a presto cluster and assign the relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[15:02:13] <jynus>	 volans: we are good, commit
[15:02:20] <jynus>	 and we now fix codfw replication
[15:02:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add an airflow cluster and assign relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis)
[15:02:28] <volans>	 you can also check dbctl -s eqiad section pc3 get and  dbctl instance pc1015 get
[15:02:37] <volans>	 ok committing
[15:02:39] <jynus>	 just commit, it is ok
[15:02:56] <jynus>	 we may have to tune cadidate master et al
[15:02:58] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: facilitate the import of an external database [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074968 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol)
[15:03:01] <jynus>	 but that's not important
[15:03:09] <volans>	 {done}
[15:03:21] <logmsgbot>	 !log volans@cumin1002 dbctl commit (dc=all): 'emergency failover pc3 to pc1015', diff saved to https://phabricator.wikimedia.org/P69396 and previous config saved to /var/cache/conftool/dbconfig/20240923-150320-volans.json
[15:03:28] <jynus>	 I see the users coming in
[15:03:31] <arnaudb>	 response time looks ok again
[15:03:41] <_joe_>	 arnaudb: see my explanation above
[15:03:45] <volans>	 the cache is cold though
[15:03:59] <jynus>	 volans: as a note for myself we need to switch p3-cofwe to replicate from pc1015-bin.099184 |    33086
[15:04:04] <volans>	 it was a hot spare for pc4... we got unlucky
[15:04:07] <arnaudb>	 ack _joe_ I missed it in the scroll thanks!
[15:04:26] <wikibugs>	 (03CR) 10Herron: "good catch thanks! the updated PS (and after sorting out the otelcol-contrib component) has thanos-query looking much better on phi-titan-" [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron)
[15:04:29] <jynus>	 _joe_: mw better?
[15:04:53] <_joe_>	 jynus: it was better as soon as it could get a connection refused from pc1013
[15:04:56] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:05:08] <jynus>	 I will switch pc3-codfw
[15:05:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[15:05:14] <jynus>	 arnaudb: can you handle orchestrator
[15:05:22] <arnaudb>	 yep
[15:05:32] <jynus>	 and tendril if possible to update the master
[15:05:33] <_joe_>	 moritzm: planned obsolecence!
[15:05:45] <_joe_>	 sorry I just saw your comment about OOW :)
[15:05:49] <jynus>	 I will update pc2013 replication
[15:06:05] <arnaudb>	 I'll paste my edit log here to ensure everything is squared
[15:06:06] <_joe_>	 jynus: <3
[15:06:31] <jynus>	 normally after an uncorrectable error, the memory stick just disables itself
[15:06:40] <wikibugs>	 (03CR) 10Santiago Faci: hieradata::services_proxy::envoy.yaml: fix duplicated port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns)
[15:06:53] <jynus>	 in this case it crashed every time it reached the bit (after X minutes after buffer pool load)
[15:07:21] <wikibugs>	 (03PS3) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287
[15:07:38] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc3 on pc2013 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:07:44] <moritzm>	 yeah, but the ones logged are the uncorrectable multi-bit failures
[15:08:02] <icinga-wm>	 RECOVERY - MariaDB Replica IO: pc3 on pc2013 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:08:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle)
[15:08:54] <jynus>	 pc2013 should be fine now more or less
[15:09:12] <jynus>	 I will setup the circular replication
[15:09:20] <jynus>	 and then will help with monitoring
[15:09:46] <jynus>	 we have to silence pc1013 too
[15:09:59] <jynus>	 if someone can send the patch to disable monitoring there
[15:10:04] <jynus>	 on hiera
[15:10:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron)
[15:10:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] titan: add opentelemetry collector [puppet] - 10https://gerrit.wikimedia.org/r/1074434 (owner: 10Herron)
[15:10:57] <moritzm>	 arnaudb: can you open a task with ops-eqiad added to look into pc1013? while it's OOW, in many cases we have parts from decommssioned, but not yet recycled servers we can swap in
[15:11:30] <arnaudb>	 moritzm: sure, aside of T375382 right?
[15:11:31] <stashbot>	 T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382
[15:11:52] <wikibugs>	 (03CR) 10Herron: [C:03+2] titan: add opentelemetry collector [puppet] - 10https://gerrit.wikimedia.org/r/1074434 (owner: 10Herron)
[15:12:12] <moritzm>	 oh, sorry I had missed that task, then no need
[15:12:27] <arnaudb>	 ack, was unsure it needed one, I'll mention it then!
[15:12:29] <jynus>	 circular replication setup
[15:12:41] <volans>	 thx
[15:12:51] <jynus>	 sadly we will have an empty cache, as volans mentioned, which is why I was waiting for it to warm up
[15:12:59] <jynus>	 (before the incident)
[15:14:59] <logmsgbot>	 !log stevemunene@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: allow 3 new endpoints T364233 T368085 T374195
[15:15:01] <jynus>	 cleaning up heartbeat table so orchestator and monitoring gets better
[15:15:06] <stashbot>	 T364233: add https://imagehash-sparql.wmcloud.org/sparql endpoint to wikidata federated query whitelists - https://phabricator.wikimedia.org/T364233
[15:15:07] <stashbot>	 T368085: Allow federated queries with Dbnary (kaiko.getalp.org) - https://phabricator.wikimedia.org/T368085
[15:15:07] <stashbot>	 T374195: Add https://metabase.wikibase.cloud/query/sparql to the Wikidata Federated Query Whitelist - https://phabricator.wikimedia.org/T374195
[15:15:26] <arnaudb>	 jynus: as far as orch goes, I should tag both hosts as what? co-master? master?
[15:15:40] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:15:47] <jynus>	 orchwise? yep, it is a circular replication, active-active all the time
[15:15:50] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:16:02] <jynus>	 should show now 0 seconds
[15:16:04] <icinga-wm>	 PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167874 (10ABran-WMF) @MoritzMuehlenhoff mentionned that we might have spare parts available for this server from decommssioned, but not yet recycled servers : @wiki_willy  I'm not sure...
[15:16:10] <arnaudb>	 I'll tag them as co-master then
[15:16:34] <jynus>	 should I prepare the disable notifications of pc1013?
[15:16:41] <arnaudb>	 I think you can
[15:16:45] <jynus>	 doing
[15:16:56] <arnaudb>	 I'll struggle with orchestrator command line for a bit
[15:17:02] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:17:16] <jynus>	 it's ok, those are not immediate issues
[15:17:32] <jynus>	 orch look ok to me now
[15:17:48] <jynus>	 I don't think there is nothing to do there, other than handle pc1013
[15:18:41] <arnaudb>	 this is already on its way as it's been depooled and dc-ops have been mentionned to see if we have some memory stick available 
[15:18:42] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 1.958 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:18:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:19:30] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:19:54] <volans>	 see also in private my alternative proposal :)
[15:20:23] <wikibugs>	 (03PS1) 10Herron: opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034
[15:20:50] <logmsgbot>	 !log stevemunene@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: allow 3 new endpoints T364233 T368085 T374195 (duration: 05m 51s)
[15:20:54] <wikibugs>	 (03CR) 10CDanis: [C:03+1] opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034 (owner: 10Herron)
[15:20:57] <stashbot>	 T364233: add https://imagehash-sparql.wmcloud.org/sparql endpoint to wikidata federated query whitelists - https://phabricator.wikimedia.org/T364233
[15:20:57] <stashbot>	 T368085: Allow federated queries with Dbnary (kaiko.getalp.org) - https://phabricator.wikimedia.org/T368085
[15:20:58] <stashbot>	 T374195: Add https://metabase.wikibase.cloud/query/sparql to the Wikidata Federated Query Whitelist - https://phabricator.wikimedia.org/T374195
[15:21:06] <icinga-wm>	 RECOVERY - Host cr3-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.53 ms
[15:21:16] <wikibugs>	 (03PS2) 10Herron: opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034
[15:21:38] <wikibugs>	 (03CR) 10CDanis: [C:03+1] thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron)
[15:22:06] <wikibugs>	 (03CR) 10Herron: [C:03+2] opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034 (owner: 10Herron)
[15:23:23] <jynus>	 hows mediawiki uncached performance/parsercache performace, is it ok?
[15:23:50] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382)
[15:24:31] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[15:24:39] <jynus>	 there is like a 33% increase in parses: https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1
[15:26:37] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[15:26:48] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382)
[15:26:58] <wikibugs>	 (03CR) 10Jcrespo: [V:03+2 C:03+2] mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo)
[15:29:48] <jynus>	 we should be in a bit of a degraded performance for a few hours
[15:30:07] <jouncebot>	 jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1530).
[15:30:11] <jynus>	 arnaudb: ddi you update zarcillo, should I?
[15:30:44] <arnaudb>	 I'll do it jynus 
[15:31:17] <jynus>	 sadly, we hit the bug where setting a pc host as master removes its monitoring
[15:31:54] <jynus>	 whatever is the puppet config it is, it should be switched to whatever x2 has
[15:33:51] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: service_proxy: Add a listener for the http interface of graphite [puppet] - 10https://gerrit.wikimedia.org/r/1075016 (https://phabricator.wikimedia.org/T374887)
[15:33:51] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723)
[15:33:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723)
[15:33:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040
[15:35:17] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart
[15:35:43] <arnaudb>	 misconfigured temporarly zarcillo: https://phabricator.wikimedia.org/P69397 
[15:36:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[15:36:49] <jynus>	 no issues, arnaudb
[15:36:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[15:37:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto)
[15:37:44] <jynus>	 it would be hard for automation to get it , and even if it got it, it only affects the grouping of metrics, not the metrics themselves
[15:38:18] <jynus>	 so the issue is in modules/profile/manifests/mariadb/parsercache.pp
[15:38:53] <jynus>	 it should be like the core ones
[15:40:13] <jynus>	 funnily, it was fixed in the past: https://phabricator.wikimedia.org/rOPUP79104d15efe2bbc049abc7c7dd90584d06bed0be
[15:41:13] <wikibugs>	 (03CR) 10CDanis: git: add replicated_local_repo define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[15:43:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Renamed log field for pipeline migration (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur)
[15:44:02] <wikibugs>	 (03CR) 10Herron: [C:03+2] thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron)
[15:45:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1419.eqiad.wmnet, mw1434.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1462.eqiad.wmnet, mw1415.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, parse1009.eqiad.wmnet, parse1021.eqiad.wmnet, mw1435.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, mw1408.eqiad.wmnet, kubernetes1012.eqiad
[15:45:00] <icinga-wm>	 mw1465.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1483.eqiad.wmnet, mw1367.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, mw1486.eqiad.wmnet, wikikube-worker1024.eqiad.wmnet, mw1464.eqiad.wmnet, mw1381.eqiad.wmnet, mw1352.eqiad.wmnet, parse1018.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1472.eqiad.wmnet, mw1376.eqiad.wmnet, kubernetes1026.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, mw1387.eqia
[15:45:00] <icinga-wm>	  mw1378.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1449.eqiad.wmnet, mw1461.eqiad.wmnet, mw1357.eqiad.wmnet, kubernetes1060.eqiad.wmnet, mw1467.eqiad.wmnet, kubernetes1020.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal
[15:45:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, parse1013.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1367.eqiad.wmnet, mw1442.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1386.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1462.eqiad.wmnet, mw1415.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1391.eqiad.wmnet, mw1424.eqiad.wmnet, mw1393.eqiad.wmnet, mw
[15:45:00] <icinga-wm>	 ad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1395.eqiad.wmnet, mw1465.eqiad.wmnet, mw1466.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1419.eqiad.wmnet, kubernetes1059.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1360.eqiad.wmnet, wikikube-worker1001.eqiad.wmnet, parse1012.eqiad.wmnet, wikikube-worker1024.eqiad.wmnet, mw1468.eqiad.wmnet, parse1006.eqiad.wmnet, kubernetes1028.
[15:45:00] <icinga-wm>	 net, wikikube-worker1010.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1024.eqiad.wmnet, kubernetes1062.eqiad.wmnet, mw1464.eqiad.wmnet, parse1021.eqiad.wmnet, mw1431.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal
[15:45:43] <sukhe>	 what's up
[15:46:03] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm
[15:46:14] <cdanis>	 sukhe: looks like eventstreams is not-up (again)
[15:46:18] <sukhe>	 I am still in a meeting so I haven't read the backlog. I can quit the meeting in five
[15:46:38] <cdanis>	 likely T375146
[15:46:55] <cdanis>	 https://phabricator.wikimedia.org/T375146
[15:47:53] <wikibugs>	 06SRE, 06DBA: Parsercache primary master databases should monitor replication - https://phabricator.wikimedia.org/T375395 (10jcrespo) 03NEW
[15:47:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10168087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm
[15:48:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168089 (10jcrespo) I've created T375395 to reflect that, despite being prometed from a replica to a master, and from passive to active, it now has less monitoring than before. I think parsercache should hav...
[15:50:54] <sukhe>	 yeah :|
[15:51:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:51:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:52:53] <wikibugs>	 (03CR) 10Bking: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[15:53:10] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[15:53:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723)
[15:53:57] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723)
[15:53:57] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040
[15:53:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: git: add replicated_local_repo define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[15:56:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[15:56:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[15:56:48] <dcausse>	 jouncebot: nowandnext
[15:56:48] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1530)
[15:56:48] <jouncebot>	 In 1 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700)
[15:56:48] <jouncebot>	 In 1 hour(s) and 3 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700)
[15:56:58] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes mw2424 and mw2425 - https://phabricator.wikimedia.org/T375398 (10jijiki) 03NEW
[15:57:14] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use kafka "external-services" fqdn with use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse)
[15:57:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto)
[15:58:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10168164 (10elukey) I took a look to puppetserver1002 and even aftet the change for the 35 workers, the memory used was almost 95%. The heap size usage stops aroun...
[15:58:31] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: use kafka "external-services" fqdn with use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse)
[15:59:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1479.eqiad.wmnet, mw1388.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, mw1408.eqiad.wmnet, mw1389.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1014.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, mw1483.eqiad.wmnet, mw1367.eqiad.wmnet, wikikube-worker100
[15:59:00] <icinga-wm>	 wmnet, mw1458.eqiad.wmnet, parse1006.eqiad.wmnet, mw1381.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1376.eqiad.wmnet, kubernetes1035.eqiad.wmnet, kubernetes1026.eqiad.wmnet, parse1014.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, kubernetes1062.eqiad.wmnet, mw1378.eqiad.wmnet, mw1449.eqiad.wmnet, mw1461.eqiad.wmnet, wikikube-worker1018.eqiad.wmnet, mw1397.eqiad.wmnet, kubernetes1027.eqia
[15:59:00] <icinga-wm>	  mw1414.eqiad.wmnet, wikikube-worker1019.eqiad.wmnet, mw1485.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, mw1396.eqiad.wmnet, mw1463.eqiad.wmnet, parse1023.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[15:59:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers kubernetes1010.eqiad.wmnet, parse1011.eqiad.wmnet, mw1433.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1367.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1386.eqiad.wmnet, mw1479.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, parse1009.eqiad.wmnet, mw1405.eqiad.wmnet, mw1399.eqiad.wmnet, mw1435.eqi
[15:59:00] <icinga-wm>	 , mw1424.eqiad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1466.eqiad.wmnet, mw1483.eqiad.wmnet, mw1419.eqiad.wmnet, mw1469.eqiad.wmnet, mw1486.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1356.eqiad.wmnet, mw1458.eqiad.wmnet, mw1371.eqiad.wmnet, parse1012.eqiad.wmnet, mw1468.eqiad.wmnet, kubernetes1028.eqiad.wmnet, wikikube-worker10
[15:59:00] <icinga-wm>	 .wmnet, kubernetes1031.eqiad.wmnet, kubernetes1024.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw1355.eqiad.wmnet, mw1472.eqiad.wmnet, wikikube-worker1031.eqiad.wmnet, mw1376.e https://wikitech.wikimedia.org/wiki/PyBal
[15:59:33] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[15:59:49] <logmsgbot>	 !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:59:51] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[16:01:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168177 (10jcrespo) {P69398}
[16:01:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:01:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:03:09] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[16:03:31] <logmsgbot>	 !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:04:07] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true,weight=10; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad
[16:05:17] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[16:05:31] <wikibugs>	 (03PS1) 10Elukey: conftool: add registry1005 to the docker-registry pool [puppet] - 10https://gerrit.wikimedia.org/r/1075050 (https://phabricator.wikimedia.org/T332016)
[16:05:33] <logmsgbot>	 !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:06:27] <wikibugs>	 (03CR) 10Elukey: [C:03+2] conftool: add registry1005 to the docker-registry pool [puppet] - 10https://gerrit.wikimedia.org/r/1075050 (https://phabricator.wikimedia.org/T332016) (owner: 10Elukey)
[16:08:13] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad
[16:08:22] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad
[16:08:48] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=registry1003.eqiad.wmnet,service=docker-registry,dc=eqiad
[16:10:39] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10168223 (10Ladsgroup) a:05Ladsgroup→03None It should be done by the per...
[16:12:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168246 (10jcrespo)
[16:13:18] <wikibugs>	 (03CR) 10CDanis: git: add replicated_local_repo define (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto)
[16:18:23] <wikibugs>	 (03PS2) 10Jdlrobson: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T375401)
[16:18:36] <wikibugs>	 (03PS3) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T375401)
[16:19:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson)
[16:20:43] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] config: remove eventbus instrumentation setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062430 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena)
[16:20:44] <wikibugs>	 (03PS2) 10DCausse: rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373)
[16:21:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10168311 (10Jhancock.wm)
[16:21:11] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10168315 (10RobH) Ongoing conversations via email with support, they've moved onto scheduling an onsite.  Sent all location details over along with a proposed maint window of October 2nd.  (Everyth...
[16:25:15] <wikibugs>	 06SRE, 06DBA: Parsercache primary master databases should monitor replication - https://phabricator.wikimedia.org/T375395#10168325 (10jcrespo) p:05Triage→03Low May not be needed if pc is rearchitectured at: T373037
[16:28:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168332 (10jcrespo) p:05Medium→03High was unbreak now, high now that issues has been mitigated after pc1013 failover.
[16:28:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10168338 (10Jhancock.wm) a:03Jhancock.wm
[16:29:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10168337 (10Jhancock.wm) @elukey hey these are the two new super micro servers I installed last week. I thought it went through without a hitch but something in the BMC didn't take.   logging-hd2004 logging-hd2005 sre...
[16:31:57] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 4 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[16:33:15] <wikibugs>	 (03PS2) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755)
[16:33:20] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal)
[16:33:39] <wikibugs>	 (03CR) 10Milimetric: "just some style thoughts" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[16:35:02] <wikibugs>	 (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[16:36:25] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335)
[16:37:29] <wikibugs>	 (03CR) 10Ladsgroup: "I think that was an oversight. I will fix it." [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French)
[16:37:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2424 - https://phabricator.wikimedia.org/T375270#10168371 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm logged into the server and not seeing any issues. looks like it might have healed itself. no memory errors pointing to another issue like that...
[16:38:32] <logmsgbot>	 !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1246.eqiad.wmnet with OS bookworm
[16:38:36] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[16:38:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10168377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm executed with errors: - db1246 (**FAIL**)   -...
[16:39:06] <wikibugs>	 (03PS1) 10Ladsgroup: pc2017: Set it to master [puppet] - 10https://gerrit.wikimedia.org/r/1075052 (https://phabricator.wikimedia.org/T374355)
[16:39:32] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson)
[16:39:46] <wikibugs>	 (03Abandoned) 10DErenrich: Add citation-needed-api to toolforge's prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1039850 (https://phabricator.wikimedia.org/T363371) (owner: 10DErenrich)
[16:40:31] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson)
[16:40:48] <wikibugs>	 (03CR) 10Ladsgroup: "I490f73b05d39c41d7b3b2b" [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French)
[16:42:27] <wikibugs>	 (03CR) 10Ottomata: "VERY COOL!" [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis)
[16:43:08] <logmsgbot>	 !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host db1246.eqiad.wmnet
[16:43:23] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:43:30] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:47:05] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10168415 (10BCornwall) Great write-up! I heartily disagree about self-documentation, though. While having clear, understandable code is a must, so too must the user operation: Nobody should have to t...
[16:49:28] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[16:49:33] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:52:59] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 8 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[16:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:56:43] <wikibugs>	 (03CR) 10Dreamy Jazz: "Thanks for the comments. Addressing these now." [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[16:58:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10168452 (10Ottomata) Approved
[16:59:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168456 (10wiki_willy) ++ @Jclark-ctr & @VRiley-WMF, who can see if there are any parts available from decommissioned servers  >>! In T375382#10167873, @ABran-WMF wrote: > @MoritzMuehlenhoff me...
[16:59:16] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim)
[16:59:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim)
[17:00:04] <wikibugs>	 (03Abandoned) 10Jdlrobson: Drop support for non-Codex message box styles in Vector 2022 and Vector [skins/Vector] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074282 (https://phabricator.wikimedia.org/T360668) (owner: 10Jdlrobson)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700)
[17:00:04] <jouncebot>	 ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700).
[17:00:07] <wikibugs>	 (03PS1) 10Ebernhardson: Revert "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075054
[17:02:19] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] Revert "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075054 (owner: 10Ebernhardson)
[17:02:51] <wikibugs>	 (03PS2) 10Jdlrobson: Do not apply table styling rules to Main page [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245)
[17:03:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245) (owner: 10Jdlrobson)
[17:03:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075054 (owner: 10Ebernhardson)
[17:04:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10168469 (10wiki_willy) Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr?  Please also keep in mind this server is due to be refreshed in Q2, so a new system will be o...
[17:05:49] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[17:06:02] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[17:06:10] <wikibugs>	 (03CR) 10Jforrester: "CI is complaining that there's no graphite for Beta Cluster, which is irritating." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) (owner: 10Giuseppe Lavagetto)
[17:07:27] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:07:32] <sukhe>	 hmm
[17:10:41] <wikibugs>	 (03PS1) 10Stoyofuku-wmf: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585)
[17:10:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf)
[17:13:29] <wikibugs>	 (03PS1) 10Ebernhardson: Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055
[17:14:03] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms
[17:14:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10168489 (10VRiley-WMF) Hey @ABran-WMF as it turns out, we don't happen to have any 2TB to use as a replacment. However, we do have plenty of 4TB drives that should work. Is it okay to move forward with...
[17:14:08] <wikibugs>	 (03PS2) 10Ebernhardson: Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055
[17:14:33] <wikibugs>	 (03PS3) 10Ebernhardson: Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055 (https://phabricator.wikimedia.org/T374335)
[17:15:13] <wikibugs>	 (03PS3) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486)
[17:15:16] <wikibugs>	 (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[17:16:29] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:30:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10168622 (10phaultfinder)
[17:34:57] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[17:35:02] <sukhe>	 hmm ok
[17:35:25] <sukhe>	 virtual chassis, I have no idea where to go from here but let's try
[17:36:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T375401) (owner: 10Jdlrobson)
[17:41:37] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Remove unused Regex pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808)
[17:42:06] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "Verified in our dashboards (https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters) this pool counter is now unused." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson)
[17:45:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10168717 (10phaultfinder)
[17:49:27] <sukhe>	 ^ these are known as per papau.l. both asw-{c,d}-codfw are being decommissioned
[17:50:54] <wikibugs>	 (03PS10) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[17:54:42] <wikibugs>	 (03CR) 10Ebernhardson: "realized i wont be available for the full deploy window, this will likely be rescheduled for thursday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson)
[17:55:24] <wikibugs>	 (03PS11) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[17:56:37] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4092/co" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[17:56:54] <wikibugs>	 (03CR) 10BCornwall: "PS10 and PS11 addresses a double-redirect for wikimediafoundation.org that failed pcc" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor)
[17:58:11] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 702496384 and 48 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:58:55] <sukhe>	 fun
[17:59:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 51152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:05:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375314#10168759 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Swapped out cable. Closing for now.
[18:07:38] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[18:08:08] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] "Looks good, I don't have +2, but I'm ok to merge." [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz)
[18:12:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168776 (10VRiley-WMF) Hi! We do have a spare DIMM that we can swap at anytime for this unit. Please let us know when is the best time to proceed with this. Thanks!
[18:21:47] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:21:53] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:24:37] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:24:40] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+1] Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich)
[18:24:43] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:28:47] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:30:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10168850 (10phaultfinder)
[18:31:37] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:33:58] <wikibugs>	 (03PS1) 10Ssingh: P:dns::recursor: set allow_extended_errors to true [puppet] - 10https://gerrit.wikimedia.org/r/1075062
[18:35:01] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4093/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075062 (owner: 10Ssingh)
[18:35:11] <wikibugs>	 (03PS2) 10Ssingh: P:dns::recursor: set allow_extended_errors to true [puppet] - 10https://gerrit.wikimedia.org/r/1075062 (https://phabricator.wikimedia.org/T375414)
[18:36:02] <wikibugs>	 (03CR) 10Ssingh: "Will merge after the switchover." [puppet] - 10https://gerrit.wikimedia.org/r/1075062 (https://phabricator.wikimedia.org/T375414) (owner: 10Ssingh)
[18:38:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:43:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:44:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:46:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 727858512 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:48:13] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:48:30] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:49:19] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418 (10Papaul) 03NEW
[18:49:25] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10168978 (10Papaul) p:05Triage→03Medium
[18:50:15] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419 (10Papaul) 03NEW
[18:50:31] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10168991 (10Papaul) p:05Triage→03Medium
[18:56:41] <icinga-wm>	 PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:04:45] <wikibugs>	 (03PS1) 10Scott French: mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962)
[19:04:45] <wikibugs>	 (03CR) 10Scott French: "Realized today that I forgot to send this one, which is actually needed for Tuesday :) Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French)
[19:05:11] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:08:43] <icinga-wm>	 RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:09:11] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10169021 (10Eevans) >>! In T375309#10168415, @BCornwall wrote: > Great write-up! I heartily disagree about self-documentation, though. While having clear, understandable code is a must, so too must t...
[19:09:15] <wikibugs>	 (03PS4) 10Jdlrobson: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679)
[19:15:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10169025 (10phaultfinder)
[19:15:10] <wikibugs>	 (03PS5) 10Jdlrobson: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679)
[19:32:16] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] Declare streams in support of the reconciliation mechanism for Dumps 2.0. (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[19:33:05] <wikibugs>	 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Corto: Licensing & copyright information - https://phabricator.wikimedia.org/T375305#10169043 (10jhathaway) Do we have to put the license in every file? The link you mentioned only says "consider". Just seems to be a bit tedious.
[19:33:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:38:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:41:39] <wikibugs>	 (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[19:45:36] <wikibugs>	 (03PS1) 10Ebernhardson: Let PageEntitySerializer.canonicalPageURL accept PageReference [extensions/EventBus] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070675 (https://phabricator.wikimedia.org/T372904) (owner: 10Peter Fischer)
[19:45:37] <wikibugs>	 (03CR) 10Ebernhardson: "Is this intended to be against the master branch? I was pondering abandoning since this is against 1.43.0-wmf.21  and .23 is the minimum d" [extensions/EventBus] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070675 (https://phabricator.wikimedia.org/T372904) (owner: 10Peter Fischer)
[19:46:31] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add tdd to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1075070 (https://phabricator.wikimedia.org/T375422)
[19:47:29] <wikibugs>	 (03PS3) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755)
[19:49:16] <wikibugs>	 (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[19:49:53] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Add tdd to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1075070 (https://phabricator.wikimedia.org/T375422) (owner: 10Gerrit maintenance bot)
[19:56:42] <wikibugs>	 (03CR) 10Ottomata: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T2000).
[20:00:05] <jouncebot>	 derenrich and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:24] <derenrich>	 o/
[20:00:51] <derenrich>	 (this is my first patch so excuse any errors)
[20:01:42] <Jdlrobson>	 p/
[20:08:59] <toyofuku>	 Gonna deploy Jon's patches
[20:09:58] <toyofuku>	 Unfortunately I do not have the power to manually +2 the backport patch so I'll do the two config deploys first, then the longer backport
[20:10:31] <Jdlrobson>	 np
[20:10:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim)
[20:10:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679) (owner: 10Jdlrobson)
[20:11:59] <wikibugs>	 (03Merged) 10jenkins-bot: Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim)
[20:12:01] <wikibugs>	 (03Merged) 10jenkins-bot: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679) (owner: 10Jdlrobson)
[20:12:27] <logmsgbot>	 !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1072600|Remove ProofreadPage dark mode namespaces exception]], [[gerrit:1074490|Promote dark mode for anons on tier 1 wikis (T374679)]]
[20:12:32] <stashbot>	 T374679: Check which projects are ready for dark mode for anons - https://phabricator.wikimedia.org/T374679
[20:13:55] <wikibugs>	 (03PS1) 10Papaul: Remove old switch stack [puppet] - 10https://gerrit.wikimedia.org/r/1075078 (https://phabricator.wikimedia.org/T375419)
[20:14:54] * jan_drewniak toyofuku: ping me when you're done and I'll do derenrich's patch
[20:15:18] * jan_drewniak (I keep pressing shift-enter because that's my slack setup...)
[20:15:22] <toyofuku>	 Sounds good!!  Thank you ☺️
[20:20:50] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Remove old switch stack [puppet] - 10https://gerrit.wikimedia.org/r/1075078 (https://phabricator.wikimedia.org/T375419) (owner: 10Papaul)
[20:21:30] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[20:23:41] <logmsgbot>	 !log toyofuku@deploy1003 jdlrobson, toyofuku, ebrahim: Backport for [[gerrit:1072600|Remove ProofreadPage dark mode namespaces exception]], [[gerrit:1074490|Promote dark mode for anons on tier 1 wikis (T374679)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:23:45] <stashbot>	 T374679: Check which projects are ready for dark mode for anons - https://phabricator.wikimedia.org/T374679
[20:23:52] <toyofuku>	 Jdlrobson: ready for testing~
[20:24:05] <Jdlrobson>	 on it
[20:24:47] <Jdlrobson>	 toyofuku: that's good to go
[20:25:04] <logmsgbot>	 !log toyofuku@deploy1003 jdlrobson, toyofuku, ebrahim: Continuing with sync
[20:27:35] <wikibugs>	 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Corto: Licensing & copyright information - https://phabricator.wikimedia.org/T375305#10169272 (10Eevans) >>! In T375305#10169043, @jhathaway wrote: > Do we have to put the license in every file? The link you mentioned only says "consider". Just seems to b...
[20:30:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10169282 (10phaultfinder)
[20:32:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10169305 (10phaultfinder)
[20:35:33] <toyofuku>	 This deploy feels very slow - is it just me?
[20:36:56] <Jdlrobson>	 yes
[20:37:14] <Jdlrobson>	 and am a bit worried that https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1074545 hasn't been started yet
[20:37:56] <toyofuku>	 Yeah - what's strange is it's the deploy steps itself that are slow, not like test infra
[20:38:07] <toyofuku>	 So the backport could potentially take an eternity
[20:38:18] <Jdlrobson>	 toyofuku: hmm  my config change seems to be live?
[20:38:26] <toyofuku>	 But like, I'm more concerned about _why_ the deploy is so slow
[20:38:36] <toyofuku>	 Yeah we're at 60% of servers rn
[20:38:41] <Jdlrobson>	 ok gotcha
[20:39:03] <toyofuku>	 But this doesn't feel like an issue with my internet connection so curious who to tag to make sure our prod infra is not in need of any attention
[20:39:08] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm
[20:40:00] <toyofuku>	 slow deploy could be either slow network or slow machines both of which wouldn't be ideal
[20:40:07] <jan_drewniak>	 toyofuku: it could be because it affets message keys. I'm not familiar with the details, but I know message related deploys are slower than deploys that don't involve messages (probably because the message cache?)
[20:40:09] <logmsgbot>	 !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072600|Remove ProofreadPage dark mode namespaces exception]], [[gerrit:1074490|Promote dark mode for anons on tier 1 wikis (T374679)]] (duration: 27m 41s)
[20:40:13] <stashbot>	 T374679: Check which projects are ready for dark mode for anons - https://phabricator.wikimedia.org/T374679
[20:40:16] <toyofuku>	 ahhhh
[20:40:19] <toyofuku>	 I do remember that
[20:40:21] <wikibugs>	 (03PS2) 10Jdlrobson: Dark mode: Make LiquidThreads namespace explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562
[20:40:23] <toyofuku>	 Well, that one's done
[20:40:30] <toyofuku>	 Gonna quickly do the next one
[20:40:34] <toyofuku>	 We might go over
[20:40:39] <toyofuku>	 Who would be the right person to tag for that?
[20:40:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245) (owner: 10Jdlrobson)
[20:41:23] <Jdlrobson>	 I guess the security window is next so we should ping Reedy sbassett and maryum as an FYI that we might go over.
[20:41:55] <toyofuku>	 Eta 30 mins on merging that patch 🥲
[20:42:56] <toyofuku>	 Reedy: sbasset: maryum: I don't know how to tag you in irc so hopefully this works, but we're in the middle of a backport deploy that is likely to extend into the security window - would that be alright with you all?
[20:43:22] <toyofuku>	 sbassett: misspelled your handle myb
[20:56:12] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm
[20:56:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1039.eqiad.wmnet with OS bookworm
[20:56:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1039.eqiad.wmnet with OS bookworm
[20:56:27] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:56:50] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T2100).
[21:01:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1040.eqiad.wmnet with OS bookworm
[21:02:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1040.eqiad.wmnet with OS bookworm
[21:02:41] <toyofuku>	 As mentioned before, we're still in the middle of a backport - this can presumably be aborted if we need to, but since it's taking a long time it would be great to finish so we don't have to start over later
[21:04:21] <Jdlrobson>	 :(
[21:04:47] <Jdlrobson>	 looks like its almost done
[21:10:00] <wikibugs>	 (03Merged) 10jenkins-bot: Do not apply table styling rules to Main page [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245) (owner: 10Jdlrobson)
[21:10:14] <logmsgbot>	 !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1074545|Do not apply table styling rules to Main page (T375245)]]
[21:10:18] <stashbot>	 T375245: Links are unreadable on main page in dark mode - https://phabricator.wikimedia.org/T375245
[21:10:32] <toyofuku>	 let's see how long the actual deploy takes
[21:10:51] <Jdlrobson>	 o_o
[21:11:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1041.eqiad.wmnet with OS bookworm
[21:11:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1041.eqiad.wmnet with OS bookworm
[21:12:50] <logmsgbot>	 !log toyofuku@deploy1003 jdlrobson, toyofuku: Backport for [[gerrit:1074545|Do not apply table styling rules to Main page (T375245)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:13:02] <toyofuku>	 Jdlrobson: ready for testing!
[21:13:35] <Jdlrobson>	 toyofuku: on it
[21:15:17] <Jdlrobson>	 toyofuku: LGTM!
[21:15:27] <toyofuku>	 proceeding
[21:15:29] <logmsgbot>	 !log toyofuku@deploy1003 jdlrobson, toyofuku: Continuing with sync
[21:17:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2020.codfw.wmnet
[21:18:46] <toyofuku>	 okay it def was the message cache bc this one is going much faster
[21:20:58] <logmsgbot>	 !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074545|Do not apply table styling rules to Main page (T375245)]] (duration: 10m 44s)
[21:21:03] <stashbot>	 T375245: Links are unreadable on main page in dark mode - https://phabricator.wikimedia.org/T375245
[21:21:20] <toyofuku>	 Jdlrobson: we're all done!
[21:21:27] <toyofuku>	 Thanks for your patience everyone
[21:21:43] <toyofuku>	 Jan_drewniak: all yours but we ran way over so might want to make sure it's okay to proceed
[21:21:52] <Jdlrobson>	 thanks toyofuku !
[21:23:14] <toyofuku>	 happy to help ☺️
[21:24:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich)
[21:25:26] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.categories-reload (exit_code=97) reloading categories to wdqs2020.codfw.wmnet
[21:28:41] <wikibugs>	 (03PS1) 10Btullis: Update the partman configuration for k8s with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075081 (https://phabricator.wikimedia.org/T365283)
[21:29:15] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update the partman configuration for k8s with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075081 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis)
[21:30:51] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm
[21:34:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[21:38:04] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm
[21:38:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old switch stack  - pt1979@cumin2002"
[21:38:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old switch stack  - pt1979@cumin2002"
[21:38:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:39:36] <jan_drewniak>	 thanks toyofuku we'll take care of derenrich's patch tomorrow
[21:39:54] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1046.eqiad.wmnet with OS bookworm
[21:40:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1046.eqiad.wmnet with OS bookworm
[21:46:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1045.eqiad.wmnet with OS bookworm
[21:46:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1045.eqiad.wmnet with OS bookworm
[21:50:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2020.codfw.wmnet
[21:51:44] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[21:55:37] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage
[22:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:05:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1045.eqiad.wmnet with reason: host reimage
[22:06:25] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1040.eqiad.wmnet with reason: host reimage
[22:06:25] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:06:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 789556784 and 108 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:08:25] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:08:34] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1046.eqiad.wmnet with reason: host reimage
[22:09:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1045.eqiad.wmnet with reason: host reimage
[22:11:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 130048 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:13:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1046.eqiad.wmnet with reason: host reimage
[22:13:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1041.eqiad.wmnet with reason: host reimage
[22:17:29] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1041.eqiad.wmnet with reason: host reimage
[22:20:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10169564 (10Papaul)
[22:21:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169567 (10Papaul)
[22:21:53] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1040.eqiad.wmnet with reason: host reimage
[22:24:17] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:24:33] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:24:34] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1045.eqiad.wmnet with OS bookworm
[22:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1045.eqiad.wmnet with OS bookworm completed:...
[22:27:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:28:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:28:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1046.eqiad.wmnet with OS bookworm
[22:28:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1046.eqiad.wmnet with OS bookworm completed:...
[22:31:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1048.eqiad.wmnet with OS bookworm
[22:31:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169612 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1048.eqiad.wmnet with OS bookworm
[22:32:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:32:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1039.eqiad.wmnet with OS bookworm
[22:32:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169613 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1039.eqiad.wmnet with OS bookworm
[22:32:54] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:32:55] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1041.eqiad.wmnet with OS bookworm
[22:33:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169614 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1041.eqiad.wmnet with OS bookworm completed:...
[22:37:17] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:37:41] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:37:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1040.eqiad.wmnet with OS bookworm
[22:37:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1040.eqiad.wmnet with OS bookworm completed:...
[22:40:49] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1047.eqiad.wmnet with OS bookworm
[22:40:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1047.eqiad.wmnet with OS bookworm
[22:44:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169624 (10Jclark-ctr)
[22:46:02] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1048.eqiad.wmnet with reason: host reimage
[22:49:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1048.eqiad.wmnet with reason: host reimage
[22:50:07] <wikibugs>	 10SRE-swift-storage: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448 (10prabhat) 03NEW
[22:53:50] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1047.eqiad.wmnet with reason: host reimage
[22:56:25] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1049.eqiad.wmnet with OS bookworm
[22:56:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1049.eqiad.wmnet with OS bookworm
[22:57:39] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1047.eqiad.wmnet with reason: host reimage
[23:02:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1050.eqiad.wmnet with OS bookworm
[23:02:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1050.eqiad.wmnet with OS bookworm
[23:04:09] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:04:28] <wikibugs>	 10SRE-swift-storage: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10169654 (10Pppery) This isn't a problem with the imageinfo API. The file itself has just somehow disappeared (the UI shows it broken too).
[23:05:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:05:26] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:05:27] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1048.eqiad.wmnet with OS bookworm
[23:05:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1048.eqiad.wmnet with OS bookworm completed:...
[23:06:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169658 (10Jclark-ctr)
[23:08:26] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Cannot move Commons File:Dhruve_Sehgal_in_2021.png - https://phabricator.wikimedia.org/T372924#10169661 (10Pppery) 05Open→03Resolved a:03Robertsky Nobody is going to track down what happened a month ago - it's well known and tracked elsew...
[23:09:07] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1052.eqiad.wmnet with OS bookworm
[23:09:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm
[23:10:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1049.eqiad.wmnet with reason: host reimage
[23:11:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:13:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1039.eqiad.wmnet with reason: host reimage
[23:14:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1051.eqiad.wmnet with OS bookworm
[23:14:18] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1049.eqiad.wmnet with reason: host reimage
[23:14:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1051.eqiad.wmnet with OS bookworm
[23:14:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:14:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1047.eqiad.wmnet with OS bookworm
[23:14:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169679 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1047.eqiad.wmnet with OS bookworm completed:...
[23:15:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169680 (10Jclark-ctr)
[23:16:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2020.codfw.wmnet
[23:16:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1050.eqiad.wmnet with reason: host reimage
[23:17:38] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1039.eqiad.wmnet with reason: host reimage
[23:18:45] <wikibugs>	 (03PS1) 10Papaul: ADD db2146 to use db.cfg for testing [puppet] - 10https://gerrit.wikimedia.org/r/1075085 (https://phabricator.wikimedia.org/T374215)
[23:21:02] <wikibugs>	 (03CR) 10Papaul: [C:03+2] ADD db2146 to use db.cfg for testing [puppet] - 10https://gerrit.wikimedia.org/r/1075085 (https://phabricator.wikimedia.org/T374215) (owner: 10Papaul)
[23:21:15] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1050.eqiad.wmnet with reason: host reimage
[23:23:52] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1052.eqiad.wmnet with reason: host reimage
[23:24:06] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10169691 (10Eevans) >>! In T370786#10023319, @hnowlan wrote: > One of the big challenges I can see here is the use of compound words - currently we use lazy names like incident-create and incident...
[23:27:34] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1052.eqiad.wmnet with reason: host reimage
[23:29:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:33:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:34:00] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1049.eqiad.wmnet with OS bookworm
[23:34:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1049.eqiad.wmnet with OS bookworm completed:...
[23:34:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:34:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169706 (10Jclark-ctr)
[23:34:55] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:34:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1039.eqiad.wmnet with OS bookworm
[23:35:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1039.eqiad.wmnet with OS bookworm completed:...
[23:35:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169708 (10Jclark-ctr)
[23:35:56] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:38:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075087
[23:42:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:43:02] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:43:03] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1050.eqiad.wmnet with OS bookworm
[23:43:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1050.eqiad.wmnet with OS bookworm completed:...
[23:43:43] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451 (10Eevans) 03NEW
[23:51:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169742 (10Papaul)