[00:09:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1074730 (owner: 10TrainBranchBot) [00:15:26] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [02:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:56] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:56] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:14:56] FIRING: [3x] SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:17] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:17] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:29:00] (03PS1) 10Slyngshede: C:idm:deployment: Add structlog dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1074846 [06:30:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074846 (owner: 10Slyngshede) [06:31:59] (03CR) 10Muehlenhoff: [C:03+1] "Doh, good catch:-)" [puppet] - 10https://gerrit.wikimedia.org/r/1074713 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [06:32:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10166449 (10ABran-WMF) those servers are a bit sensitive, @wiki_willy do you think this would be manageable to check if we have a spare disk during this week? [06:32:26] (03CR) 10Slyngshede: [C:03+2] C:idm:deployment: Add structlog dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1074846 (owner: 10Slyngshede) [06:35:14] (03CR) 10Slyngshede: [C:03+2] Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 (owner: 10Slyngshede) [06:38:24] (03Merged) 10jenkins-bot: Audit log for permission requests validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1071849 (owner: 10Slyngshede) [06:40:25] (03CR) 10Muehlenhoff: [C:03+2] Remove puppet checkout on pybaltest [puppet] - 10https://gerrit.wikimedia.org/r/1047509 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [06:42:47] (03PS2) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1074167 [06:43:23] (03PS4) 10Slyngshede: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 [06:45:54] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1074167 (owner: 10Muehlenhoff) [06:47:12] (03CR) 10Slyngshede: [C:03+2] Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 (owner: 10Slyngshede) [06:48:51] (03PS3) 10Slyngshede: UI for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [06:49:36] (03Merged) 10jenkins-bot: Grant permissions: Hookup LDAP permission granting. [software/bitu] - 10https://gerrit.wikimedia.org/r/1073162 (owner: 10Slyngshede) [06:51:32] (03PS4) 10Slyngshede: Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1074178 (https://phabricator.wikimedia.org/T359820) [06:55:36] (03CR) 10Muehlenhoff: [C:03+2] icinga: Enable profile::auto_restarts::service for keyholder-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1074358 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:58:40] (03PS1) 10Kevin Bazira: ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387) [06:58:58] (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetserver: fix SHA1 path for labsprivate [puppet] - 10https://gerrit.wikimedia.org/r/1074713 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [07:00:05] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:30] (03CR) 10Elukey: [C:03+1] kafka::broker: Add the external-services DNS name to the certs [puppet] - 10https://gerrit.wikimedia.org/r/1074411 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm) [07:07:49] (03PS1) 10Elukey: requestctl: change comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074849 [07:08:07] (03CR) 10Elukey: [V:03+2 C:03+2] requestctl: change comment for post_docroot.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1074849 (owner: 10Elukey) [07:09:51] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10166477 (10elukey) >>! In T374443#10161254, @MoritzMuehlenhoff wrote: >>>! In T374443#10161219, @elukey wrote: >> The move was d... [07:14:39] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira) [07:14:56] FIRING: SystemdUnitFailed: exim4-base.service on mw2424:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:13] (03PS1) 10Muehlenhoff: Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023) [07:20:42] (03PS1) 10Slyngshede: C:idm Enable audit logging for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1074853 [07:22:01] (03CR) 10Elukey: [C:03+1] Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff) [07:22:06] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4074/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede) [07:22:45] (03CR) 10Elukey: [C:03+1] mw_rc_irc: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074429 (owner: 10Muehlenhoff) [07:23:13] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4076/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede) [07:24:21] (03PS7) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) [07:24:27] (03CR) 10Elukey: sre.hosts.decommission: update/remove puppet-related constants (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [07:24:56] (03PS2) 10Slyngshede: C:idm Enable audit logging for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1074853 [07:25:57] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4077/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede) [07:26:58] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4078/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede) [07:27:53] (03CR) 10Hashar: [C:04-2] "> gate-and-submit will run against the rebased version of the change, right?" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar) [07:28:56] (03PS2) 10Muehlenhoff: Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023) [07:30:39] (03CR) 10Arnaudb: [C:03+1] sre.switchdc.databases: update Phabricator more (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [07:30:57] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1001 from list of keytab sync hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074850 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff) [07:37:09] (03CR) 10Jelto: [V:04-1] "pcc fails with" [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn) [07:45:12] (03PS1) 10Muehlenhoff: puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) [07:47:43] (03CR) 10CI reject: [V:04-1] puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:49:00] (03CR) 10Elukey: [C:03+2] sre.hosts.decommission: update/remove puppet-related constants [cookbooks] - 10https://gerrit.wikimedia.org/r/1074101 (https://phabricator.wikimedia.org/T374443) (owner: 10Elukey) [07:49:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:52:29] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts poolcounter1004.eqiad.wmnet [07:53:27] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts poolcounter1004.eqiad.wmnet [07:55:15] (03PS1) 10Elukey: profile::lvs::realserver: update poolcounter hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074949 (https://phabricator.wikimedia.org/T332015) [07:58:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074949 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [07:59:34] (03CR) 10Elukey: [C:03+2] profile::lvs::realserver: update poolcounter hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074949 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [08:01:01] (03CR) 10Brouberol: [C:03+1] Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [08:02:05] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10166552 (10MoritzMuehlenhoff) [08:03:11] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira) [08:04:08] (03Merged) 10jenkins-bot: ml-services: update rec-api image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074847 (https://phabricator.wikimedia.org/T374387) (owner: 10Kevin Bazira) [08:08:07] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2424 -> wikikube-worker2124 [puppet] - 10https://gerrit.wikimedia.org/r/1074410 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [08:08:44] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:12:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede) [08:12:13] !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2424 to wikikube-worker2124 [08:12:29] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm Enable audit logging for testing. [puppet] - 10https://gerrit.wikimedia.org/r/1074853 (owner: 10Slyngshede) [08:12:34] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [08:12:41] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:13:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:20] !log elukey@puppetmaster1001:~$ sudo puppet cert destroy performance.discovery.wmnet [08:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:44] (03PS1) 10Muehlenhoff: Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023) [08:16:45] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2424 to wikikube-worker2124 - jiji@cumin1002" [08:16:55] (03PS2) 10Muehlenhoff: Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023) [08:17:40] (03CR) 10Elukey: [C:03+1] Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff) [08:18:09] !log installing systemd bugfix updates from Bookworm point release [08:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:32] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:21:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2424 to wikikube-worker2124 - jiji@cumin1002" [08:21:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:21:04] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2124 [08:21:16] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2124 [08:21:30] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1001/puppetmaster2001 from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1074950 (https://phabricator.wikimedia.org/T368023) (owner: 10Muehlenhoff) [08:21:39] (03CR) 10JMeybohm: [C:03+2] kafka::broker: Add the external-services DNS name to the certs [puppet] - 10https://gerrit.wikimedia.org/r/1074411 (https://phabricator.wikimedia.org/T374729) (owner: 10JMeybohm) [08:21:55] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2424 to wikikube-worker2124 [08:24:26] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove frlog2001 and frpm2001 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1074538 (https://phabricator.wikimedia.org/T375239) (owner: 10Dwisehaupt) [08:24:47] !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for puppetmaster1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [08:25:22] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: add cert expiration monitoring for apis [puppet] - 10https://gerrit.wikimedia.org/r/1074552 (https://phabricator.wikimedia.org/T348725) (owner: 10Dwisehaupt) [08:25:49] !log Updated CI job operations-puppet-tests-bullseye to image rebuild for Puppet 7 # T330490 [08:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:53] T330490: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 [08:26:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for puppetmaster1001.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [08:26:39] !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for puppetmaster2001.codfw.wmnet: Renew puppet certificate - elukey@cumin1002 [08:26:45] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2124.codfw.wmnet on all recursors [08:26:49] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2124.codfw.wmnet on all recursors [08:32:52] (03PS1) 10JMeybohm: Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231) [08:33:54] (03CR) 10CI reject: [V:04-1] Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231) (owner: 10JMeybohm) [08:34:27] (03PS2) 10JMeybohm: Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231) [08:34:45] (03PS3) 10JMeybohm: Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231) [08:38:09] (03CR) 10JMeybohm: [C:03+2] Revert "Disable paging for mw-wikifunctions" [puppet] - 10https://gerrit.wikimedia.org/r/1074951 (https://phabricator.wikimedia.org/T374231) (owner: 10JMeybohm) [08:39:05] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166669 (10ayounsi) a:03ayounsi [08:42:11] (03CR) 10Lucas Werkmeister (WMDE): "Acknowledged" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar) [08:45:10] (03CR) 10Hashar: [V:03+2 C:03+2] "Done" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar) [08:49:28] FIRING: JobUnavailable: Reduced availability for job poolcounter_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:53:13] RESOLVED: JobUnavailable: Reduced availability for job poolcounter_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:54:10] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166734 (10ayounsi) Opened high priority JTAC case 2024-0923-266479 and attached logs/debug output. [08:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:57:07] (03CR) 10Stevemunene: [C:03+1] Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [08:57:25] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts poolcounter2003.codfw.wmnet [08:57:36] !log roll-restarting all kafka clusters for certificate changes - T374729 [08:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:40] T374729: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729 [09:00:49] !log jiji@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:01:01] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [09:03:31] (03CR) 10Btullis: [V:03+1 C:03+2] Enable the rclone backup schedule on db1208 [puppet] - 10https://gerrit.wikimedia.org/r/1074382 (https://phabricator.wikimedia.org/T372908) (owner: 10Btullis) [09:04:56] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2124 - jiji@cumin1002" [09:05:16] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2124 - jiji@cumin1002" [09:05:16] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:05:17] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2124.codfw.wmnet 79.0.192.10.in-addr.arpa 9.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:05:20] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2124.codfw.wmnet 79.0.192.10.in-addr.arpa 9.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:05:21] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2124 [09:05:26] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [09:05:38] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2124 [09:05:38] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2124 [09:07:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:07:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts poolcounter2003.codfw.wmnet [09:10:08] (03CR) 10Elukey: sre.network.tls: start from scratch if CSR is missing (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi) [09:12:26] !log jayme@cumin1002 conftool action : set/weight=10; selector: name=registry2005.codfw.wmnet [09:12:47] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=registry2005.codfw.wmnet [09:16:24] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts poolcounter2004.codfw.wmnet [09:18:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [09:20:15] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [09:20:39] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [09:20:51] (03PS1) 10Elukey: role::poolcounter::server: cleanup after Bookworm migration [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015) [09:21:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-codfw [09:21:49] !log jayme@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [09:21:53] (03PS2) 10Elukey: role::poolcounter::server: cleanup after Bookworm migration [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015) [09:21:55] (03PS1) 10Ayounsi: Enable and scrape gNMIc api Prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361) [09:22:23] !log jayme@cumin1002 conftool action : set/pooled=no; selector: name=registry200(3|4).codfw.wmnet [09:23:54] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361) (owner: 10Ayounsi) [09:23:57] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2124.codfw.wmnet with reason: host reimage [09:26:27] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: poolcounter2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [09:27:45] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2124.codfw.wmnet with reason: host reimage [09:28:17] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=registry2004.codfw.wmnet [09:29:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: poolcounter2004.codfw.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [09:29:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:29:35] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts poolcounter2004.codfw.wmnet [09:30:23] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move puppet-merge (bash script) to puppetserver1001 - https://phabricator.wikimedia.org/T374443#10166770 (10elukey) 05Open→03Resolved [09:34:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:34:41] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10166787 (10MoritzMuehlenhoff) [09:35:22] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10166778 (10elukey) 05Open→03Resolved a:03elukey [09:35:38] (03CR) 10Elukey: [C:03+2] role::poolcounter::server: cleanup after Bookworm migration [puppet] - 10https://gerrit.wikimedia.org/r/1074953 (https://phabricator.wikimedia.org/T332015) (owner: 10Elukey) [09:40:48] (03PS1) 10Muehlenhoff: Add ganeti1039 - ganeti1052 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074957 (https://phabricator.wikimedia.org/T365650) [09:40:52] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:46] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:46] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:41:50] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:42:24] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1039 - ganeti1052 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1074957 (https://phabricator.wikimedia.org/T365650) (owner: 10Muehlenhoff) [09:42:46] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:42:46] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:44:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [09:46:07] (03PS1) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [09:46:27] (03CR) 10CI reject: [V:04-1] Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [09:46:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [09:48:18] (03PS1) 10Slyngshede: Block User: Add LDAP blocking/unblocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1074960 (https://phabricator.wikimedia.org/T359820) [09:48:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2124.codfw.wmnet with OS bullseye [09:48:41] (03PS6) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) [09:49:02] (03PS2) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [09:49:04] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:49:22] (03CR) 10CI reject: [V:04-1] Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [09:50:16] (03CR) 10Ayounsi: sre.network.tls: start from scratch if CSR is missing (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi) [09:51:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [09:51:24] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10166816 (10dcaro) >>! In T372814#10165304, @Jclark-ctr wrote: > @Andrew i see this ticket is in my name. is there something i need to do for this?... [09:52:05] (03PS3) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [09:52:56] (03PS4) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [09:53:02] !log homer cr*codfw* commit 'T372878' [09:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:07] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [09:54:29] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10166817 (10dcaro) a:05Jclark-ctr→03dcaro [09:58:40] (03PS5) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [09:58:40] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2124.codfw.wmnet [09:59:33] !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker2124.codfw.wmnet [09:59:34] !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2124.codfw.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1000) [10:00:17] (03PS6) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [10:00:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10166843 (10MoritzMuehlenhoff) >>! In T365650#10165298, @Jclark-ctr wrote: > @MoritzMuehlenhoff can you update puppet site.pp is mis... [10:01:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4082/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [10:02:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 293, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:05:51] 10ops-eqiad, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372 (10fnegri) 03NEW [10:06:01] 10ops-eqiad, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10166884 (10fnegri) [10:06:33] jouncebot: nowandnext [10:06:34] For the next 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1000) [10:06:34] In 2 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1300) [10:07:10] (03CR) 10Elukey: [C:03+1] "LGTM, maybe test-cookbook it before merging so we are sure it works (it not already done!)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi) [10:08:21] 10ops-eqiad, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10166889 (10fnegri) ipmi-sel confirms a "Thermal Trip" both for June 20th and Sep 21st: ` fnegri@cloudvirt1063:~$ sudo ipmi-sel ID | Date | Time | Name |... [10:08:23] (03CR) 10DCausse: [C:03+2] cirrus-streaming-update: enable calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074090 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [10:08:45] !log rolling out debmonitor-client updates T216832 [10:08:48] (03PS1) 10Btullis: Remove the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) [10:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:49] T216832: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832 [10:09:34] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4083/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [10:09:38] (03Merged) 10jenkins-bot: cirrus-streaming-update: enable calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074090 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [10:10:38] (03CR) 10Ayounsi: [C:03+2] "Thanks ! already tested, except the very last PS which I don't think requires testing as it's quite minor." [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi) [10:11:00] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:11:23] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:12:17] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host registry1005.eqiad.wmnet [10:12:18] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [10:12:36] (03PS1) 10Btullis: Clean up the test secrets after testing [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692) [10:13:23] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4084/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [10:14:46] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:14:54] (03CR) 10Jcrespo: [C:03+1] "I found only 2 discrepancies: pc2007 is not marked as a master on puppet (may not be an issue as it may not be yet fully setup) CC Amir. A" [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [10:15:09] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:15:30] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM registry1005.eqiad.wmnet - elukey@cumin1002" [10:15:35] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM registry1005.eqiad.wmnet - elukey@cumin1002" [10:15:35] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:15:35] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache registry1005.eqiad.wmnet on all recursors [10:15:38] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) registry1005.eqiad.wmnet on all recursors [10:16:03] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM registry1005.eqiad.wmnet - elukey@cumin1002" [10:16:07] (03PS6) 10Btullis: Add an hdfs_file type and provider [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) [10:16:07] (03PS7) 10Btullis: Add some test secrets to the hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1074959 (https://phabricator.wikimedia.org/T323692) [10:16:07] (03PS2) 10Btullis: Remove the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) [10:16:08] (03PS2) 10Btullis: Clean up the test secrets after testing [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692) [10:16:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM registry1005.eqiad.wmnet - elukey@cumin1002" [10:16:54] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4085/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [10:17:43] (03PS1) 10Elukey: Set puppet 7 for registry1005 [puppet] - 10https://gerrit.wikimedia.org/r/1074965 (https://phabricator.wikimedia.org/T375374) [10:18:11] (03CR) 10Elukey: [C:03+2] Set puppet 7 for registry1005 [puppet] - 10https://gerrit.wikimedia.org/r/1074965 (https://phabricator.wikimedia.org/T375374) (owner: 10Elukey) [10:18:27] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host registry1005.eqiad.wmnet with OS bookworm [10:22:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [10:23:30] (03PS3) 10Btullis: Absent the secrets after testing is complete [puppet] - 10https://gerrit.wikimedia.org/r/1074963 (https://phabricator.wikimedia.org/T323692) [10:23:30] (03PS3) 10Btullis: Clean up the test secrets after testing [puppet] - 10https://gerrit.wikimedia.org/r/1074964 (https://phabricator.wikimedia.org/T323692) [10:23:36] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 375, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:23:57] (03Merged) 10jenkins-bot: sre.network.tls: start from scratch if CSR is missing [cookbooks] - 10https://gerrit.wikimedia.org/r/1074383 (https://phabricator.wikimedia.org/T375179) (owner: 10Ayounsi) [10:25:41] !log homer lsw1-a6-codfw* commit 'T372878' [10:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:45] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [10:27:41] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10166945 (10MoritzMuehlenhoff) 05Open→03Resolved Updated deb has been rolled out fleetwide, closing. [10:27:49] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage [10:29:52] FIRING: ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:23] RESOLVED: ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:52] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage [10:33:38] (03CR) 10Elukey: [C:03+1] Also move the apt::pin under the buster conditional [puppet] - 10https://gerrit.wikimedia.org/r/1073467 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff) [10:33:43] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2425.codfw.wmnet [10:34:13] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2124.codfw.wmnet [10:34:15] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2124.codfw.wmnet [10:34:17] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2425.codfw.wmnet [10:34:28] (03CR) 10Elukey: [C:03+1] On Bookworm create the system user using systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1073469 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff) [10:35:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:36:58] (03CR) 10Muehlenhoff: [C:03+2] Also move the apt::pin under the buster conditional [puppet] - 10https://gerrit.wikimedia.org/r/1073467 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff) [10:37:31] (03CR) 10Muehlenhoff: [C:03+2] On Bookworm create the system user using systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1073469 (https://phabricator.wikimedia.org/T374928) (owner: 10Muehlenhoff) [10:37:39] (03CR) 10Btullis: [V:03+1 C:03+2] Add a datahubsearch cluster and assign the correct hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074389 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [10:37:45] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [10:38:04] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [10:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:45] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:43:02] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:43:39] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2425.codfw.wmnet with reason: reimage [10:43:42] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2425.codfw.wmnet with reason: reimage [10:43:54] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:49:40] (03PS1) 10Elukey: Revert "Set puppet 7 for registry1005" [puppet] - 10https://gerrit.wikimedia.org/r/1074972 [10:51:52] !log starting db master table checks on s1 (db1163, db2203) T375186 [10:51:54] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host registry1005.eqiad.wmnet with OS bookworm [10:51:54] !log elukey@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host registry1005.eqiad.wmnet [10:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:56] T375186: databases preswitchover checks - https://phabricator.wikimedia.org/T375186 [10:52:30] (03CR) 10Muehlenhoff: [C:03+1] Revert "Set puppet 7 for registry1005" [puppet] - 10https://gerrit.wikimedia.org/r/1074972 (owner: 10Elukey) [10:53:28] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [10:53:35] (03CR) 10CI reject: [V:04-1] cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [10:53:43] (03CR) 10Elukey: [C:03+2] Revert "Set puppet 7 for registry1005" [puppet] - 10https://gerrit.wikimedia.org/r/1074972 (owner: 10Elukey) [10:53:53] (03PS3) 10DCausse: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) [10:54:11] (03CR) 10DCausse: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [10:56:44] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on registry1005.eqiad.wmnet with reason: WIP - working on puppet runs [10:56:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on registry1005.eqiad.wmnet with reason: WIP - working on puppet runs [10:57:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host registry1005.eqiad.wmnet [10:59:05] (03PS1) 10Muehlenhoff: Switch registry1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1074973 (https://phabricator.wikimedia.org/T375374) [10:59:34] (03CR) 10Elukey: [C:03+1] Switch registry1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1074973 (https://phabricator.wikimedia.org/T375374) (owner: 10Muehlenhoff) [11:00:03] (03CR) 10Muehlenhoff: [C:03+2] Switch registry1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1074973 (https://phabricator.wikimedia.org/T375374) (owner: 10Muehlenhoff) [11:02:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host registry1005.eqiad.wmnet [11:03:56] (03CR) 10DCausse: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [11:13:03] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [11:14:21] (03Merged) 10jenkins-bot: cirrus-streaming-updater: disable legacy network policies for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074091 (https://phabricator.wikimedia.org/T373195) (owner: 10DCausse) [11:15:28] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:15:38] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:16:09] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:16:20] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:17:02] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:17:06] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:20:30] (03PS1) 10Muehlenhoff: Fix site.pp after adding new Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074978 [11:21:25] (03CR) 10Muehlenhoff: [C:03+2] Fix site.pp after adding new Ganeti hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074978 (owner: 10Muehlenhoff) [11:41:20] (03PS3) 10Muehlenhoff: envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690 [11:41:28] (03PS4) 10Muehlenhoff: envoy: Add support for passing an array of sets to the firewall service [puppet] - 10https://gerrit.wikimedia.org/r/1072690 [11:45:39] !log installing cups security updates [11:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:22] (03CR) 10Jaime Nuche: [C:03+1] "> Then when using scap3 for deployment, Puppet was made to NOT install the Jenkins package since it is not prepared by Puppet. I guess it " [puppet] - 10https://gerrit.wikimedia.org/r/1074461 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [11:47:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167117 (10VRiley-WMF) With this information, I'm going to reach back out to Dell. [11:49:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:49:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:01] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 375, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:11] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 293, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:39] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on mw2425.codfw.wmnet with reason: reimage [11:51:42] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2425.codfw.wmnet with reason: reimage [11:52:19] !log jiji@cumin1002 START - Cookbook sre.hosts.provision for host mw2425.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [11:54:02] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1074986 (https://phabricator.wikimedia.org/T375186) [11:54:03] (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2229 [puppet] - 10https://gerrit.wikimedia.org/r/1074986 (https://phabricator.wikimedia.org/T375186) (owner: 10Arnaudb) [11:54:24] (03PS1) 10Effie Mouzeli: kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878) [11:55:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:55:12] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:55:19] (03CR) 10Jforrester: "Neat!" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1073461 (owner: 10Hashar) [11:55:51] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2425.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [11:57:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [11:59:03] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 375, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:59:12] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 293, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:59:52] (03CR) 10Kamila Součková: [C:03+1] kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [12:00:37] (03CR) 10Clément Goubert: [C:03+1] kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [12:00:50] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: rename mw2425 -> wikikube-worker2125 [puppet] - 10https://gerrit.wikimedia.org/r/1074987 (https://phabricator.wikimedia.org/T372878) (owner: 10Effie Mouzeli) [12:02:34] (03CR) 10Muehlenhoff: [C:03+2] bacula::storage: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074427 (owner: 10Muehlenhoff) [12:02:58] effie: I'll merge your patch along [12:03:46] cheers thanx [12:04:15] merged [12:05:12] (03CR) 10Filippo Giunchedi: [C:03+1] Enable and scrape gNMIc api Prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361) (owner: 10Ayounsi) [12:05:36] (03CR) 10Filippo Giunchedi: [C:03+1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [12:09:12] (03CR) 10Muehlenhoff: [C:03+2] No longer include config-master on Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1074151 (https://phabricator.wikimedia.org/T374443) (owner: 10Muehlenhoff) [12:10:45] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2425.codfw.wmnet [12:10:46] !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2425.codfw.wmnet [12:12:30] !log jiji@cumin1002 START - Cookbook sre.hosts.rename from mw2425 to wikikube-worker2125 [12:12:40] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [12:12:56] !log restarting replication on pc1013 after crash T375382 [12:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:04] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [12:13:15] ^ heads up _joe_ moritzm this could have caused some mw errors [12:13:30] <_joe_> ack, thanks [12:14:01] in the past it used to be very loggy, but I think it wasn't noticed that much this time [12:14:29] ok [12:14:29] PROBLEM - config-master.wikimedia.org requires authentication on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:14:33] I think specially as it was only down for 9 seconds [12:14:44] the config-master alert should be harmless [12:15:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:13] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:54] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2425 to wikikube-worker2125 - jiji@cumin1002" [12:17:25] FIRING: SystemdUnitFailed: apache2.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:32] (03CR) 10Filippo Giunchedi: Add monitoring to network devices gRPC endpoints (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1074435 (owner: 10Ayounsi) [12:17:34] (03CR) 10Ayounsi: [C:03+2] Enable and scrape gNMIc api Prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1074954 (https://phabricator.wikimedia.org/T375361) (owner: 10Ayounsi) [12:17:37] (03PS1) 10Muehlenhoff: Revert "No longer include config-master on Puppet 5 frontends" [puppet] - 10https://gerrit.wikimedia.org/r/1074994 [12:18:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2425 to wikikube-worker2125 - jiji@cumin1002" [12:18:02] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:18:02] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2125 [12:18:17] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2125 [12:18:56] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2425 to wikikube-worker2125 [12:20:32] (03PS1) 10Slyngshede: C:idm setup structlogger instance. [puppet] - 10https://gerrit.wikimedia.org/r/1074998 [12:21:26] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4086/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede) [12:22:02] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2125.codfw.wmnet on all recursors [12:22:05] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2125.codfw.wmnet on all recursors [12:22:19] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4087/console" [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede) [12:24:15] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4088/co" [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede) [12:24:42] 10ops-eqiad, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167250 (10ABran-WMF) as @jcrespo found on P69389 this crash is due to a memory issue on channel:0 slot:1 [12:24:43] (03CR) 10Slyngshede: [V:03+1 C:03+2] C:idm setup structlogger instance. [puppet] - 10https://gerrit.wikimedia.org/r/1074998 (owner: 10Slyngshede) [12:26:05] (03PS1) 10DCausse: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000 [12:26:14] (03PS2) 10DCausse: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000 [12:26:50] 10ops-eqiad, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167256 (10ABran-WMF) This confirm the position of the stick that is in error in DIMM slot A9: {F57531822} {F57531824} [12:27:04] !log jiji@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2125.codfw.wmnet [12:27:25] RESOLVED: SystemdUnitFailed: apache2.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2125.codfw.wmnet with OS bullseye [12:27:37] !log jiji@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2125 [12:27:43] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [12:27:54] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000 (owner: 10DCausse) [12:29:33] (03Merged) 10jenkins-bot: cirrus-streaming-updater: test use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075000 (owner: 10DCausse) [12:29:48] (03PS3) 10Brouberol: cloudnative-pg-cluster: facilitate the import of an external database [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074968 (https://phabricator.wikimedia.org/T374950) [12:32:19] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:32:33] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:33:13] FIRING: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:35:43] RESOLVED: JobUnavailable: Reduced availability for job gnmic in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:35:56] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2125 - jiji@cumin1002" [12:36:00] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2125 - jiji@cumin1002" [12:36:00] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:36:00] !log jiji@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2125.codfw.wmnet 81.0.192.10.in-addr.arpa 1.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:36:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2125.codfw.wmnet 81.0.192.10.in-addr.arpa 1.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:36:04] !log jiji@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2125 [12:36:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2125 [12:36:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2125 [12:44:16] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075005 [12:50:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167324 (10VRiley-WMF) After working with Dell and explaining the issue, they can confirm that there is no hardware issues in the TSR report. I did provide them the image that @Jclark-ct... [12:54:13] (03PS1) 10DCausse: cirrus-streaming-updater: use kafka "external-services" fqdn with use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 [12:54:17] !log mnz@deploy1003 Started deploy [airflow-dags/research@3e2d3b8]: deploy reference risk DAG [12:54:23] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2125.codfw.wmnet with reason: host reimage [12:54:52] !log mnz@deploy1003 Finished deploy [airflow-dags/research@3e2d3b8]: deploy reference risk DAG (duration: 00m 59s) [12:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:58:01] (03CR) 10DCausse: "Tested with one job & kafka-main in I0cc7640 and worked ok." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse) [12:58:04] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2125.codfw.wmnet with reason: host reimage [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:11] I can’t deploy anyway, so good ^^ [13:02:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1069991 (owner: 10EoghanGaffney) [13:03:03] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10167367 (10MoritzMuehlenhoff) [13:13:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:12] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:15:16] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:15:24] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:17:12] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:17:16] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:17:24] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:17:30] (03PS1) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) [13:18:11] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2125.codfw.wmnet with OS bullseye [13:19:18] (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:21:02] !log homer cr*codfw* commit 'T372878' [13:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:06] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [13:21:24] !log homer lsw1-a6-codfw* commit 'T372878' [13:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:15] (03PS5) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) [13:23:33] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [13:23:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:24:06] (03CR) 10CI reject: [V:04-1] dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [13:25:16] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10167416 (10Volans) Thanks for the summary @ssingh. I have a local proposal that will send out when ready. There is one main point to decide and... [13:27:32] (03PS2) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) [13:29:22] (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:29:28] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 291, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:29:58] (03PS3) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) [13:30:44] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on cr3-ulsfo with reason: waiting for JTAC [13:30:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cr3-ulsfo with reason: waiting for JTAC [13:31:04] 06SRE, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10167444 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a9eff4bb-15d3-41a4-8dd6-65ccc0663c06) set by ayounsi@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their serv... [13:31:47] (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:31:50] (03PS4) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) [13:33:41] (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:33:47] !log jiji@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2125.codfw.wmnet [13:33:49] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2125.codfw.wmnet [13:33:49] (03PS5) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) [13:33:51] !log jiji@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=1) Renumbering for host wikikube-worker2125.codfw.wmnet [13:35:32] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:35:39] (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:36:07] (03PS3) 10Hashar: contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) [13:37:19] (03PS1) 10Stevemunene: hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) [13:37:30] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 373, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:07] (03CR) 10CI reject: [V:04-1] contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [13:38:19] (03PS6) 10Btullis: Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) [13:38:56] 00:00:54.145 1) profile::configmaster on debian-11-x86_64 test compilation with default parameters is expected to compile into a catalogue without dependency cycles [13:38:56] 00:00:54.145 error during compilation: Function lookup() did not find a value for the name 'profile::configmaster::server_name' (file: /srv/workspace/puppet/modules/profile/manifests/configmaster.pp, line: 8) on node 4c9703cf2f06.integration.eqiad1.wikimedia.cloud [13:39:06] something is broken in the puppet specs [13:40:05] yeah [13:40:08] see -sre [13:40:09] (03CR) 10CI reject: [V:04-1] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:40:10] sending a patch [13:41:02] (03PS4) 10Bartosz Dziewoński: SSO domain shouldn't have a mobile version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) [13:41:33] sukhe: thank you! ) [13:42:01] (03CR) 10Bartosz Dziewoński: "Done. I also tweaked the logic to avoid repeating the domain name more times than necessary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [13:42:30] (03PS1) 10Ssingh: spec: remove profile_configmaster_spec.rb [puppet] - 10https://gerrit.wikimedia.org/r/1075013 [13:42:44] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:46] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:43:44] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:43:46] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:44:43] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:49:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167485 (10Papaul) @Jclark-ctr @ABran-WMF @VRiley-WMF can I take over this task and try to re-image it? [13:49:45] (03CR) 10Btullis: hdfs: add new an-workers to insetup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:50:28] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr3-ulsfo [13:50:36] (03CR) 10Muehlenhoff: spec: remove profile_configmaster_spec.rb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [13:50:49] (03PS1) 10Giuseppe Lavagetto: service_proxy: Add a listener for the http interface of graphite [puppet] - 10https://gerrit.wikimedia.org/r/1075016 (https://phabricator.wikimedia.org/T374887) [13:50:55] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo [13:51:03] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr4-ulsfo [13:51:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo [13:52:20] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr3-eqsin [13:52:34] (03PS2) 10Stevemunene: hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) [13:52:54] (03CR) 10Majavah: "i think it should be possible to fix the tests instead of removing them, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [13:52:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-eqsin [13:53:48] (03CR) 10Stevemunene: hdfs: add new an-workers to insetup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:54:44] (03CR) 10Brouberol: [C:03+1] hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:55:49] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqsin [13:56:15] (03CR) 10Hashar: "CI fails due to a temporary glitch in the rspec tests." [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [13:56:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqsin [13:56:38] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr1-drmrs [13:56:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-drmrs [13:57:27] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-drmrs [13:57:45] (03PS1) 10Giuseppe Lavagetto: ExtensionDistributor: reach graphite via the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) [13:57:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-drmrs [13:57:48] (03CR) 10Btullis: [C:03+1] hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:58:27] (03CR) 10CI reject: [V:04-1] ExtensionDistributor: reach graphite via the service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) (owner: 10Giuseppe Lavagetto) [13:58:28] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-b12-drmrs [13:58:35] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [13:58:41] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b12-drmrs [13:59:07] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-b13-drmrs [13:59:17] (03CR) 10Stevemunene: [C:03+2] hdfs: add new an-workers to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1075012 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:59:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b13-drmrs [13:59:47] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host registry1005.eqiad.wmnet with OS bookworm [14:00:32] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr1-esams [14:00:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-esams [14:02:17] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-bw27-esams [14:02:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-bw27-esams [14:02:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:02:44] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [14:02:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:03:30] (03CR) 10Ssingh: "No strong opinions either way, I will just update the spec." [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [14:03:43] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@3e2d3b8]: Deploy latest DAGs to analytics Airflow instance. T369868. [14:03:56] T369868: Improve handling of delete, restore, and merge from incremental update - https://phabricator.wikimedia.org/T369868 [14:04:32] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@3e2d3b8]: Deploy latest DAGs to analytics Airflow instance. T369868. (duration: 00m 48s) [14:06:20] (03CR) 10Ssingh: "error during compilation: Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Httpd] is already de" [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [14:06:35] 06SRE, 06Infrastructure-Foundations, 06serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10167577 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:06:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167571 (10Volans) AFAIK `pc1015` should be the candidate host if we want to fail it over, from `dbctl`: ` "note": "Hot spare for pc4 and cold spare for pc3", ` [14:07:15] (03CR) 10Ssingh: "^ Running it locally, seems like there is more work required." [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [14:07:23] (03CR) 10Muehlenhoff: [C:03+1] "Let's just remove it, not sure if it's actually still useful for anything." [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [14:12:47] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage [14:13:13] (03PS2) 10Ssingh: spec: remove profile_configmaster_spec.rb [puppet] - 10https://gerrit.wikimedia.org/r/1075013 [14:13:49] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-by27-esams [14:13:55] (03CR) 10Ssingh: "I tried fixing it but since this blocks CI, I am removing it. If someone has a fix, please feel free to update it 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [14:14:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-by27-esams [14:14:10] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-esams [14:14:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-esams [14:15:38] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr1-codfw [14:15:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-codfw [14:16:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on registry1005.eqiad.wmnet with reason: host reimage [14:16:20] (03CR) 10Ssingh: [C:03+2] spec: remove profile_configmaster_spec.rb [puppet] - 10https://gerrit.wikimedia.org/r/1075013 (owner: 10Ssingh) [14:16:47] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cr2-codfw [14:16:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-codfw [14:17:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:17:44] Deployment mw-jobrunner.eqiad.main in mw-jobrunner at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner&var-deployment=mw-jobrunner.eqiad.main - ... [14:17:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:17:46] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167645 (10jcrespo) good catch, let's then start by moving replication from pc4 to: pc3: pc1013 -> pc1015, in the earliest binlog possible, for warmup (this should be a noop), and later we can patch/run dbct... [14:18:16] (03PS6) 10Ssingh: dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) [14:19:35] (03CR) 10Filippo Giunchedi: "thirdparty/otelcol-contrib isn't a thing in bookworm and will need to be added prior to this patch" [puppet] - 10https://gerrit.wikimedia.org/r/1074434 (owner: 10Herron) [14:21:47] (03CR) 10Filippo Giunchedi: [C:04-1] "I tested this in Pontoon and I'm getting invalid configuration:" [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [14:24:32] (03PS1) 10Jcrespo: mariadb: Move pc1015 configuration to master of pc3 section [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) [14:24:54] (03CR) 10Ssingh: "Turning this on only for Wikimedia DNS. We will turn this on for internal recursors next week. I am pretty sure this should be fine but no" [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [14:25:21] (03PS1) 10Herron: apt: add thirdparty/otelcol-contrib bookworm component [puppet] - 10https://gerrit.wikimedia.org/r/1075025 [14:25:54] (03CR) 10Ssingh: [C:03+2] dnsrecursor: add optional setting of extended-resolution-errors [puppet] - 10https://gerrit.wikimedia.org/r/1074196 (https://phabricator.wikimedia.org/T375200) (owner: 10Ssingh) [14:26:13] (03CR) 10Mforns: hieradata::services_proxy::envoy.yaml: fix duplicated port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [14:27:33] (03CR) 10Filippo Giunchedi: [C:03+1] apt: add thirdparty/otelcol-contrib bookworm component [puppet] - 10https://gerrit.wikimedia.org/r/1075025 (owner: 10Herron) [14:27:42] (03CR) 10Herron: [C:03+2] apt: add thirdparty/otelcol-contrib bookworm component [puppet] - 10https://gerrit.wikimedia.org/r/1075025 (owner: 10Herron) [14:29:32] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167674 (10ABran-WMF) sure! you can reimage it @Papaul [14:30:04] (03CR) 10Bking: [C:03+1] "+1 to merge once the change passes CI. Partman is a "guess and check" type application so there may be more iterations ;)" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [14:30:15] !log restarting and moving replication source of pc1015 T375382 [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:27] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [14:30:48] !log sudo cumin 'O:wikidough' 'run-puppet-agent' [14:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host registry1005.eqiad.wmnet with OS bookworm [14:33:54] (03PS9) 10Herron: thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 [14:36:51] (03PS1) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [14:37:10] (03CR) 10CI reject: [V:04-1] Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:37:15] (03CR) 10Arnaudb: [C:03+1] mariadb: Move pc1015 configuration to master of pc3 section [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [14:37:51] (03CR) 10Arnaudb: [C:03+1] mariadb: Move pc1015 configuration to master of pc3 section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [14:38:04] (03PS2) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [14:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:18] (03PS3) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [14:38:47] (03CR) 10CI reject: [V:04-1] Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:38:50] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [14:40:28] (03PS4) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [14:41:14] (03PS1) 10MVernon: hiera: specify cluster for apus nodes [puppet] - 10https://gerrit.wikimedia.org/r/1075027 (https://phabricator.wikimedia.org/T279621) [14:42:39] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:45:48] (03PS5) 10JMeybohm: Initial commit of containerd puppet code [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) [14:47:03] PROBLEM - Host pc1013 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:47:19] hi [14:47:21] !incidents [14:47:21] !incidents [14:47:21] 5267 (ACKED) Host pc1013 (paged) - PING - Packet loss = 100% [14:47:21] 5267 (ACKED) Host pc1013 (paged) - PING - Packet loss = 100% [14:47:25] jynus: our friend came back [14:47:26] Here. [14:47:29] * Emperor here [14:47:29] !ack 5267 [14:47:30] 5267 (ACKED) Host pc1013 (paged) - PING - Packet loss = 100% [14:47:45] <_joe_> volans: wdym? [14:47:50] I guess we might have to force the failover earlier than expected... [14:48:02] _joe_: it had already failed [14:48:04] (03CR) 10Jcrespo: "heads up" [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [14:48:05] <_joe_> volans: are you handling the alert? [14:48:12] https://sal.toolforge.org/log/QZTMHpIBFk7ipym_lMyU [14:48:24] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: name=registry1005.eqiad.wmnet [14:48:38] Is it me, or did that p.age everyone immediately rather than just the oncall folk? [14:48:39] there's a DIMM error in SEL [14:48:56] yeah: T374215 [14:48:56] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [14:49:00] not that [14:49:01] <_joe_> Emperor: no idea because I'm oncall [14:49:02] Emperor: weird because I ACKed it here even before it paged on the app [14:49:04] PROBLEM - MariaDB Replica IO: pc3 on pc2013 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc1013.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc1013.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:49:10] Emperor: didn't page me [14:49:12] T375382 [14:49:13] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [14:49:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.93% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:49:17] _joe_: we were discussing if we should failover or not in the DP meeting and the consensus was to try to failover but without a rush checking with the DBAs that are OOO today, but I guess at this point we have to failover sooner than expected [14:49:17] <_joe_> but I responded at the first page [14:49:21] surprinsingly it logged a successful succesful self-heal earlier [14:49:33] we expected it worked for longer until we failover it [14:49:38] Emperor: I didn't get paged by splunk, just IRC hashtag [14:49:43] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=registry1005.eqiad.wmnet [14:49:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074418 (https://phabricator.wikimedia.org/T375272) (owner: 10Bartosz Dziewoński) [14:49:48] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: name=registry1005.eqiad.wmnet [14:49:49] <_joe_> if there's a dimm error I guess we have no alternative [14:49:59] it is not booting up? [14:50:20] <_joe_> is anyone trying to boot it? [14:50:22] console is dead [14:50:26] I'll powercycle it [14:50:30] <_joe_> yep [14:50:33] oh, yes, sorry, I'm an idiot and got emailed by nagios rather than p.aged by splunk [14:50:46] is there anything we need to do in the meantime? [14:50:46] I'd prefer to boot it ap and later failover than do it without [14:50:56] !log elukey@puppetserver1001 conftool action : set/pooled=true,weight=10; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad [14:50:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10167799 (10Papaul) @ABran-WMF osorry forgot to ask, are we re-imaging with Bullseye? [14:51:16] (03CR) 10Arnaudb: [C:03+1] mariadb: Move pc1015 configuration to master of pc3 section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [14:51:16] <_joe_> jynus: ack, moritz has powercycled it AIUI [14:51:20] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frlog2001 - https://phabricator.wikimedia.org/T375239#10167791 (10Jhancock.wm) a:03Papaul [14:51:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: Decommission frack hosts: frpm2001 - https://phabricator.wikimedia.org/T375297#10167797 (10Jhancock.wm) a:03Papaul [14:51:31] let see if it comes back, it will be faser [14:51:35] !log powercycle pc1013 (DIMM error in DIMM_A9) [14:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:39] if not I am preparing pc1015 [14:51:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10167805 (10Jhancock.wm) I forgot to hit submit on my last update. pay-lb2001 was moved on Friday. The two latest decons have left us with another... [14:52:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:52:15] how bad it is for mediawiki errors? [14:52:21] I see [14:52:26] <_joe_> yeah [14:52:45] merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1075024 [14:52:46] it's booting now, took a while to get POST checks [14:52:53] (03CR) 10JHathaway: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn) [14:52:57] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:52:58] sometimes it ask for an enter: moritzm [14:53:05] (03CR) 10Jcrespo: [C:03+2] mariadb: Move pc1015 configuration to master of pc3 section [puppet] - 10https://gerrit.wikimedia.org/r/1075024 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [14:53:34] (03PS4) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) [14:53:43] grub is up now [14:53:48] root@puppetmaster1001:~$ puppet-merge: To ensure consistent locking please run puppet-merge from: puppetserver1001.eqiad.wmnet [14:53:49] and system is booting [14:53:53] help with this ^ [14:54:02] jynus: just go to puppetserver1001 [14:54:05] <_joe_> jynus: go to puppetserver1001 :) [14:54:08] same UI as before [14:54:08] jynus: just run from puppetserver [14:54:11] ok, I am stupid [14:54:12] nothing else changed [14:54:13] RECOVERY - Host pc1013 #page is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:54:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.63% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:54:18] ok great [14:54:21] pc1013 is back [14:54:24] thanks moritzm <3 [14:54:32] <_joe_> well the server is up [14:54:36] the question is whether this is stable enough or will re-appear [14:54:37] <_joe_> mariadb isn't I guess [14:54:40] moritzm: it will [14:54:49] <_joe_> jynus: are you starting the database? [14:54:59] I am on it [14:55:04] <_joe_> ack [14:55:22] Sep 23 10:58:45 pc1013 kernel: MCE: Killing mysqld:1332 due to hardware memory corruption fault at 7f4e020fd5c0 [14:55:31] last line of the previous boot kernel log [14:55:37] same thing that happened before [14:55:38] PROBLEM - MariaDB Replica Lag: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:55:41] yep no surprise here [14:55:46] <_joe_> !incidents [14:55:47] 5267 (ACKED) Host pc1013 (paged) - PING - Packet loss = 100% [14:55:51] service up, outage should fix now [14:55:57] but I would like to do the failover asap [14:56:01] so it won't happen again [14:56:05] +1 [14:56:10] <_joe_> yeah I think it's sensible at this point, +1 [14:56:15] thanks jynus [14:56:16] +1 [14:56:25] I need some help as it is currently depooled on 2 sections [14:56:34] I am not familiar with day to day dbctl operations [14:56:49] <_joe_> jynus: I can try to help, and so can volans I guess [14:56:52] +1 [14:56:54] i can too [14:56:54] I would go with dbctl "edit [14:57:04] and just adjust it at yur will or I can if you prefer [14:57:10] and you check it before committing [14:57:10] if the rest can confirm mw okness [14:57:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:57:17] while we go through the failover [14:57:25] <_joe_> jynus: ack [14:57:34] volans: either would work [14:57:44] server is unresponsive again [14:57:47] :-( [14:57:50] <_joe_> sigh [14:57:56] <_joe_> ok [14:57:59] * volans preparing dbctl edit [14:58:01] yep, it crashed again [14:58:02] to submit for review [14:58:06] <_joe_> yep [14:58:07] ack volans [14:58:12] (03CR) 10Ryan Kemper: [C:03+2] wdqs: allow 3 new federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1074199 (https://phabricator.wikimedia.org/T364233) (owner: 10Ryan Kemper) [14:58:26] <_joe_> ok, I'll monitor mediawiki [14:58:39] I would like to restart pc1015 once before pooling it [14:58:41] doing it now [14:58:48] to apply puppet changes [14:59:14] <_joe_> oh you mean mariadb, not the whole server [14:59:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.36% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:59:18] meh, pc1013 is OOW since less than three months... [14:59:18] jynus: maybe an upgrade cookbook would be nice ? [14:59:20] (03PS10) 10Herron: thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 [14:59:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:41] arnaudb: that's ok, it is the restart after the puppet config that needs to be done [14:59:42] (depending on the production impact) [14:59:46] ack [15:00:27] _joe_: the issue is that pc1015 was a hot spare for pc4, not pc5, so it its a longer process [15:00:50] <_joe_> so with the server unresponsive, we're bound to have more slowdowns in mediawiki [15:01:03] jynus: try dbctl config diff and check the output [15:01:04] <_joe_> jynus: take your time [15:01:28] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#10167841 (10brouberol) `cirrus-streaming-updater` is replacing the list of brokers by the external services service name: https://gerrit.wi... [15:01:28] (03CR) 10Btullis: [C:03+1] "Looks great, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074968 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [15:01:35] <_joe_> right now pc1013 responds to network, so it refuses connections and that is fast. The problem is when it's down, we have a pretty generous connection timeout [15:01:40] volans: looks good, let me be sure pc1015 is ok [15:01:41] (03CR) 10Brouberol: [C:03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse) [15:01:45] sure [15:01:59] lgtm volans looks like what we do in other switchovers [15:02:11] (03CR) 10Brouberol: [C:03+1] Add a presto cluster and assign the relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074430 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [15:02:13] volans: we are good, commit [15:02:20] and we now fix codfw replication [15:02:22] (03CR) 10Brouberol: [C:03+1] Add an airflow cluster and assign relevant hosts [puppet] - 10https://gerrit.wikimedia.org/r/1074391 (https://phabricator.wikimedia.org/T374932) (owner: 10Btullis) [15:02:28] you can also check dbctl -s eqiad section pc3 get and dbctl instance pc1015 get [15:02:37] ok committing [15:02:39] just commit, it is ok [15:02:56] we may have to tune cadidate master et al [15:02:58] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: facilitate the import of an external database [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074968 (https://phabricator.wikimedia.org/T374950) (owner: 10Brouberol) [15:03:01] but that's not important [15:03:09] {done} [15:03:21] !log volans@cumin1002 dbctl commit (dc=all): 'emergency failover pc3 to pc1015', diff saved to https://phabricator.wikimedia.org/P69396 and previous config saved to /var/cache/conftool/dbconfig/20240923-150320-volans.json [15:03:28] I see the users coming in [15:03:31] response time looks ok again [15:03:41] <_joe_> arnaudb: see my explanation above [15:03:45] the cache is cold though [15:03:59] volans: as a note for myself we need to switch p3-cofwe to replicate from pc1015-bin.099184 | 33086 [15:04:04] it was a hot spare for pc4... we got unlucky [15:04:07] ack _joe_ I missed it in the scroll thanks! [15:04:26] (03CR) 10Herron: "good catch thanks! the updated PS (and after sorting out the otelcol-contrib component) has thanos-query looking much better on phi-titan-" [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [15:04:29] _joe_: mw better? [15:04:53] <_joe_> jynus: it was better as soon as it could get a connection refused from pc1013 [15:04:56] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:08] I will switch pc3-codfw [15:05:09] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [15:05:14] arnaudb: can you handle orchestrator [15:05:22] yep [15:05:32] and tendril if possible to update the master [15:05:33] <_joe_> moritzm: planned obsolecence! [15:05:45] <_joe_> sorry I just saw your comment about OOW :) [15:05:49] I will update pc2013 replication [15:06:05] I'll paste my edit log here to ensure everything is squared [15:06:06] <_joe_> jynus: <3 [15:06:31] normally after an uncorrectable error, the memory stick just disables itself [15:06:40] (03CR) 10Santiago Faci: hieradata::services_proxy::envoy.yaml: fix duplicated port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074248 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [15:06:53] in this case it crashed every time it reached the bit (after X minutes after buffer pool load) [15:07:21] (03PS3) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 [15:07:38] RECOVERY - MariaDB Replica Lag: pc3 on pc2013 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:07:44] yeah, but the ones logged are the uncorrectable multi-bit failures [15:08:02] RECOVERY - MariaDB Replica IO: pc3 on pc2013 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:08:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 (owner: 10Krinkle) [15:08:54] pc2013 should be fine now more or less [15:09:12] I will setup the circular replication [15:09:20] and then will help with monitoring [15:09:46] we have to silence pc1013 too [15:09:59] if someone can send the patch to disable monitoring there [15:10:04] on hiera [15:10:08] (03CR) 10Filippo Giunchedi: [C:03+1] thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [15:10:11] (03CR) 10Filippo Giunchedi: [C:03+1] titan: add opentelemetry collector [puppet] - 10https://gerrit.wikimedia.org/r/1074434 (owner: 10Herron) [15:10:57] arnaudb: can you open a task with ops-eqiad added to look into pc1013? while it's OOW, in many cases we have parts from decommssioned, but not yet recycled servers we can swap in [15:11:30] moritzm: sure, aside of T375382 right? [15:11:31] T375382: Post pc1013 crash - https://phabricator.wikimedia.org/T375382 [15:11:52] (03CR) 10Herron: [C:03+2] titan: add opentelemetry collector [puppet] - 10https://gerrit.wikimedia.org/r/1074434 (owner: 10Herron) [15:12:12] oh, sorry I had missed that task, then no need [15:12:27] ack, was unsure it needed one, I'll mention it then! [15:12:29] circular replication setup [15:12:41] thx [15:12:51] sadly we will have an empty cache, as volans mentioned, which is why I was waiting for it to warm up [15:12:59] (before the incident) [15:14:59] !log stevemunene@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: allow 3 new endpoints T364233 T368085 T374195 [15:15:01] cleaning up heartbeat table so orchestator and monitoring gets better [15:15:06] T364233: add https://imagehash-sparql.wmcloud.org/sparql endpoint to wikidata federated query whitelists - https://phabricator.wikimedia.org/T364233 [15:15:07] T368085: Allow federated queries with Dbnary (kaiko.getalp.org) - https://phabricator.wikimedia.org/T368085 [15:15:07] T374195: Add https://metabase.wikibase.cloud/query/sparql to the Wikidata Federated Query Whitelist - https://phabricator.wikimedia.org/T374195 [15:15:26] jynus: as far as orch goes, I should tag both hosts as what? co-master? master? [15:15:40] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:15:47] orchwise? yep, it is a circular replication, active-active all the time [15:15:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:02] should show now 0 seconds [15:16:04] PROBLEM - Host cr3-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10167874 (10ABran-WMF) @MoritzMuehlenhoff mentionned that we might have spare parts available for this server from decommssioned, but not yet recycled servers : @wiki_willy I'm not sure... [15:16:10] I'll tag them as co-master then [15:16:34] should I prepare the disable notifications of pc1013? [15:16:41] I think you can [15:16:45] doing [15:16:56] I'll struggle with orchestrator command line for a bit [15:17:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:17:16] it's ok, those are not immediate issues [15:17:32] orch look ok to me now [15:17:48] I don't think there is nothing to do there, other than handle pc1013 [15:18:41] this is already on its way as it's been depooled and dc-ops have been mentionned to see if we have some memory stick available [15:18:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 1.958 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:52] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:19:30] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:19:54] see also in private my alternative proposal :) [15:20:23] (03PS1) 10Herron: opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034 [15:20:50] !log stevemunene@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: allow 3 new endpoints T364233 T368085 T374195 (duration: 05m 51s) [15:20:54] (03CR) 10CDanis: [C:03+1] opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034 (owner: 10Herron) [15:20:57] T364233: add https://imagehash-sparql.wmcloud.org/sparql endpoint to wikidata federated query whitelists - https://phabricator.wikimedia.org/T364233 [15:20:57] T368085: Allow federated queries with Dbnary (kaiko.getalp.org) - https://phabricator.wikimedia.org/T368085 [15:20:58] T374195: Add https://metabase.wikibase.cloud/query/sparql to the Wikidata Federated Query Whitelist - https://phabricator.wikimedia.org/T374195 [15:21:06] RECOVERY - Host cr3-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.53 ms [15:21:16] (03PS2) 10Herron: opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034 [15:21:38] (03CR) 10CDanis: [C:03+1] thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [15:22:06] (03CR) 10Herron: [C:03+2] opentelemetry::collector: add package dependency for config file [puppet] - 10https://gerrit.wikimedia.org/r/1075034 (owner: 10Herron) [15:23:23] hows mediawiki uncached performance/parsercache performace, is it ok? [15:23:50] (03PS1) 10Jcrespo: mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) [15:24:31] (03CR) 10Arnaudb: [C:03+1] mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [15:24:39] there is like a 33% increase in parses: https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1 [15:26:37] (03CR) 10Jcrespo: [C:03+2] mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [15:26:48] (03PS2) 10Jcrespo: mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) [15:26:58] (03CR) 10Jcrespo: [V:03+2 C:03+2] mariadb: Disable pc1013 notifications [puppet] - 10https://gerrit.wikimedia.org/r/1075036 (https://phabricator.wikimedia.org/T375382) (owner: 10Jcrespo) [15:29:48] we should be in a bit of a degraded performance for a few hours [15:30:07] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1530). [15:30:11] arnaudb: ddi you update zarcillo, should I? [15:30:44] I'll do it jynus [15:31:17] sadly, we hit the bug where setting a pc host as master removes its monitoring [15:31:54] whatever is the puppet config it is, it should be switched to whatever x2 has [15:33:51] (03PS2) 10Giuseppe Lavagetto: service_proxy: Add a listener for the http interface of graphite [puppet] - 10https://gerrit.wikimedia.org/r/1075016 (https://phabricator.wikimedia.org/T374887) [15:33:51] (03PS1) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [15:33:52] (03PS1) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [15:33:54] (03PS1) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [15:35:17] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [15:35:43] misconfigured temporarly zarcillo: https://phabricator.wikimedia.org/P69397 [15:36:48] (03CR) 10CI reject: [V:04-1] conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [15:36:49] no issues, arnaudb [15:36:53] (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [15:37:01] (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto) [15:37:44] it would be hard for automation to get it , and even if it got it, it only affects the grouping of metrics, not the metrics themselves [15:38:18] so the issue is in modules/profile/manifests/mariadb/parsercache.pp [15:38:53] it should be like the core ones [15:40:13] funnily, it was fixed in the past: https://phabricator.wikimedia.org/rOPUP79104d15efe2bbc049abc7c7dd90584d06bed0be [15:41:13] (03CR) 10CDanis: git: add replicated_local_repo define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [15:43:19] (03CR) 10Vgutierrez: [C:03+1] Renamed log field for pipeline migration (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [15:44:02] (03CR) 10Herron: [C:03+2] thanos-query: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1072586 (owner: 10Herron) [15:45:00] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1419.eqiad.wmnet, mw1434.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1462.eqiad.wmnet, mw1415.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, parse1009.eqiad.wmnet, parse1021.eqiad.wmnet, mw1435.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, mw1408.eqiad.wmnet, kubernetes1012.eqiad [15:45:00] mw1465.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1483.eqiad.wmnet, mw1367.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, mw1486.eqiad.wmnet, wikikube-worker1024.eqiad.wmnet, mw1464.eqiad.wmnet, mw1381.eqiad.wmnet, mw1352.eqiad.wmnet, parse1018.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1472.eqiad.wmnet, mw1376.eqiad.wmnet, kubernetes1026.eqiad.wmnet, wikikube-worker1020.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, mw1387.eqia [15:45:00] mw1378.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1449.eqiad.wmnet, mw1461.eqiad.wmnet, mw1357.eqiad.wmnet, kubernetes1060.eqiad.wmnet, mw1467.eqiad.wmnet, kubernetes1020.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [15:45:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, parse1013.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1367.eqiad.wmnet, mw1442.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1386.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1462.eqiad.wmnet, mw1415.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1391.eqiad.wmnet, mw1424.eqiad.wmnet, mw1393.eqiad.wmnet, mw [15:45:00] ad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1395.eqiad.wmnet, mw1465.eqiad.wmnet, mw1466.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1419.eqiad.wmnet, kubernetes1059.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1360.eqiad.wmnet, wikikube-worker1001.eqiad.wmnet, parse1012.eqiad.wmnet, wikikube-worker1024.eqiad.wmnet, mw1468.eqiad.wmnet, parse1006.eqiad.wmnet, kubernetes1028. [15:45:00] net, wikikube-worker1010.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1024.eqiad.wmnet, kubernetes1062.eqiad.wmnet, mw1464.eqiad.wmnet, parse1021.eqiad.wmnet, mw1431.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [15:45:43] what's up [15:46:03] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [15:46:14] sukhe: looks like eventstreams is not-up (again) [15:46:18] I am still in a meeting so I haven't read the backlog. I can quit the meeting in five [15:46:38] likely T375146 [15:46:55] https://phabricator.wikimedia.org/T375146 [15:47:53] 06SRE, 06DBA: Parsercache primary master databases should monitor replication - https://phabricator.wikimedia.org/T375395 (10jcrespo) 03NEW [15:47:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10168087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm [15:48:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168089 (10jcrespo) I've created T375395 to reflect that, despite being prometed from a replica to a master, and from passive to active, it now has less monitoring than before. I think parsercache should hav... [15:50:54] yeah :| [15:51:00] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:51:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:52:53] (03CR) 10Bking: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [15:53:10] (03CR) 10Vgutierrez: [C:04-1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:53:56] (03PS2) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [15:53:57] (03PS2) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [15:53:57] (03PS2) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [15:53:59] (03CR) 10Giuseppe Lavagetto: git: add replicated_local_repo define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [15:56:33] (03CR) 10CI reject: [V:04-1] conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [15:56:35] (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [15:56:48] jouncebot: nowandnext [15:56:48] For the next 0 hour(s) and 3 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1530) [15:56:48] In 1 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700) [15:56:48] In 1 hour(s) and 3 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700) [15:56:58] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes mw2424 and mw2425 - https://phabricator.wikimedia.org/T375398 (10jijiki) 03NEW [15:57:14] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use kafka "external-services" fqdn with use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse) [15:57:23] (03CR) 10CI reject: [V:04-1] profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 (owner: 10Giuseppe Lavagetto) [15:58:26] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10168164 (10elukey) I took a look to puppetserver1002 and even aftet the change for the 35 workers, the memory used was almost 95%. The heap size usage stops aroun... [15:58:31] (03Merged) 10jenkins-bot: cirrus-streaming-updater: use kafka "external-services" fqdn with use_all_dns_ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075006 (owner: 10DCausse) [15:59:00] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers parse1011.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1479.eqiad.wmnet, mw1388.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, mw1408.eqiad.wmnet, mw1389.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1014.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, mw1483.eqiad.wmnet, mw1367.eqiad.wmnet, wikikube-worker100 [15:59:00] wmnet, mw1458.eqiad.wmnet, parse1006.eqiad.wmnet, mw1381.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1376.eqiad.wmnet, kubernetes1035.eqiad.wmnet, kubernetes1026.eqiad.wmnet, parse1014.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, kubernetes1062.eqiad.wmnet, mw1378.eqiad.wmnet, mw1449.eqiad.wmnet, mw1461.eqiad.wmnet, wikikube-worker1018.eqiad.wmnet, mw1397.eqiad.wmnet, kubernetes1027.eqia [15:59:00] mw1414.eqiad.wmnet, wikikube-worker1019.eqiad.wmnet, mw1485.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, mw1396.eqiad.wmnet, mw1463.eqiad.wmnet, parse1023.eqiad https://wikitech.wikimedia.org/wiki/PyBal [15:59:00] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers kubernetes1010.eqiad.wmnet, parse1011.eqiad.wmnet, mw1433.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1367.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1386.eqiad.wmnet, mw1479.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, parse1009.eqiad.wmnet, mw1405.eqiad.wmnet, mw1399.eqiad.wmnet, mw1435.eqi [15:59:00] , mw1424.eqiad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1466.eqiad.wmnet, mw1483.eqiad.wmnet, mw1419.eqiad.wmnet, mw1469.eqiad.wmnet, mw1486.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1356.eqiad.wmnet, mw1458.eqiad.wmnet, mw1371.eqiad.wmnet, parse1012.eqiad.wmnet, mw1468.eqiad.wmnet, kubernetes1028.eqiad.wmnet, wikikube-worker10 [15:59:00] .wmnet, kubernetes1031.eqiad.wmnet, kubernetes1024.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw1355.eqiad.wmnet, mw1472.eqiad.wmnet, wikikube-worker1031.eqiad.wmnet, mw1376.e https://wikitech.wikimedia.org/wiki/PyBal [15:59:33] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [15:59:49] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:59:51] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:01:51] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168177 (10jcrespo) {P69398} [16:01:59] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:01:59] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:03:09] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:03:31] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:04:07] !log elukey@puppetserver1001 conftool action : set/pooled=true,weight=10; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad [16:05:17] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:05:31] (03PS1) 10Elukey: conftool: add registry1005 to the docker-registry pool [puppet] - 10https://gerrit.wikimedia.org/r/1075050 (https://phabricator.wikimedia.org/T332016) [16:05:33] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:06:27] (03CR) 10Elukey: [C:03+2] conftool: add registry1005 to the docker-registry pool [puppet] - 10https://gerrit.wikimedia.org/r/1075050 (https://phabricator.wikimedia.org/T332016) (owner: 10Elukey) [16:08:13] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad [16:08:22] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=registry1005.eqiad.wmnet,service=docker-registry,dc=eqiad [16:08:48] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=registry1003.eqiad.wmnet,service=docker-registry,dc=eqiad [16:10:39] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10168223 (10Ladsgroup) a:05Ladsgroup→03None It should be done by the per... [16:12:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168246 (10jcrespo) [16:13:18] (03CR) 10CDanis: git: add replicated_local_repo define (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [16:18:23] (03PS2) 10Jdlrobson: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T375401) [16:18:36] (03PS3) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T375401) [16:19:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson) [16:20:43] (03CR) 10Ottomata: [C:03+1] config: remove eventbus instrumentation setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062430 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena) [16:20:44] (03PS2) 10DCausse: rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373) [16:21:07] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10168311 (10Jhancock.wm) [16:21:11] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: cp307[12] thermal issues - https://phabricator.wikimedia.org/T374986#10168315 (10RobH) Ongoing conversations via email with support, they've moved onto scheduling an onsite. Sent all location details over along with a proposed maint window of October 2nd. (Everyth... [16:25:15] 06SRE, 06DBA: Parsercache primary master databases should monitor replication - https://phabricator.wikimedia.org/T375395#10168325 (10jcrespo) p:05Triage→03Low May not be needed if pc is rearchitectured at: T373037 [16:28:41] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168332 (10jcrespo) p:05Medium→03High was unbreak now, high now that issues has been mitigated after pc1013 failover. [16:28:49] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10168338 (10Jhancock.wm) a:03Jhancock.wm [16:29:44] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10168337 (10Jhancock.wm) @elukey hey these are the two new super micro servers I installed last week. I thought it went through without a hitch but something in the BMC didn't take. logging-hd2004 logging-hd2005 sre... [16:31:57] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 4 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:33:15] (03PS2) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) [16:33:20] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Disable regex steam hadoop ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074102 (https://phabricator.wikimedia.org/T361498) (owner: 10Joal) [16:33:39] (03CR) 10Milimetric: "just some style thoughts" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [16:35:02] (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [16:36:25] (03PS3) 10Ebernhardson: cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335) [16:37:29] (03CR) 10Ladsgroup: "I think that was an oversight. I will fix it." [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [16:37:49] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2424 - https://phabricator.wikimedia.org/T375270#10168371 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm logged into the server and not seeing any issues. looks like it might have healed itself. no memory errors pointing to another issue like that... [16:38:32] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1246.eqiad.wmnet with OS bookworm [16:38:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [16:38:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10168377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host db1246.eqiad.wmnet with OS bookworm executed with errors: - db1246 (**FAIL**) -... [16:39:06] (03PS1) 10Ladsgroup: pc2017: Set it to master [puppet] - 10https://gerrit.wikimedia.org/r/1075052 (https://phabricator.wikimedia.org/T374355) [16:39:32] (03CR) 10Ebernhardson: [C:03+2] cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [16:39:46] (03Abandoned) 10DErenrich: Add citation-needed-api to toolforge's prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1039850 (https://phabricator.wikimedia.org/T363371) (owner: 10DErenrich) [16:40:31] (03Merged) 10jenkins-bot: cirrus: Read from public and private streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1073566 (https://phabricator.wikimedia.org/T374335) (owner: 10Ebernhardson) [16:40:48] (03CR) 10Ladsgroup: "I490f73b05d39c41d7b3b2b" [dns] - 10https://gerrit.wikimedia.org/r/1073897 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [16:42:27] (03CR) 10Ottomata: "VERY COOL!" [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [16:43:08] !log pt1979@cumin1002 START - Cookbook sre.hosts.dhcp for host db1246.eqiad.wmnet [16:43:23] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:43:30] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:47:05] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10168415 (10BCornwall) Great write-up! I heartily disagree about self-documentation, though. While having clear, understandable code is a must, so too must the user operation: Nobody should have to t... [16:49:28] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:49:33] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:52:59] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 8 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:56:43] (03CR) 10Dreamy Jazz: "Thanks for the comments. Addressing these now." [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [16:58:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to stat1007 for cyndywikime - https://phabricator.wikimedia.org/T375060#10168452 (10Ottomata) Approved [16:59:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168456 (10wiki_willy) ++ @Jclark-ctr & @VRiley-WMF, who can see if there are any parts available from decommissioned servers >>! In T375382#10167873, @ABran-WMF wrote: > @MoritzMuehlenhoff me... [16:59:16] (03CR) 10Jdlrobson: [C:03+1] Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [16:59:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [17:00:04] (03Abandoned) 10Jdlrobson: Drop support for non-Codex message box styles in Vector 2022 and Vector [skins/Vector] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074282 (https://phabricator.wikimedia.org/T360668) (owner: 10Jdlrobson) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700) [17:00:04] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T1700). [17:00:07] (03PS1) 10Ebernhardson: Revert "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075054 [17:02:19] (03CR) 10Ebernhardson: [C:03+2] Revert "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075054 (owner: 10Ebernhardson) [17:02:51] (03PS2) 10Jdlrobson: Do not apply table styling rules to Main page [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245) [17:03:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245) (owner: 10Jdlrobson) [17:03:17] (03Merged) 10jenkins-bot: Revert "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075054 (owner: 10Ebernhardson) [17:04:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10168469 (10wiki_willy) Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be o... [17:05:49] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:06:02] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:06:10] (03CR) 10Jforrester: "CI is complaining that there's no graphite for Beta Cluster, which is irritating." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075017 (https://phabricator.wikimedia.org/T374887) (owner: 10Giuseppe Lavagetto) [17:07:27] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:07:32] hmm [17:10:41] (03PS1) 10Stoyofuku-wmf: Deploy donate link to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) [17:10:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074550 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [17:13:29] (03PS1) 10Ebernhardson: Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055 [17:14:03] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [17:14:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es1022 - https://phabricator.wikimedia.org/T375257#10168489 (10VRiley-WMF) Hey @ABran-WMF as it turns out, we don't happen to have any 2TB to use as a replacment. However, we do have plenty of 4TB drives that should work. Is it okay to move forward with... [17:14:08] (03PS2) 10Ebernhardson: Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055 [17:14:33] (03PS3) 10Ebernhardson: Revert^2 "cirrus: Read from public and private streams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075055 (https://phabricator.wikimedia.org/T374335) [17:15:13] (03PS3) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) [17:15:16] (03CR) 10Dreamy Jazz: [WikiReplicas] Hide autoblock targets in the globalblocks table (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [17:16:29] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:30:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10168622 (10phaultfinder) [17:34:57] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:35:02] hmm ok [17:35:25] virtual chassis, I have no idea where to go from here but let's try [17:36:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T375401) (owner: 10Jdlrobson) [17:41:37] (03PS3) 10Ebernhardson: cirrus: Remove unused Regex pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) [17:42:06] (03CR) 10Ebernhardson: [C:03+1] "Verified in our dashboards (https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters) this pool counter is now unused." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [17:45:26] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10168717 (10phaultfinder) [17:49:27] ^ these are known as per papau.l. both asw-{c,d}-codfw are being decommissioned [17:50:54] (03PS10) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:54:42] (03CR) 10Ebernhardson: "realized i wont be available for the full deploy window, this will likely be rescheduled for thursday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson) [17:55:24] (03PS11) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:56:37] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4092/co" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:56:54] (03CR) 10BCornwall: "PS10 and PS11 addresses a double-redirect for wikimediafoundation.org that failed pcc" [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [17:58:11] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 702496384 and 48 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:58:55] fun [17:59:11] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 51152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:03:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:48] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375314#10168759 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Swapped out cable. Closing for now. [18:07:38] (03CR) 10Milimetric: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [18:08:08] (03CR) 10Milimetric: [C:03+1] "Looks good, I don't have +2, but I'm ok to merge." [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [18:12:16] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10168776 (10VRiley-WMF) Hi! We do have a spare DIMM that we can swap at anytime for this unit. Please let us know when is the best time to proceed with this. Thanks! [18:21:47] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:24:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:24:40] (03CR) 10Jdrewniak: [C:03+1] Add a 0-coverage QuickSurvey to enwiki to advertise the Add A Fact Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich) [18:24:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:47] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:30:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10168850 (10phaultfinder) [18:31:37] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:33:58] (03PS1) 10Ssingh: P:dns::recursor: set allow_extended_errors to true [puppet] - 10https://gerrit.wikimedia.org/r/1075062 [18:35:01] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4093/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075062 (owner: 10Ssingh) [18:35:11] (03PS2) 10Ssingh: P:dns::recursor: set allow_extended_errors to true [puppet] - 10https://gerrit.wikimedia.org/r/1075062 (https://phabricator.wikimedia.org/T375414) [18:36:02] (03CR) 10Ssingh: "Will merge after the switchover." [puppet] - 10https://gerrit.wikimedia.org/r/1075062 (https://phabricator.wikimedia.org/T375414) (owner: 10Ssingh) [18:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:43:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:44:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:46:13] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 727858512 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:48:13] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:48:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:49:19] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418 (10Papaul) 03NEW [18:49:25] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10168978 (10Papaul) p:05Triage→03Medium [18:50:15] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419 (10Papaul) 03NEW [18:50:31] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10168991 (10Papaul) p:05Triage→03Medium [18:56:41] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:04:45] (03PS1) 10Scott French: mw-(api-ext|web): scale back to 75% at p95 targets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) [19:04:45] (03CR) 10Scott French: "Realized today that I forgot to send this one, which is actually needed for Tuesday :) Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075056 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [19:05:11] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:08:43] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:11] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10169021 (10Eevans) >>! In T375309#10168415, @BCornwall wrote: > Great write-up! I heartily disagree about self-documentation, though. While having clear, understandable code is a must, so too must t... [19:09:15] (03PS4) 10Jdlrobson: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679) [19:15:09] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10169025 (10phaultfinder) [19:15:10] (03PS5) 10Jdlrobson: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679) [19:32:16] (03CR) 10Gmodena: [C:03+1] Declare streams in support of the reconciliation mechanism for Dumps 2.0. (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [19:33:05] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Corto: Licensing & copyright information - https://phabricator.wikimedia.org/T375305#10169043 (10jhathaway) Do we have to put the license in every file? The link you mentioned only says "consider". Just seems to be a bit tedious. [19:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:41:39] (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [19:45:36] (03PS1) 10Ebernhardson: Let PageEntitySerializer.canonicalPageURL accept PageReference [extensions/EventBus] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070675 (https://phabricator.wikimedia.org/T372904) (owner: 10Peter Fischer) [19:45:37] (03CR) 10Ebernhardson: "Is this intended to be against the master branch? I was pondering abandoning since this is against 1.43.0-wmf.21 and .23 is the minimum d" [extensions/EventBus] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070675 (https://phabricator.wikimedia.org/T372904) (owner: 10Peter Fischer) [19:46:31] (03PS1) 10Gerrit maintenance bot: Add tdd to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1075070 (https://phabricator.wikimedia.org/T375422) [19:47:29] (03PS3) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) [19:49:16] (03CR) 10Xcollazo: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [19:49:53] (03CR) 10Ladsgroup: [C:03+2] Add tdd to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1075070 (https://phabricator.wikimedia.org/T375422) (owner: 10Gerrit maintenance bot) [19:56:42] (03CR) 10Ottomata: Declare streams in support of the reconciliation mechanism for Dumps 2.0. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073855 (https://phabricator.wikimedia.org/T368755) (owner: 10Xcollazo) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T2000). [20:00:05] derenrich and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] o/ [20:00:51] (this is my first patch so excuse any errors) [20:01:42] p/ [20:08:59] Gonna deploy Jon's patches [20:09:58] Unfortunately I do not have the power to manually +2 the backport patch so I'll do the two config deploys first, then the longer backport [20:10:31] np [20:10:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [20:10:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679) (owner: 10Jdlrobson) [20:11:59] (03Merged) 10jenkins-bot: Remove ProofreadPage dark mode namespaces exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072600 (owner: 10Ebrahim) [20:12:01] (03Merged) 10jenkins-bot: Promote dark mode for anons on tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074490 (https://phabricator.wikimedia.org/T374679) (owner: 10Jdlrobson) [20:12:27] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1072600|Remove ProofreadPage dark mode namespaces exception]], [[gerrit:1074490|Promote dark mode for anons on tier 1 wikis (T374679)]] [20:12:32] T374679: Check which projects are ready for dark mode for anons - https://phabricator.wikimedia.org/T374679 [20:13:55] (03PS1) 10Papaul: Remove old switch stack [puppet] - 10https://gerrit.wikimedia.org/r/1075078 (https://phabricator.wikimedia.org/T375419) [20:14:54] * jan_drewniak toyofuku: ping me when you're done and I'll do derenrich's patch [20:15:18] * jan_drewniak (I keep pressing shift-enter because that's my slack setup...) [20:15:22] Sounds good!! Thank you ☺️ [20:20:50] (03CR) 10Papaul: [C:03+2] Remove old switch stack [puppet] - 10https://gerrit.wikimedia.org/r/1075078 (https://phabricator.wikimedia.org/T375419) (owner: 10Papaul) [20:21:30] (03CR) 10Btullis: [C:03+2] Add a new partition recipe for k8s workers with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075007 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [20:23:41] !log toyofuku@deploy1003 jdlrobson, toyofuku, ebrahim: Backport for [[gerrit:1072600|Remove ProofreadPage dark mode namespaces exception]], [[gerrit:1074490|Promote dark mode for anons on tier 1 wikis (T374679)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:45] T374679: Check which projects are ready for dark mode for anons - https://phabricator.wikimedia.org/T374679 [20:23:52] Jdlrobson: ready for testing~ [20:24:05] on it [20:24:47] toyofuku: that's good to go [20:25:04] !log toyofuku@deploy1003 jdlrobson, toyofuku, ebrahim: Continuing with sync [20:27:35] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Corto: Licensing & copyright information - https://phabricator.wikimedia.org/T375305#10169272 (10Eevans) >>! In T375305#10169043, @jhathaway wrote: > Do we have to put the license in every file? The link you mentioned only says "consider". Just seems to b... [20:30:06] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375218#10169282 (10phaultfinder) [20:32:46] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10169305 (10phaultfinder) [20:35:33] This deploy feels very slow - is it just me? [20:36:56] yes [20:37:14] and am a bit worried that https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1074545 hasn't been started yet [20:37:56] Yeah - what's strange is it's the deploy steps itself that are slow, not like test infra [20:38:07] So the backport could potentially take an eternity [20:38:18] toyofuku: hmm my config change seems to be live? [20:38:26] But like, I'm more concerned about _why_ the deploy is so slow [20:38:36] Yeah we're at 60% of servers rn [20:38:41] ok gotcha [20:39:03] But this doesn't feel like an issue with my internet connection so curious who to tag to make sure our prod infra is not in need of any attention [20:39:08] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [20:40:00] slow deploy could be either slow network or slow machines both of which wouldn't be ideal [20:40:07] toyofuku: it could be because it affets message keys. I'm not familiar with the details, but I know message related deploys are slower than deploys that don't involve messages (probably because the message cache?) [20:40:09] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1072600|Remove ProofreadPage dark mode namespaces exception]], [[gerrit:1074490|Promote dark mode for anons on tier 1 wikis (T374679)]] (duration: 27m 41s) [20:40:13] T374679: Check which projects are ready for dark mode for anons - https://phabricator.wikimedia.org/T374679 [20:40:16] ahhhh [20:40:19] I do remember that [20:40:21] (03PS2) 10Jdlrobson: Dark mode: Make LiquidThreads namespace explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 [20:40:23] Well, that one's done [20:40:30] Gonna quickly do the next one [20:40:34] We might go over [20:40:39] Who would be the right person to tag for that? [20:40:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245) (owner: 10Jdlrobson) [20:41:23] I guess the security window is next so we should ping Reedy sbassett and maryum as an FYI that we might go over. [20:41:55] Eta 30 mins on merging that patch 🥲 [20:42:56] Reedy: sbasset: maryum: I don't know how to tag you in irc so hopefully this works, but we're in the middle of a backport deploy that is likely to extend into the security window - would that be alright with you all? [20:43:22] sbassett: misspelled your handle myb [20:56:12] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [20:56:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1039.eqiad.wmnet with OS bookworm [20:56:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1039.eqiad.wmnet with OS bookworm [20:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:56:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [21:00:04] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240923T2100). [21:01:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1040.eqiad.wmnet with OS bookworm [21:02:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1040.eqiad.wmnet with OS bookworm [21:02:41] As mentioned before, we're still in the middle of a backport - this can presumably be aborted if we need to, but since it's taking a long time it would be great to finish so we don't have to start over later [21:04:21] :( [21:04:47] looks like its almost done [21:10:00] (03Merged) 10jenkins-bot: Do not apply table styling rules to Main page [extensions/WikimediaMessages] (wmf/1.43.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1074545 (https://phabricator.wikimedia.org/T375245) (owner: 10Jdlrobson) [21:10:14] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1074545|Do not apply table styling rules to Main page (T375245)]] [21:10:18] T375245: Links are unreadable on main page in dark mode - https://phabricator.wikimedia.org/T375245 [21:10:32] let's see how long the actual deploy takes [21:10:51] o_o [21:11:39] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1041.eqiad.wmnet with OS bookworm [21:11:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1041.eqiad.wmnet with OS bookworm [21:12:50] !log toyofuku@deploy1003 jdlrobson, toyofuku: Backport for [[gerrit:1074545|Do not apply table styling rules to Main page (T375245)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:02] Jdlrobson: ready for testing! [21:13:35] toyofuku: on it [21:15:17] toyofuku: LGTM! [21:15:27] proceeding [21:15:29] !log toyofuku@deploy1003 jdlrobson, toyofuku: Continuing with sync [21:17:54] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2020.codfw.wmnet [21:18:46] okay it def was the message cache bc this one is going much faster [21:20:58] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1074545|Do not apply table styling rules to Main page (T375245)]] (duration: 10m 44s) [21:21:03] T375245: Links are unreadable on main page in dark mode - https://phabricator.wikimedia.org/T375245 [21:21:20] Jdlrobson: we're all done! [21:21:27] Thanks for your patience everyone [21:21:43] Jan_drewniak: all yours but we ran way over so might want to make sure it's okay to proceed [21:21:52] thanks toyofuku ! [21:23:14] happy to help ☺️ [21:24:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074311 (owner: 10DErenrich) [21:25:26] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.categories-reload (exit_code=97) reloading categories to wdqs2020.codfw.wmnet [21:28:41] (03PS1) 10Btullis: Update the partman configuration for k8s with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075081 (https://phabricator.wikimedia.org/T365283) [21:29:15] (03CR) 10Btullis: [C:03+2] Update the partman configuration for k8s with local storage [puppet] - 10https://gerrit.wikimedia.org/r/1075081 (https://phabricator.wikimedia.org/T365283) (owner: 10Btullis) [21:30:51] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [21:34:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:38:04] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1001.eqiad.wmnet with OS bookworm [21:38:12] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old switch stack - pt1979@cumin2002" [21:38:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old switch stack - pt1979@cumin2002" [21:38:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:39:36] thanks toyofuku we'll take care of derenrich's patch tomorrow [21:39:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1046.eqiad.wmnet with OS bookworm [21:40:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1046.eqiad.wmnet with OS bookworm [21:46:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1045.eqiad.wmnet with OS bookworm [21:46:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1045.eqiad.wmnet with OS bookworm [21:50:21] !log bking@cumin2002 START - Cookbook sre.wdqs.categories-reload reloading categories to wdqs2020.codfw.wmnet [21:51:44] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [21:55:37] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1001.eqiad.wmnet with reason: host reimage [22:03:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:43] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1045.eqiad.wmnet with reason: host reimage [22:06:25] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1040.eqiad.wmnet with reason: host reimage [22:06:25] RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:06:37] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 789556784 and 108 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:08:25] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:08:34] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1046.eqiad.wmnet with reason: host reimage [22:09:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1045.eqiad.wmnet with reason: host reimage [22:11:42] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 130048 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:13:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1046.eqiad.wmnet with reason: host reimage [22:13:57] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1041.eqiad.wmnet with reason: host reimage [22:17:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1041.eqiad.wmnet with reason: host reimage [22:20:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10169564 (10Papaul) [22:21:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169567 (10Papaul) [22:21:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1040.eqiad.wmnet with reason: host reimage [22:24:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:24:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:24:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1045.eqiad.wmnet with OS bookworm [22:24:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1045.eqiad.wmnet with OS bookworm completed:... [22:27:44] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:28:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:28:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1046.eqiad.wmnet with OS bookworm [22:28:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1046.eqiad.wmnet with OS bookworm completed:... [22:31:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1048.eqiad.wmnet with OS bookworm [22:31:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169612 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1048.eqiad.wmnet with OS bookworm [22:32:19] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:32:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1039.eqiad.wmnet with OS bookworm [22:32:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169613 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1039.eqiad.wmnet with OS bookworm [22:32:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:32:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1041.eqiad.wmnet with OS bookworm [22:33:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169614 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1041.eqiad.wmnet with OS bookworm completed:... [22:37:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:37:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:37:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1040.eqiad.wmnet with OS bookworm [22:37:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1040.eqiad.wmnet with OS bookworm completed:... [22:40:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1047.eqiad.wmnet with OS bookworm [22:40:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1047.eqiad.wmnet with OS bookworm [22:44:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169624 (10Jclark-ctr) [22:46:02] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1048.eqiad.wmnet with reason: host reimage [22:49:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1048.eqiad.wmnet with reason: host reimage [22:50:07] 10SRE-swift-storage: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448 (10prabhat) 03NEW [22:53:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1047.eqiad.wmnet with reason: host reimage [22:56:25] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1049.eqiad.wmnet with OS bookworm [22:56:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1049.eqiad.wmnet with OS bookworm [22:57:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1047.eqiad.wmnet with reason: host reimage [23:02:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1050.eqiad.wmnet with OS bookworm [23:02:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1050.eqiad.wmnet with OS bookworm [23:04:09] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:04:28] 10SRE-swift-storage: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448#10169654 (10Pppery) This isn't a problem with the imageinfo API. The file itself has just somehow disappeared (the UI shows it broken too). [23:05:26] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:05:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1048.eqiad.wmnet with OS bookworm [23:05:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1048.eqiad.wmnet with OS bookworm completed:... [23:06:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169658 (10Jclark-ctr) [23:08:26] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Cannot move Commons File:Dhruve_Sehgal_in_2021.png - https://phabricator.wikimedia.org/T372924#10169661 (10Pppery) 05Open→03Resolved a:03Robertsky Nobody is going to track down what happened a month ago - it's well known and tracked elsew... [23:09:07] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1052.eqiad.wmnet with OS bookworm [23:09:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1052.eqiad.wmnet with OS bookworm [23:10:36] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1049.eqiad.wmnet with reason: host reimage [23:11:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:13:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1039.eqiad.wmnet with reason: host reimage [23:14:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1051.eqiad.wmnet with OS bookworm [23:14:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1049.eqiad.wmnet with reason: host reimage [23:14:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1051.eqiad.wmnet with OS bookworm [23:14:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:14:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1047.eqiad.wmnet with OS bookworm [23:14:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169679 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1047.eqiad.wmnet with OS bookworm completed:... [23:15:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169680 (10Jclark-ctr) [23:16:34] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.categories-reload (exit_code=0) reloading categories to wdqs2020.codfw.wmnet [23:16:58] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1050.eqiad.wmnet with reason: host reimage [23:17:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1039.eqiad.wmnet with reason: host reimage [23:18:45] (03PS1) 10Papaul: ADD db2146 to use db.cfg for testing [puppet] - 10https://gerrit.wikimedia.org/r/1075085 (https://phabricator.wikimedia.org/T374215) [23:21:02] (03CR) 10Papaul: [C:03+2] ADD db2146 to use db.cfg for testing [puppet] - 10https://gerrit.wikimedia.org/r/1075085 (https://phabricator.wikimedia.org/T374215) (owner: 10Papaul) [23:21:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1050.eqiad.wmnet with reason: host reimage [23:23:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1052.eqiad.wmnet with reason: host reimage [23:24:06] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10169691 (10Eevans) >>! In T370786#10023319, @hnowlan wrote: > One of the big challenges I can see here is the use of compound words - currently we use lazy names like incident-create and incident... [23:27:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1052.eqiad.wmnet with reason: host reimage [23:29:22] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:33:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:34:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1049.eqiad.wmnet with OS bookworm [23:34:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1049.eqiad.wmnet with OS bookworm completed:... [23:34:24] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:34:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169706 (10Jclark-ctr) [23:34:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:34:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1039.eqiad.wmnet with OS bookworm [23:35:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1039.eqiad.wmnet with OS bookworm completed:... [23:35:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169708 (10Jclark-ctr) [23:35:56] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:38:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075087 [23:42:36] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:43:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:43:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1050.eqiad.wmnet with OS bookworm [23:43:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10169713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host ganeti1050.eqiad.wmnet with OS bookworm completed:... [23:43:43] 06SRE-OnFire, 10Incident Tooling: Corto: Access model (MVP only) - https://phabricator.wikimedia.org/T375451 (10Eevans) 03NEW [23:51:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169742 (10Papaul)