[00:15:05] (03PS1) 10DDesouza: Reader Survey: Partially undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) [00:16:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [00:30:06] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1104401 (owner: 10TrainBranchBot) [00:38:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1104743 [00:38:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1104743 (owner: 10TrainBranchBot) [00:53:37] (03PS1) 10Tim Starling: In shellbox set allowUrlFiles=true [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104744 (https://phabricator.wikimedia.org/T292322) [00:57:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1104743 (owner: 10TrainBranchBot) [01:04:35] (03CR) 10Scott French: "Ah, good catch! So, you'll need to bump the `version` in `Chart.yaml` as well, in order for deployments of this chart to pick up the chang" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104744 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [01:06:28] (03PS2) 10Tim Starling: In shellbox set allowUrlFiles=true [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104744 (https://phabricator.wikimedia.org/T292322) [01:06:42] (03CR) 10Tim Starling: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104744 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [01:08:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1104745 [01:08:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1104745 (owner: 10TrainBranchBot) [01:12:32] RECOVERY - Host ripe-atlas-eqsin is UP: PING WARNING - Packet loss = 71%, RTA = 0.47 ms [01:18:31] (03CR) 10Scott French: [C:03+1] "That should do the trick. Thanks, Tim!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104744 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [01:18:56] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [01:19:37] (03CR) 10Scott French: [C:03+2] In shellbox set allowUrlFiles=true [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104744 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [01:21:10] (03Merged) 10jenkins-bot: In shellbox set allowUrlFiles=true [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104744 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [01:23:28] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [01:23:40] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [01:26:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1104745 (owner: 10TrainBranchBot) [01:27:10] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [01:27:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [01:30:20] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [01:30:46] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [01:33:11] !log deployed shellbox-video to pick up config change for T292322 [01:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:15] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [01:47:34] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/7f6584eae5accab8db4f596234c0e30714fce29ed44f68ecd2745ae41d03433f/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:07:34] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:08:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.8 [core] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1104747 (https://phabricator.wikimedia.org/T375667) [02:08:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.8 [core] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1104747 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [02:35:15] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [02:35:15] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [02:35:44] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.8 [core] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1104747 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T0300) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:15:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T0400) [04:01:47] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104751 (https://phabricator.wikimedia.org/T375667) [04:01:49] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104751 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [04:02:36] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104751 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [04:03:01] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.8 refs T375667 [04:03:05] T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667 [04:21:32] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:31:32] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T0500) [05:21:22] (03PS1) 10KartikMistry: Update Recommendation API to 2024-12-16-203402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104754 (https://phabricator.wikimedia.org/T382278) [05:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10408201 (10phaultfinder) [06:35:15] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [06:35:15] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [06:54:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T0700) [07:00:05] marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T0700). nyaa~ [07:13:50] (03CR) 10Muehlenhoff: [C:03+2] Remove tarlogic1 from admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/1104725 (owner: 10Muehlenhoff) [07:14:53] (03CR) 10Muehlenhoff: [C:03+2] Copy puppet git hooks to puppetserver module [puppet] - 10https://gerrit.wikimedia.org/r/1104626 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:15:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:33:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [07:34:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [07:42:00] (03PS1) 10Muehlenhoff: Deprecate system::role for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1104945 [07:42:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:48] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:16] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53070 bytes in 8.573 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:46] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:53:20] !log installing expat security updates [07:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:14] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2093.codfw.wmnet [08:00:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2093.codfw.wmnet [08:01:00] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2092.codfw.wmnet [08:01:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2092.codfw.wmnet [08:02:26] * kart_ is here.. [08:02:58] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2092.codfw.wmnet with OS bookworm [08:02:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102733 (https://phabricator.wikimedia.org/T380928) (owner: 10KartikMistry) [08:03:01] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2093.codfw.wmnet with OS bookworm [08:03:17] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2092 [08:03:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2092 [08:03:20] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2093 [08:03:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2093 [08:03:42] (03Merged) 10jenkins-bot: Enable the Contribute menu in 5th group of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102733 (https://phabricator.wikimedia.org/T380928) (owner: 10KartikMistry) [08:04:39] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1102733|Enable the Contribute menu in 5th group of wikis (T380928)]] [08:04:43] T380928: Enable the Contribute menu in 5th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T380928 [08:06:40] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:07:18] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:04] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [08:17:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [08:18:15] !log kartik@deploy2002 kartik: Backport for [[gerrit:1102733|Enable the Contribute menu in 5th group of wikis (T380928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:18:19] T380928: Enable the Contribute menu in 5th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T380928 [08:19:30] !log kartik@deploy2002 kartik: Continuing with sync [08:19:35] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10408311 (10Volans) [08:19:57] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [08:20:11] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2093.codfw.wmnet with reason: host reimage [08:20:15] (03PS1) 10Slyngshede: Release v0.1.5 [software/bitu] - 10https://gerrit.wikimedia.org/r/1104946 [08:21:07] (03CR) 10Volans: [C:03+1] "I've performed the post-merge steps required by https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Renaming/Deleting_a_cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259) (owner: 10Klausman) [08:23:10] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2092.codfw.wmnet with reason: host reimage [08:23:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2093.codfw.wmnet with reason: host reimage [08:26:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2092.codfw.wmnet with reason: host reimage [08:29:19] (03PS1) 10Jelto: miscweb: bump all design image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104948 (https://phabricator.wikimedia.org/T382230) [08:34:22] (03PS2) 10Volans: sre.hosts.upgrade-and-reboot: remove cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) [08:34:22] (03PS1) 10Volans: ownership: Infrastructure Foundations cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) [08:34:23] (03PS1) 10Volans: ownership: Data Platform cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) [08:34:24] (03PS1) 10Volans: ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) [08:34:26] (03PS1) 10Volans: ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) [08:34:27] (03PS1) 10Volans: ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) [08:34:29] (03PS1) 10Volans: ownership: Collaboration Services cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) [08:34:31] FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:33] (03PS1) 10Volans: ownership: Observability cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104955 (https://phabricator.wikimedia.org/T379258) [08:34:38] (03PS1) 10Volans: ownership: WMCS cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104956 (https://phabricator.wikimedia.org/T379258) [08:35:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [08:36:46] (03PS1) 10Hashar: devtools: fix hiera after host renaming [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) [08:37:45] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102733|Enable the Contribute menu in 5th group of wikis (T380928)]] (duration: 33m 05s) [08:37:49] T380928: Enable the Contribute menu in 5th group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T380928 [08:39:31] FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:58] (03PS1) 10Slyngshede: P:idm enable account managers LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1104958 (https://phabricator.wikimedia.org/T359820) [08:42:12] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:42:56] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:43:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one typo inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:43:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2093.codfw.wmnet with OS bookworm [08:43:42] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:43:53] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:44:42] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:46:08] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:46:22] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:46:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2092.codfw.wmnet with OS bookworm [08:48:36] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:49:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1104958 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [08:49:41] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:50:13] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104955 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:50:38] (03CR) 10Volans: "In addition to the linked task see also my email with subject "Introducing cookbook ownership" for more context and details." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104956 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:52:03] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2092.codfw.wmnet [08:52:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2092.codfw.wmnet [08:52:11] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2093.codfw.wmnet [08:52:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2093.codfw.wmnet [08:52:25] (03CR) 10AOkoth: [C:03+1] miscweb: bump all design image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104948 (https://phabricator.wikimedia.org/T382230) (owner: 10Jelto) [08:54:27] (03CR) 10FNegri: [C:03+1] ownership: WMCS cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104956 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:54:48] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2087.codfw.wmnet [08:55:10] (03CR) 10Jelto: [C:03+2] miscweb: bump all design image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104948 (https://phabricator.wikimedia.org/T382230) (owner: 10Jelto) [08:55:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2087.codfw.wmnet [08:55:55] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2084.codfw.wmnet [08:56:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104598 (https://phabricator.wikimedia.org/T379002) (owner: 10DCausse) [08:56:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2084.codfw.wmnet [08:56:31] (03Merged) 10jenkins-bot: miscweb: bump all design image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104948 (https://phabricator.wikimedia.org/T382230) (owner: 10Jelto) [09:00:01] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2084.codfw.wmnet with OS bookworm [09:00:02] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2087.codfw.wmnet with OS bookworm [09:03:03] (03PS2) 10Volans: ownership: Infrastructure Foundations cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) [09:03:14] (03CR) 10Volans: ownership: Infrastructure Foundations cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:05:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:07:02] (03CR) 10Elukey: [C:03+1] ownership: Infrastructure Foundations cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:08:09] I am hacking/live debugging `scap clean` since it did not manage to clean the old branch [09:08:53] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2084.codfw.wmnet with OS bookworm [09:09:30] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2084.codfw.wmnet with OS bookworm [09:13:05] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:13:11] (03CR) 10Volans: [C:03+2] ownership: Infrastructure Foundations cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:13:31] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:13:49] (03CR) 10Slyngshede: [C:03+2] P:idm enable account managers LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1104958 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [09:14:43] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:15:13] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:15:31] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:16:05] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:18:06] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [09:21:16] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104955 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:23:09] (03PS3) 10DCausse: rdf-streaming-updater: add wdqs udpater streams in event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) [09:23:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:27:27] the overnight train-presync failed on the deployment server [09:27:35] !log T378097: reindexing all lexemes [09:27:37] neither the timer nor the service are marked as failure [09:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:39] T378097: Investigation: why do statements on Senses and Forms not show up in searches using haswbstatement - https://phabricator.wikimedia.org/T378097 [09:27:49] which I am happily ignoring [09:27:58] so I am going to run scap sync-world [09:28:11] (the job failed to deploy to Kubernetes after reaching some timeout) [09:32:01] (03PS1) 10Elukey: charts: add Prometheus statsd support to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104961 (https://phabricator.wikimedia.org/T216826) [09:34:09] !log hashar@deploy2002 Started scap sync-world: Overnight deployment timed out deploying to Kubernetes as usual - T375667 [09:34:13] T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667 [09:34:58] (03PS2) 10Elukey: charts: add Prometheus statsd support to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104961 (https://phabricator.wikimedia.org/T216826) [09:38:11] * hashar shakes fists at Helm [09:40:27] (03PS1) 10Elukey: services: update production settings for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104963 (https://phabricator.wikimedia.org/T216826) [09:40:40] (03CR) 10Filippo Giunchedi: [C:03+1] "Very nice! Thank you" [software] - 10https://gerrit.wikimedia.org/r/1104727 (https://phabricator.wikimedia.org/T381680) (owner: 10Scott French) [09:41:22] !log hashar@deploy2002 Started scap sync-world: Kubernetes cluster was unreachable (timeout) - T375667 [09:41:26] T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667 [09:42:09] (03PS3) 10Volans: ownership: Infrastructure Foundations cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) [09:44:49] !log hashar@deploy2002 Finished scap sync-world: Kubernetes cluster was unreachable (timeout) - T375667 (duration: 03m 27s) [09:51:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:27] (03CR) 10Volans: [C:03+2] ownership: Infrastructure Foundations cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:56:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:52] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: default to -15d for sidecar min_time [puppet] - 10https://gerrit.wikimedia.org/r/1104630 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:57:59] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: refactor common functionality [puppet] - 10https://gerrit.wikimedia.org/r/1104631 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:59:23] (03Merged) 10jenkins-bot: ownership: Infrastructure Foundations cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104949 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [10:08:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104963 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [10:10:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1104946 (owner: 10Slyngshede) [10:11:26] 06SRE, 10SRE-swift-storage, 06Commons: Interieur - 's-Gravenhage - 20089866 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381893#10408618 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This week's rclone sync job has fixed this for us (I've purged the commons... [10:11:48] 06SRE, 10SRE-swift-storage, 06Commons: Interieur - 's-Gravenhage - 20085391 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381891#10408624 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This week's rclone sync job has fixed this for us (I've purged the commons... [10:14:43] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on wikikube-worker2084.codfw.wmnet with reason: Test downtime to troubleshoot failed cookbook [10:14:45] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:15:00 on wikikube-worker2084.codfw.wmnet with reason: Test downtime to troubleshoot failed cookbook [10:25:03] (03PS1) 10Volans: icinga: fix icinga-status bug [puppet] - 10https://gerrit.wikimedia.org/r/1104967 [10:25:38] (03CR) 10CI reject: [V:04-1] icinga: fix icinga-status bug [puppet] - 10https://gerrit.wikimedia.org/r/1104967 (owner: 10Volans) [10:28:50] (03PS1) 10Muehlenhoff: Blacklist squashfs [puppet] - 10https://gerrit.wikimedia.org/r/1104968 [10:30:22] (03PS2) 10Volans: icinga: fix icinga-status bug [puppet] - 10https://gerrit.wikimedia.org/r/1104967 [10:33:14] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on gitlab1004.wikimedia.org with reason: Test downtime to troubleshoot failed cookbook [10:33:16] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:15:00 on gitlab1004.wikimedia.org with reason: Test downtime to troubleshoot failed cookbook [10:33:36] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:33:49] (03PS1) 10Muehlenhoff: Enable management of cn=wmf for production IDMs [puppet] - 10https://gerrit.wikimedia.org/r/1104970 [10:33:54] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:34:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1104967 (owner: 10Volans) [10:35:15] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [10:35:15] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [10:37:13] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: fix icinga-status bug [puppet] - 10https://gerrit.wikimedia.org/r/1104967 (owner: 10Volans) [10:42:30] (03CR) 10Volans: [C:03+2] icinga: fix icinga-status bug [puppet] - 10https://gerrit.wikimedia.org/r/1104967 (owner: 10Volans) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1100) [11:02:24] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2084 [11:02:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2084 [11:02:48] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2087 [11:02:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2087 [11:04:13] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104975 [11:04:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [11:06:08] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:04] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:12:42] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [11:14:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [11:15:12] FIRING: [2x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:20:32] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2084.codfw.wmnet with reason: host reimage [11:21:15] (03CR) 10Slyngshede: [C:03+2] Release v0.1.5 [software/bitu] - 10https://gerrit.wikimedia.org/r/1104946 (owner: 10Slyngshede) [11:22:15] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2087.codfw.wmnet with reason: host reimage [11:23:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2084.codfw.wmnet with reason: host reimage [11:27:22] (03Merged) 10jenkins-bot: Release v0.1.5 [software/bitu] - 10https://gerrit.wikimedia.org/r/1104946 (owner: 10Slyngshede) [11:27:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2087.codfw.wmnet with reason: host reimage [11:30:04] (03CR) 10Elukey: [C:03+1] Blacklist squashfs [puppet] - 10https://gerrit.wikimedia.org/r/1104968 (owner: 10Muehlenhoff) [11:31:59] (03CR) 10Elukey: [C:03+1] benthos: webrequest_live: fix unittest failure [puppet] - 10https://gerrit.wikimedia.org/r/1103382 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [11:36:48] (03PS1) 10Slyngshede: Upgrade IDM to version 0.1.5 [dns] - 10https://gerrit.wikimedia.org/r/1104978 [11:40:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1104978 (owner: 10Slyngshede) [11:40:01] (03CR) 10Volans: "Ownership question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [11:40:05] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:29] (03CR) 10Slyngshede: [C:03+2] Upgrade IDM to version 0.1.5 [dns] - 10https://gerrit.wikimedia.org/r/1104978 (owner: 10Slyngshede) [11:43:05] PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:43:09] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:43:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2084.codfw.wmnet with OS bookworm [11:44:05] RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:44:56] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10408820 (10Volans) [11:45:36] (03CR) 10Giuseppe Lavagetto: ownership: Traffic cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [11:46:31] (03CR) 10Volans: ownership: Traffic cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [11:46:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2087.codfw.wmnet with OS bookworm [11:47:02] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [11:47:42] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10408826 (10cmooney) Thanks for the diagram @Papaul. Overall looks fine thanks. **FR-Tech** Probably worth catching up with the frack guys to go over the setup but based on your... [11:47:54] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1104970 (owner: 10Muehlenhoff) [11:48:18] (03CR) 10JMeybohm: [C:03+1] "k8s wise this lgtm but I don't know about the statsd exporter config" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104961 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [11:49:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [11:51:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2084.codfw.wmnet [11:51:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2084.codfw.wmnet [11:51:23] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2087.codfw.wmnet [11:51:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2087.codfw.wmnet [11:55:12] FIRING: [3x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:57:35] (03CR) 10Muehlenhoff: [C:03+2] Enable signups.validators.IsUsernameEmail validator [puppet] - 10https://gerrit.wikimedia.org/r/1104651 (https://phabricator.wikimedia.org/T382226) (owner: 10Muehlenhoff) [11:58:03] (03CR) 10Hnowlan: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104961 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [12:02:27] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2082.codfw.wmnet [12:03:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2082.codfw.wmnet [12:03:10] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2083.codfw.wmnet [12:03:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2083.codfw.wmnet [12:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10409010 (10phaultfinder) [12:05:00] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [12:05:00] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [12:05:15] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [12:05:15] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [12:05:18] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2082.codfw.wmnet with OS bookworm [12:05:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2083.codfw.wmnet with OS bookworm [12:05:38] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2082 [12:05:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2082 [12:05:38] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2083 [12:05:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2083 [12:09:13] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:20] (03PS1) 10Hnowlan: Revert^2 "kubernetes: add mw-videoscaler to scap deployments" [puppet] - 10https://gerrit.wikimedia.org/r/1104985 [12:10:42] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10409036 (10BTullis) [12:11:21] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10409039 (10BTullis) a:03BTullis I'll pick this up. [12:21:45] (03CR) 10Cathal Mooney: [C:03+2] Disable SSH password auth on all devices [homer/public] - 10https://gerrit.wikimedia.org/r/1091725 (https://phabricator.wikimedia.org/T379464) (owner: 10Ayounsi) [12:22:17] (03Merged) 10jenkins-bot: Disable SSH password auth on all devices [homer/public] - 10https://gerrit.wikimedia.org/r/1091725 (https://phabricator.wikimedia.org/T379464) (owner: 10Ayounsi) [12:23:55] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2083.codfw.wmnet with reason: host reimage [12:24:12] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2082.codfw.wmnet with reason: host reimage [12:27:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2083.codfw.wmnet with reason: host reimage [12:31:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2082.codfw.wmnet with reason: host reimage [12:39:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2083.codfw.wmnet with OS bookworm [12:48:15] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:50:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2082.codfw.wmnet with OS bookworm [12:51:34] (03PS1) 10Jsn.sherman: Enable AutoModerator on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104992 (https://phabricator.wikimedia.org/T382286) [12:51:38] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2082.codfw.wmnet [12:51:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2082.codfw.wmnet [12:51:50] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2083.codfw.wmnet [12:51:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2083.codfw.wmnet [12:52:31] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2081.codfw.wmnet [12:53:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2081.codfw.wmnet [12:53:14] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2080.codfw.wmnet [12:53:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2080.codfw.wmnet [12:54:34] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2080.codfw.wmnet with OS bookworm [12:54:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2081.codfw.wmnet with OS bookworm [12:54:54] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2080 [12:54:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2080 [12:55:00] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2081 [12:55:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2081 [12:55:29] (03CR) 10Ladsgroup: [C:03+1] "Needs manual rebase?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [12:57:28] (03PS2) 10Volans: ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) [12:58:15] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1300) [13:00:26] (03PS2) 10Volans: ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) [13:00:39] (03PS2) 10Volans: ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) [13:00:46] (03PS2) 10Volans: ownership: Collaboration Services cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) [13:00:54] (03PS2) 10Volans: ownership: Observability cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104955 (https://phabricator.wikimedia.org/T379258) [13:01:02] (03PS2) 10Volans: ownership: WMCS cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104956 (https://phabricator.wikimedia.org/T379258) [13:01:20] (03PS2) 10Volans: ownership: Data Platform cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) [13:05:15] RESOLVED: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [13:05:15] RESOLVED: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [13:06:03] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104955 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:09:46] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10409212 (10Andrew) a:03Andrew Can't promise that I'll finish this task but I'm currently working on making an httrack copy of wikitech [13:11:15] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:12:34] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2080.codfw.wmnet with reason: host reimage [13:13:01] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2081.codfw.wmnet with reason: host reimage [13:13:23] (03CR) 10Volans: [C:03+2] ownership: WMCS cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104956 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:13:42] (03CR) 10Volans: [C:03+2] ownership: Observability cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104955 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:16:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2080.codfw.wmnet with reason: host reimage [13:19:07] (03Merged) 10jenkins-bot: ownership: WMCS cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104956 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:19:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2081.codfw.wmnet with reason: host reimage [13:19:52] (03Merged) 10jenkins-bot: ownership: Observability cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104955 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [13:21:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [13:21:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [13:32:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2080.codfw.wmnet with OS bookworm [13:37:11] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:23] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2081.codfw.wmnet with OS bookworm [13:39:16] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:37] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2081.codfw.wmnet [13:39:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2081.codfw.wmnet [13:39:54] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2080.codfw.wmnet [13:39:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2080.codfw.wmnet [13:41:26] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2078-2079].codfw.wmnet [13:44:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [13:45:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2078-2079].codfw.wmnet [13:46:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2078.codfw.wmnet with OS bookworm [13:46:24] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2079.codfw.wmnet with OS bookworm [13:46:43] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2078 [13:46:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2078 [13:46:44] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2079 [13:46:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2079 [13:48:30] (03PS1) 10Muehlenhoff: Fix two typos [software/bitu] - 10https://gerrit.wikimedia.org/r/1104999 [13:50:33] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:33] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:02] (03CR) 10Slyngshede: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1104999 (owner: 10Muehlenhoff) [13:56:32] (03CR) 10Muehlenhoff: [C:03+2] Fix two typos [software/bitu] - 10https://gerrit.wikimedia.org/r/1104999 (owner: 10Muehlenhoff) [13:57:49] (03CR) 10Elukey: [C:03+2] charts: add Prometheus statsd support to Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104961 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [13:57:58] (03CR) 10Elukey: [C:03+2] services: update production settings for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104963 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [13:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10409372 (10phaultfinder) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1400). [14:00:05] dcausse and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] o/ [14:00:50] I can’t deploy today, sorry [14:01:00] * Lucas_WMDE needs to set up new YubiKey [14:01:14] np! [14:01:23] I can deploy [14:02:12] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2024-12-16-203402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104754 (https://phabricator.wikimedia.org/T382278) (owner: 10KartikMistry) [14:02:24] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:02:41] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:03:41] (03Merged) 10jenkins-bot: Update Recommendation API to 2024-12-16-203402-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104754 (https://phabricator.wikimedia.org/T382278) (owner: 10KartikMistry) [14:04:05] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2079.codfw.wmnet with reason: host reimage [14:04:13] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2078.codfw.wmnet with reason: host reimage [14:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [14:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104598 (https://phabricator.wikimedia.org/T379002) (owner: 10DCausse) [14:04:45] MichaelG_WMF: o/ deploying my two config patches, please ping me once you're around and I'll deploy your patch [14:05:04] (03Merged) 10jenkins-bot: rdf-streaming-updater: add wdqs udpater streams in event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [14:05:06] (03Merged) 10jenkins-bot: cirrussearch: increase shard count for cebwiki_content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104598 (https://phabricator.wikimedia.org/T379002) (owner: 10DCausse) [14:05:33] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1099727|rdf-streaming-updater: add wdqs udpater streams in event stream config (T374919)]], [[gerrit:1104598|cirrussearch: increase shard count for cebwiki_content (T379002)]] [14:05:38] T374919: Adapt the rdf-streaming-updater flink job to use wikimedia-eventutilities-flink - https://phabricator.wikimedia.org/T374919 [14:05:38] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [14:06:43] RESOLVED: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [14:06:44] RESOLVED: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:07:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2079.codfw.wmnet with reason: host reimage [14:08:40] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:10:08] dcausse: I'm around! Sorry, I was distracted for a moment [14:10:22] no worries :) [14:10:46] My config change does in practice change nothing for production, it just sets the default explicitly. The only functional change is for one beta wiki. [14:11:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2078.codfw.wmnet with reason: host reimage [14:12:11] MichaelG_WMF: ack, do you still need to do some sanity check on the debug servers because of the wmf-config/ext-GrowthExperiments.php change? [14:13:06] Can do. But unless I misspelled `false`, I'm not seeing how anything could possibly go wrong from that [14:14:31] FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:55] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1099727|rdf-streaming-updater: add wdqs udpater streams in event stream config (T374919)]], [[gerrit:1104598|cirrussearch: increase shard count for cebwiki_content (T379002)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:15:00] T374919: Adapt the rdf-streaming-updater flink job to use wikimedia-eventutilities-flink - https://phabricator.wikimedia.org/T374919 [14:15:00] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [14:15:56] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:17:45] !log dcausse@deploy2002 dcausse: Continuing with sync [14:19:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [14:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:20:07] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:21:36] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1104945 (owner: 10Muehlenhoff) [14:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10409439 (10phaultfinder) [14:24:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104992 (https://phabricator.wikimedia.org/T382286) (owner: 10Jsn.sherman) [14:25:41] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099727|rdf-streaming-updater: add wdqs udpater streams in event stream config (T374919)]], [[gerrit:1104598|cirrussearch: increase shard count for cebwiki_content (T379002)]] (duration: 20m 07s) [14:25:46] T374919: Adapt the rdf-streaming-updater flink job to use wikimedia-eventutilities-flink - https://phabricator.wikimedia.org/T374919 [14:25:46] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [14:26:22] MichaelG_WMF: shipping your patch now [14:26:33] (03PS2) 10Aklapper: Phabricator: Add "video/mp4" to files.viewable-mime-types [puppet] - 10https://gerrit.wikimedia.org/r/1101481 (https://phabricator.wikimedia.org/T309222) [14:26:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [14:27:18] @dcausse Yay! [14:27:19] !log T375641: reindexing all EntitySchema pages on testwikidatawiki [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:23] T375641: [ES-M3]: Implement label and aliases search for EntitySchemas via the wbsearchentities API - https://phabricator.wikimedia.org/T375641 [14:27:28] (03Merged) 10jenkins-bot: beta: enable updating link-suggestions from read-mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102909 (https://phabricator.wikimedia.org/T378536) (owner: 10Michael Große) [14:27:41] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:27:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2079.codfw.wmnet with OS bookworm [14:27:58] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1102909|beta: enable updating link-suggestions from read-mode (T378536)]] [14:28:02] T378536: Surfacing structured tasks: Create a proof of concept solution for generating Add Link suggestions on-the-fly - https://phabricator.wikimedia.org/T378536 [14:31:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2078.codfw.wmnet with OS bookworm [14:31:41] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:32:15] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2078-2079].codfw.wmnet [14:32:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2078-2079].codfw.wmnet [14:33:06] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2076-2077].codfw.wmnet [14:34:13] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:34:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2076-2077].codfw.wmnet [14:34:27] !log dcausse@deploy2002 migr, dcausse: Backport for [[gerrit:1102909|beta: enable updating link-suggestions from read-mode (T378536)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:34:30] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:34:30] T378536: Surfacing structured tasks: Create a proof of concept solution for generating Add Link suggestions on-the-fly - https://phabricator.wikimedia.org/T378536 [14:34:53] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2076.codfw.wmnet with OS bookworm [14:34:54] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2077.codfw.wmnet with OS bookworm [14:35:02] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2077 [14:35:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2077 [14:35:13] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2076 [14:35:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2076 [14:35:43] MichaelG_WMF: should be on testservers if you want to do quick sanity check [14:35:56] thanks, I'll have a look! [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:22] @dcausse seems to still work just fine on my side! [14:38:41] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:38:46] MichaelG_WMF: ack, continuing with sync [14:38:56] (03CR) 10Kgraessle: [C:03+1] Enable AutoModerator on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104992 (https://phabricator.wikimedia.org/T382286) (owner: 10Jsn.sherman) [14:39:06] !log dcausse@deploy2002 migr, dcausse: Continuing with sync [14:39:31] FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:40] MichaelG_WMF: out of curiosity is 'migr' (from logmsgbot) your nickname? [14:40:10] (03CR) 10CDanis: [C:03+2] benthos: webrequest_live: fix unittest failure [puppet] - 10https://gerrit.wikimedia.org/r/1103382 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [14:40:53] @dcausse I think that is the shell name based on the schema from WMDE? It is just the first two letters of my first and surname. [14:41:41] @dcausse: I took over my phabricator account from my time at WMDE. [14:41:49] ok, was wondering where this is coming from because I can't seem to find where you've put this info for scap to be aware of it [14:42:15] I’m guessing it pulls it out of puppet (or LDAP?) based on the commit author email? [14:42:31] Good question [14:42:33] modules/admin/data/data.yaml has an account named migr with that @w.o address [14:42:34] ah possibly [14:42:40] ok [14:42:57] (that’s the one file path in puppet.git that I have memorized :D) [14:43:30] it is also what is listed as my "Username" in my Gerrit settings [14:43:41] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10409570 (10Papaul) @cmooney thank you for the review. For the fundraising rack i have a separate racking and installation task coming up where I will put all the details. Like yo... [14:46:24] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102909|beta: enable updating link-suggestions from read-mode (T378536)]] (duration: 18m 26s) [14:46:29] T378536: Surfacing structured tasks: Create a proof of concept solution for generating Add Link suggestions on-the-fly - https://phabricator.wikimedia.org/T378536 [14:47:15] MichaelG_WMF: ack, the deploy should be done [14:47:44] !log Run extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php for arwiki cawiki frwiki mediawikiwiki orwiki wawiki wawiktionary wikidatawiki zhwiki (T378829) [14:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:47] T378829: Run Flow migration script at *Phase 2a* wikis - https://phabricator.wikimedia.org/T378829 [14:47:58] @dcausse Thanks! [14:48:03] yw! [14:48:19] !log closing the UTC afternoon backport window [14:48:19] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10409583 (10Papaul) @cmooney also just keep in mind that this task is to have an overview of how things we be need connected. The type of optic and other elements will be based on t... [14:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:48] 07Puppet, 06Infrastructure-Foundations, 06Release-Engineering-Team: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 - https://phabricator.wikimedia.org/T371980#10409584 (10hashar) 05Open→03Declined My intent was to remove the `umask` parameter (T338277) which was completed.... [14:50:30] (03PS2) 10Muehlenhoff: Deprecate system::role for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1104945 [14:50:45] (03CR) 10Muehlenhoff: Deprecate system::role for remaining WMCS roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104945 (owner: 10Muehlenhoff) [14:52:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2077.codfw.wmnet with reason: host reimage [14:52:48] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2076.codfw.wmnet with reason: host reimage [14:52:58] (03PS2) 10Muehlenhoff: Enable management of cn=wmf for production IDMs [puppet] - 10https://gerrit.wikimedia.org/r/1104970 [14:55:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2077.codfw.wmnet with reason: host reimage [14:57:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2076.codfw.wmnet with reason: host reimage [15:05:25] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10409640 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm we will get another ticket if this gets triggered again. we can link... [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:49] (03PS1) 10Eevans: restbase: Upgrade Cassandra to 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105011 (https://phabricator.wikimedia.org/T380420) [15:14:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2077.codfw.wmnet with OS bookworm [15:15:34] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105011 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [15:17:41] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2076.codfw.wmnet with OS bookworm [15:21:06] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2076-2077].codfw.wmnet [15:21:07] (03PS1) 10Eevans: restbase: cleanup decommissioned hosts [puppet] - 10https://gerrit.wikimedia.org/r/1105015 [15:21:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2076-2077].codfw.wmnet [15:22:06] (03CR) 10Eevans: [C:03+2] restbase: Upgrade Cassandra to 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105011 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [15:22:19] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10409722 (10BTullis) Thanks @Volans for your work on this. I think that it will be very helpful. With reference your point here... > The choice to be able to mark on... [15:22:49] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2072-2073].codfw.wmnet [15:23:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2072-2073].codfw.wmnet [15:26:18] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2072.codfw.wmnet with OS bookworm [15:26:20] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2073.codfw.wmnet with OS bookworm [15:26:38] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2072 [15:26:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2072 [15:26:39] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2073 [15:26:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2073 [15:27:21] (03PS1) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:27:43] (03CR) 10CI reject: [V:04-1] Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [15:28:43] (03PS1) 10Scott French: shellbox: release image 2024-12-17-061932 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105017 (https://phabricator.wikimedia.org/T292322) [15:29:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10409789 (10phaultfinder) [15:30:23] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:42] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [15:30:46] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [15:30:50] (03CR) 10Hnowlan: [C:03+1] shellbox: release image 2024-12-17-061932 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105017 (https://phabricator.wikimedia.org/T292322) (owner: 10Scott French) [15:31:15] (03CR) 10Scott French: [C:03+1] Revert^2 "kubernetes: add mw-videoscaler to scap deployments" [puppet] - 10https://gerrit.wikimedia.org/r/1104985 (owner: 10Hnowlan) [15:31:18] (03PS2) 10Slyngshede: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 [15:31:57] (03PS1) 10Andrew Bogott: cloudbackup: work around a postgresql bug by adjusting shared_buffers [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) [15:33:20] (03CR) 10Scott French: [C:03+2] shellbox: release image 2024-12-17-061932 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105017 (https://phabricator.wikimedia.org/T292322) (owner: 10Scott French) [15:33:54] (03CR) 10CI reject: [V:04-1] cloudbackup: work around a postgresql bug by adjusting shared_buffers [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [15:35:04] (03Merged) 10jenkins-bot: shellbox: release image 2024-12-17-061932 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105017 (https://phabricator.wikimedia.org/T292322) (owner: 10Scott French) [15:36:45] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans) [15:39:12] (03PS1) 10Lucas Werkmeister (WMDE): Add new lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1105021 [15:41:32] (03PS2) 10DDesouza: Reader Survey: Partially undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) [15:44:41] (03PS3) 10DDesouza: Reader Survey: Partially undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) [15:44:45] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2073.codfw.wmnet with reason: host reimage [15:45:53] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [15:46:25] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2072.codfw.wmnet with reason: host reimage [15:47:25] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:47:46] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [15:47:59] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [15:48:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2073.codfw.wmnet with reason: host reimage [15:48:20] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [15:48:36] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [15:48:57] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:50:24] (03PS1) 10DDesouza: Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) [15:51:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2072.codfw.wmnet with reason: host reimage [15:51:58] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:52:20] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [15:52:26] (03CR) 10Eevans: "I think that's just test data. I assume any valid hostname there would work, and that these are (were) real is because it's copypasta?" [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans) [15:52:43] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [15:53:04] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [15:53:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [15:53:36] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [15:55:02] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10409862 (10Jhancock.wm) [15:55:12] FIRING: [3x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:56:16] (03CR) 10Eevans: [C:03+1] ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [15:57:35] (03CR) 10Hnowlan: [C:03+2] entrypoint.sh: use full thumbor path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101097 (owner: 10AntiCompositeNumber) [15:57:48] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:58:32] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:59:03] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:59:10] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10409881 (10Arnoldokoth) a:03thcipriani [15:59:26] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:59:57] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1600). [16:00:16] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:00:48] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:01:14] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:01:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:02:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [16:02:32] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update [16:02:47] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update [16:02:49] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:03:02] !log brennen@deploy2002 Started deploy [phabricator/deployment@53251a4]: deploy phab2002 for T382346 [16:03:06] T382346: Deploy Phabricator/Phorge 2024-12-17 - https://phabricator.wikimedia.org/T382346 [16:03:29] !log brennen@deploy2002 Finished deploy [phabricator/deployment@53251a4]: deploy phab2002 for T382346 (duration: 00m 27s) [16:03:32] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:04:24] (03CR) 10Clément Goubert: [C:03+1] ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:05:06] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10409926 (10cmooney) Thanks @papaul for the feedback. >>! In T382219#10409570, @Papaul wrote: > "Why do we have 40FBase-LR4 modules listed?" Because the msw2 will connect to the ms... [16:08:17] !log brennen@deploy2002 Started deploy [phabricator/deployment@53251a4]: deploy phab1004 for T382346 [16:08:21] T382346: Deploy Phabricator/Phorge 2024-12-17 - https://phabricator.wikimedia.org/T382346 [16:08:25] (03Merged) 10jenkins-bot: entrypoint.sh: use full thumbor path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101097 (owner: 10AntiCompositeNumber) [16:08:41] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update [16:08:55] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update [16:09:09] !log brennen@deploy2002 Finished deploy [phabricator/deployment@53251a4]: deploy phab1004 for T382346 (duration: 00m 52s) [16:09:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2073.codfw.wmnet with OS bookworm [16:10:25] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2072.codfw.wmnet with OS bookworm [16:12:50] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2072-2073].codfw.wmnet [16:12:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2072-2073].codfw.wmnet [16:13:08] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10409969 (10cmooney) Just a few notes on this. Firstly we are now getting the AE/LAG interface stats for our core routers since they were all upgraded to a more recent Ju... [16:16:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [16:17:05] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [16:17:36] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [16:17:52] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [16:18:02] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:18:23] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [16:18:27] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10409983 (10Arnoldokoth) Thanks @BTullis [16:18:36] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [16:18:52] (03Abandoned) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1104740 (owner: 10Pppery) [16:19:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:19:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:19:54] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [16:20:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [16:20:49] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [16:21:18] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10409987 (10Volans) Thanks @BTullis for your feedback. I'm aware of those use cases and the related SIGs or other form of working groups. The problem with working g... [16:21:22] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [16:22:29] !log deployed shellbox 2024-12-17-061932 for T292322 [16:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:33] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [16:22:33] (03CR) 10Andrew Bogott: [C:04-2] "whoops, that is not what 'includes' does" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [16:29:17] (03PS15) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [16:29:37] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [16:30:20] (03PS1) 10Elukey: charts: improve Kartotherian metrics and monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) [16:31:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2004-dev to codfw - jhancock@cumin2002" [16:31:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2004-dev to codfw - jhancock@cumin2002" [16:31:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephosd2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:34:12] (03PS2) 10Andrew Bogott: cloudbackup: work around a postgresql bug by adjusting shared_buffers [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) [16:36:22] (03CR) 10CI reject: [V:04-1] cloudbackup: work around a postgresql bug by adjusting shared_buffers [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [16:40:53] (03PS3) 10Andrew Bogott: cloudbackup: work around a postgresql bug by adjusting shared_buffers [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) [16:40:59] (03CR) 10Andrew Bogott: cloudbackup: work around a postgresql bug by adjusting shared_buffers [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [16:43:05] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [16:45:19] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [16:52:54] (03PS1) 10FNegri: Revert "Block PAWS workers nodes from all UDP traffic other than DNS & NTP" [puppet] - 10https://gerrit.wikimedia.org/r/1105036 (https://phabricator.wikimedia.org/T381373) [16:53:43] (03PS16) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [16:54:02] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [16:54:06] (03PS4) 10Andrew Bogott: cloudbackup: work around a postgresql bug by adjusting shared_buffers [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) [16:54:26] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4707/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [16:56:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [16:57:20] (03CR) 10Scott French: [C:03+1] ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:58:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1700). [17:00:05] tgr, MatmaRex, and Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:07] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2004-dev'] [17:01:18] o/ [17:01:33] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10410142 (10Jhancock.wm) a:03Jhancock.wm [17:02:37] (03PS5) 10Scott French: hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040) [17:02:37] (03PS3) 10Scott French: hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040) [17:02:53] hi [17:02:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2004-dev'] [17:05:37] 👋 [17:06:40] (03CR) 10Majavah: Revert "Block PAWS workers nodes from all UDP traffic other than DNS & NTP" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105036 (https://phabricator.wikimedia.org/T381373) (owner: 10FNegri) [17:08:01] tgr|away: this would be a good use case for an httpbb test -- have you written one of those before? [17:09:17] rzl: I think I did once [17:09:26] if you look in modules/profile/files/httpbb/appserver/test_redirects.yaml you can see some others -- these are blackbox tests we can run to verify the apache config is doing what we think it's doing [17:09:45] it's relatively unimportant functionality though [17:10:18] (the .well-known URI is used by some password managers to provide a direct link for changing the password) [17:10:19] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [17:10:36] anyway I can add a test if you feel it's worth it [17:11:08] (03PS1) 10Herron: thanos-store: enable caching bucket [puppet] - 10https://gerrit.wikimedia.org/r/1105037 (https://phabricator.wikimedia.org/T368953) [17:11:32] like with any other apache config change especially templated, we'll be running the full test suite anyway to make sure this didn't have adverse effects -- might as well also test that it's doing the right thing :) [17:12:30] (we don't have great test coverage in there but it's not because we don't wish it were better -- it's like any other sparsely tested code, we just try to add more as we go) [17:16:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:45] PROBLEM - Host kubernetes2056 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:09] (03PS3) 10Gergő Tisza: Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) [17:18:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:19:06] MatmaRex: on beta I still know less but thank you for getting these reviewed :) ideally I'd like to have httpbb test coverage for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100534 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100531 too, I'm not sure offhand if there are already tests for those cases [17:20:00] I guess https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100534 doesn't have a review yet, were you unable to find someone? [17:21:39] rzl: re 1100534, we should probably skip that one today, ideally we'd have some apache expert review that, if we have such a person. i don't know who that would be [17:22:11] it works on beta, but i have no idea if the extra may have some effect on performance, or something… [17:22:33] FIRING: KubernetesCalicoDown: kubernetes2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2056.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:22:38] rzl: re 1100531, do you mean tests that would test the portals specifically on the beta cluster? i think there are probably tests somewhere to verify that they work in production [17:23:10] in production is all I need to see [17:24:00] rzl: added a test [17:24:12] aha yeah there's a test for https://www.wikimedia.org/wiki/Main_Page -> https://foundation.wikimedia.org/wiki/Main_Page so that should be good [17:24:13] yeah, that probably exists, btu i have no idea where to find it… i vaguely recall seeing some alerts in this channel once when the portals were broken [17:25:38] (03PS2) 10Elukey: charts: improve Kartotherian metrics and monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) [17:26:04] rzl: btw, mostly unrelated question. do you happen to know if the %{ENV:RW_PROTO} business is still needed? i saw that in lots of apache config files when working on these changes. can it be replaced with just 'https', since we don't allow http access any more, or is it needed for some internal testing, or something else? [17:27:53] MatmaRex: I don't know offhand, sorry [17:28:11] no problem [17:28:40] I would try _j.oe_ for both that question and the apache review if he's available -- I assume he's off for today but I *think* he's not gone for the holiday break yet [17:29:15] he's a reviewer on all the changes, but he hasn't commented [17:29:20] (03PS3) 10Elukey: charts: improve Kartotherian metrics and monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) [17:29:48] tgr|away: thanks -- this asserts that it redirects to http: but you wanted https: right? [17:30:00] (only noticed when I ran the test and found it's already passing) [17:30:17] (03CR) 10Elukey: "I checked the current metrics via https://graphite.wikimedia.org/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [17:30:31] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [17:30:34] oops, sorry [17:30:37] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [17:30:43] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [17:30:54] (03PS4) 10Gergő Tisza: Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) [17:30:56] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [17:32:29] perfect https://www.irccloud.com/pastebin/bk0yTRW5/ [17:32:46] (03CR) 10RLazarus: [C:03+2] Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza) [17:36:01] waiting for puppet on the deploy host, then I'll scap this out [17:39:57] PROBLEM - Host wikikube-worker2186 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:21] PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:27] tgr|away: aha -- there aren't any diffs to scap, I couldn't figure out why at first but I see we don't actually use that file from the Puppet repo in the k8s world, it's one of the ones we had to reimplement [17:45:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/mediawiki/templates/lamp/_site_helpers.tpl#114 [17:46:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [17:46:09] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [17:46:44] in the meantime the change is testable at mwdebug2001, but it'll only be on the bare metal hosts (basically just debug) until we change it in the charts repo as well [17:47:07] rzl: would a patch for that also be something for this window, or is there a separate deploy process for that? [17:47:33] FIRING: [2x] KubernetesCalicoDown: kubernetes2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:47:57] PROBLEM - Host elastic2080 is DOWN: PING CRITICAL - Packet loss = 100% [17:48:16] officially it goes in the MediaWiki Infrastructure window instead but I'm happy to ship it out either way and I'm not picky about which [17:48:44] we're low on time though, the puppet window isn't usually this busy or this complex :) let me start on MatmaRex's if that's okay [17:49:29] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:34] of course [17:49:35] if we can find a quiet spot in the day we might also be able to ship yours out ad hoc [17:50:19] MatmaRex: will you be testing these individually or all together? [17:51:35] RECOVERY - Host elastic2080 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [17:51:44] rzl: as you prefer. each change works by itself, but it may be easier to do them together [17:51:47] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [17:52:00] (except the last one, as noted earlier, we should probably skip that one) [17:52:23] fwiw, i've tested them individually on the beta cluster two weeks ago [17:52:28] elastic2079 an 80 was me bump in those cable while working in the rack [17:52:51] okay, cool -- let's also do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100531 separately since it also affects prod, but I can merge 30, 32, 33 all at once [17:52:55] papaul: thanks <3 [17:53:16] (03CR) 10RLazarus: [C:03+2] MediaWiki: Ensure nice 404 instead of php-fpm 404 on auth domain [puppet] - 10https://gerrit.wikimedia.org/r/1100530 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [17:53:26] rzl: you welcome [17:53:37] (03CR) 10RLazarus: [C:03+2] MediaWiki: Redirect auth domain root to wikimedia.org portal [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [17:54:00] oh except I didn't realize they're all parented -- yeah okay let's do em all [17:54:16] simpler than rebasing [17:54:33] (03CR) 10RLazarus: [C:03+2] MediaWiki: Define wikimedia.org portal on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) (owner: 10Bartosz Dziewoński) [17:55:01] (03CR) 10RLazarus: [C:03+2] MediaWiki: Remove duplicate ErrorDocument 404 from beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100533 (owner: 10Bartosz Dziewoński) [17:55:12] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [17:57:57] RECOVERY - Host wikikube-worker2186 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [17:58:17] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 196, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:19] RECOVERY - Host kubernetes2056 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [17:58:21] RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:59:32] swfrench-wmf: hanging over just a few minutes if that's okay [17:59:56] (just waiting for puppet on the deploy host now) [17:59:57] rzl: absolutely! take your time [18:00:05] swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1800). [18:02:33] RESOLVED: [2x] KubernetesCalicoDown: kubernetes2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:03:07] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:04:01] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:04:19] - ServerAlias *.wikimedia.org [18:04:19] + ServerAlias *.wikimedia.org [18:04:34] haha this does end up having a diff as rendered, because we drop a trailing space after the ".org" [18:04:40] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1100530, 1100531, 1100532, 1100533 [18:11:25] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [18:11:29] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [18:13:44] o/ [18:13:51] sorry I missed the puppet window :( [18:14:03] (03PS17) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:14:22] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:14:24] Lucas_WMDE: no worries -- I was going to ping you separately anyhow, your thing doesn't really need to be in the window [18:15:09] officially that should be an access request -- do you mind filing a ticket with https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ and the clinic duty person will get back to you? [18:15:31] ok, I’ll see if I can find my old task to copy+paste from ;) [18:15:32] thanks! [18:15:42] it'll be a quick and easy one since it's just updating the key and not getting new access approved [18:15:57] (but probably mention that in the text just so you don't have to take the long way around) [18:16:57] okay, 'tis the sport to have the enginer hoist with his own petard [18:17:27] scap for MatmaRex's change helpfully warns me that the httpbb tests from tgr|away's change are now failing 🤦 because of course the tests got out there successfully but the config change didn't [18:17:47] :o [18:17:51] so I'm going to go ahead and continue because I know what's going on, but I'll expedite fixing that so it doesn't happen to everyone else who tries to scap [18:18:10] in the meantime, MatmaRex, go ahead and test on k8s-mwdebug [18:18:27] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T382354 (10Lucas_Werkmeister_WMDE) 03NEW [18:18:29] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4708/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:18:35] rzl: ^ [18:18:58] Lucas_WMDE: thanks! arnoldokoth is on clinic duty this week, he'll take care of you [18:19:04] rzl: it's only the wikimedia.org portal change that could impact prod, right? [18:19:05] (03PS2) 10Lucas Werkmeister (WMDE): Add new lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1105021 (https://phabricator.wikimedia.org/T382354) [18:19:13] that sounds ominous :D [18:19:31] isn’t “he will… take care of you” what darth sidious says to the separatist leaders 🤔 [18:19:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [18:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [18:19:46] your request will be... handled in the appropriate manner [18:19:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:02] MatmaRex: well, "could" is a big word, but that's the only impact I foresee, yeah :D [18:20:09] (anyway, portals are still online on k8s-mwdebug) [18:20:15] I expect an… adequate response [18:20:30] ^ that httpbb alert is me, as above [18:20:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:42] v-- so are all of those [18:20:45] (03CR) 10Hnowlan: [C:03+1] charts: improve Kartotherian metrics and monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [18:20:54] !log rzl@deploy2002 rzl: https://gerrit.wikimedia.org/r/1100530, 1100531, 1100532, 1100533 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:20:58] !log rzl@deploy2002 rzl: Continuing with sync [18:21:11] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:21:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T382354#10410364 (10Lucas_Werkmeister_WMDE) [18:21:42] Lucas_WMDE Haha. Yes, I will handle your request. [18:23:25] (03PS18) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:23:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:24:31] FIRING: [11x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:24:58] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4709/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:25:11] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:25:21] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:25:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:25:26] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1100530, 1100531, 1100532, 1100533 (duration: 22m 25s) [18:25:39] MatmaRex: all set, thanks for your patience [18:25:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:26:00] swfrench-wmf: thanks for your patience as well -- I'll have one more chart patch to fix that httpbb mismatch but feel free to go ahead in the meantime [18:26:01] rzl: thank you for deploying, sorry i took up more than all of the window :) [18:26:25] swfrench-wmf: when scap warns you about httpbb errors for https://en.wikipedia.org/.well-known/change-password you can safely continue [18:27:14] rzl: ack, thank you! [18:27:36] (03PS19) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:27:40] (03CR) 10Scott French: [C:03+2] mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:28:12] (03PS1) 10RLazarus: Fix protocol for .well-known/change-password Apache rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105048 (https://phabricator.wikimedia.org/T381625) [18:28:16] (03PS20) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:29:12] (03Merged) 10jenkins-bot: mediawiki: add remaining migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082863 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:30:19] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:30:21] (03CR) 10RLazarus: "This dupes https://gerrit.wikimedia.org/r/1101462 into the MediaWiki Helm chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105048 (https://phabricator.wikimedia.org/T381625) (owner: 10RLazarus) [18:31:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T382354#10410399 (10Arnoldokoth) [18:32:31] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:32:38] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:32:47] (03CR) 10AOkoth: [C:03+2] Add new lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1105021 (https://phabricator.wikimedia.org/T382354) (owner: 10Lucas Werkmeister (WMDE)) [18:33:29] (03PS21) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:34:25] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:34:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:35:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T382354#10410404 (10Arnoldokoth) @Lucas_Werkmeister_WMDE This should be good to go. Feel free to resolve once you verify. [18:35:29] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:35:47] arnoldokoth: I’ll go make dinner and then come back to test when puppet should have run everywhere, okay? ^^ [18:36:10] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [18:36:16] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [18:36:27] Lucas_WMDE: Np. [18:37:14] and thanks for merging! [18:37:18] (03PS22) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:37:26] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [18:37:30] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [18:38:04] (03PS23) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:39:12] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [18:39:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [18:40:04] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [18:40:08] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [18:40:12] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:40:44] (03PS24) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:41:33] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:41:39] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:42:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:42:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:42:49] (03CR) 10Scott French: [C:03+2] hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:43:40] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:44:15] rzl: once I run puppet on the deployment server, I'll be running scap as well. is there a patch you'd like me to pick up at the same time? [18:44:50] swfrench-wmf: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1105048 is the fix -- if you'd like to bring that along for the ride, it's welcome, but not required [18:45:24] (I added tgr|away as reviewer, but all it really needs is an eyeball to confirm it looks the same as https://gerrit.wikimedia.org/r/c/operations/puppet/+/1101462) [18:45:41] jouncebot: nowandnext [18:45:41] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1800) [18:45:41] In 0 hour(s) and 14 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1900) [18:45:44] and then the verification is just that scap's httpbb run doesn't complain at you [18:46:39] if that doesn't sound fun, we can also just roll back adding the test to httpbb [18:47:19] (03CR) 10Scott French: [C:03+1] Fix protocol for .well-known/change-password Apache rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105048 (https://phabricator.wikimedia.org/T381625) (owner: 10RLazarus) [18:47:28] (03PS25) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:47:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455#10410457 (10Andrew) [18:48:02] rzl: thanks for the pointer to the associated puppet patch - that does indeed match :) [18:48:27] so yeah, +1 but if you'd prefer to wait that's fine too [18:48:38] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:48:46] (03CR) 10Bartosz Dziewoński: "I filed T382358 to find out." [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza) [18:48:46] I'm waiting on puppet agent on the active deployment host, so I have time [18:48:52] I'm happy if you are -- I'll +2 and then be nearby while you scap, if that sounds good? [18:48:57] (03PS26) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:49:04] rzl: SGTM [18:49:08] (03CR) 10RLazarus: [C:03+2] Fix protocol for .well-known/change-password Apache rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105048 (https://phabricator.wikimedia.org/T381625) (owner: 10RLazarus) [18:49:25] (03PS4) 10Bartosz Dziewoński: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) [18:52:37] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4710/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [18:53:08] (03Merged) 10jenkins-bot: Fix protocol for .well-known/change-password Apache rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105048 (https://phabricator.wikimedia.org/T381625) (owner: 10RLazarus) [18:53:49] thcipriani: apologies in advance if we run over into the train window -- my fault, not swfrench-wmf's who's just helping clean up my mess :) [18:55:13] to be fair, it's wild that we have to have this configured in two places :) [18:56:15] rzl: the chart change is live - is there anything you wanted to check in the diffs before deploying? sites-config ConfigMap looks reasonable to me [18:56:34] (03PS27) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [18:56:34] rzl: good reminder for me to update the calendar, dancy is on train this week. Thanks for the heads up! [18:57:19] swfrench-wmf: nah, if you don't get an httpbb warning when it hits the testservers, that's all I need [18:57:36] rzl: great! off we go [18:57:46] !log swfrench@deploy2002 Started scap sync-world: Deployment to populate remaining migration release files - T377040 [18:57:50] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:00:05] thcipriani and thcipriani: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T1900) [19:00:16] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4711/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [19:00:27] bah still updating calendar [19:01:10] o/ [19:01:15] Ping me when I can press the train button. [19:01:32] dancy: ack, will do - thanks for your patience! [19:01:38] no prob [19:04:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for lucaswerkmeister-wmde (new SSH key) - https://phabricator.wikimedia.org/T382354#10410541 (10Lucas_Werkmeister_WMDE) 05Open→03Resolved a:03Arnoldokoth Works like a charm, thank you! [19:04:56] rzl: checks are clean now, thanks! [19:05:03] swfrench-wmf: perfect thank you [19:09:22] !log swfrench@deploy2002 Finished scap sync-world: Deployment to populate remaining migration release files - T377040 (duration: 11m 35s) [19:09:26] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:09:38] (03CR) 10Scott French: [C:03+2] mediawiki: remove migration release overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082864 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:10:15] (03Merged) 10jenkins-bot: mediawiki: remove migration release overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082864 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:12:05] dancy: all yours - thanks again! [19:12:16] Gracias [19:12:43] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105056 (https://phabricator.wikimedia.org/T375667) [19:12:44] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105056 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [19:13:27] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105056 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [19:14:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker1081 - https://phabricator.wikimedia.org/T381878#10410569 (10Jclark-ctr) a:03Jclark-ctr Confirmed: Service Request 202767674 was successfully submitted. [19:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10410581 (10Jclark-ctr) a:03VRiley-WMF [19:17:43] (03PS28) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [19:17:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10410668 (10Jclark-ctr) I see an-presto1005 was just changed to decom status in netbox waiting for decom ticket to remove from rack should resolve power issue [19:19:44] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [19:19:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:20:18] (03PS29) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [19:20:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:21:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:21:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10410689 (10Jclark-ctr) So that was my mistake i have found out from dell that it only supports 6x dimms for cpu2. 10x dimm for cpu1. Al... [19:21:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10410690 (10Jclark-ctr) 05In progress→03Resolved [19:23:06] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4712/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [19:23:33] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:24:31] FIRING: [11x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:24:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:25:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:25:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:25:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:29:54] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.8 refs T375667 [19:29:58] T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667 [20:05:11] (03CR) 10FNegri: [C:03+1] Deprecate system::role for remaining WMCS roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1104945 (owner: 10Muehlenhoff) [20:12:19] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [20:15:06] (03CR) 10FNegri: "I'm confused by the amount of increase needed: how big is the database that Backy2 works on? 4GB is much more than I would expect for a da" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [20:17:11] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [20:20:14] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [20:23:41] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [20:23:45] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [20:40:19] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [20:44:30] thanks for the follow-up rzl! Sorry, I was afk. [20:45:01] tgr|away: no worries! everything should be out and live now, lmk if anything doesn't look right [20:45:37] (verified that it now works correctly in production) [20:50:03] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [20:50:31] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [20:51:38] (03PS1) 10Urbanecm: uzwiki: Update tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105063 (https://phabricator.wikimedia.org/T370165) [20:57:07] (03PS1) 10Ebernhardson: Revert "cirrus: Enable mlr-2024 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105065 (https://phabricator.wikimedia.org/T377128) [20:57:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105065 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241217T2100). [21:00:05] danisztls, JSherman, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:13] hey hey [21:00:15] i can deploy today [21:00:28] I'm here [21:00:33] hi JSherman [21:00:42] danisztls: ebernhardson: hey, around too? [21:01:10] o/ [21:01:18] hey [21:01:31] hi [21:01:46] urbanecm: howdy [21:02:14] (03CR) 10Urbanecm: [C:03+2] Enable AutoModerator on azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104992 (https://phabricator.wikimedia.org/T382286) (owner: 10Jsn.sherman) [21:02:19] hey! [21:02:31] (03CR) 10Urbanecm: [C:03+2] Reader Survey: Partially undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:02:51] (03CR) 10Urbanecm: [C:03+2] Revert "cirrus: Enable mlr-2024 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105065 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:03:09] (03Merged) 10jenkins-bot: Enable AutoModerator on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104992 (https://phabricator.wikimedia.org/T382286) (owner: 10Jsn.sherman) [21:03:17] (03Merged) 10jenkins-bot: Reader Survey: Partially undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:03:22] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104741 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:03:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105065 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:03:35] (03Merged) 10jenkins-bot: Revert "cirrus: Enable mlr-2024 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105065 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [21:04:15] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1104741|Reader Survey: Partially undeploy (T378660)]], [[gerrit:1104992|Enable AutoModerator on azwiki (T382286)]], [[gerrit:1105065|Revert "cirrus: Enable mlr-2024 for select wikis" (T377128)]] [21:04:19] (03PS2) 10DDesouza: Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) [21:04:22] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:04:23] T382286: Enable AutoModerator on azwiki - https://phabricator.wikimedia.org/T382286 [21:04:23] T377128: Import recent MLR models built by MjoLniR in production and test them - https://phabricator.wikimedia.org/T377128 [21:04:35] (03CR) 10Urbanecm: [C:03+2] uzwiki: Update tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105063 (https://phabricator.wikimedia.org/T370165) (owner: 10Urbanecm) [21:05:22] (03Merged) 10jenkins-bot: uzwiki: Update tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105063 (https://phabricator.wikimedia.org/T370165) (owner: 10Urbanecm) [21:07:57] (03CR) 10BryanDavis: [C:03+1] php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [21:10:50] !log urbanecm@deploy2002 urbanecm, ebernhardson, dani, jsn: Backport for [[gerrit:1104741|Reader Survey: Partially undeploy (T378660)]], [[gerrit:1104992|Enable AutoModerator on azwiki (T382286)]], [[gerrit:1105065|Revert "cirrus: Enable mlr-2024 for select wikis" (T377128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:10:56] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:10:57] T382286: Enable AutoModerator on azwiki - https://phabricator.wikimedia.org/T382286 [21:10:57] T377128: Import recent MLR models built by MjoLniR in production and test them - https://phabricator.wikimedia.org/T377128 [21:12:02] danisztls: ebernhardson: JSherman: can you check your patches, please? [21:12:43] verified good on my end [21:13:41] urbanecm: looks good [21:14:02] ty [21:15:48] ebernhardson: what about you? [21:19:40] (03PS1) 10Ottomata: Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) [21:20:15] ebernhardson: ping? [21:20:56] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4713/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [21:21:04] (03CR) 10Ottomata: Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [21:26:18] (03PS1) 10Ottomata: Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) [21:26:44] (03CR) 10Ottomata: [C:04-1] "Do not merge until https://gerrit.wikimedia.org/r/c/operations/puppet/+/1105076 is merged and verified to work." [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [21:27:07] !log urbanecm@deploy2002 Sync cancelled. [21:27:41] (03PS1) 10Urbanecm: Revert^2 "cirrus: Enable mlr-2024 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105079 (https://phabricator.wikimedia.org/T377128) [21:27:42] ebernhardson: 😞 [21:27:48] (03CR) 10Urbanecm: [C:03+2] Revert^2 "cirrus: Enable mlr-2024 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105079 (https://phabricator.wikimedia.org/T377128) (owner: 10Urbanecm) [21:28:03] reverting, no response for ~15 mins [21:29:18] urbanecm: doh, sorry been distracted [21:29:21] urbanecm: its fine [21:29:38] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1104741|Reader Survey: Partially undeploy (T378660)]], [[gerrit:1104992|Enable AutoModerator on azwiki (T382286)]], [[gerrit:1105063|uzwiki: Update tagline (T370165)]] [21:29:45] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:29:45] T382286: Enable AutoModerator on azwiki - https://phabricator.wikimedia.org/T382286 [21:29:46] T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165 [21:30:05] ebernhardson: i already pressed the reverting buttons. feel free to deploy yourself once my sync finishes. [21:30:48] urbanecm: ok, no worries [21:32:31] (03PS2) 10DDesouza: Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) [21:35:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:35:35] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:35:53] !log urbanecm@deploy2002 urbanecm, dani, jsn: Backport for [[gerrit:1104741|Reader Survey: Partially undeploy (T378660)]], [[gerrit:1104992|Enable AutoModerator on azwiki (T382286)]], [[gerrit:1105063|uzwiki: Update tagline (T370165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:36:00] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:36:00] T382286: Enable AutoModerator on azwiki - https://phabricator.wikimedia.org/T382286 [21:36:01] T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165 [21:36:02] !log urbanecm@deploy2002 urbanecm, dani, jsn: Continuing with sync [21:37:02] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377 (10Jdrewniak) 03NEW [21:40:09] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377#10411168 (10LMora-WMF) [21:40:21] (03CR) 10Scott French: "Thanks, Filippo!" [software] - 10https://gerrit.wikimedia.org/r/1104727 (https://phabricator.wikimedia.org/T381680) (owner: 10Scott French) [21:40:26] (03CR) 10Scott French: [C:03+2] ops-maint-gcal.js: truncate message details [software] - 10https://gerrit.wikimedia.org/r/1104727 (https://phabricator.wikimedia.org/T381680) (owner: 10Scott French) [21:41:53] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1104741|Reader Survey: Partially undeploy (T378660)]], [[gerrit:1104992|Enable AutoModerator on azwiki (T382286)]], [[gerrit:1105063|uzwiki: Update tagline (T370165)]] (duration: 12m 14s) [21:42:00] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:42:00] T382286: Enable AutoModerator on azwiki - https://phabricator.wikimedia.org/T382286 [21:42:00] T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165 [21:42:09] (03Merged) 10jenkins-bot: ops-maint-gcal.js: truncate message details [software] - 10https://gerrit.wikimedia.org/r/1104727 (https://phabricator.wikimedia.org/T381680) (owner: 10Scott French) [21:42:14] done and synced [21:42:20] thanks urbanecm:! [21:42:30] JSherman: danisztls: ^^ [21:42:32] no problem [21:42:36] over to you ebernhardson if needed [21:43:20] urbanecm: ty [21:43:38] np [21:52:53] (03PS1) 10Ebernhardson: Revert^3 "cirrus: Enable mlr-2024 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105084 [21:53:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105084 (owner: 10Ebernhardson) [21:55:12] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [21:58:22] (03Merged) 10jenkins-bot: Revert^3 "cirrus: Enable mlr-2024 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105084 (owner: 10Ebernhardson) [21:58:53] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1105084|Revert^3 "cirrus: Enable mlr-2024 for select wikis"]] [22:03:29] (03PS1) 10Krinkle: webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 [22:05:19] (03CR) 10CI reject: [V:04-1] webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 (owner: 10Krinkle) [22:05:39] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1105084|Revert^3 "cirrus: Enable mlr-2024 for select wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:06:19] (03PS2) 10Krinkle: webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 [22:07:42] (03CR) 10CI reject: [V:04-1] webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 (owner: 10Krinkle) [22:07:44] (03PS3) 10Krinkle: webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 [22:07:50] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [22:09:05] (03CR) 10CI reject: [V:04-1] webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 (owner: 10Krinkle) [22:10:08] (03PS4) 10Krinkle: webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 [22:13:17] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105084|Revert^3 "cirrus: Enable mlr-2024 for select wikis"]] (duration: 14m 23s) [22:13:19] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [22:19:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [22:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [22:20:14] (03CR) 10Krinkle: "Please bear in mind, that the perf team on longer exists, and the navtiming service remains unowned per https://www.mediawiki.org/wiki/Dev" [alerts] - 10https://gerrit.wikimedia.org/r/1105087 (owner: 10Krinkle) [22:40:19] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [22:47:40] !log dancy@deploy2002 Installing scap version "4.134.0" for 2 host(s) [22:49:24] !log dancy@deploy2002 Installation of scap version "4.134.0" completed for 2 hosts [22:49:54] !log dancy@deploy2002 Started scap sync-world: Testing scap 4.134.0 [22:53:13] !log dancy@deploy2002 Finished scap sync-world: Testing scap 4.134.0 (duration: 03m 18s) [23:06:19] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [23:10:19] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [23:24:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:32:39] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Console/management wiring - https://phabricator.wikimedia.org/T382383 (10Papaul) 03NEW