[00:07:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082300 (owner: 10TrainBranchBot) [00:08:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082308 [00:08:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082308 (owner: 10TrainBranchBot) [00:34:15] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082308 (owner: 10TrainBranchBot) [00:51:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:54:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [00:56:17] (03PS1) 10CDanis: coredns: PreferDualStack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082312 [02:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:47] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10252631 (10tstarling) The MW installer has a feature allowing the user to check a box to subscribe to mediawiki-announce. I tried to test it, sin... [03:47:38] (03CR) 10BBlack: [C:03+1] "Looks sane to human eyes!" [puppet] - 10https://gerrit.wikimedia.org/r/1082241 (owner: 10Ssingh) [03:48:46] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:codfw:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371434#10252637 (10Papaul) 05Open→03Resolved This is complete [03:48:53] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10252634 (10Papaul) 05Open→03Resolved This is complete. [04:51:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [04:59:16] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10252671 (10tstarling) The slow request was ` curl https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/anonymous_s... [05:15:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3007.wikimedia.org [05:20:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3007.wikimedia.org [05:31:34] (03CR) 10Arnaudb: [C:03+1] mysql: refactor this currently unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082245 (owner: 10Volans) [05:32:39] (03CR) 10Arnaudb: [C:03+1] "that was fast!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082246 (owner: 10Volans) [05:35:15] (03CR) 10Arnaudb: [C:03+2] mysql_legacy: get systemd status for instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [05:45:33] (03Merged) 10jenkins-bot: mysql_legacy: get systemd status for instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:47] (03PS1) 10KartikMistry: Update cxserver to 2024-10-23-055433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082326 [06:33:02] (03PS1) 10Slyngshede: Revert "P:ircstream temporarily disable alerting" [puppet] - 10https://gerrit.wikimedia.org/r/1082327 [06:34:47] (03CR) 10Slyngshede: "It made little difference disabling the Prometheus scraper. We still see 30 clients dropping of ever 20 minutes." [puppet] - 10https://gerrit.wikimedia.org/r/1082327 (owner: 10Slyngshede) [06:35:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [06:35:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10252720 (10ops-monitoring-bot) Draining ganeti2012.codfw.wmnet of running VMs [06:36:47] (03PS1) 10STran: Add source wiki to contributions on Special:GlobalContributions [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) [06:38:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [06:40:46] (03CR) 10Muehlenhoff: "Zooming out to the time before the change was made, I don't see a real difference in the graph, though?" [puppet] - 10https://gerrit.wikimedia.org/r/1082327 (owner: 10Slyngshede) [06:41:55] Quick cxserver deployment.. [06:42:01] (03CR) 10Santhosh: [C:03+1] Update cxserver to 2024-10-23-055433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082326 (owner: 10KartikMistry) [06:42:35] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-10-23-055433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082326 (owner: 10KartikMistry) [06:42:37] (03CR) 10EarlyWarningBot: "Failed command: "composer run --timeout=0 phpunit:parallel:database --"" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [06:43:34] (03Merged) 10jenkins-bot: Update cxserver to 2024-10-23-055433-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082326 (owner: 10KartikMistry) [06:44:11] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:44:34] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:44:55] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:45:23] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:45:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [06:46:40] (03CR) 10EarlyWarningBot: "Failed command: "composer run --timeout=0 phpunit:parallel:database --"" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [06:47:22] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:47:58] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:48:49] !log Updated cxserver to 2024-10-23-055433-production [06:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:26] (03CR) 10Muehlenhoff: "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [06:51:50] (03CR) 10STran: "Failing because it's dependent on another change being backported (I think)" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [06:54:36] (03CR) 10CI reject: [V:04-1] Add source wiki to contributions on Special:GlobalContributions [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [06:59:36] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10252733 (10Ckepper) Awesome, thank you - works like a charm 👍 [07:00:04] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T0700). [07:00:05] Tran: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:04] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Grant bd808 membership in the contint-roots and contint-docker groups - https://phabricator.wikimedia.org/T377792#10252751 (10Bmueller) Approved - thanks @Dzahn! [07:05:47] (03PS1) 10Jelto: gitlab::runner: stop runner on gitlab-runner2003 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1082329 (https://phabricator.wikimedia.org/T377374) [07:07:34] (03CR) 10Jelto: [C:03+2] gitlab::runner: stop runner on gitlab-runner2003 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1082329 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [07:10:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner2003.codfw.wmnet with OS bullseye [07:11:09] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host gitlab-runner2003 [07:11:52] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:12:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2002.codfw.wmnet to drbd [07:12:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10252861 (10ops-monitoring-bot) VM ml-etcd2002.codfw.wmnet switching disk type to drbd [07:13:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1082327 (owner: 10Slyngshede) [07:15:15] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner2003 - jelto@cumin1002" [07:15:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner2003 - jelto@cumin1002" [07:15:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:15:19] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache gitlab-runner2003.codfw.wmnet 93.32.192.10.in-addr.arpa 3.9.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:15:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab-runner2003.codfw.wmnet 93.32.192.10.in-addr.arpa 3.9.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:15:22] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host gitlab-runner2003 [07:15:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host gitlab-runner2003 [07:15:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host gitlab-runner2003 [07:22:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2002.codfw.wmnet to drbd [07:22:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [07:22:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10252863 (10ops-monitoring-bot) Draining ganeti2012.codfw.wmnet of running VMs [07:23:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [07:23:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2002.codfw.wmnet to plain [07:23:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10252864 (10ops-monitoring-bot) VM ml-etcd2002.codfw.wmnet switching disk type to plain [07:24:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2002.codfw.wmnet to plain [07:24:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [07:25:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10252868 (10ops-monitoring-bot) Draining ganeti2012.codfw.wmnet of running VMs [07:27:00] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2011 from active Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082240 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [07:32:11] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [07:32:40] (03CR) 10Brouberol: [C:03+1] Add new kafka-jumbo nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082249 (https://phabricator.wikimedia.org/T377874) (owner: 10Btullis) [07:33:19] !log installing perf updates on bookworm nodes [07:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [07:37:57] (03PS1) 10Jelto: Revert "gitlab::runner: stop runner on gitlab-runner2003 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1082331 (https://phabricator.wikimedia.org/T377374) [07:38:57] (03CR) 10Jelto: [C:03+2] Revert "gitlab::runner: stop runner on gitlab-runner2003 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1082331 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [07:39:54] (03CR) 10Brouberol: airflow: define an optional airflow-kerberos Deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [07:41:17] (03PS8) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [07:44:20] 06SRE, 06serviceops: host rdb1014 is down - https://phabricator.wikimedia.org/T376961#10252886 (10MoritzMuehlenhoff) 05Resolved→03Open >>! In T376961#10225247, @akosiaris wrote: > I 'll resolve, although something tells me we 'll soon see this again. You jinxed it :-) rdb1014 is again down since three d... [07:46:00] (03CR) 10Slyngshede: [C:03+2] Revert "P:ircstream temporarily disable alerting" [puppet] - 10https://gerrit.wikimedia.org/r/1082327 (owner: 10Slyngshede) [07:52:31] (03CR) 10Kosta Harlan: Add source wiki to contributions on Special:GlobalContributions (031 comment) [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [07:52:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2003.codfw.wmnet with OS bullseye [07:53:09] (03CR) 10Brouberol: [C:03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082163 (https://phabricator.wikimedia.org/T377745) (owner: 10Gmodena) [07:53:44] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10252919 (10MatthewVernon) Yes, per [[ https://www.sqlite.org/lang_vacuum.html | the docs ]], "The VACUUM command works by copying the... [07:53:44] (03CR) 10Kosta Harlan: "Yes, I think if you backport I95a5b88ec81583e16ccf8e58cdb8e12e00aae5bf, this will work. I believe you should be able to use `scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [07:54:32] is the deployment window still open? [07:54:49] Oh shoot [07:54:57] I got my hour mixed [07:55:12] feel free to extend [07:55:17] thanks hashar [07:55:20] 🙇 thank you [07:55:32] (03PS1) 10Jelto: docker_registry_ha::registry: update gitlab-runner2003 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082332 (https://phabricator.wikimedia.org/T377374) [07:55:33] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [07:55:35] the idea of the window is to have a chance to have a deployer present [07:56:00] and if a deployment doesn't conflict with an ongoing other operation, I guess it is fine to deploy a config :) [07:56:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: reboot [07:56:40] These are patches going into core/CheckUser, not config. Is that alright? [07:56:42] (03CR) 10MVernon: [C:03+1] restbase203[6-8]: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1082301 (https://phabricator.wikimedia.org/T377896) (owner: 10Eevans) [07:56:54] (03CR) 10MVernon: [C:03+1] Configure restbase10[34-42] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1082302 (https://phabricator.wikimedia.org/T354227) (owner: 10Eevans) [07:56:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: reboot [07:57:38] Tran: yes, should be fine [07:57:57] Since I'm extending the window is there anything I need to do when I'm done to close it out? [07:58:02] (going to start my backports now) [07:58:23] (03CR) 10Slyngshede: [C:03+2] P:idp ensure defaults for Redis is present for all deployment. [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [07:59:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10252925 (10MatthewVernon) That's quite a big problem if the OS can't find the disks on storage servers; did the test node have a different disk controlle... [07:59:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [07:59:23] (03CR) 10Kosta Harlan: [C:03+1] Add source wiki to contributions on Special:GlobalContributions [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [08:01:37] Tran: I think just add the `!log UTC morning deploys done` message when you finish [08:01:51] :+1 [08:01:59] per https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#General_advice_before_you_start [08:02:00] (03CR) 10Ayounsi: "replied to the specific question, let me know if I should also review the patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:02:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [08:02:47] (03PS1) 10Jelto: gitlab::runner: stop runner on gitlab-runner2004 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1082333 (https://phabricator.wikimedia.org/T377374) [08:04:25] (03CR) 10Jelto: [C:03+2] gitlab::runner: stop runner on gitlab-runner2004 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1082333 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [08:06:05] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: new JDK - jmm@cumin2002 [08:07:22] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner2004.codfw.wmnet with OS bullseye [08:07:48] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host gitlab-runner2004 [08:08:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [08:08:29] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:09:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet [08:09:14] (03PS1) 10Jelto: docker_registry_ha::registry: update gitlab-runner2004 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082334 (https://phabricator.wikimedia.org/T377374) [08:09:16] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082209 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [08:11:48] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner2004 - jelto@cumin1002" [08:11:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host gitlab-runner2004 - jelto@cumin1002" [08:11:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:11:52] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache gitlab-runner2004.codfw.wmnet 71.48.192.10.in-addr.arpa 1.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:11:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab-runner2004.codfw.wmnet 71.48.192.10.in-addr.arpa 1.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:11:56] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host gitlab-runner2004 [08:12:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host gitlab-runner2004 [08:12:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host gitlab-runner2004 [08:12:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10252959 (10MatthewVernon) lshw picks up a `MegaRAID 12GSAS/PCIe Secure SAS39xx` which [[ https://linux-hardware.org/?id=pci:1000-10e2-1590-032a | the int... [08:14:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet [08:17:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet [08:18:30] FIRING: JobUnavailable: Reduced availability for job gitlab_runner in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10252976 (10MatthewVernon) I had a quick look at the web UI of the BMC, and saw the following under storage: {F57636115} ...which looks to my inexpert eye... [08:20:00] (03CR) 10Ayounsi: "Happy to review it but I don't know grub enough. Do you have any pointers on how you came up with it?" [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [08:22:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet [08:22:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet [08:23:22] (03PS16) 10Andrea Denisse: alert: Ensure vopsbot database is synced from active to passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) [08:23:23] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1082325/4357/" [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [08:26:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: new JDK - jmm@cumin2002 [08:28:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet [08:28:41] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [08:28:46] (03CR) 10WMDE-leszek: [C:03+1] wikidata-query-gui: add releases for commons, query-main and scholarly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082167 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:29:10] !log installing Java 11 security updates [08:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:26] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10253022 (10elukey) >>! In T376014#10251224, @Ottomata wrote: >> if/when we'll decide to move to Eventstreams... [08:32:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [08:33:54] (03PS1) 10Jelto: Revert "gitlab::runner: stop runner on gitlab-runner2004 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1082412 (https://phabricator.wikimedia.org/T377374) [08:36:07] (03CR) 10Jelto: [C:03+2] Revert "gitlab::runner: stop runner on gitlab-runner2004 for reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1082412 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [08:36:44] (03PS9) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [08:41:57] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10253093 (10MatthewVernon) does `megacli` work? That's the tool (from the `megacli` package) that I... [08:43:12] Backport failed due to failing tests, I don't think it's related to my patch but investigating now. [08:43:27] (03CR) 10Cathal Mooney: [C:03+1] vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [08:44:02] (03CR) 10Cathal Mooney: [C:03+1] Netbox: run the vlan_migration report every 2 hours [puppet] - 10https://gerrit.wikimedia.org/r/1081071 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [08:47:10] (03PS1) 10Elukey: sre.hosts.provision: fix hw raid detection for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1082422 (https://phabricator.wikimedia.org/T371400) [08:48:19] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:48:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:49:57] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1082422 (https://phabricator.wikimedia.org/T371400) (owner: 10Elukey) [08:50:37] (03CR) 10Gmodena: [C:03+2] charts: airflow: alert only on task failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082163 (https://phabricator.wikimedia.org/T377745) (owner: 10Gmodena) [08:51:29] FIRING: [16x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:51:51] (03Merged) 10jenkins-bot: charts: airflow: alert only on task failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082163 (https://phabricator.wikimedia.org/T377745) (owner: 10Gmodena) [08:51:53] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10253127 (10jcrespo) Nope, megacli doesn't work. That's the one option I tried first, before going o... [08:52:15] FIRING: [2x] JobUnavailable: Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:52:31] (03PS7) 10Vgutierrez: liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) [08:52:31] (03PS4) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) [08:54:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [08:54:56] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: fix hw raid detection for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1082422 (https://phabricator.wikimedia.org/T371400) (owner: 10Elukey) [08:55:20] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10253136 (10jcrespo) >>! In T377853#10251006, @jcrespo wrote: > perccli and storecli are not exactly... [08:56:33] (03PS10) 10Brouberol: airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) [08:57:53] (03CR) 10Brouberol: [C:03+2] airflow: analytics: alert only on task failure [puppet] - 10https://gerrit.wikimedia.org/r/1082001 (https://phabricator.wikimedia.org/T377745) (owner: 10Gmodena) [08:58:00] I think [08:58:01] https://phabricator.wikimedia.org/T377912 needs more eyes on it (possibly an emergency) [08:58:32] (03CR) 10STran: "recheck" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [08:59:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10253144 (10elukey) >>! In T371400#10252976, @MatthewVernon wrote: > I had a quick look at the web UI of the BMC, and saw the following under storage: >... [09:01:33] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10observability: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10253146 (10MatthewVernon) Perhaps relevantly, I was screenshotting the BMC storage page on another... [09:02:40] I'll backport my changes in another window, still looking at the failing test [09:02:43] !log UTC morning deploys done [09:02:43] (03PS1) 10Slyngshede: P:idp Add default Redis values to profile. [puppet] - 10https://gerrit.wikimedia.org/r/1082423 (https://phabricator.wikimedia.org/T377728) [09:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:06] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082423 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [09:04:15] (03CR) 10Hnowlan: [C:03+2] modules/admin: Add bd808 to contint-roots and contint-docker groups [puppet] - 10https://gerrit.wikimedia.org/r/1082105 (https://phabricator.wikimedia.org/T377792) (owner: 10BryanDavis) [09:04:38] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1082423 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [09:05:01] (03CR) 10Slyngshede: [C:03+2] P:idp Add default Redis values to profile. [puppet] - 10https://gerrit.wikimedia.org/r/1082423 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [09:05:23] (03CR) 10Elukey: [C:03+1] remote: add dry_run getter for RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082244 (owner: 10Volans) [09:07:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2004.codfw.wmnet with OS bullseye [09:10:11] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Grant bd808 membership in the contint-roots and contint-docker groups - https://phabricator.wikimedia.org/T377792#10253156 (10hnowlan) 05Open→03Resolved a:03hnowlan Merged! [09:10:14] (03PS1) 10Muehlenhoff: Add ganeti2039/ganeti2040 as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082424 (https://phabricator.wikimedia.org/T376594) [09:11:28] (03CR) 10Elukey: [C:03+1] "Small typo in the commit msg, really nice refactor :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082245 (owner: 10Volans) [09:13:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10253170 (10cmooney) >>! In T377381#10252289, @Jclark-ctr wrote: > @cmooney Step 1: Firewall Installation & Cabling is complet... [09:13:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082424 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [09:14:33] (03CR) 10Elukey: [C:03+1] mysql_legacy: add cursor method to Instance class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082246 (owner: 10Volans) [09:16:33] (03CR) 10Elukey: [C:03+1] k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:17:12] (03CR) 10Clément Goubert: "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:17:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10253199 (10MatthewVernon) >>! In T371400#10253144, @elukey wrote: > If so apologies, but these are the first nodes with HW raid that we get, some adjus... [09:18:22] (03CR) 10Volans: [C:03+2] remote: add dry_run getter for RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082244 (owner: 10Volans) [09:18:53] (03PS2) 10Volans: mysql: refactor this currently unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082245 [09:20:06] (03CR) 10Volans: mysql: refactor this currently unused module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082245 (owner: 10Volans) [09:22:15] FIRING: [2x] JobUnavailable: Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:22:38] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti2039/ganeti2040 as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082424 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [09:24:13] !log volans@cumin1002 START - Cookbook sre.mysql.pool db1185 gradually with 4 steps - Testing new cookbook [09:24:15] !log volans@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185 gradually with 4 steps - Testing new cookbook [09:26:47] (03CR) 10Elukey: [C:03+1] "Left a couple of nits, nothing big, review and decide what to do :) Feel free to merge afterwards!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:27:38] (03Merged) 10jenkins-bot: remote: add dry_run getter for RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082244 (owner: 10Volans) [09:28:27] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/cadvisor: lookup extra metrics from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082178 (https://phabricator.wikimedia.org/T377804) (owner: 10Tiziano Fogli) [09:29:27] !log volans@cumin1002 START - Cookbook sre.mysql.depool db1185 - Testing new cookbook [09:29:34] !log volans@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1185 - Testing new cookbook [09:29:44] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:30:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961#10253253 (10Clement_Goubert) p:05Triage→03Medium a:05akosiaris→03Jclark-ctr [09:30:47] !log volans@cumin1002 START - Cookbook sre.mysql.pool db1185 gradually with 4 steps - Testing new cookbook [09:31:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [09:31:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:32:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961#10253273 (10akosiaris) >>! In T376961#10252886, @MoritzMuehlenhoff wrote: >>>! In T376961#10225247, @akosiaris wrote: >> I 'll resolve, alth... [09:33:00] (plz ignore the message i posted previously) [09:34:11] !log volans@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185 gradually with 4 steps - Testing new cookbook [09:34:17] (03CR) 10Volans: [C:03+2] mysql: refactor this currently unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082245 (owner: 10Volans) [09:35:13] jouncebot: nowandnext [09:35:14] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [09:35:14] In 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1000) [09:36:47] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] wmcs: puppetserver: introduce apt pin for openjdk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082201 (https://phabricator.wikimedia.org/T377803) (owner: 10Arturo Borrero Gonzalez) [09:38:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet [09:42:05] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [09:46:29] (03Merged) 10jenkins-bot: mysql: refactor this currently unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082245 (owner: 10Volans) [09:48:05] (03CR) 10Volans: [C:03+2] mysql_legacy: add cursor method to Instance class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082246 (owner: 10Volans) [09:48:28] (03PS1) 10Jgiannelos: changeprop: Fix broken identation caused by trailing whitespace trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082430 [09:49:37] (03PS2) 10Jgiannelos: changeprop: Fix broken identation caused by trailing whitespace trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082430 (https://phabricator.wikimedia.org/T372749) [09:50:19] (03PS12) 10Volans: sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) [09:52:53] (03CR) 10Hnowlan: [C:03+1] changeprop: Fix broken identation caused by trailing whitespace trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082430 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [09:54:12] (03CR) 10Jgiannelos: [C:03+2] changeprop: Fix broken identation caused by trailing whitespace trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082430 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [09:54:44] (03CR) 10Elukey: [C:03+1] "Left a minor nit, looks good in my opinion!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:55:17] (03Merged) 10jenkins-bot: changeprop: Fix broken identation caused by trailing whitespace trim [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082430 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [09:55:47] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [09:56:44] (03PS13) 10Volans: sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) [09:59:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [09:59:34] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [09:59:45] (03CR) 10Slyngshede: Account blocking: Publically available log of all block and unblocks. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1000) [10:03:13] !log Restarted MediaModeration scanning script for commonswiki - https://wikitech.wikimedia.org/wiki/MediaModeration [10:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:34] (03PS1) 10Gmodena: refinery: gobblin: add webrequest_frontend. [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) [10:04:46] (03PS1) 10Jgiannelos: changeprop: Fix typo in rerendered_pcs_endpoints url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082435 [10:05:01] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:06:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [10:08:07] (03CR) 10Hnowlan: [C:03+1] changeprop: Fix typo in rerendered_pcs_endpoints url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082435 (owner: 10Jgiannelos) [10:09:28] (03CR) 10Tiziano Fogli: alert: Ensure vopsbot database is synced from active to passive hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [10:10:02] (03CR) 10Arnaudb: [C:03+1] sre.mysql.pool: add two new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [10:10:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [10:10:21] (03CR) 10Jgiannelos: [C:03+2] changeprop: Fix typo in rerendered_pcs_endpoints url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082435 (owner: 10Jgiannelos) [10:10:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [10:11:40] (03Merged) 10jenkins-bot: changeprop: Fix typo in rerendered_pcs_endpoints url [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082435 (owner: 10Jgiannelos) [10:12:04] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:12:23] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:13:15] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:13:17] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:13:27] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:13:48] (03CR) 10Volans: [C:03+2] sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [10:14:29] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:15:39] (03PS1) 10Jgiannelos: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082437 [10:19:08] 10ops-codfw, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401#10253422 (10aborrero) a:03Papaul hey @Papaul and/or @Jhancock.wm per https://phabricator.wikimedia.org/T377568#10247882 you should be receiving mem... [10:19:25] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401#10253430 (10aborrero) [10:20:13] (03Merged) 10jenkins-bot: sre.mysql.pool: add two new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [10:23:57] jouncebot: nowandnext [10:23:57] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1000) [10:23:57] In 0 hour(s) and 36 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1100) [10:35:47] (03CR) 10JMeybohm: wikidata-query-gui: move query.wikidata.org into separate values file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:35:55] (03CR) 10JMeybohm: [C:04-1] wikidata-query-gui: add releases for commons, query-main and scholarly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082167 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:36:38] (03PS1) 10Hnowlan: sessionstore: complete migration to envoy tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082441 (https://phabricator.wikimedia.org/T363996) [10:38:30] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082312 (owner: 10CDanis) [10:39:22] (03CR) 10Elukey: [C:03+1] sessionstore: complete migration to envoy tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082441 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [10:40:38] (03CR) 10Effie Mouzeli: [C:03+1] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082437 (owner: 10Jgiannelos) [10:41:01] (03CR) 10Jgiannelos: [C:03+2] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082437 (owner: 10Jgiannelos) [10:42:18] (03Merged) 10jenkins-bot: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082437 (owner: 10Jgiannelos) [10:43:05] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [10:43:36] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:45:17] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:45:43] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:46:33] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.15.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082443 [10:49:04] (03PS9) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [10:49:05] (03PS25) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [10:49:14] (03CR) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:49:19] (03CR) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:50:08] (03PS1) 10Superzerocool: nlwiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082444 (https://phabricator.wikimedia.org/T377930) [10:51:11] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:53:30] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:56:15] (03CR) 10JMeybohm: [C:03+2] k8s.pool-depool-node: Add support for multiple nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:59:34] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.15.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082443 (owner: 10Volans) [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1100). [11:01:27] (03PS2) 10Slyngshede: Account blocking: Publically available log of all block and unblocks. [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) [11:01:36] (03CR) 10Mvolz: [C:03+2] Update Zotero to node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082237 (owner: 10Mvolz) [11:01:38] (03CR) 10Slyngshede: Account blocking: Publically available log of all block and unblocks. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [11:01:59] (03Merged) 10jenkins-bot: k8s.pool-depool-node: Add support for multiple nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1082191 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [11:02:38] (03Merged) 10jenkins-bot: Update Zotero to node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082237 (owner: 10Mvolz) [11:04:41] jouncebot: nowandnext [11:04:42] For the next 0 hour(s) and 55 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1100) [11:04:42] In 1 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1300) [11:04:54] (03PS1) 10Dreamy Jazz: recentchanges: Use current time for imported revision category changes [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082445 (https://phabricator.wikimedia.org/T377932) [11:05:09] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:05:33] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:05:52] (03PS2) 10Máté Szabó: recentchanges: Use current time for imported revision category changes [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082445 (https://phabricator.wikimedia.org/T377932) (owner: 10Dreamy Jazz) [11:08:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [11:09:00] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:09:04] (03CR) 10Hnowlan: [C:03+2] thumbor: add mcrouter config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078386 (owner: 10Hnowlan) [11:09:22] RESOLVED: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [11:09:31] (03CR) 10Dreamy Jazz: [C:03+2] recentchanges: Use current time for imported revision category changes [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082445 (https://phabricator.wikimedia.org/T377932) (owner: 10Dreamy Jazz) [11:09:37] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:09:46] (03PS2) 10Muehlenhoff: Deprecate system::role for initial batch of serviceops services [puppet] - 10https://gerrit.wikimedia.org/r/1076160 [11:10:06] (03Merged) 10jenkins-bot: thumbor: add mcrouter config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078386 (owner: 10Hnowlan) [11:10:07] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.15.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082443 (owner: 10Volans) [11:11:18] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:11:45] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:12:19] (03PS1) 10Volans: Upstream release v8.15.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1082448 [11:17:08] (03PS1) 10Jgiannelos: changeprop: Configure PCS URI to be the discovery name of the service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082450 [11:17:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [11:19:04] (03PS2) 10Jgiannelos: changeprop: Configure PCS URI to be the discovery name of the service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082450 (https://phabricator.wikimedia.org/T372749) [11:19:04] Dreamy_Jazz: I'm done fyi in case you wanted to use the rest of the window [11:19:21] Thanks. Will do. [11:20:19] (03CR) 10Jgiannelos: "Discovered while checking for metrics on PCS level. We didn't get more traffic as it was expected." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082450 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [11:20:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082445 (https://phabricator.wikimedia.org/T377932) (owner: 10Dreamy Jazz) [11:21:07] FIRING: [3x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [11:23:26] (03CR) 10Volans: [C:03+2] Upstream release v8.15.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1082448 (owner: 10Volans) [11:27:24] (03PS3) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) [11:28:53] (03Abandoned) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082167 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:30:12] (03CR) 10Slyngshede: [C:03+2] Account blocking: Publically available log of all block and unblocks. [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [11:30:46] (03CR) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:31:18] (03PS4) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) [11:33:35] (03Merged) 10jenkins-bot: Upstream release v8.15.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1082448 (owner: 10Volans) [11:33:35] (03Merged) 10jenkins-bot: Account blocking: Publically available log of all block and unblocks. [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [11:33:38] (03PS1) 10Muehlenhoff: Add component/jdk8 for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082451 [11:38:14] (03PS5) 10Jelto: wikidata-query-gui: add releases for commons, query-main and scholarly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) [11:39:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [11:41:04] (03Merged) 10jenkins-bot: recentchanges: Use current time for imported revision category changes [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082445 (https://phabricator.wikimedia.org/T377932) (owner: 10Dreamy Jazz) [11:41:57] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1082445|recentchanges: Use current time for imported revision category changes (T377932)]] [11:42:02] T377932: ImportTemporaryUserIntegrationTest::testShouldSuccessfullyUpdateCategoryMembershipInRecentChanges is flaky - https://phabricator.wikimedia.org/T377932 [11:44:39] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1082445|recentchanges: Use current time for imported revision category changes (T377932)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:44:43] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [11:45:33] (03PS1) 10Muehlenhoff: Move ganeti2012 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1082454 (https://phabricator.wikimedia.org/T376594) [11:47:48] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:23] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082445|recentchanges: Use current time for imported revision category changes (T377932)]] (duration: 07m 26s) [11:49:36] T377932: ImportTemporaryUserIntegrationTest::testShouldSuccessfullyUpdateCategoryMembershipInRecentChanges is flaky - https://phabricator.wikimedia.org/T377932 [11:50:28] (03CR) 10Muehlenhoff: [C:03+2] Move ganeti2012 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1082454 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [11:51:10] (03CR) 10Muehlenhoff: [C:03+2] Add component/jdk8 for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1082451 (owner: 10Muehlenhoff) [11:54:08] (03PS1) 10Muehlenhoff: Also remove ganeti2012 from list of active nodes used by ferm [puppet] - 10https://gerrit.wikimedia.org/r/1082455 [11:54:17] 06SRE, 10ChangeProp, 06collaboration-services, 06Infrastructure-Foundations, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10253657 (10jijiki) [11:56:11] (03CR) 10Muehlenhoff: [C:03+2] Also remove ganeti2012 from list of active nodes used by ferm [puppet] - 10https://gerrit.wikimedia.org/r/1082455 (owner: 10Muehlenhoff) [12:11:33] 06SRE, 10ChangeProp, 06cloud-services-team, 06collaboration-services, and 11 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10253695 (10jijiki) [12:14:49] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for initial batch of serviceops services [puppet] - 10https://gerrit.wikimedia.org/r/1076160 (owner: 10Muehlenhoff) [12:16:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2039.codfw.wmnet to cluster codfw and group C [12:16:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2039.codfw.wmnet to cluster codfw and group C [12:18:27] (03PS1) 10Muehlenhoff: Add ganeti2039/2040 to list of Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082458 (https://phabricator.wikimedia.org/T376594) [12:21:24] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti2039/2040 to list of Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1082458 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [12:21:29] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:25:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2039.codfw.wmnet to cluster codfw and group C [12:26:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2039.codfw.wmnet to cluster codfw and group C [12:33:33] (03CR) 10JMeybohm: [C:03+2] Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:33:36] (03CR) 10JMeybohm: [C:03+2] k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:33:39] (03CR) 10JMeybohm: [C:03+2] k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:34:09] (03CR) 10Gergő Tisza: "test error was T377932" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [12:34:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, just a few typos inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 (owner: 10Slyngshede) [12:34:29] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10253766 (10aborrero) 05In progress→03Stalled Turns out, to enable PTR creation support, per {T377740} we would need to eit... [12:34:34] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942 (10phaultfinder) 03NEW [12:35:21] (03PS1) 10Volans: orchestrator: fix bug with older requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082459 [12:37:54] (03CR) 10JMeybohm: [C:03+1] "Can't speak to the individual configurations, but helm-wise this LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082166 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:39:21] (03Merged) 10jenkins-bot: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:39:22] (03Merged) 10jenkins-bot: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:39:27] (03Merged) 10jenkins-bot: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:46:01] (03PS1) 10Gergő Tisza: SessionManager: Add more logging when unpersisting invalid sessions [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082464 (https://phabricator.wikimedia.org/T372702) [12:46:22] (03PS1) 10Gergő Tisza: Log unexpected central session lookup misses [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082465 (https://phabricator.wikimedia.org/T372702) [12:47:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082464 (https://phabricator.wikimedia.org/T372702) (owner: 10Gergő Tisza) [12:47:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082465 (https://phabricator.wikimedia.org/T372702) (owner: 10Gergő Tisza) [12:48:11] (03CR) 10Btullis: [C:03+1] "Good stuff, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [12:50:12] (03CR) 10Lucas Werkmeister (WMDE): "On the deployment calendar this is currently scheduled before https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaCampaignEvent" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [12:50:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10253808 (10ayounsi) a:05ayounsi→03Papaul [12:50:47] 06SRE, 06Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028#10253811 (10ayounsi) 05Open→03Stalled [12:51:34] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638#10253832 (10ayounsi) 05Open→03Stalled a:05ayounsi→03None [12:52:04] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10253836 (10ayounsi) a:03Papaul [12:52:42] (03CR) 10Btullis: [V:03+1 C:03+2] Add new kafka-jumbo nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082249 (https://phabricator.wikimedia.org/T377874) (owner: 10Btullis) [12:53:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2040.codfw.wmnet to cluster codfw and group C [12:55:56] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10253850 (10BTullis) a:05BTullis→03None [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1300). [13:00:05] Daimona, tgr, and Tran: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:25] I can theoretically deploy but it would be much better if someone else could ^^ [13:01:23] o/ [13:01:33] 👋 I can self-service if we have time for my patches in this window [13:01:35] (03CR) 10Elukey: [C:03+1] orchestrator: fix bug with older requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082459 (owner: 10Volans) [13:02:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2040.codfw.wmnet to cluster codfw and group C [13:02:18] (03CR) 10Volans: [C:03+2] orchestrator: fix bug with older requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082459 (owner: 10Volans) [13:02:32] (03CR) 10Ssingh: [C:03+2] wmflib::service: Set depool_threshold as a float [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [13:03:56] Tran: I would say feel free to go ahead with your changes if no one else is deploying yet [13:04:08] (03CR) 10Volans: wmflib::service: Set depool_threshold as a float (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [13:04:29] Hi! [13:05:30] @Daimona can you deploy? [13:06:26] (03CR) 10Ssingh: [C:03+2] wmflib::service: Set depool_threshold as a float (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082238 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [13:06:38] (03CR) 10Brouberol: [C:03+2] airflow: define an optional airflow-kerberos Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082207 (https://phabricator.wikimedia.org/T375875) (owner: 10Brouberol) [13:08:05] (03PS3) 10Daimona Eaytoy: Enable CampaignEvents collaboration list in testwiki and test2wiki (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) [13:08:11] (03CR) 10Daimona Eaytoy: "> On the deployment calendar this is currently scheduled before https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaCampaignEve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [13:08:16] (03PS40) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [13:09:08] !log running agent on A:lvs to roll out CR 1082238 [13:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:23] No, I'm not a deployer. But Lucas did raise a point about localhost:6009 not being equivalent to query-main. I'm guessing we should get in touch with Search to discuss this. [13:09:40] The backport should be good to go regardless, but not the config patch. [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:37] I guess I’ll try to deploy after all [13:11:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:11:41] it’s just gonna be a bit slow probably [13:12:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:12:20] !log installing qemu security updates [13:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082277 (https://phabricator.wikimedia.org/T377746) (owner: 10Daimona Eaytoy) [13:12:43] Thank you Lucas_WMDE, it's very appreciated. take your time [13:12:48] (03Merged) 10jenkins-bot: orchestrator: fix bug with older requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082459 (owner: 10Volans) [13:13:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:13:12] (03PS1) 10Volans: service: change depool_threshold field to float [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082470 (https://phabricator.wikimedia.org/T377127) [13:13:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:13:57] dammit, I used the wrong browser profile [13:14:00] how do I log out of logstash? [13:14:24] (03CR) 10Volans: "Please prepare always before hand also the related patch for Spicerack or releasing the puppet one will break any cookbook that uses the s" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082470 (https://phabricator.wikimedia.org/T377127) (owner: 10Volans) [13:14:40] Lucas_WMDE: I can deploy if needed (sorry was looking elsewhere) [13:15:07] tgr|away: I already started the first scap backport, but you can take over after that [13:15:15] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs and A:ulsfo and A:lvs [13:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs and A:ulsfo and A:lvs [13:20:29] (03CR) 10Ssingh: [C:03+1] "Thanks for patch and for the reminder. We will take care of it in the future." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082470 (https://phabricator.wikimedia.org/T377127) (owner: 10Volans) [13:20:44] @HouseOfM: I've been scanning documentation but couldn't find a definitive answer. I think we should hold off. [13:21:09] (03CR) 10Hnowlan: [C:03+1] changeprop: Configure PCS URI to be the discovery name of the service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082450 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [13:21:17] ok, it's frustrating but understandable [13:22:15] FIRING: [2x] JobUnavailable: Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:22:57] Is the only thing I could find. I'm guessing the current approach could be fine, but I'd want a confirmation before we move forward. [13:23:09] ^meant to link to https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service [13:23:40] (03CR) 10Daimona Eaytoy: [C:04-1] "Holding off for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [13:24:03] (03Merged) 10jenkins-bot: WikiProjectIDLookup: use SparqlClient and make endpoint configurable [extensions/WikimediaCampaignEvents] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082277 (https://phabricator.wikimedia.org/T377746) (owner: 10Daimona Eaytoy) [13:24:28] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1082277|WikiProjectIDLookup: use SparqlClient and make endpoint configurable (T377746)]] [13:24:33] T377746: Make Sparql endpoint configurable - https://phabricator.wikimedia.org/T377746 [13:24:41] (03PS1) 10Jgiannelos: pcs: Configure prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 [13:25:07] Daimona, HouseOfM: IMHO the URL is probably correct for now, but I think search team might not be thrilled to find out about another internal use case of WDQS that they know nothing about [13:25:27] Yeah that's exactly how I feel based on what I read so far. [13:25:29] like, at some point that port will need to be updated for the query-main / query-scholarly split, and they’ll need to know all the places that need updating [13:25:56] (03PS2) 10Jgiannelos: pcs: Configure prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 [13:26:19] I'd err on the side of forgiveness rather than permission, but I understand [13:26:20] (03CR) 10Ssingh: [C:03+1] "Apologies for missing this; noted for future." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082470 (https://phabricator.wikimedia.org/T377127) (owner: 10Volans) [13:26:23] Daimona: will there be anything to test for the backport alone once it’s on mwdebug? [13:26:39] or should I just send it through right away [13:26:57] I don't think so, the feature is still 100% disabled/hidden in prod [13:26:59] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs and not A:ulsfo and A:lvs [13:27:01] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1082277|WikiProjectIDLookup: use SparqlClient and make endpoint configurable (T377746)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:27:01] (03CR) 10Volans: [C:03+2] service: change depool_threshold field to float [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082470 (https://phabricator.wikimedia.org/T377127) (owner: 10Volans) [13:27:02] ack [13:27:04] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Continuing with sync [13:27:07] good timing :P [13:27:13] (03PS3) 10Jgiannelos: pcs: Configure prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 [13:27:43] (03PS4) 10Jgiannelos: pcs: Configure prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 [13:28:12] (03PS41) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [13:28:13] Daimona, HouseOfM, if you add a new client targetting wdqs-internal, would mind adding a task similar to T374021 (parent task T374453) so that we don't forget to update your service? [13:28:14] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [13:28:14] T374453: Migration of traffic to new Split Endpoints for WDQS - https://phabricator.wikimedia.org/T374453 [13:29:10] dcausse: will do [13:29:19] thanks! :) [13:29:55] (03PS5) 10Jgiannelos: pcs: Configure prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 (https://phabricator.wikimedia.org/T372749) [13:29:56] (03PS5) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [13:30:24] (03CR) 10Jgiannelos: [C:03+2] changeprop: Configure PCS URI to be the discovery name of the service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082450 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [13:30:32] Oh hi dcausse :D I was just wondering what would be the best way to get in touch with y'all :) [13:31:10] Is this something you think it would be worth discussing in more detail before we move forward? [13:31:25] (03PS6) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [13:31:26] (03Merged) 10jenkins-bot: changeprop: Configure PCS URI to be the discovery name of the service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082450 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [13:31:36] Daimona: hi! just the saw the conversation and thought that might be a good time to ask :) [13:31:43] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082277|WikiProjectIDLookup: use SparqlClient and make endpoint configurable (T377746)]] (duration: 07m 15s) [13:31:45] (I'm multitasking badly and trying to understand what to do, so apologies if what I'm saying sounds more stupid than usual) [13:31:57] tgr|away: over to you, I think [13:32:00] T377746: Make Sparql endpoint configurable - https://phabricator.wikimedia.org/T377746 [13:32:35] (with dcausse here, maybe Daimona’s config change will be ready for deployment later in the window still, but let’s start with those backports first IMHO) [13:32:52] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [13:32:56] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082269 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [13:32:58] (03CR) 10Herron: [C:03+1] "LGTM barring Tiziano's open comment" [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [13:32:59] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [13:33:06] I +2ed the backports already to speed up the deployment :) [13:33:17] To simplify, @dcausse, what endpoint should we be using right now? [13:33:24] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [13:33:37] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [13:33:47] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:34:06] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:34:19] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:34:45] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:34:49] HouseOfM: (cc Daimona) sadly it does not exist yet for wdqs-internal but should be available soon, what you could do in the meantime is test that your use-case fits with wdqs-main only (testing your queries with https://wdqs-main.wikidata.org) [13:35:01] 06SRE, 10ChangeProp, 06cloud-services-team, 06collaboration-services, and 11 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10254048 (10bking) Forgive the drive-by comment, but at the 6-month anniversary of this ticket, it might be... [13:36:39] dcausse: and is it okay if they reuse the proxy for the internal full endpoint in the meantime, like in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1082082/3/wmf-config/CommonSettings.php ? [13:36:54] (assuming the phab task to later move it to the split endpoints is created) [13:36:55] Daimona: the main concern I have is that wdqs is not the most robust service we have, so if the use-case you have that relies on it is a bit critical that might not the best backend to use to source the data you need [13:36:56] (03Merged) 10jenkins-bot: service: change depool_threshold field to float [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082470 (https://phabricator.wikimedia.org/T377127) (owner: 10Volans) [13:37:38] Lucas_WMDE: yes should be fine, we should migrate them alongside the constraint checks [13:37:45] (03PS1) 10Muehlenhoff: Switch ganeti2040 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082474 [13:37:57] !log instaling gdk-pixbuf security updates [13:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [13:38:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082269 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [13:38:53] dcausse: it depends what you mean by "critical", it's for an ancillary part of our extension that is yet to be deployed so we might be ok [13:39:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:39:41] dcausse: thankfully our use case is pretty simple (just one query), low traffic (results are cached for 1h), and not critical (we tolerate stale data for up to 1w, plus it's an experimental feature). And it should work with the -main endpoint happily (though I'm going to double-check). Nonetheless, I'd be happy to have a longer / more detailed conversation with y'all if you think that would be helpful. [13:39:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:40:14] yikes, zuul predicts 24 more minutes for those backports :S [13:40:35] Daimona: I say we go ahead, and then discuss any necessary changes outside of this arena. thoughts? [13:40:42] HouseOfM: "critical" in a sense that if wdqs fails would that affect the global stability/availability of the site because of your use-case? [13:41:02] dcausse: oh absolutely not [13:42:00] (03CR) 10AOkoth: [C:03+1] docker_registry_ha::registry: update gitlab-runner2003 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082332 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [13:42:01] Isn't the zuul ETA implemented as `return rand( 20, 60 )`? [13:42:35] HouseOfM: thanks, another quick questions: are the queries run by your extension written in the code or provided by the users? [13:42:37] (03CR) 10AOkoth: [C:03+1] docker_registry_ha::registry: update gitlab-runner2004 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082334 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [13:42:47] It's a single query written in the code [13:42:54] ^ [13:43:00] https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaCampaignEvents/+/76080e38ca4e6cb7b8e8ebc88f445d7abbc8d55f/src/WikiProject/WikiProjectIDLookup.php#100 [13:43:00] cool, then all good to me :) [13:43:12] ok, then let’s go ahead with that config change while the backports go through gate-and-submit? [13:43:28] yes please Lucas_WMDE [13:43:47] thanks for all of the input dcausse [13:43:48] OK, thanks :) I'll restore it in the calendar and remove my CR-1 [13:43:54] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2040 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082474 (owner: 10Muehlenhoff) [13:44:11] And will file the subtask as an action item unless you've already done that @HouseOfM [13:44:12] ok [13:44:27] Daimona: I'm doing it niw [13:44:29] now* [13:45:09] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: mysql_legacy: SQL query quote escape - https://phabricator.wikimedia.org/T376712#10254082 (10ABran-WMF) 05Open→03Declined see T368881#10254014 [13:45:29] Noice, thank you, and thanks dcausse for the quick feedback :) [13:47:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [13:48:47] (03CR) 10Jelto: [C:03+2] docker_registry_ha::registry: update gitlab-runner2003 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082332 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [13:48:51] (03CR) 10Jelto: [C:03+2] docker_registry_ha::registry: update gitlab-runner2004 IP [puppet] - 10https://gerrit.wikimedia.org/r/1082334 (https://phabricator.wikimedia.org/T377374) (owner: 10Jelto) [13:50:16] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: mariadb: systemctl status accessor in mysql_legacy - https://phabricator.wikimedia.org/T377129#10254105 (10ABran-WMF) 05Open→03Resolved code is implemented, needs to be tested under T374191 [13:51:42] (03CR) 10Daimona Eaytoy: "Removing CR-1 per updated description of T376055." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [13:53:25] ah. “backport is locked by tgr” [13:53:30] FIRING: [2x] JobUnavailable: Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:53:31] then the config change has to wait a bit after all ^^ [13:54:11] I missed those TrainBranchBot / wikibugs messages of 13:38 UTC [13:54:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs and not A:ulsfo and A:lvs [13:55:26] Lucas_WMDE: yeah it's still waiting for zuul [13:55:40] yeah, I just thought you were still away, sorry ^^ [13:57:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:58:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [13:59:15] Lucas_WMDE: the irssi away plugin I'm using broke a while ago, and I can't muster the will to debug it, so I'm just permanently away now [13:59:32] I don't look at IRC very often so I guess it's sort of accurate [13:59:39] heh [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1400) [14:00:18] backport window is still ongoing, sorry [14:00:42] and if the Wikifunctions people don’t need the whole window, there are still several more backports we’d love to deploy in the remaining time [14:00:54] (but we can also just let the current backport go through and then do the rest later) [14:01:01] with all the patches to backport, that is to be expected :-] [14:01:17] +1 on extending, and thanks for all the deployments! [14:01:29] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:57] tgr|away: I have switched to https://www.irccloud.com/ and thus chat from a browser tab [14:02:04] (03PS17) 10Andrea Denisse: alert: Ensure vopsbot database is synced from active to passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) [14:02:30] the UI is decent and as I remember the service provides a bouncer so you can attach irssii to it [14:02:57] (03PS1) 10Clément Goubert: php*-cli: Add helper scripts for mwcron, mwscript [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T341555) [14:03:08] (03CR) 10Andrea Denisse: alert: Ensure vopsbot database is synced from active to passive hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [14:04:16] hashar: yeah I tried. Wasn't great either, though can't recall the details. Still better than having to fix fifteen year old perl plugins I guess. [14:04:56] it could be chatrooms over eggdrop bots (which is written in TCL) [14:07:10] (03Merged) 10jenkins-bot: Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082265 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [14:07:12] (03CR) 10Eevans: [C:03+2] restbase203[6-8]: initial setup [puppet] - 10https://gerrit.wikimedia.org/r/1082301 (https://phabricator.wikimedia.org/T377896) (owner: 10Eevans) [14:07:17] yay 1/2 [14:07:18] (03Merged) 10jenkins-bot: Auth: pass accountType to authevents log stream [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082269 (https://phabricator.wikimedia.org/T341650) (owner: 10Gergő Tisza) [14:07:21] yay 2/2 [14:07:21] (03CR) 10Eevans: [C:03+2] Configure restbase10[34-42] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1082302 (https://phabricator.wikimedia.org/T354227) (owner: 10Eevans) [14:07:50] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1082265|Auth: pass accountType to authevents log stream (T341650 T375510 T375505)]], [[gerrit:1082269|Auth: pass accountType to authevents log stream (T341650 T375510 T375505)]] [14:08:09] T341650: Update authentication metrics for IP masking - https://phabricator.wikimedia.org/T341650 [14:08:10] T375510: Temp accounts Grafana Dashboard: Rate of account creation - https://phabricator.wikimedia.org/T375510 [14:08:10] T375505: Temp accounts Grafana Dashboard: Rate of temporary account creation - https://phabricator.wikimedia.org/T375505 [14:08:14] (03CR) 10Ssingh: [C:03+2] P:dns:auth: alert if a change was submitted but authdns-update was not run [puppet] - 10https://gerrit.wikimedia.org/r/1082241 (owner: 10Ssingh) [14:10:10] (03PS3) 10Eevans: Configure restbase10[34-42] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1082302 (https://phabricator.wikimedia.org/T377896) [14:10:15] !log tgr@deploy2002 tgr: Backport for [[gerrit:1082265|Auth: pass accountType to authevents log stream (T341650 T375510 T375505)]], [[gerrit:1082269|Auth: pass accountType to authevents log stream (T341650 T375510 T375505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:19] hashar: granted, tcl was worse. I wrote some Wikipedia plugins for eggbot back when huwiki was still using IRC. [14:10:55] (03CR) 10Clément Goubert: "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1069136 (owner: 10Clément Goubert) [14:11:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10254230 (10MoritzMuehlenhoff) [14:13:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10254250 (10Eevans) a:05Eevans→03None [14:13:46] (03CR) 10Eevans: [C:03+2] Configure restbase10[34-42] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1082302 (https://phabricator.wikimedia.org/T377896) (owner: 10Eevans) [14:14:29] !log sudo cumin 'A:dnsbox' 'run-puppet-agent' [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:39] !log tgr@deploy2002 tgr: Continuing with sync [14:17:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:18:17] !log sudo cumin 'O:alerting_host' 'run-puppet-agent' [14:18:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:14] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082265|Auth: pass accountType to authevents log stream (T341650 T375510 T375505)]], [[gerrit:1082269|Auth: pass accountType to authevents log stream (T341650 T375510 T375505)]] (duration: 13m 23s) [14:21:29] T341650: Update authentication metrics for IP masking - https://phabricator.wikimedia.org/T341650 [14:21:29] T375510: Temp accounts Grafana Dashboard: Rate of account creation - https://phabricator.wikimedia.org/T375510 [14:21:29] T375505: Temp accounts Grafana Dashboard: Rate of temporary account creation - https://phabricator.wikimedia.org/T375505 [14:21:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:21:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:21:54] yay [14:22:09] is anyone from Wikifunctions claiming their window? [14:22:12] otherwise I would keep deploying [14:22:19] (03PS2) 10Clément Goubert: php*-cli: Add helper scripts for mwcron, mwscript [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) [14:22:28] (assuming Daimona / HouseOfM and Tran are still around) [14:22:34] 👋 [14:22:36] yup [14:22:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:23:13] (03CR) 10Volans: interactive: Ring the bell by default in ask_input (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1069136 (owner: 10Clément Goubert) [14:23:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:23:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:24:22] (03CR) 10Lucas Werkmeister (WMDE): "Please add `Depends-On: I95a5b88ec81583e16ccf8e58cdb8e12e00aae5bf` in that case so CI will also pass (`scap backport` will also know what " [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:24:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [14:24:51] alright, let’s start with the config change [14:25:08] Tran: I also left a comment on the CheckUser backport [14:25:27] (03Merged) 10jenkins-bot: Enable CampaignEvents collaboration list in testwiki and test2wiki (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082082 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [14:25:52] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.15.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082482 [14:25:52] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1082082|Enable CampaignEvents collaboration list in testwiki and test2wiki (v2) (T376055)]] [14:25:59] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.15.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082482 (owner: 10Volans) [14:26:13] T376055: Release Collaboration List MVP to testwiki and test2wiki - https://phabricator.wikimedia.org/T376055 [14:27:18] (03PS2) 10STran: Add source wiki to contributions on Special:GlobalContributions [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) [14:27:57] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:28:14] (03PS2) 10JHathaway: efi: add script install grub on all efi sys parts [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) [14:28:19] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1082082|Enable CampaignEvents collaboration list in testwiki and test2wiki (v2) (T376055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:31] Daimona, HouseOfM: please test :) [14:28:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:29:07] btw, re “Isn't the zuul ETA implemented as `return rand( 20, 60 )`?” – it looks like for those MW core backports it was relatively accurate in the end, they did take over 30 minutes in total :S [14:29:12] (which is unfortunate for Tran…) [14:29:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:29:23] 🥳 [14:30:12] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10254335 (10MoritzMuehlenhoff) [14:30:17] (03CR) 10JHathaway: "Apologies for not adding more context. I updated the patch with a bit more explanation as well as the resources and testing methodology, l" [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [14:30:30] It's no chance that I've been dreaming of parallel PHPUnit for a few years now... Hopefully the days of hour-long waits will soon be over. [14:30:41] yup [14:31:29] (03Abandoned) 10Muehlenhoff: mw_rc_irc: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074429 (owner: 10Muehlenhoff) [14:31:50] (03CR) 10Btullis: [C:03+1] "Nice. Let's give it a try." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [14:33:37] Ohhh damn I forgot that test2wiki is in group1, so we still see the HTTP error there. testwiki is working now though, so I'm not worried. Since it's still a test wiki after all, I'm happy just leaving it semi-broken for a few hours. [14:33:45] ah, ok ^^ [14:33:51] * Lucas_WMDE wasn’t aware of that either [14:33:55] good to deploy then? [14:34:03] I didn't want to add yet another backport for wmf.27 :D [14:34:07] :'D [14:34:18] Yeah LGTM. @HouseOfM anything on your side? [14:34:27] (03PS1) 10Hnowlan: admin: update tandic ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1082483 (https://phabricator.wikimedia.org/T300383) [14:34:35] if they’re scheduled together, two backports aren’t really slower than one :P [14:34:49] but I sure would prefer not to backport that separately to wmf.27 now ^^ [14:35:04] (03CR) 10STran: Add source wiki to contributions on Special:GlobalContributions (031 comment) [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:35:12] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.15.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1082482 (owner: 10Volans) [14:35:50] Yeah no need to, it can stay as-is. The only difference is basically that the error message changes background from red to gray :P [14:36:17] I wrote that backwards but you get the idea [14:37:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:04] alright, let’s go ahead with it [14:39:06] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [14:39:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:40:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:42:03] Daimona, HouseOfM : we have some documentation about using WDQS as a backend for other features:https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Technical_interactions cc: dcausse [14:43:40] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082082|Enable CampaignEvents collaboration list in testwiki and test2wiki (v2) (T376055)]] (duration: 17m 47s) [14:43:50] T376055: Release Collaboration List MVP to testwiki and test2wiki - https://phabricator.wikimedia.org/T376055 [14:43:58] alright [14:44:04] Tran: do you want to self-service your backports? [14:44:14] 👍 no problem [14:44:17] starting my backport now then [14:44:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:44:57] 👍 [14:45:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:45:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:46:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:46:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:46:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:47:07] (03PS1) 10Volans: Upstream release v8.15.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1082485 [14:47:13] (03CR) 10Volans: [C:03+2] Upstream release v8.15.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1082485 (owner: 10Volans) [14:47:14] gehel: Ooooooh thanks, I didn't see that! Maybe the page could include either the URL of the endpoint (`localhost:6009`) or the service name (`wdqs-internal`)? Tthose are the things I was looking for. [14:48:17] (03PS1) 10Volans: setup.py: pin prospector [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 [14:49:01] And BTW, I read the last section and believe we comply with all the requirements, so yay! [14:49:06] (03CR) 10Volans: "Fixed in I870b9c0c7c3c18f0f7df35de8115dabbd7aa335e" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1069136 (owner: 10Clément Goubert) [14:49:38] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [14:51:33] (03PS1) 10Muehlenhoff: Switch ganeti2039 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1082490 [14:51:36] I also just retested on testwiki and it's looking good. Thanks @Lucas_WMDE for this and the other million config changes you helped with :D [14:51:42] yay \o/ [14:52:37] Tran: can you ping me when you’re done? I have something else to deploy [14:52:50] (03Merged) 10jenkins-bot: Support template overrides in ContributionsPager [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082203 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:52:52] (03Merged) 10jenkins-bot: Add source wiki to contributions on Special:GlobalContributions [extensions/CheckUser] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082328 (https://phabricator.wikimedia.org/T356292) (owner: 10STran) [14:52:59] yay, CI finished [14:53:19] !log stran@deploy2002 Started scap sync-world: Backport for [[gerrit:1082203|Support template overrides in ContributionsPager (T356292)]], [[gerrit:1082328|Add source wiki to contributions on Special:GlobalContributions (T356292)]] [14:53:23] T356292: Return global contributions by temporary accounts given an IP address or range - https://phabricator.wikimedia.org/T356292 [14:53:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on rdb1014.eqiad.wmnet with reason: Hardware issue [14:53:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on rdb1014.eqiad.wmnet with reason: Hardware issue [14:54:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961#10254455 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0d71122b-3e94-47c7-a121-4dda9db372d8) set by cgoubert@cumin1002... [14:55:45] !log stran@deploy2002 stran: Backport for [[gerrit:1082203|Support template overrides in ContributionsPager (T356292)]], [[gerrit:1082328|Add source wiki to contributions on Special:GlobalContributions (T356292)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:57:31] (03Merged) 10jenkins-bot: Upstream release v8.15.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1082485 (owner: 10Volans) [14:57:35] (03CR) 10Clément Goubert: interactive: Ring the bell by default in ask_input (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1069136 (owner: 10Clément Goubert) [14:59:27] The full feature can't be tested since it depends on a bugfix hitting group1 in a few hours (Special:GlobalContributions is only available on meta) but I poked around and didn't see any regressions elsewhere. Going to continue. [14:59:31] !log stran@deploy2002 stran: Continuing with sync [14:59:42] sounds good [15:00:32] I wonder how hard it would be to temporarily bump a wiki to another version in mw-debug-k8s [15:00:45] I think I used to do that once or twice by editing the versions JSON file on the bare-metal mwdebug hosts [15:00:58] but it does sound like a fairly niche use case [15:01:19] 🫣 [15:02:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:47] (03CR) 10CI reject: [V:04-1] setup.py: pin prospector [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 (owner: 10Volans) [15:03:40] hm, it sounds like that might actually be possible with T276994 [15:03:41] T276994: Provide an mwdebug functionality on kubernetes - https://phabricator.wikimedia.org/T276994 [15:03:48] if the right config is somewhere below /srv/mediawiki [15:03:58] * Lucas_WMDE hasn’t tried out mw-experimental yet [15:04:12] !log stran@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082203|Support template overrides in ContributionsPager (T356292)]], [[gerrit:1082328|Add source wiki to contributions on Special:GlobalContributions (T356292)]] (duration: 10m 53s) [15:04:26] T356292: Return global contributions by temporary accounts given an IP address or range - https://phabricator.wikimedia.org/T356292 [15:05:21] Should be done @Lucas_WMDE> [15:05:25] thanks! [15:08:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:08:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1082483 (https://phabricator.wikimedia.org/T300383) (owner: 10Hnowlan) [15:09:05] * Lucas_WMDE deploying some more [15:10:01] (03PS2) 10Volans: tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 [15:10:23] !log uploaded spicerack_8.15.1 to apt.wikimedia.org bullseye-wikimedia [15:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:28] !log lucaswerkmeister-wmde Deployed security patch for T377912 [15:16:19] (03PS6) 10Bking: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) [15:16:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:17:06] (03CR) 10Andrea Denisse: [C:03+2] alert: Ensure vopsbot database is synced from active to passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [15:17:52] !log UTC afternoon backport+config window done [15:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:57] jouncebot: nowandnext [15:17:57] No deployments scheduled for the next 1 hour(s) and 42 minute(s) [15:17:57] In 1 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1700) [15:18:22] tgr|away: IMHO you could deploy those session logging backports now if you like (or wait until the later window where you scheduled them) [15:18:35] * Lucas_WMDE done deploying [15:19:02] (03CR) 10Btullis: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:19:46] !log dduvall@deploy2002 Started deploy [releng/jenkins-deploy@d8e345f] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/94 [15:20:38] !log dduvall@deploy2002 Finished deploy [releng/jenkins-deploy@d8e345f] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/94 (duration: 01m 05s) [15:22:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:23:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10254615 (10cmooney) [15:24:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10254623 (10cmooney) [15:25:04] (03CR) 10CI reject: [V:04-1] tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 (owner: 10Volans) [15:25:20] (03PS3) 10Volans: tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 [15:28:06] !log uploaded openjdk-8 8u422-b05-1~deb12u0 for component/jdk for bookworm-wikimedia (bootstrap build since openjdk-8 needs openjdk-8 to build) [15:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:02] (03CR) 10Btullis: [C:03+1] "Nice work Brian." [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:30:27] (03PS7) 10Ryan Kemper: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:31:09] (03PS8) 10Ryan Kemper: statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:31:42] (03CR) 10Ryan Kemper: [C:03+1] statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:34:27] I am going to restart the CI Jenkins to reload some plugins [15:35:56] !log Restarted CI Jenkins [15:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:07] (03CR) 10Bking: [C:03+2] statistics::explorer hosts: better visibility into processes [puppet] - 10https://gerrit.wikimedia.org/r/1081987 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [15:40:06] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10254745 (10Ottomata) > reconnecting using the last event id's timestamp may be lossy I think it shouldn't be... [15:41:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10254748 (10cmooney) [15:42:16] !log dduvall@deploy2002 Started deploy [releng/jenkins-deploy@e1c56d1] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/95 [15:42:55] !log dduvall@deploy2002 Finished deploy [releng/jenkins-deploy@e1c56d1] (releasing): Deploying https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/95 (duration: 00m 53s) [15:43:33] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082308 (owner: 10TrainBranchBot) [15:43:33] (03PS8) 10Vgutierrez: liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) [15:44:22] (03CR) 10CI reject: [V:04-1] liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) (owner: 10Vgutierrez) [15:45:24] duh... a missing SPDX header... I wonder why utils/run_ci_locally.sh didn't catch that :) [15:46:31] (03PS9) 10Vgutierrez: liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) [15:46:31] (03PS6) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) [15:47:59] (03CR) 10Ssingh: "Once the doc string is added, we will merge this tomorrow (Thu)" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:48:34] 10ops-codfw, 06SRE, 06DC-Ops: lsw-d[18]-codfw missing console port info in netbox - https://phabricator.wikimedia.org/T376917#10254802 (10Jhancock.wm) a:03Jhancock.wm [15:48:53] (03PS1) 10Btullis: Add new an-worker nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082497 (https://phabricator.wikimedia.org/T377878) [15:51:06] !log btullis@deploy2002 Started deploy [airflow-dags/analytics_test@ba61f77]: T351388 [15:51:11] T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues - https://phabricator.wikimedia.org/T351388 [15:51:20] !log btullis@deploy2002 Finished deploy [airflow-dags/analytics_test@ba61f77]: T351388 (duration: 00m 31s) [15:51:45] !log btullis@deploy2002 Started deploy [airflow-dags/analytics@ba61f77]: T351388 [15:51:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082498 [15:51:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082498 (owner: 10TrainBranchBot) [15:52:49] !log btullis@deploy2002 Finished deploy [airflow-dags/analytics@ba61f77]: T351388 (duration: 01m 08s) [15:53:28] !log btullis@deploy2002 Started deploy [airflow-dags/search@ba61f77]: T351388 [15:53:42] (03PS1) 10Elukey: tox: add Jenkins settings to reduce its execution time [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082500 [15:53:42] (03PS1) 10Elukey: tests: fix outstanding CI issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082501 [15:53:53] !log btullis@deploy2002 Finished deploy [airflow-dags/search@ba61f77]: T351388 (duration: 00m 29s) [15:54:20] !log btullis@deploy2002 Started deploy [airflow-dags/research@ba61f77]: T351388 [15:55:03] !log btullis@deploy2002 Finished deploy [airflow-dags/research@ba61f77]: T351388 (duration: 00m 45s) [15:55:04] (03CR) 10Elukey: [C:03+1] tests: fix outstanding CI issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082501 (owner: 10Elukey) [15:55:15] !log btullis@deploy2002 Started deploy [airflow-dags/platform_eng@ba61f77]: T351388 [15:55:43] !log btullis@deploy2002 Finished deploy [airflow-dags/platform_eng@ba61f77]: T351388 (duration: 00m 31s) [15:55:59] !log btullis@deploy2002 Started deploy [airflow-dags/analytics_product@ba61f77]: T351388 [15:57:12] !log btullis@deploy2002 Finished deploy [airflow-dags/analytics_product@ba61f77]: T351388 (duration: 01m 15s) [15:57:28] T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues - https://phabricator.wikimedia.org/T351388 [15:57:55] (03PS2) 10Elukey: tox: add Jenkins settings to reduce its execution time [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082500 [15:57:55] (03PS2) 10Elukey: tests: fix outstanding CI issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082501 [16:03:10] (03PS1) 10Urbanecm: StructuredTaskMobileArticleTarget: Fix history hacks to avoid firing events [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082503 (https://phabricator.wikimedia.org/T377907) [16:04:00] (03CR) 10CI reject: [V:04-1] tox: add Jenkins settings to reduce its execution time [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082500 (owner: 10Elukey) [16:06:03] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: Decommission ganeti2009/ganeti2010 - https://phabricator.wikimedia.org/T377741#10254930 (10Jhancock.wm) 05Open→03Resolved [16:06:23] (03PS2) 10Btullis: Add new an-worker nodes to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1082497 (https://phabricator.wikimedia.org/T377878) [16:07:09] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4361/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082497 (https://phabricator.wikimedia.org/T377878) (owner: 10Btullis) [16:07:31] jouncebot: nowandnext [16:07:31] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [16:07:31] In 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1700) [16:09:44] !log hnowlan@cumin1002 START - Cookbook sre.discovery.service-route depool sessionstore in codfw: sessionstore mesh migration T363996 [16:10:18] (03CR) 10Scott French: [C:03+1] sessionstore: complete migration to envoy tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082441 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:10:23] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [16:14:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:14:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: sessionstore mesh migration T363996 [16:15:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:15:43] (03CR) 10Volans: "this is against the debian branch, not master ;)" [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082500 (owner: 10Elukey) [16:15:57] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [16:16:18] (03CR) 10Hnowlan: [C:03+2] sessionstore: complete migration to envoy tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082441 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:17:21] (03Merged) 10jenkins-bot: sessionstore: complete migration to envoy tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082441 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:17:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:19:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:20:39] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: apply [16:20:49] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [16:21:22] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: apply [16:21:44] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [16:24:17] (03CR) 10CI reject: [V:04-1] StructuredTaskMobileArticleTarget: Fix history hacks to avoid firing events [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082503 (https://phabricator.wikimedia.org/T377907) (owner: 10Urbanecm) [16:25:03] !log hnowlan@cumin1002 START - Cookbook sre.discovery.service-route pool sessionstore in codfw: sessionstore mesh migration T363996 [16:25:35] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [16:27:53] (03CR) 10Elukey: "sigh" [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082500 (owner: 10Elukey) [16:30:07] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in codfw: sessionstore mesh migration T363996 [16:30:40] (03PS1) 10Ssingh: P:dns:auth: use correct path for npre_command and authdns-update check [puppet] - 10https://gerrit.wikimedia.org/r/1082510 [16:30:42] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [16:31:07] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [16:31:45] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4362/co" [puppet] - 10https://gerrit.wikimedia.org/r/1082510 (owner: 10Ssingh) [16:31:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:32:11] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082498 (owner: 10TrainBranchBot) [16:32:23] (03CR) 10Ssingh: [V:03+1] "Thanks to @vgutierrez@wikimedia.org for pointing out that the path was incorrect." [puppet] - 10https://gerrit.wikimedia.org/r/1082510 (owner: 10Ssingh) [16:32:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:33:39] (03PS1) 10Ssingh: wikimedia-dns.org: dummy change to test authdns-update alert [dns] - 10https://gerrit.wikimedia.org/r/1082511 [16:33:46] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns:auth: use correct path for npre_command and authdns-update check [puppet] - 10https://gerrit.wikimedia.org/r/1082510 (owner: 10Ssingh) [16:34:23] (03Abandoned) 10Elukey: tox: add Jenkins settings to reduce its execution time [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1082500 (owner: 10Elukey) [16:35:13] !log sudo cumin 'O:alerting_host or O:dnsbox' 'run-puppet-agent' [16:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:19] (03CR) 10Ahmon Dancy: "recheck" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082503 (https://phabricator.wikimedia.org/T377907) (owner: 10Urbanecm) [16:39:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10255195 (10phaultfinder) [16:39:46] going to test an alert for cases when a change was submitted to the DNS repository but authdns-update was not run [16:39:52] please disregard alert spam for the next few mins [16:40:08] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: dummy change to test authdns-update alert [dns] - 10https://gerrit.wikimedia.org/r/1082511 (owner: 10Ssingh) [16:52:27] !log restart ircecho on alerting hosts [16:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:00] dancy: do you plan to deploy the GE backport? or should i? [16:56:22] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns1005 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:22] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns1006 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:22] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns2005 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:22] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns2004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:22] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns2006 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns4004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns4003 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns3004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns3003 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns6002 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:24] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns6001 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:25] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns7001 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:25] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns7002 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:26] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns5004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:26] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns5003 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:56:28] urbanecm: I'll leave it to you! [16:57:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:58:10] urbanecm: Testing seems to be a problem. [16:58:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:58:30] RECOVERY - check if authdns-update was run after a change was submitted to dns.git on dns1004 is OK: DNS git repository and local zone files are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:59:23] dancy: given phan complains about a change that was not change, i'd say it'd fail on an empty commit too [16:59:44] lemme test it, and if it fails, we can forcemerge and fix CI later [16:59:57] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082498 (owner: 10TrainBranchBot) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1700) [17:00:16] (03PS1) 10Urbanecm: [DNM] Test CI [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082516 [17:00:22] let's see ^^ [17:00:43] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network eleemnts from Netbox - https://phabricator.wikimedia.org/T377996 (10cmooney) 03NEW p:05Triage→03Medium [17:00:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:01:19] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10255334 (10cmooney) Just a note to say that fundraising no longer use any VM infra, so every assigned IP I believe belongs to just a single server. [17:01:22] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network eleemnts from Netbox - https://phabricator.wikimedia.org/T377996#10255335 (10cmooney) [17:01:29] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10255336 (10cmooney) [17:01:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:01:45] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10255340 (10cmooney) [17:02:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:02:20] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10255341 (10cmooney) [17:02:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:03:13] 06SRE, 06Infrastructure-Foundations, 10netops: Manange fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10255344 (10cmooney) [17:06:04] dancy: okay, phan fails here too. i'll forcemerge, do you mind filling a task for "CI broken in current train branch"? [17:06:21] (03CR) 10Hnowlan: [V:03+2 C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1082483 (https://phabricator.wikimedia.org/T300383) (owner: 10Hnowlan) [17:07:59] urbanecm: I'm afk at the moment but I can when I get back (<30 minutes) [17:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082518 [17:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082518 (owner: 10TrainBranchBot) [17:12:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383#10255385 (10hnowlan) Key updated - please let me know if it works. [17:21:37] jouncebot: nowandnext [17:21:38] For the next 0 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1700) [17:21:38] In 0 hour(s) and 38 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1800) [17:22:28] (03CR) 10CI reject: [V:04-1] [DNM] Test CI [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082516 (owner: 10Urbanecm) [17:27:22] (03PS1) 10Bking: search-platform: Fix runbook link for RdfStreamingUpdaterSpaceUsageTooHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1082522 (https://phabricator.wikimedia.org/T375109) [17:30:29] (03CR) 10Urbanecm: [C:03+2] StructuredTaskMobileArticleTarget: Fix history hacks to avoid firing events [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082503 (https://phabricator.wikimedia.org/T377907) (owner: 10Urbanecm) [17:30:35] (03CR) 10Urbanecm: [V:03+2 C:03+2] StructuredTaskMobileArticleTarget: Fix history hacks to avoid firing events [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082503 (https://phabricator.wikimedia.org/T377907) (owner: 10Urbanecm) [17:31:18] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1082503|StructuredTaskMobileArticleTarget: Fix history hacks to avoid firing events (T377907)]] [17:31:52] T377907: [regression-wmf.28] mobile - link recommendation task cannot load editor - https://phabricator.wikimedia.org/T377907 [17:33:51] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1082503|StructuredTaskMobileArticleTarget: Fix history hacks to avoid firing events (T377907)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:34:53] (03PS4) 10Elukey: tests: fix outstanding CI issues [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082486 (owner: 10Volans) [17:38:29] !log urbanecm@deploy2002 urbanecm: Continuing with sync [17:38:49] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1082518 (owner: 10TrainBranchBot) [17:39:25] (03CR) 10CI reject: [V:04-1] tox: add Jenkins settings to reduce its execution time [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1082524 (owner: 10Elukey) [17:39:29] (03PS1) 10Cathal Mooney: Update static reverse PTR records for frack records codfw [dns] - 10https://gerrit.wikimedia.org/r/1082525 (https://phabricator.wikimedia.org/T374176) [17:43:13] (03PS2) 10Bking: search-platform: Fix runbook link for RdfStreamingUpdaterSpaceUsageTooHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1082522 (https://phabricator.wikimedia.org/T375109) [17:43:15] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082503|StructuredTaskMobileArticleTarget: Fix history hacks to avoid firing events (T377907)]] (duration: 11m 56s) [17:43:21] * urbanecm done [17:43:35] T377907: [regression-wmf.28] mobile - link recommendation task cannot load editor - https://phabricator.wikimedia.org/T377907 [17:46:42] urbanecm: Do you have a link to a blank commit job that fails? [17:46:44] (03PS3) 10MacFan4000: ExtensionDistributor: Mark 1.43 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082256 (https://phabricator.wikimedia.org/T372322) [17:47:02] dancy: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1082516 [17:47:12] or https://integration.wikimedia.org/ci/job/mwext-php74-phan/49277/ if you want the build specifically [17:47:17] thx [17:51:25] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10255525 (10elukey) >>! In T376014#10254745, @Ottomata wrote: >> reconnecting using the last event id's timest... [17:51:45] urbanecm: https://phabricator.wikimedia.org/T378003 [17:51:49] ty! [17:51:51] (03PS3) 10Bking: RdfStreamingUpdaterSpaceUsageTooHigh: move alert from search-platform to data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1082522 (https://phabricator.wikimedia.org/T375109) [17:52:11] dancy: fwiw, backport done, regression is no longer there, so i removed the task as a blocker [17:52:18] Thanks! [17:59:42] RECOVERY - Host rdb1014 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:00:05] dancy and jeena: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T1800). [18:01:06] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082531 (https://phabricator.wikimedia.org/T375659) [18:01:08] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082531 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:01:29] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:01:53] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082531 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:05:39] (03PS1) 10Ssingh: P:dns:auth: increase clush timeout for command execution [puppet] - 10https://gerrit.wikimedia.org/r/1082533 [18:06:46] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4363/console" [puppet] - 10https://gerrit.wikimedia.org/r/1082533 (owner: 10Ssingh) [18:07:37] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082538 (https://phabricator.wikimedia.org/T375659) [18:07:38] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082538 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:07:54] Rolling the train back due to error counts rising. I'll file a report shortly. [18:08:06] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns:auth: increase clush timeout for command execution [puppet] - 10https://gerrit.wikimedia.org/r/1082533 (owner: 10Ssingh) [18:08:55] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082538 (https://phabricator.wikimedia.org/T375659) (owner: 10TrainBranchBot) [18:09:14] !log running agent on A:dnsbox [18:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:10:31] Filed https://phabricator.wikimedia.org/T378006 [18:12:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:13:02] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:15:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:11] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:19:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:25] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:21:29] httpbb failure is a 500 on https://it.wikiquote.org/wiki/Pagina_principale, I'm assuming due to T378006 but haven't dug [18:21:30] T378006: Cannot declare class CacheTime, because the name is already in use in CacheTime.php - https://phabricator.wikimedia.org/T378006 [18:21:43] That makes the most sense. [18:22:16] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1082522 (https://phabricator.wikimedia.org/T375109) (owner: 10Bking) [18:23:19] (03PS1) 10Ahmon Dancy: Adjust return type documentation on SuggestedEdits [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082539 [18:24:02] (03PS2) 10Ahmon Dancy: Adjust return type documentation on SuggestedEdits [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082539 (https://phabricator.wikimedia.org/T378003) [18:26:39] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.28 refs T375659 [18:27:13] T375659: 1.43.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T375659 [18:28:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:29:52] (03PS1) 10Scott French: shellbox: pin all instances at live image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082317 (https://phabricator.wikimedia.org/T375243) [18:29:53] (03PS1) 10Scott French: shellbox-syntaxhighlight: upgrade to 2024-10-15-214239 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082318 (https://phabricator.wikimedia.org/T375243) [18:29:55] (03PS2) 10Scott French: shellbox: upgrade to 2024-10-15-214239 (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082319 (https://phabricator.wikimedia.org/T375243) [18:31:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:33:02] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:33:42] (03CR) 10Scott French: "Thanks in advance for the review, Hugh. If you could take a look at this and the next two patches in the series, that would be greatly app" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082317 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [18:35:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:35:25] FIRING: [7x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:39:37] (03CR) 10Bking: [C:03+2] RdfStreamingUpdaterSpaceUsageTooHigh: move alert from search-platform to data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1082522 (https://phabricator.wikimedia.org/T375109) (owner: 10Bking) [18:40:49] (03Merged) 10jenkins-bot: RdfStreamingUpdaterSpaceUsageTooHigh: move alert from search-platform to data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1082522 (https://phabricator.wikimedia.org/T375109) (owner: 10Bking) [18:44:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082539 (https://phabricator.wikimedia.org/T378003) (owner: 10Ahmon Dancy) [18:45:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:46:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:45] (03PS1) 10Ssingh: P:dns:auth: exit on non-zero code from clush [puppet] - 10https://gerrit.wikimedia.org/r/1082541 [18:52:59] dancy: Any idea when the train may be moving again? [18:53:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [18:53:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [18:54:12] Niharika: i'd say whenever the train blocker gets resolved :) [18:54:21] T378006 [18:54:21] T378006: Cannot declare class CacheTime, because the name is already in use in CacheTime.php - https://phabricator.wikimedia.org/T378006 [18:54:29] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1082540 is apparently having some CI troubles [18:54:56] Niharika: The train is blocked on https://phabricator.wikimedia.org/T378006, which is currently being worked in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1082540. Looks like there are testing problems to be resolved. The person working on it said that he'd be travelling for an hour or so, so I think it's gonna be a while. [18:55:05] Ah. I saw the patch but not the CI troubles. Alright, I'll rein in my excitement a while longer. [18:55:14] Thanks urbanecm and dancy [18:58:10] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10255848 (10Ottomata) > when the client resumes from last-event-timestamp whatever came from the last-consume... [18:58:27] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10255850 (10jhathaway) For most bots I see in use, the common convention seems to be looking for messages with a `!` prefix. Name spacing only seems to occur on the function level. This i... [19:00:36] (03CR) 10Dzahn: [V:03+1] gerrit: use systemd::sysuser, reserved UID/GID, new name for daemon user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:00:38] (03PS6) 10Dzahn: gerrit: use systemd::sysuser, reserved UID/GID, new name for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) [19:02:23] (03CR) 10Ssingh: [C:03+2] P:dns:auth: exit on non-zero code from clush [puppet] - 10https://gerrit.wikimedia.org/r/1082541 (owner: 10Ssingh) [19:02:28] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10255873 (10Ottomata) > Exactly yes, I didn't explain myself clearly. But at this point, if I need to subscrib... [19:02:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [19:03:37] (03CR) 10Dzahn: "To fully automate this (make autosync work) you will have to apply the quickdatacopy on both sides, active and passive. Not only on passiv" [puppet] - 10https://gerrit.wikimedia.org/r/1082325 (https://phabricator.wikimedia.org/T375143) (owner: 10Andrea Denisse) [19:04:11] (03Merged) 10jenkins-bot: Adjust return type documentation on SuggestedEdits [extensions/GrowthExperiments] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082539 (https://phabricator.wikimedia.org/T378003) (owner: 10Ahmon Dancy) [19:04:50] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1082539|Adjust return type documentation on SuggestedEdits (T378003)]] [19:05:21] T378003: mwext-php74-phan CI job failing on mediawiki/extensions/GrowthExperiments - https://phabricator.wikimedia.org/T378003 [19:09:39] !log dummy authdns-update run [19:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:13] (03CR) 10Dzahn: "Thank you very much, Amir! I tested and this isn't wrong but there is unfortunately more to it. I am still getting a ""Access denied for" [puppet] - 10https://gerrit.wikimedia.org/r/1080781 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [19:12:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:13:02] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:13:29] !log dancy@deploy2002 dancy: Backport for [[gerrit:1082539|Adjust return type documentation on SuggestedEdits (T378003)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:13:33] !log dancy@deploy2002 dancy: Continuing with sync [19:13:46] T378003: mwext-php74-phan CI job failing on mediawiki/extensions/GrowthExperiments - https://phabricator.wikimedia.org/T378003 [19:15:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:16:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:18:10] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082539|Adjust return type documentation on SuggestedEdits (T378003)]] (duration: 13m 20s) [19:19:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:24:34] (03CR) 10Ssingh: [C:03+1] "Can't see anything obvious missing even if I don't fully understand the context!" [dns] - 10https://gerrit.wikimedia.org/r/1082525 (https://phabricator.wikimedia.org/T374176) (owner: 10Cathal Mooney) [19:25:57] 06SRE, 06Infrastructure-Foundations, 10Mail: postfix mx puppetry - https://phabricator.wikimedia.org/T325395#10255974 (10jhathaway) 05Open→03Resolved [19:26:10] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix MTA Profile - https://phabricator.wikimedia.org/T325398#10255976 (10jhathaway) 05Open→03Resolved [19:27:36] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mta-outbound-infra - https://phabricator.wikimedia.org/T325402#10255981 (10jhathaway) 05Open→03Invalid architecture was dropped in favor of only having mx-in and mx-out hosts. [19:27:43] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out - https://phabricator.wikimedia.org/T325407#10255984 (10jhathaway) 05Open→03Resolved [19:28:14] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mta-inbound-infra - https://phabricator.wikimedia.org/T325401#10255978 (10jhathaway) 05Open→03Invalid architecture was dropped in favor of only having mx-in and mx-out hosts. [19:29:00] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in - https://phabricator.wikimedia.org/T325406#10255987 (10jhathaway) 05Open→03Resolved [19:30:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:31:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:33:02] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:33:06] (03CR) 10Dzahn: "Using https://phabricator.wikimedia.org/T377643#10255989 to follow-up and debug." [puppet] - 10https://gerrit.wikimedia.org/r/1080781 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [19:36:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:37:42] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160#10256013 (10Dzahn) Thank you @jhathaway, cool. I think we... [19:38:49] (03PS1) 10AOkoth: aptrepo: upgrade gitlab-ce and gitlab-runner to 17.3 [puppet] - 10https://gerrit.wikimedia.org/r/1082547 (https://phabricator.wikimedia.org/T378016) [19:39:26] (03Abandoned) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/758511 (owner: 10BBlack) [19:39:26] (03Abandoned) 10BBlack: Remove disc-appservers-ro from mock_etc geo file [dns] - 10https://gerrit.wikimedia.org/r/1054660 (owner: 10BBlack) [19:39:26] (03Abandoned) 10BBlack: Switch appservers-ro to active/passive [dns] - 10https://gerrit.wikimedia.org/r/1054659 (owner: 10BBlack) [19:39:26] (03Abandoned) 10BBlack: Add disc-appservers-ro to mock_etc metafo [dns] - 10https://gerrit.wikimedia.org/r/1054658 (owner: 10BBlack) [19:39:27] (03Abandoned) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/551252 (owner: 10BBlack) [19:39:59] (03PS1) 10Ssingh: tox.ini: add Python 3.11 to interpreters [dns] - 10https://gerrit.wikimedia.org/r/1082548 [19:40:14] (03Abandoned) 10BBlack: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/700323 (owner: 10Giuseppe Lavagetto) [19:44:01] (03PS1) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) [19:44:07] (03PS2) 10Abijeet Patro: tables-catalog: Add translate_cache table [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) [19:45:25] RESOLVED: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:46:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:48:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160#10256047 (10Dzahn) a:05Dzahn→03None [19:49:14] (03PS40) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [19:49:55] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 10Znuny: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160#10256044 (10Dzahn) 05Open→03Resolved a:03Dzahn `... [19:55:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:56:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T2000). Please do the needful. [20:00:06] tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:01] (03PS1) 10Dzahn: aptrepo: allow versions > 17.3 < 17.4 for gitlab-ce and gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/1082552 (https://phabricator.wikimedia.org/T378016) [20:02:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [20:03:02] (03CR) 10Dzahn: [C:03+1] "nitpick: I would call this something like "allow versions > 17.3", but yes :) thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1082547 (https://phabricator.wikimedia.org/T378016) (owner: 10AOkoth) [20:03:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10256091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [20:03:39] (03Abandoned) 10Dzahn: aptrepo: allow versions > 17.3 < 17.4 for gitlab-ce and gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/1082552 (https://phabricator.wikimedia.org/T378016) (owner: 10Dzahn) [20:04:03] (03CR) 10Dzahn: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 17.3 [puppet] - 10https://gerrit.wikimedia.org/r/1082547 (https://phabricator.wikimedia.org/T378016) (owner: 10AOkoth) [20:04:43] tgr: if you're around, are you able to self-deploy? [20:05:27] dancy: is it OK to go on with the backports or do you want to reserve more time for following up on the train blocker? [20:05:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:05:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:06:27] tgr: it's okay to proceed with backports [20:08:10] thx. cjming yeah I can deploy [20:08:28] (03PS15) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [20:08:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:08:57] cool [20:09:03] (03CR) 10CI reject: [V:04-1] elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:10:24] (03PS16) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [20:10:39] (03PS1) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1082553 (https://phabricator.wikimedia.org/T367204) [20:15:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:17:00] 10ops-drmrs, 10ops-eqiad, 06DC-Ops: Clean up old drmrs-eqiad circuit CRT-009240 - https://phabricator.wikimedia.org/T370023#10256140 (10wiki_willy) Hi @RobH & @VRiley-WMF - can you provide an update on this one? We're still paying for the cross-connect, until it gets decom'd. [20:17:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082464 (https://phabricator.wikimedia.org/T372702) (owner: 10Gergő Tisza) [20:17:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082465 (https://phabricator.wikimedia.org/T372702) (owner: 10Gergő Tisza) [20:17:33] (03PS17) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [20:18:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:19:40] 10ops-drmrs, 10ops-eqiad, 06DC-Ops: Clean up old drmrs-eqiad circuit CRT-009240 - https://phabricator.wikimedia.org/T370023#10256159 (10RobH) [20:23:38] 10ops-drmrs, 10ops-eqiad, 06DC-Ops: Clean up old drmrs-eqiad circuit CRT-009240 - https://phabricator.wikimedia.org/T370023#10256177 (10RobH) [20:23:38] (03PS18) 10Bking: elasticsearch: monitor snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) [20:23:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074540 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:28:26] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021 (10jhathaway) 03NEW [20:28:33] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#10256211 (10jhathaway) p:05Triage→03Medium [20:29:06] 06SRE, 06Data-Platform-SRE, 06serviceops: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 - https://phabricator.wikimedia.org/T373490#10256231 (10Ottomata) [20:29:16] (03PS5) 10Dzahn: site/mx: move interface::alias out of site.pp to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082266 [20:29:19] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in-lists - https://phabricator.wikimedia.org/T325404#10256214 (10jhathaway) p:05Low→03Medium [20:29:31] 06SRE, 06Infrastructure-Foundations, 10Mail: MTA Provisioning - https://phabricator.wikimedia.org/T325403#10256235 (10jhathaway) [20:29:32] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in-lists - https://phabricator.wikimedia.org/T325404#10256233 (10jhathaway) [20:29:32] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#10256234 (10jhathaway) [20:31:08] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out-lists - https://phabricator.wikimedia.org/T325405#10256236 (10jhathaway) [20:31:10] 10ops-drmrs, 10ops-eqiad, 06SRE, 06DC-Ops: Clean up old drmrs-eqiad circuit CRT-009240 - https://phabricator.wikimedia.org/T370023#10256238 (10RobH) [20:31:13] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out-lists - https://phabricator.wikimedia.org/T325405#10256239 (10jhathaway) [20:31:15] (03CR) 10Dzahn: [V:04-1] "Unless this is soon going to be removed anyways, as I just realized. I am confused though because the host is here in site.pp and unknown" [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [20:31:16] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#10256240 (10jhathaway) [20:31:18] 06SRE, 06Infrastructure-Foundations, 10Mail: MTA Provisioning - https://phabricator.wikimedia.org/T325403#10256241 (10jhathaway) [20:32:22] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [20:32:34] (03CR) 10Dzahn: [V:04-1 C:04-1] "-1 for this specific PS, but you get the idea..." [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [20:33:07] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim null client config with a Postfix null client config - https://phabricator.wikimedia.org/T325408#10256255 (10jhathaway) [20:36:09] 06SRE, 06Infrastructure-Foundations, 10Mail: MTA provisioning - https://phabricator.wikimedia.org/T325403#10256261 (10jhathaway) [20:36:17] 06SRE, 06Infrastructure-Foundations, 10Mail: MTA provisioning - https://phabricator.wikimedia.org/T325403#10256265 (10jhathaway) 05Open→03Resolved [20:41:19] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [20:45:21] 06SRE, 06Infrastructure-Foundations, 10Mail: Decom Exim based mx{1001,2001}.wikimedia.org - https://phabricator.wikimedia.org/T325409#10256361 (10jhathaway) [20:45:26] 06SRE, 06Infrastructure-Foundations, 10Mail: Decom Exim based mx{1001,2001}.wikimedia.org - https://phabricator.wikimedia.org/T325409#10256362 (10jhathaway) 05Open→03Resolved [20:46:22] (03PS41) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [20:46:23] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20241023 [20:47:20] (03Abandoned) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1082553 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [20:48:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256380 (10cmooney) [20:49:37] (03Merged) 10jenkins-bot: SessionManager: Add more logging when unpersisting invalid sessions [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082464 (https://phabricator.wikimedia.org/T372702) (owner: 10Gergő Tisza) [20:49:40] (03Merged) 10jenkins-bot: Log unexpected central session lookup misses [extensions/CentralAuth] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1082465 (https://phabricator.wikimedia.org/T372702) (owner: 10Gergő Tisza) [20:50:02] (03PS1) 10Varnent: Update Office Wiki favicon to use wmf.ico and also delete now unused office.ico file. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) [20:50:10] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1082464|SessionManager: Add more logging when unpersisting invalid sessions (T372702)]], [[gerrit:1082465|Log unexpected central session lookup misses (T372702)]] [20:50:33] T372702: editors are repeatedly getting logged out (August 2024) - https://phabricator.wikimedia.org/T372702 [20:51:00] 06SRE, 06Infrastructure-Foundations, 10Mail: Integration tests - https://phabricator.wikimedia.org/T358355#10256386 (10jhathaway) [20:52:43] !log tgr@deploy2002 tgr: Backport for [[gerrit:1082464|SessionManager: Add more logging when unpersisting invalid sessions (T372702)]], [[gerrit:1082465|Log unexpected central session lookup misses (T372702)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:53:46] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release 20241023 [20:55:31] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release 20241023 [20:58:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256417 (10cmooney) [20:58:40] (03CR) 10BCornwall: [V:03+1 C:03+1] "tox runs happily on 3.11.2." [dns] - 10https://gerrit.wikimedia.org/r/1082548 (owner: 10Ssingh) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T2100) [21:00:31] 06SRE, 06Infrastructure-Foundations, 10Mail: Integration tests - https://phabricator.wikimedia.org/T358355#10256420 (10jhathaway) 05Open→03Resolved They are still a bit rough in places, but resolving for now: https://gitlab.wikimedia.org/jhathaway/mx-tests [21:00:37] !log tgr@deploy2002 tgr: Continuing with sync [21:00:42] (03CR) 10BCornwall: [V:03+1 C:03+1] tox.ini: add Python 3.11 to interpreters (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1082548 (owner: 10Ssingh) [21:02:11] (03PS42) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [21:02:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256437 (10Jclark-ctr) @cmooney all cables have been connected for Step 2: Initial cabling for the new devices for switches a... [21:02:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release 20241023 [21:02:43] dzahn@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [21:05:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10256451 (10cmooney) >>! In T377381#10256437, @Jclark-ctr wrote: > @cmooney all cables have been connected for Step 2: Initial... [21:05:18] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082464|SessionManager: Add more logging when unpersisting invalid sessions (T372702)]], [[gerrit:1082465|Log unexpected central session lookup misses (T372702)]] (duration: 15m 07s) [21:05:38] T372702: editors are repeatedly getting logged out (August 2024) - https://phabricator.wikimedia.org/T372702 [21:07:35] !log UTC late deploys done [21:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028 (10jhathaway) 03NEW [21:10:41] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10256498 (10jhathaway) p:05Triage→03Low [21:12:11] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on phabricator servers with Postfix - https://phabricator.wikimedia.org/T378029 (10jhathaway) 03NEW [21:12:36] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim on phabricator servers with Postfix - https://phabricator.wikimedia.org/T378029#10256514 (10jhathaway) p:05Triage→03Low [21:16:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [21:16:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10256517 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye executed... [21:22:34] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release 20241023 [21:35:46] (03CR) 10Scott French: "Thanks, claime!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1082478 (https://phabricator.wikimedia.org/T377958) (owner: 10Clément Goubert) [21:38:42] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10256552 (10Eevans) >>! In T370786#10255850, @jhathaway wrote: > For most bots I see in use, the common convention seems to be looking for messages with a `!` prefix. Name spacing only se... [21:39:34] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030 (10RobH) 03NEW [21:40:06] (03PS2) 10Reedy: CacheTme: Add forward namespaced alias [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082565 (https://phabricator.wikimedia.org/T378006) [21:41:32] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10256576 (10RobH) a:03bking Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. Thi... [21:41:33] 10ops-eqiad, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10256581 (10RobH) [21:46:11] 10ops-codfw, 06DC-Ops, 06Discovery-Search: Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031 (10RobH) 03NEW [21:46:32] 10ops-eqiad, 06DC-Ops, 06Discovery-Search: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10256603 (10RobH) [21:47:18] 10ops-codfw, 06DC-Ops, 06Discovery-Search: Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10256609 (10RobH) [21:48:19] 10ops-codfw, 06DC-Ops, 06Discovery-Search: Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10256620 (10RobH) a:03bking Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. T... [21:50:18] (03PS1) 10Urbanecm: throttle: Add exemption for WikiArabia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082568 (https://phabricator.wikimedia.org/T377957) [21:50:28] jouncebot: nowandnext [21:50:28] For the next 0 hour(s) and 9 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241023T2100) [21:50:28] In 8 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T0600) [21:50:28] In 8 hour(s) and 9 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T0600) [21:51:18] (03CR) 10Urbanecm: [C:03+2] throttle: Add exemption for WikiArabia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082568 (https://phabricator.wikimedia.org/T377957) (owner: 10Urbanecm) [21:51:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082568 (https://phabricator.wikimedia.org/T377957) (owner: 10Urbanecm) [21:52:09] (03Merged) 10jenkins-bot: throttle: Add exemption for WikiArabia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082568 (https://phabricator.wikimedia.org/T377957) (owner: 10Urbanecm) [21:52:37] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1082568|throttle: Add exemption for WikiArabia (T377957)]] [21:52:45] T377957: Temporary lift IP cap for WikiArabia 2024 at Muscat on 24-28 October 2024 - https://phabricator.wikimedia.org/T377957 [21:56:14] (03CR) 10Urbanecm: "The old URI will be probably requested by users who have it still cached. For officewiki, I guess that is fine?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082559 (https://phabricator.wikimedia.org/T378026) (owner: 10Varnent) [21:57:08] (03PS2) 10Superzerocool: nlwiki, commonswiki, wikidata: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082444 (https://phabricator.wikimedia.org/T377930) [21:59:43] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1082568|throttle: Add exemption for WikiArabia (T377957)]] (duration: 07m 06s) [21:59:48] T377957: Temporary lift IP cap for WikiArabia 2024 at Muscat on 24-28 October 2024 - https://phabricator.wikimedia.org/T377957 [22:01:29] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:08:24] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [22:21:52] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [22:23:46] jouncebot: nowandnext [22:23:46] No deployments scheduled for the next 7 hour(s) and 36 minute(s) [22:23:46] In 7 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T0600) [22:23:47] In 7 hour(s) and 36 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241024T0600) [22:24:40] (03CR) 10Reedy: [C:03+2] CacheTme: Add forward namespaced alias [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082565 (https://phabricator.wikimedia.org/T378006) (owner: 10Reedy) [22:26:07] (03CR) 10Arlolra: [C:03+1] pcs: Configure prometheus metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082473 (https://phabricator.wikimedia.org/T372749) (owner: 10Jgiannelos) [22:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:30:25] (03PS2) 10Scott French: shellbox: pin all instances at live image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082317 (https://phabricator.wikimedia.org/T375243) [22:30:27] (03PS2) 10Scott French: shellbox-syntaxhighlight: upgrade to 2024-10-15-214239 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082318 (https://phabricator.wikimedia.org/T375243) [22:30:28] (03PS3) 10Scott French: shellbox: upgrade to 2024-10-15-214239 (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082319 (https://phabricator.wikimedia.org/T375243) [22:32:10] 10ops-codfw, 06DC-Ops, 06Discovery-Search: Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034 (10RobH) 03NEW [22:32:12] I'm seeing lots of `Database servers in cluster26 are overloaded. ` in mediawiki logs [22:32:29] Same for cluster28 [22:32:58] Looks like an event that may have passed. [22:33:03] 10ops-codfw, 06DC-Ops, 06Discovery-Search: Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10256737 (10RobH) [22:33:29] 10ops-codfw, 06DC-Ops, 06Discovery-Search: Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10256733 (10RobH) a:03bking [22:34:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:42:52] (03CR) 10Reedy: [C:03+2] "Test failure seems unrelated..." [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082565 (https://phabricator.wikimedia.org/T378006) (owner: 10Reedy) [22:43:44] (03CR) 10Reedy: [C:03+2] "Ah, T377932" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082565 (https://phabricator.wikimedia.org/T378006) (owner: 10Reedy) [22:44:00] (03PS1) 10Reedy: recentchanges: Use current time for imported revision category changes [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082575 (https://phabricator.wikimedia.org/T377932) [22:44:13] (03PS3) 10Reedy: CacheTme: Add forward namespaced alias [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082565 (https://phabricator.wikimedia.org/T378006) [22:44:39] (03CR) 10Reedy: [C:03+2] recentchanges: Use current time for imported revision category changes [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082575 (https://phabricator.wikimedia.org/T377932) (owner: 10Reedy) [22:55:13] 06SRE, 10Sustainability (Incident Followup): create a place (whiteboard) where SRE advertises current site status / things for awareness - https://phabricator.wikimedia.org/T378038 (10Dzahn) 03NEW [22:56:39] 06SRE, 10Sustainability (Incident Followup): create a place (whiteboard) where SRE advertises current site status / things for awareness - https://phabricator.wikimedia.org/T378038#10256817 (10Dzahn) [22:58:06] 06SRE, 10Sustainability (Incident Followup): create a place (whiteboard) where SRE advertises current site status / things for awareness - https://phabricator.wikimedia.org/T378038#10256822 (10Dzahn) Any SRE can feel free to edit the ticket description if I missed something or to clarify. This was just a follo... [23:10:55] 06SRE: exception raised for "sre.dns.admin show" - https://phabricator.wikimedia.org/T378039#10256847 (10Dzahn) [23:13:04] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: exception raised for "sre.dns.admin show" - https://phabricator.wikimedia.org/T378039#10256848 (10Dzahn) [23:14:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961#10256855 (10Jclark-ctr) Confirmed: Service Request 199807744 was successfully submitted. [23:15:12] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383#10256850 (10Dzahn) a:05jhathaway→03TAndic [23:16:33] (03Merged) 10jenkins-bot: recentchanges: Use current time for imported revision category changes [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082575 (https://phabricator.wikimedia.org/T377932) (owner: 10Reedy) [23:18:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1082264/4366/" [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [23:23:10] (03Merged) 10jenkins-bot: CacheTme: Add forward namespaced alias [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1082565 (https://phabricator.wikimedia.org/T378006) (owner: 10Reedy) [23:24:23] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10256863 (10Eevans) >>! In T377827#10252919, @MatthewVernon wrote: > Yes, per [[ https://www.sqlite.org/lang_vacuum.html | the docs ]],... [23:26:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed on ms-be1075 - https://phabricator.wikimedia.org/T377109#10256865 (10Jclark-ctr) ` jclark@ms-be1075:~$ for disk in $(lsblk -dn -o NAME); do echo "Device: /dev/$disk"; udevadm info -q property -n /dev/$disk | grep -E "ID_SERIAL|ID_PATH"; d... [23:26:33] (03PS1) 10Zabe: s8: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082579 (https://phabricator.wikimedia.org/T183490) [23:29:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:29:17] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383#10256869 (10TAndic) 05Open→03Resolved Thank you @hnowlan ! Everything is working. In case it helps someone in the future, I needed to make minor adjustments to my... [23:29:51] (03CR) 10Dzahn: [V:03+1 C:03+2] "CC: Hashar. Complete noop on existing prod servers, but doing this on the new gerrit2003 machine which is masked and not getting any traff" [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [23:30:35] (03CR) 10Dzahn: [V:03+1 C:03+2] "using the UID/GID you already reserved in the past but we didn't get to use yet." [puppet] - 10https://gerrit.wikimedia.org/r/1082264 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [23:31:59] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed on ms-be1075 - https://phabricator.wikimedia.org/T377109#10256879 (10Jclark-ctr) located sdt serial in idrac hardware inventory slot 19 SerialNumber WSD79SJJ ` | Device: /dev/sdt ID_SERIAL=ST8000NM012A-2KE131_WSD79SJJ ID_SERIAL_SHORT=W... [23:34:07] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383#10256882 (10Dzahn) It's perfect that you marked it as resolved. Thanks for confirming. P.S. There is a small typo in there. It's just "eqiad" instead of "equiad". [23:34:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed on ms-be1075 - https://phabricator.wikimedia.org/T377109#10256883 (10Jclark-ctr) 05Open→03Resolved [23:34:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082580 [23:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1082580 (owner: 10TrainBranchBot) [23:38:34] (03CR) 10Ssingh: tox.ini: add Python 3.11 to interpreters (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1082548 (owner: 10Ssingh) [23:39:11] (03PS2) 10Ssingh: tox.ini: add Python 3.11 to interpreters (and remove 3.7) [dns] - 10https://gerrit.wikimedia.org/r/1082548 [23:39:30] !log reedy@deploy2002 Started scap sync-world: T378006 [23:39:39] T378006: Cannot declare class CacheTime, because the name is already in use in CacheTime.php - https://phabricator.wikimedia.org/T378006 [23:39:58] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: exception raised for "sre.dns.admin show" - https://phabricator.wikimedia.org/T378039#10256892 (10ssingh) Thanks for filing this task! It's a known issue as documented in T365454#10179477 as well. That being said and in the meantime, I am curious to hear if... [23:44:26] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: exception raised for "sre.dns.admin show" - https://phabricator.wikimedia.org/T378039#10256896 (10Dzahn) >>! In T378039#10256892, @ssingh wrote: > That being said and in the meantime, I am curious to hear if you have a suggestion on how to improve this text.... [23:46:39] !log reedy@deploy2002 Finished scap sync-world: T378006 (duration: 07m 09s) [23:46:44] T378006: Cannot declare class CacheTime, because the name is already in use in CacheTime.php - https://phabricator.wikimedia.org/T378006 [23:47:14] dancy: ^ Train should be good to attempt to try and roll forward tomorrow or so [23:47:59] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Apply openjdk upgrade (11.0.25+9-1~deb11u1) - eevans@cumin1002 [23:48:29] Yay! Thanks to everyone involved in fixin stuff [23:48:53] I have another patch in master that could be considered some further hardening [23:48:56] Tim is having a look at it atm [23:50:49] (03CR) 10Tacsipacsi: "This patch changes no files. I guess this is not what you wanted to do?" [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [23:57:13] dzahn@cumin2002 dzahn: The backup on gitlab2002 is complete, ready to proceed with upgrade.