[00:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101960 [00:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101960 (owner: 10TrainBranchBot) [00:50:52] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, xfer wdqs scholarly 2023(public)->2026(internal)) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2027.codfw.wmnet w/ force delete existing files, repooling both afterwards [00:50:56] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [00:56:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101960 (owner: 10TrainBranchBot) [01:08:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101966 [01:08:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101966 (owner: 10TrainBranchBot) [01:19:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [01:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [01:20:58] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10395667 (10Dwisehaupt) @Papaul @Jhancock.wm I'm getting to building these hosts now (so many other things were pre-reqs) and they are starting out ok. Except fransw2002 is not reac... [01:29:52] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101966 (owner: 10TrainBranchBot) [01:31:27] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10395679 (10Papaul) @Dwisehaupt yes the host is still racked in C7 we are waiting for civic2001 to be decom so we can move it into C8/U16. For the issue about the host is not reacha... [01:36:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, xfer wdqs scholarly 2023(public)->2026(internal)) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2027.codfw.wmnet w/ force delete existing files, repooling both afterwards [01:36:32] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [01:40:08] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 206636680 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:41:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 42592 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:49:33] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10395689 (10Dwisehaupt) @Papaul Thanks. I'll take a look at civi2001. I believe we need to keep it in place through the end of the month (just in case for big english) and then we c... [01:52:32] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/d486464159cce853466b996ebd3d3e2d81d20cbb42a6103376b28d0acc67c450/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:04:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:32] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:06] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 8406 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:10] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10395739 (10Papaul) @Dwisehaupt there is no rush on our end. You can take you time on that, [04:07:35] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [04:13:46] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: CRITICAL - Host Unreachable (2001:df2:e500:201:103:102:166:20) [05:19:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [05:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [06:04:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:05] (03PS1) 10Func: ve.ui.CodeMirror.v6: Use plugin callback to load the actual module [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102141 (https://phabricator.wikimedia.org/T374072) [06:35:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102141 (https://phabricator.wikimedia.org/T374072) (owner: 10Func) [06:37:31] (03PS1) 10Func: styles: Avoid misalignments when line numbering is disabled [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102142 (https://phabricator.wikimedia.org/T381714) [06:38:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102142 (https://phabricator.wikimedia.org/T381714) (owner: 10Func) [06:50:29] (03PS1) 10Kevin Bazira: APIGW: Add configuration to expose LW isvc article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102150 (https://phabricator.wikimedia.org/T371897) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T0700) [07:13:38] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10395916 (10Ammarpad) >>! In T381851#10394860, @Scott_French wrote: > Thanks, @Ammarpad - It would great if you could you please confirm your SSH public key via a second authenticated channel.... [07:36:46] (03PS1) 10Kevin Bazira: httpbb: add post deployment tests for the article-country endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1102201 (https://phabricator.wikimedia.org/T371897) [07:45:03] (03CR) 10Gmodena: data-engineering: add alerts for dumps2 flink app. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena) [07:47:10] (03PS7) 10Gmodena: dse-k8s-services: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) [07:59:27] (03PS1) 10Novem Linguae: Follow-up I9df39fdcc: Convert missed 'this' to 'el' [extensions/PageTriage] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102205 (https://phabricator.wikimedia.org/T381741) [08:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T0800). nyaa~ [08:00:05] Func: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/PageTriage] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102205 (https://phabricator.wikimedia.org/T381741) (owner: 10Novem Linguae) [08:00:15] o/ [08:01:09] if I'm not too late, I'm going to add one right now [08:08:00] (03PS1) 10Jelto: Rename kubernetes[2011-2014] to wikikube-worker[2180-2183] [puppet] - 10https://gerrit.wikimedia.org/r/1102206 (https://phabricator.wikimedia.org/T377877) [08:14:36] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: /var/lib/archiva 8812 MB (3% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:33:18] think we'll still have the backport? [08:33:58] (03PS1) 10Slyngshede: Release v0.1.4 [software/bitu] - 10https://gerrit.wikimedia.org/r/1102213 [08:47:55] (03CR) 10Alexandros Kosiaris: [C:04-1] "LGTM, but put something (even if it is commented) to showcase the structure of the option in modules/mesh/values.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [08:49:20] (03CR) 10Alexandros Kosiaris: "Same comment as the parent change. Something in values.yaml, even if an example stanza commented to allow a reader to quickly reason about" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [08:58:49] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [08:59:40] (03PS1) 10Marostegui: control-mariadb-client-10.6-bookworm: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1102215 (https://phabricator.wikimedia.org/T380073) [09:00:55] (03CR) 10Jelto: [C:03+1] "this looks good to me from the GitLab side. But I have little knowledge what data the blunderbuss service provides. Please keep in mind th" [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [09:01:13] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[2011-2014] to wikikube-worker[2180-2183] [puppet] - 10https://gerrit.wikimedia.org/r/1102206 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [09:01:54] (03CR) 10Marostegui: [C:03+2] control-mariadb-client-10.6-bookworm: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1102215 (https://phabricator.wikimedia.org/T380073) (owner: 10Marostegui) [09:02:21] (03Merged) 10jenkins-bot: control-mariadb-client-10.6-bookworm: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1102215 (https://phabricator.wikimedia.org/T380073) (owner: 10Marostegui) [09:02:22] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance idm-test1001:9100) - https://phabricator.wikimedia.org/T381947 (10LSobanski) 03NEW [09:04:27] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2011-2014].codfw.wmnet [09:04:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2136.codfw.wmnet with reason: maintenance [09:04:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2136.codfw.wmnet with reason: maintenance [09:05:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2136 to upgrade MariaDB 10.11 T378940', diff saved to https://phabricator.wikimedia.org/P71694 and previous config saved to /var/cache/conftool/dbconfig/20241211-090538-marostegui.json [09:05:42] T378940: Compile and package MariaDB 10.11.10 and 10.6.20 - https://phabricator.wikimedia.org/T378940 [09:06:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2011-2014].codfw.wmnet [09:08:49] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2011-2014].codfw.wmnet [09:08:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2011-2014].codfw.wmnet [09:09:52] (03CR) 10Jelto: [C:03+2] Rename kubernetes[2011-2014] to wikikube-worker[2180-2183] [puppet] - 10https://gerrit.wikimedia.org/r/1102206 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [09:10:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71695 and previous config saved to /var/cache/conftool/dbconfig/20241211-091029-root.json [09:11:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10396114 (10Marostegui) a:05ABran-WMF→03Jhancock.wm [09:13:32] (03PS1) 10Brouberol: ceph-csi: remove un-necessary network policies allowing kube api egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102221 (https://phabricator.wikimedia.org/T381264) [09:14:29] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2011 to wikikube-worker2180 [09:14:50] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:16:15] (03PS1) 10Marostegui: installserver: Do not reimage es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1102222 [09:17:12] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [09:17:12] status [09:17:22] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [09:17:22] status [09:18:27] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2011 to wikikube-worker2180 - jelto@cumin1002" [09:18:59] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1102222 (owner: 10Marostegui) [09:19:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2011 to wikikube-worker2180 - jelto@cumin1002" [09:19:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:19:05] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2180 [09:19:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2180 [09:19:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [09:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [09:20:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2011 to wikikube-worker2180 [09:20:36] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2012 to wikikube-worker2181 [09:20:56] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:21:02] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: rename mw-dumps helmfiles. (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [09:22:07] (03Merged) 10jenkins-bot: dse-k8s-services: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [09:24:31] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2012 to wikikube-worker2181 - jelto@cumin1002" [09:24:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes2014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2014 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:25:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2012 to wikikube-worker2181 - jelto@cumin1002" [09:25:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:25:00] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2181 [09:25:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2181 [09:25:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71696 and previous config saved to /var/cache/conftool/dbconfig/20241211-092535-root.json [09:26:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2012 to wikikube-worker2181 [09:26:37] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2013 to wikikube-worker2182 [09:26:57] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:30:39] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2013 to wikikube-worker2182 - jelto@cumin1002" [09:30:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2013 to wikikube-worker2182 - jelto@cumin1002" [09:30:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:30:57] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2182 [09:30:59] (03PS1) 10Marostegui: production-m1.sql.erb: Upgrade grants [puppet] - 10https://gerrit.wikimedia.org/r/1102226 (https://phabricator.wikimedia.org/T367380) [09:31:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2182 [09:31:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2013 to wikikube-worker2182 [09:32:18] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2014 to wikikube-worker2183 [09:32:27] !log elukey@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [09:32:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-lab1002.eqiad.wmnet with OS bookworm [09:32:34] (03CR) 10Btullis: [C:03+1] "Great! Thanks for looking into this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102221 (https://phabricator.wikimedia.org/T381264) (owner: 10Brouberol) [09:32:39] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:33:13] (03CR) 10Brouberol: [C:03+2] ceph-csi: remove un-necessary network policies allowing kube api egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102221 (https://phabricator.wikimedia.org/T381264) (owner: 10Brouberol) [09:35:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [09:35:34] (03PS1) 10Marostegui: report_users.sh: Add dbproxy2005 IP [software] - 10https://gerrit.wikimedia.org/r/1102228 (https://phabricator.wikimedia.org/T367380) [09:36:13] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2014 to wikikube-worker2183 - jelto@cumin1002" [09:36:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2014 to wikikube-worker2183 - jelto@cumin1002" [09:36:32] (03CR) 10Marostegui: "This is a NOOP - grants added to the DB" [puppet] - 10https://gerrit.wikimedia.org/r/1102226 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [09:36:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:36:32] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2183 [09:36:33] (03CR) 10Marostegui: [C:03+2] production-m1.sql.erb: Upgrade grants [puppet] - 10https://gerrit.wikimedia.org/r/1102226 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [09:36:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2183 [09:36:49] (03CR) 10Marostegui: [C:03+2] report_users.sh: Add dbproxy2005 IP [software] - 10https://gerrit.wikimedia.org/r/1102228 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [09:37:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2014 to wikikube-worker2183 [09:37:40] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2180.codfw.wmnet wikikube-worker2181.codfw.wmnet wikikube-worker2182.codfw.wmnet wikikube-worker2183.codfw.wmnet on all recursors [09:37:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2180.codfw.wmnet wikikube-worker2181.codfw.wmnet wikikube-worker2182.codfw.wmnet wikikube-worker2183.codfw.wmnet on all recursors [09:39:52] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2180.codfw.wmnet with OS bookworm [09:40:02] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2180 [09:40:05] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2181.codfw.wmnet with OS bookworm [09:40:14] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2182.codfw.wmnet with OS bookworm [09:40:25] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2182 [09:40:29] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2183.codfw.wmnet with OS bookworm [09:40:32] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:40:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71697 and previous config saved to /var/cache/conftool/dbconfig/20241211-094040-root.json [09:42:07] (03PS1) 10Marostegui: wmnet: Update m1-master.codfw.wmnet CNAME [dns] - 10https://gerrit.wikimedia.org/r/1102233 (https://phabricator.wikimedia.org/T367380) [09:44:17] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2180 - jelto@cumin1002" [09:44:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2180 - jelto@cumin1002" [09:44:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:44:21] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2180.codfw.wmnet 109.32.192.10.in-addr.arpa 9.0.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:44:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2180.codfw.wmnet 109.32.192.10.in-addr.arpa 9.0.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:44:25] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2180 [09:44:27] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:44:27] (03CR) 10Marostegui: [C:03+2] wmnet: Update m1-master.codfw.wmnet CNAME [dns] - 10https://gerrit.wikimedia.org/r/1102233 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [09:44:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2180 [09:44:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2180 [09:44:51] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2181 [09:46:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:46:50] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2182.codfw.wmnet 28.48.192.10.in-addr.arpa 8.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:46:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2182.codfw.wmnet 28.48.192.10.in-addr.arpa 8.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:46:54] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2182 [09:47:21] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:47:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2182 [09:47:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2182 [09:48:15] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2183 [09:49:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098045 (https://phabricator.wikimedia.org/T377809) (owner: 10Joely Rooke WMDE) [09:50:55] (03PS1) 10Marostegui: mariadb: Update dbproxy200(1,5) notes [puppet] - 10https://gerrit.wikimedia.org/r/1102237 [09:51:00] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2181 - jelto@cumin1002" [09:51:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2181 - jelto@cumin1002" [09:51:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:51:06] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2181.codfw.wmnet 110.32.192.10.in-addr.arpa 0.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:51:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2181.codfw.wmnet 110.32.192.10.in-addr.arpa 0.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:51:09] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2181 [09:51:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2181 [09:51:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2181 [09:51:37] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:51:50] (03PS1) 10Marostegui: report_users.sh: Change variable [software] - 10https://gerrit.wikimedia.org/r/1102238 [09:52:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10396280 (10JMeybohm) >>! In T379717#10395266, @VRiley-WMF wrote: > Can we proceed with swapping th... [09:52:27] (03CR) 10Marostegui: [C:03+2] mariadb: Update dbproxy200(1,5) notes [puppet] - 10https://gerrit.wikimedia.org/r/1102237 (owner: 10Marostegui) [09:52:46] (03CR) 10Marostegui: [C:03+2] report_users.sh: Change variable [software] - 10https://gerrit.wikimedia.org/r/1102238 (owner: 10Marostegui) [09:55:21] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2183 - jelto@cumin1002" [09:55:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2183 - jelto@cumin1002" [09:55:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:55:26] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2183.codfw.wmnet 29.48.192.10.in-addr.arpa 9.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:55:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2183.codfw.wmnet 29.48.192.10.in-addr.arpa 9.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:55:30] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2183 [09:55:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71698 and previous config saved to /var/cache/conftool/dbconfig/20241211-095546-root.json [09:56:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2183 [09:56:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2183 [09:58:46] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@416a3c0]: Backfill webrequest actor metrics rollup hourly 2024 12 [09:59:49] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@416a3c0]: Backfill webrequest actor metrics rollup hourly 2024 12 (duration: 01m 02s) [10:01:23] (03CR) 10JMeybohm: charts: Add kartotherian (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [10:02:23] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2180.codfw.wmnet with reason: host reimage [10:03:58] (03PS1) 10Elukey: dockerfile: fix upstream_version filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102240 [10:04:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:04:57] (03PS1) 10Marostegui: report_users.sh: Use cumin2024 [software] - 10https://gerrit.wikimedia.org/r/1102241 [10:06:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2180.codfw.wmnet with reason: host reimage [10:06:42] (03CR) 10Marostegui: [C:03+2] report_users.sh: Use cumin2024 [software] - 10https://gerrit.wikimedia.org/r/1102241 (owner: 10Marostegui) [10:08:06] (03PS1) 10Marostegui: production-m2.sql.erb: Replaced dbproxy2002 with dbproxy2006 [puppet] - 10https://gerrit.wikimedia.org/r/1102242 (https://phabricator.wikimedia.org/T367380) [10:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:48] (03CR) 10Marostegui: [C:03+2] production-m2.sql.erb: Replaced dbproxy2002 with dbproxy2006 [puppet] - 10https://gerrit.wikimedia.org/r/1102242 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:10:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71699 and previous config saved to /var/cache/conftool/dbconfig/20241211-101051-root.json [10:12:43] (03PS1) 10Marostegui: report_users.sh: Add dbproxy2006 IP [software] - 10https://gerrit.wikimedia.org/r/1102244 (https://phabricator.wikimedia.org/T367380) [10:12:49] (03PS9) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [10:12:49] (03PS5) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) [10:12:50] (03PS5) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [10:12:55] (03CR) 10Elukey: charts: Add kartotherian (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [10:13:28] (03CR) 10Marostegui: [C:03+2] report_users.sh: Add dbproxy2006 IP [software] - 10https://gerrit.wikimedia.org/r/1102244 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:14:00] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2183.codfw.wmnet with reason: host reimage [10:15:21] (03PS1) 10Marostegui: wmnet: Promote dbproxy2006 to m2 master [dns] - 10https://gerrit.wikimedia.org/r/1102245 (https://phabricator.wikimedia.org/T367380) [10:16:22] (03PS1) 10Marostegui: mariadb: Update dbproxy200(2,6) notes [puppet] - 10https://gerrit.wikimedia.org/r/1102246 [10:17:05] (03CR) 10Marostegui: [C:03+2] mariadb: Update dbproxy200(2,6) notes [puppet] - 10https://gerrit.wikimedia.org/r/1102246 (owner: 10Marostegui) [10:17:32] (03CR) 10Marostegui: [C:03+2] wmnet: Promote dbproxy2006 to m2 master [dns] - 10https://gerrit.wikimedia.org/r/1102245 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:17:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2183.codfw.wmnet with reason: host reimage [10:17:52] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10396354 (10Krinkle) [10:25:26] (03PS1) 10Brouberol: airflow-ml: define DNS records [dns] - 10https://gerrit.wikimedia.org/r/1102249 (https://phabricator.wikimedia.org/T380258) [10:25:33] jouncebot: nowandnext [10:25:34] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [10:25:34] In 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1100) [10:26:19] (03CR) 10Dreamy Jazz: [C:03+2] Revert^2 "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101913 (owner: 10Cwhite) [10:26:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2180.codfw.wmnet with OS bookworm [10:27:28] (03PS1) 10Marostegui: production-m3.sql.erb: Replace dbproxy2003 with dbproxy2007 [puppet] - 10https://gerrit.wikimedia.org/r/1102250 (https://phabricator.wikimedia.org/T367380) [10:29:25] (03PS1) 10Marostegui: wmnet: Promote dbproxy2007 to m3-codfw master [dns] - 10https://gerrit.wikimedia.org/r/1102251 (https://phabricator.wikimedia.org/T367380) [10:30:03] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10396433 (10Andrew) These hosts have a somewhat unusual vlan setup, so my guess is something is tripping on that -- paging @cmooney for m... [10:30:29] (03PS1) 10Marostegui: report_users.sh: Add dbproxy2007 IP [software] - 10https://gerrit.wikimedia.org/r/1102253 (https://phabricator.wikimedia.org/T367380) [10:31:08] (03PS1) 10Brouberol: airflow-ml: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) [10:31:13] (03PS1) 10Brouberol: deployment_server: define airflow-ml users [puppet] - 10https://gerrit.wikimedia.org/r/1102255 (https://phabricator.wikimedia.org/T380258) [10:31:15] (03PS1) 10Brouberol: airflow-ml: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1102256 (https://phabricator.wikimedia.org/T380258) [10:31:17] (03PS1) 10Brouberol: airflow-ml: define CAS config [puppet] - 10https://gerrit.wikimedia.org/r/1102257 (https://phabricator.wikimedia.org/T380258) [10:31:19] (03PS1) 10Brouberol: openldap: define new offloaded airflow-ml-ops group [puppet] - 10https://gerrit.wikimedia.org/r/1102258 (https://phabricator.wikimedia.org/T380258) [10:32:31] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2182.codfw.wmnet with OS bookworm [10:33:20] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2182.codfw.wmnet with OS bookworm [10:33:23] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2182 [10:33:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2182 [10:33:55] (03CR) 10Marostegui: [C:03+2] production-m3.sql.erb: Replace dbproxy2003 with dbproxy2007 [puppet] - 10https://gerrit.wikimedia.org/r/1102250 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:34:08] (03CR) 10Marostegui: [C:03+2] report_users.sh: Add dbproxy2007 IP [software] - 10https://gerrit.wikimedia.org/r/1102253 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:34:12] (03CR) 10Marostegui: [C:03+2] wmnet: Promote dbproxy2007 to m3-codfw master [dns] - 10https://gerrit.wikimedia.org/r/1102251 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:34:34] (03Merged) 10jenkins-bot: report_users.sh: Add dbproxy2007 IP [software] - 10https://gerrit.wikimedia.org/r/1102253 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:37:23] (03PS1) 10Marostegui: mariadb: Update dbproxy200(3,7) notes [puppet] - 10https://gerrit.wikimedia.org/r/1102259 [10:39:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2183.codfw.wmnet with OS bookworm [10:43:40] (03PS1) 10Marostegui: production-m5.sql.erb: Upgrade dbproxy grants [puppet] - 10https://gerrit.wikimedia.org/r/1102260 (https://phabricator.wikimedia.org/T367380) [10:45:20] (03CR) 10Fabfur: Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) (owner: 10Fabfur) [10:45:27] (03PS5) 10Fabfur: Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) [10:46:02] (03CR) 10Marostegui: [C:03+2] mariadb: Update dbproxy200(3,7) notes [puppet] - 10https://gerrit.wikimedia.org/r/1102259 (owner: 10Marostegui) [10:46:10] (03CR) 10Marostegui: [C:03+2] production-m5.sql.erb: Upgrade dbproxy grants [puppet] - 10https://gerrit.wikimedia.org/r/1102260 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:46:23] (03Merged) 10jenkins-bot: Revert^2 "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101913 (owner: 10Cwhite) [10:49:44] (03PS1) 10Marostegui: report_users.sh: Add dbproxy2008 IP [software] - 10https://gerrit.wikimedia.org/r/1102261 (https://phabricator.wikimedia.org/T367380) [10:49:47] (03PS1) 10Marostegui: wmnet: Promote dbproxy2008 to m3-codfw master [dns] - 10https://gerrit.wikimedia.org/r/1102262 (https://phabricator.wikimedia.org/T367380) [10:51:49] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2182.codfw.wmnet with reason: host reimage [10:53:09] (03CR) 10Marostegui: [C:03+2] report_users.sh: Add dbproxy2008 IP [software] - 10https://gerrit.wikimedia.org/r/1102261 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:53:22] (03CR) 10Marostegui: [C:03+2] wmnet: Promote dbproxy2008 to m3-codfw master [dns] - 10https://gerrit.wikimedia.org/r/1102262 (https://phabricator.wikimedia.org/T367380) (owner: 10Marostegui) [10:54:36] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [10:54:44] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1101913|Revert^2 "Stats: Move StatsFactory flush into emitBufferedStats"]] [10:55:04] (03PS1) 10Marostegui: mariadb: Update dbproxy2004,dbproxy2008 notes [puppet] - 10https://gerrit.wikimedia.org/r/1102263 [10:55:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2182.codfw.wmnet with reason: host reimage [10:55:46] (03CR) 10Marostegui: [C:03+2] mariadb: Update dbproxy2004,dbproxy2008 notes [puppet] - 10https://gerrit.wikimedia.org/r/1102263 (owner: 10Marostegui) [10:58:46] !log merging https://gerrit.wikimedia.org/r/c/operations/dns/+/1100084 to direct Argentina, Chile, Uruguay to magru (T359054) [10:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:25] !log dreamyjazz@deploy2002 dreamyjazz, cwhite: Backport for [[gerrit:1101913|Revert^2 "Stats: Move StatsFactory flush into emitBufferedStats"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:59:57] (03CR) 10Fabfur: [C:03+2] Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) (owner: 10Fabfur) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1100) [11:00:08] (03PS6) 10Fabfur: Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) [11:00:19] (03CR) 10Fabfur: [V:03+2 C:03+2] Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) (owner: 10Fabfur) [11:03:21] jouncebot: now [11:03:21] For the next 0 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1100) [11:03:51] !log dreamyjazz@deploy2002 dreamyjazz, cwhite: Continuing with sync [11:09:06] (03CR) 10Klausman: [C:03+1] APIGW: Add configuration to expose LW isvc article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102150 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [11:09:06] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101913|Revert^2 "Stats: Move StatsFactory flush into emitBufferedStats"]] (duration: 14m 22s) [11:11:39] jouncebot: nowandnext [11:11:40] For the next 0 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1100) [11:11:40] In 0 hour(s) and 48 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1200) [11:11:48] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2181.codfw.wmnet with OS bookworm [11:12:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2181.codfw.wmnet with OS bookworm [11:12:19] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2181 [11:12:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2181 [11:13:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [11:13:26] (03PS2) 10Brouberol: airflow-ml: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) [11:13:26] (03PS1) 10Brouberol: airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102268 (https://phabricator.wikimedia.org/T380258) [11:13:56] (03PS2) 10Brouberol: airflow-ml: define DNS records [dns] - 10https://gerrit.wikimedia.org/r/1102249 (https://phabricator.wikimedia.org/T380258) [11:14:20] (03Merged) 10jenkins-bot: Prep pilot wiki config for IRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [11:14:37] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1099213|Prep pilot wiki config for IRS (T374105)]] [11:14:42] T374105: Incident Reporting System - MVP - https://phabricator.wikimedia.org/T374105 [11:15:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2182.codfw.wmnet with OS bookworm [11:17:27] !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1099213|Prep pilot wiki config for IRS (T374105)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:17:45] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#10396579 (10Fabfur) Argentina, Chile and Uruguay now lands on magru by default [11:20:22] !log mszabo@deploy2002 mszabo: Continuing with sync [11:25:02] (03CR) 10Btullis: [C:03+1] "Looks good to me." [dns] - 10https://gerrit.wikimedia.org/r/1102249 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:25:40] (03CR) 10Btullis: [C:03+1] deployment_server: define airflow-ml users [puppet] - 10https://gerrit.wikimedia.org/r/1102255 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:25:41] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099213|Prep pilot wiki config for IRS (T374105)]] (duration: 11m 04s) [11:25:45] T374105: Incident Reporting System - MVP - https://phabricator.wikimedia.org/T374105 [11:26:13] (03CR) 10Btullis: [C:03+1] airflow-ml: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1102256 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:27:08] (03CR) 10Btullis: [C:03+1] "Looks good to me, but let's get someone on I/F to check that they're happy with it, too." [puppet] - 10https://gerrit.wikimedia.org/r/1102257 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:28:33] (03CR) 10Slyngshede: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1102257 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:28:48] (03CR) 10JMeybohm: [C:03+1] "lgtm" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102240 (owner: 10Elukey) [11:28:58] (03CR) 10Btullis: [C:03+1] openldap: define new offloaded airflow-ml-ops group [puppet] - 10https://gerrit.wikimedia.org/r/1102258 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:29:25] (03CR) 10Btullis: [C:03+1] airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102268 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:29:47] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2181.codfw.wmnet with reason: host reimage [11:31:18] (03CR) 10Btullis: airflow-ml: define helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:32:28] (03PS1) 10Hnowlan: kubernetes: include idle_timeout and tcp_keepalive in service mesh data [puppet] - 10https://gerrit.wikimedia.org/r/1102272 (https://phabricator.wikimedia.org/T371701) [11:33:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2181.codfw.wmnet with reason: host reimage [11:35:26] (03CR) 10Elukey: [C:03+2] dockerfile: fix upstream_version filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102240 (owner: 10Elukey) [11:37:32] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:41:14] (03Merged) 10jenkins-bot: dockerfile: fix upstream_version filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102240 (owner: 10Elukey) [11:43:48] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4664/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102272 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:44:07] (03CR) 10JMeybohm: [C:03+1] kubernetes: include idle_timeout and tcp_keepalive in service mesh data [puppet] - 10https://gerrit.wikimedia.org/r/1102272 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:44:09] (03PS1) 10Elukey: Release version 4.0.3 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102276 [11:44:17] (03CR) 10JMeybohm: [V:03+1 C:03+1] kubernetes: include idle_timeout and tcp_keepalive in service mesh data [puppet] - 10https://gerrit.wikimedia.org/r/1102272 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [11:51:53] (03CR) 10Brouberol: airflow-ml: define helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:52:56] (03CR) 10Brouberol: airflow-ml: define helmfile and values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:53:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2181.codfw.wmnet with OS bookworm [11:54:44] !log homer 'lsw1-d6-codfw*' commit 'T377877' [11:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:47] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [11:55:21] (03CR) 10Brouberol: [C:03+2] airflow-ml: define DNS records [dns] - 10https://gerrit.wikimedia.org/r/1102249 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:56:14] !log homer 'lsw1-c1-codfw*' commit 'T377877' [11:56:15] (03CR) 10Brouberol: [C:03+2] deployment_server: define airflow-ml users [puppet] - 10https://gerrit.wikimedia.org/r/1102255 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [11:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:53] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2180-2183].codfw.wmnet [11:57:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2180-2183].codfw.wmnet [11:58:08] (03CR) 10Bartosz Dziewoński: [C:03+1] Fix protocol for .well-known/change-password Apache rule [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza) [11:59:17] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381967 (10Jelto) 03NEW [12:00:05] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1200). Please do the needful. [12:00:52] (03PS3) 10Hnowlan: mediawiki: get mercurius label from mediawiki image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) [12:01:45] (03CR) 10Brouberol: [C:03+2] airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102268 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [12:02:07] (03PS1) 10KartikMistry: Update cxserver to 2024-12-10-132417-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102278 (https://phabricator.wikimedia.org/T369815) [12:02:20] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101839 (owner: 10PipelineBot) [12:04:02] (03PS2) 10Brouberol: airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102268 (https://phabricator.wikimedia.org/T380258) [12:04:02] (03PS3) 10Brouberol: airflow-ml: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) [12:04:02] (03PS1) 10Brouberol: airflow-ml: define kubernetes namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102280 (https://phabricator.wikimedia.org/T380258) [12:04:03] (03CR) 10Hnowlan: "Given that this comes from the puppet data for the listeners, does it really belong in values.yaml? Most other mesh listener options aren'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:04:35] (03CR) 10Hnowlan: [C:03+2] kubernetes: include idle_timeout and tcp_keepalive in service mesh data [puppet] - 10https://gerrit.wikimedia.org/r/1102272 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:04:38] (03PS1) 10Jelto: Rename kubernetes20(17|21|22|24) to wikikube-worker[2184-2187] [puppet] - 10https://gerrit.wikimedia.org/r/1102281 (https://phabricator.wikimedia.org/T377877) [12:04:59] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:05:02] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:05:19] (03CR) 10Btullis: [C:03+1] airflow-ml: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [12:06:28] (03CR) 10Btullis: [C:03+1] airflow-ml: define kubernetes namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102280 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [12:06:33] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101839 (owner: 10PipelineBot) [12:08:11] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:08:37] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:11:03] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:11:06] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:11:29] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:11:35] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:11:56] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:12:42] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:12:57] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:13:02] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [12:14:39] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:15:13] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:18:07] (03PS2) 10Abijeet Patro: Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) [12:18:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:18:40] (03CR) 10CI reject: [V:04-1] Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102283 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:21:31] 06SRE, 10SRE-swift-storage, 06Commons, 10Thumbor, 06Traffic: Unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#10396801 (10TheDJ) 05Open→03Declined Most likely a device/browser level issue. No... [12:22:20] (03PS1) 10Btullis: dse-k8s: Add a namespace for llm-inference work by the ML team [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102284 (https://phabricator.wikimedia.org/T377266) [12:23:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10396810 (10MatthewVernon) 05Resolved→03Open @elukey ms-be2085 is still missing its spinning drives, I'm afraid. I tried setting them to JBOD via the... [12:23:32] 06SRE, 06Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097#10396814 (10TheDJ) @BCornwall is this still an issue ? [12:23:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10396815 (10MatthewVernon) p:05Medium→03High [12:27:39] 06SRE, 10SRE-swift-storage, 06Traffic-Icebox, 07affects-Kiwix-and-openZIM, 07Wikimedia-Performance-recommendation: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217#10396817 (10TheDJ) @MatthewVernon This still needs to happen right ? [12:39:43] (03PS4) 10Hnowlan: mediawiki: get mercurius label from mediawiki image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) [12:41:15] (03PS1) 10Btullis: dse-k8s: Add token for the llm-inference namespace [puppet] - 10https://gerrit.wikimedia.org/r/1102287 (https://phabricator.wikimedia.org/T377266) [12:43:36] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4665/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102287 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [12:45:09] (03PS2) 10Btullis: dse-k8s: Add tokens for the llm-inference namespace [puppet] - 10https://gerrit.wikimedia.org/r/1102287 (https://phabricator.wikimedia.org/T377266) [12:47:24] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4666/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102287 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [12:47:35] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [12:47:41] (03CR) 10Hnowlan: mediawiki: get mercurius label from mediawiki image version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:47:44] (03CR) 10Hnowlan: [C:03+2] mediawiki: get mercurius label from mediawiki image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:48:37] Doing quick cxserver deployment.. [12:48:58] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-12-10-132417-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102278 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry) [12:49:54] (03Merged) 10jenkins-bot: mediawiki: get mercurius label from mediawiki image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:50:53] (03Merged) 10jenkins-bot: Update cxserver to 2024-12-10-132417-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102278 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry) [12:52:10] (03PS3) 10Hnowlan: mesh.configuration: dummy commit for 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101917 [12:54:21] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:54:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [12:54:44] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:54:48] (03PS6) 10Hnowlan: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [12:54:48] (03PS3) 10Hnowlan: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) [12:56:01] (03CR) 10Hnowlan: "This is coming from the puppet mesh configuration (where is is documented) and can't be configured at the chart level, so I don't think it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:57:00] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [12:59:45] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [13:00:11] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [13:00:38] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [13:01:10] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [13:02:29] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20(17|21|22|24) to wikikube-worker[2184-2187] [puppet] - 10https://gerrit.wikimedia.org/r/1102281 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [13:02:40] (03CR) 10Brouberol: [V:03+2 C:03+2] airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102268 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:02:44] (03CR) 10Brouberol: [C:03+2] airflow-ml: define kubernetes namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102280 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:02:51] (03CR) 10Brouberol: [C:03+2] airflow-ml: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:03:07] (03CR) 10Brouberol: [C:03+2] dse-k8s: Add tokens for the llm-inference namespace [puppet] - 10https://gerrit.wikimedia.org/r/1102287 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [13:03:32] (03CR) 10Brouberol: [C:03+1] dse-k8s: Add a namespace for llm-inference work by the ML team [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102284 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [13:03:48] (03CR) 10Brouberol: [C:03+1] dse-k8s: Add tokens for the llm-inference namespace [puppet] - 10https://gerrit.wikimedia.org/r/1102287 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [13:04:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [13:04:23] !log Updated cxserver to 2024-12-10-132417-production (T369815) [13:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:27] T369815: Enable in content Translation the new languages Google Translate supports in June 2024 - https://phabricator.wikimedia.org/T369815 [13:05:33] (03PS1) 10Hnowlan: base: fix typo in CHANGELOG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 [13:06:28] (03Merged) 10jenkins-bot: airflow-ml: define kubernetes namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102280 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:06:42] (03Merged) 10jenkins-bot: airflow-ml: register namespaces in cloudnative/ceph operator tenant namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102268 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:07:05] (03Merged) 10jenkins-bot: airflow-ml: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102254 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:08:03] 06SRE, 10SRE-swift-storage, 06Traffic-Icebox, 07affects-Kiwix-and-openZIM, 07Wikimedia-Performance-recommendation: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217#10396995 (10MatthewVernon) @TheDJ we're still emitting old-style ETags. [13:08:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:09:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:11:13] (03PS2) 10Brouberol: airflow-ml: define CAS config [puppet] - 10https://gerrit.wikimedia.org/r/1102257 (https://phabricator.wikimedia.org/T380258) [13:11:13] (03PS2) 10Brouberol: openldap: define new offloaded airflow-ml-ops group [puppet] - 10https://gerrit.wikimedia.org/r/1102258 (https://phabricator.wikimedia.org/T380258) [13:11:13] (03PS2) 10Brouberol: airflow-ml: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1102256 (https://phabricator.wikimedia.org/T380258) [13:11:55] (03CR) 10Klausman: [V:03+2 C:03+2] httpbb: add post deployment tests for the article-country endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1102201 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [13:13:05] (03CR) 10Brouberol: [C:03+2] airflow-ml: define CAS config [puppet] - 10https://gerrit.wikimedia.org/r/1102257 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:13:42] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2017,2021-2022,2024].codfw.wmnet [13:17:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Stuck/bugged BMC on ml-lab1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381902#10397020 (10klausman) The management interface works now, for unclear reasons. Maybe it just took forever to recover from reset(s)? It's all ver... [13:17:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [13:18:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [13:18:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2017,2021-2022,2024].codfw.wmnet [13:19:27] (03CR) 10Jelto: [C:03+2] Rename kubernetes20(17|21|22|24) to wikikube-worker[2184-2187] [puppet] - 10https://gerrit.wikimedia.org/r/1102281 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [13:19:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [13:19:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [13:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [13:21:01] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2017 to wikikube-worker2184 [13:21:22] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:21:36] (03PS1) 10Brouberol: airflow-ml: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102308 (https://phabricator.wikimedia.org/T380258) [13:25:11] (03CR) 10Brouberol: [C:03+2] airflow-ml: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102308 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:25:19] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2017 to wikikube-worker2184 - jelto@cumin1002" [13:25:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2017 to wikikube-worker2184 - jelto@cumin1002" [13:25:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:25:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2184 [13:25:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2184 [13:26:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [13:26:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2017 to wikikube-worker2184 [13:27:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [13:27:55] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2021 to wikikube-worker2185 [13:28:16] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:28:48] (03CR) 10Brouberol: [C:03+2] openldap: define new offloaded airflow-ml-ops group [puppet] - 10https://gerrit.wikimedia.org/r/1102258 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:29:10] (03CR) 10Brouberol: [C:03+2] "The underlying application was deployed:" [puppet] - 10https://gerrit.wikimedia.org/r/1102256 (https://phabricator.wikimedia.org/T380258) (owner: 10Brouberol) [13:31:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:31:50] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2021 to wikikube-worker2185 - jelto@cumin1002" [13:32:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2021 to wikikube-worker2185 - jelto@cumin1002" [13:32:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:32:44] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2185 [13:32:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2185 [13:33:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2021 to wikikube-worker2185 [13:34:19] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2022 to wikikube-worker2186 [13:34:40] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:39:16] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10397071 (10SLyngshede-WMF) To trigger webauthn for select users, we'll just reuse the groovy script from u2f and set the mfa-method field in LDAP to mfa-webauthn ` cas.authn.mfa... [13:39:30] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2022 to wikikube-worker2186 - jelto@cumin1002" [13:40:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2022 to wikikube-worker2186 - jelto@cumin1002" [13:40:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:44] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2186 [13:40:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2186 [13:41:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2022 to wikikube-worker2186 [13:42:23] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2024 to wikikube-worker2187 [13:42:44] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:45:42] (03CR) 10Btullis: [V:03+1 C:03+2] dse-k8s: Add tokens for the llm-inference namespace [puppet] - 10https://gerrit.wikimedia.org/r/1102287 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [13:45:54] (03PS1) 10Brouberol: airflow: enable the support of multiple executors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102312 (https://phabricator.wikimedia.org/T362788) [13:46:44] (03PS2) 10Brouberol: airflow: enable the support of multiple executors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102312 (https://phabricator.wikimedia.org/T362788) [13:47:16] (03PS2) 10Btullis: dse-k8s: Add a namespace for llm-inference work by the ML team [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102284 (https://phabricator.wikimedia.org/T377266) [13:53:13] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2024 to wikikube-worker2187 - jelto@cumin1002" [13:53:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2024 to wikikube-worker2187 - jelto@cumin1002" [13:53:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:53:35] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2187 [13:53:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2187 [13:54:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2024 to wikikube-worker2187 [13:57:22] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2184.codfw.wmnet wikikube-worker2185.codfw.wmnet wikikube-worker2186.codfw.wmnet wikikube-worker2187.codfw.wmnet on all recursors [13:57:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2184.codfw.wmnet wikikube-worker2185.codfw.wmnet wikikube-worker2186.codfw.wmnet wikikube-worker2187.codfw.wmnet on all recursors [13:59:32] (03PS1) 10Btullis: dse-k8s: Add tokens for mw-content-history-reconcile-enrich namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1400). [14:00:05] Func, arlolra, and joelyrookewmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:13] hi [14:00:26] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2184.codfw.wmnet with OS bookworm [14:00:29] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2185.codfw.wmnet with OS bookworm [14:00:30] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2186.codfw.wmnet with OS bookworm [14:00:32] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2187.codfw.wmnet with OS bookworm [14:00:37] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2184 [14:00:40] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2186 [14:00:42] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2187 [14:00:50] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:01:42] (03CR) 10Btullis: [C:03+2] dse-k8s: Add a namespace for llm-inference work by the ML team [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102284 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [14:02:20] (03PS2) 10Btullis: dse-k8s: Add tokens for mw-content-history-reconcile-enrich namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) [14:02:46] * TheresNoTime can deploy [14:03:33] (03CR) 10Samtar: [C:03+2] "start deploy" [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102141 (https://phabricator.wikimedia.org/T374072) (owner: 10Func) [14:03:50] (03CR) 10Samtar: [C:03+2] "start deploy" [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102142 (https://phabricator.wikimedia.org/T381714) (owner: 10Func) [14:04:09] while they're merging, joelyrookewmde I'll do yours first [14:04:22] okie dokie [14:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098045 (https://phabricator.wikimedia.org/T377809) (owner: 10Joely Rooke WMDE) [14:04:39] 07sre-alert-triage, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Alert in need of triage: SmartNotHealthy (instance stat1011:9100) - https://phabricator.wikimedia.org/T380835#10397141 (10BTullis) p:05Triage→03Medium a:03BTullis [14:04:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:04:41] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4668/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) (owner: 10Btullis) [14:04:50] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2186 - jelto@cumin1002" [14:04:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2186 - jelto@cumin1002" [14:04:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:54] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2186.codfw.wmnet 180.48.192.10.in-addr.arpa 0.8.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:04:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2186.codfw.wmnet 180.48.192.10.in-addr.arpa 0.8.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:04:58] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2186 [14:04:59] (03Merged) 10jenkins-bot: dse-k8s: Add a namespace for llm-inference work by the ML team [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102284 (https://phabricator.wikimedia.org/T377266) (owner: 10Btullis) [14:05:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2186 [14:05:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2186 [14:05:19] (03Merged) 10jenkins-bot: Remove feature flag which controls wikibase item link location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098045 (https://phabricator.wikimedia.org/T377809) (owner: 10Joely Rooke WMDE) [14:05:29] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:05:39] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1098045|Remove feature flag which controls wikibase item link location (T377809)]] [14:05:43] T377809: Cleanup "Move wikidata item link into Other Projects sidebar" - https://phabricator.wikimedia.org/T377809 [14:06:11] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:06:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:07:48] * Lucas_WMDE also around if needed [14:07:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:07:52] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2184.codfw.wmnet 41.32.192.10.in-addr.arpa 1.4.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:07:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2184.codfw.wmnet 41.32.192.10.in-addr.arpa 1.4.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:07:55] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2184 [14:08:05] (03CR) 10Brouberol: [C:03+1] dse-k8s: Add tokens for mw-content-history-reconcile-enrich namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) (owner: 10Btullis) [14:08:25] !log btullis@cumin1002 START - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash rolling restart_daemons on A:apifeatureusage [14:08:32] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:08:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2184 [14:09:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2184 [14:09:22] o/ arlo is around watching this irc channel from my laptop. [14:09:25] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2185 [14:09:26] !log samtar@deploy2002 samtar, joelyrookewmde: Backport for [[gerrit:1098045|Remove feature flag which controls wikibase item link location (T377809)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:29] joelyrookewmde: ready for testing ^ [14:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:33] he has a backport scheduled in this window if anyone is around. [14:10:02] *looking* [14:10:10] subbu: ack, will be doing that one next probably [14:10:19] ty [14:10:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:10:56] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2187.codfw.wmnet 87.48.192.10.in-addr.arpa 7.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:10:57] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:11:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2187.codfw.wmnet 87.48.192.10.in-addr.arpa 7.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:11:00] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2187 [14:11:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.apifeatureusage.roll-restart-reboot-logstash (exit_code=0) rolling restart_daemons on A:apifeatureusage [14:11:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2187 [14:11:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2187 [14:11:46] looks goof to me [14:11:51] good* [14:12:00] !log samtar@deploy2002 samtar, joelyrookewmde: Continuing with sync [14:12:56] PROBLEM - Host ms-be2085 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:28] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2185 - jelto@cumin1002" [14:14:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [14:14:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2185 - jelto@cumin1002" [14:14:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:14:32] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2185.codfw.wmnet 89.32.192.10.in-addr.arpa 9.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:14:35] (03Merged) 10jenkins-bot: ve.ui.CodeMirror.v6: Use plugin callback to load the actual module [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102141 (https://phabricator.wikimedia.org/T374072) (owner: 10Func) [14:14:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2185.codfw.wmnet 89.32.192.10.in-addr.arpa 9.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:14:36] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2185 [14:14:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10397163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm [14:14:39] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on archiva1002.wikimedia.org with reason: Adding new disk [14:14:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on archiva1002.wikimedia.org with reason: Adding new disk [14:14:59] (03CR) 10Giuseppe Lavagetto: [C:03+2] Release version 4.0.3 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102276 (owner: 10Elukey) [14:15:25] (03Merged) 10jenkins-bot: styles: Avoid misalignments when line numbering is disabled [extensions/CodeMirror] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102142 (https://phabricator.wikimedia.org/T381714) (owner: 10Func) [14:15:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2185 [14:15:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2185 [14:17:54] RECOVERY - Host ms-be2085 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [14:18:11] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098045|Remove feature flag which controls wikibase item link location (T377809)]] (duration: 12m 32s) [14:18:15] T377809: Cleanup "Move wikidata item link into Other Projects sidebar" - https://phabricator.wikimedia.org/T377809 [14:18:23] joelyrookewmde: live on prod [14:18:32] (03Merged) 10jenkins-bot: Release version 4.0.3 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102276 (owner: 10Elukey) [14:19:04] Func: will do your two backports now [14:19:10] ok [14:19:38] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2085.codfw.wmnet with OS bullseye [14:19:43] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1102141|ve.ui.CodeMirror.v6: Use plugin callback to load the actual module (T374072)]], [[gerrit:1102142|styles: Avoid misalignments when line numbering is disabled (T381714)]] [14:19:48] T374072: CodeMirror 6 + 2017 wikitext editor race conditions - https://phabricator.wikimedia.org/T374072 [14:19:49] T381714: Width of the cm-content element not set when line numbering is disabled in the 2017 wikitext editor - https://phabricator.wikimedia.org/T381714 [14:22:02] thanks!! [14:22:16] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2186.codfw.wmnet with reason: host reimage [14:22:59] !log samtar@deploy2002 samtar, func: Backport for [[gerrit:1102141|ve.ui.CodeMirror.v6: Use plugin callback to load the actual module (T374072)]], [[gerrit:1102142|styles: Avoid misalignments when line numbering is disabled (T381714)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:23:05] Func: ready for testing ^ [14:23:12] looking [14:24:59] TheresNoTime: looks good [14:25:04] !log samtar@deploy2002 samtar, func: Continuing with sync [14:25:47] (03CR) 10Gmodena: [C:03+1] "LGTM. Left you question re naming convnetions." [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) (owner: 10Btullis) [14:25:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2186.codfw.wmnet with reason: host reimage [14:25:58] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2184.codfw.wmnet with reason: host reimage [14:28:17] (03PS3) 10Btullis: dse-k8s: Add tokens for mw-content-history-reconcile-enrich namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) [14:28:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2187.codfw.wmnet with reason: host reimage [14:28:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2184.codfw.wmnet with reason: host reimage [14:28:46] (03CR) 10Btullis: dse-k8s: Add tokens for mw-content-history-reconcile-enrich namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) (owner: 10Btullis) [14:30:20] (03CR) 10Hnowlan: [C:03+1] APIGW: Add configuration to expose LW isvc article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102150 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [14:30:26] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102141|ve.ui.CodeMirror.v6: Use plugin callback to load the actual module (T374072)]], [[gerrit:1102142|styles: Avoid misalignments when line numbering is disabled (T381714)]] (duration: 10m 42s) [14:30:31] T374072: CodeMirror 6 + 2017 wikitext editor race conditions - https://phabricator.wikimedia.org/T374072 [14:30:31] T381714: Width of the cm-content element not set when line numbering is disabled in the 2017 wikitext editor - https://phabricator.wikimedia.org/T381714 [14:30:36] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4669/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) (owner: 10Btullis) [14:30:37] Func: both live on prod [14:30:43] (03CR) 10Hnowlan: [C:03+1] admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:30:45] subbu: will do arlo's now [14:30:49] thanks [14:30:52] thanks [14:30:52] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load - https://phabricator.wikimedia.org/T381980#10397235 (10Lucas_Werkmeister_WMDE) Clickable link: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ W... [14:31:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [14:31:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2187.codfw.wmnet with reason: host reimage [14:32:14] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load - https://phabricator.wikimedia.org/T381980#10397243 (10Reedy) p:05Triage→03High [14:32:17] (03CR) 10Kevin Bazira: [C:03+2] APIGW: Add configuration to expose LW isvc article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102150 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [14:32:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2185.codfw.wmnet with reason: host reimage [14:32:37] (03Merged) 10jenkins-bot: Add Atieno's public key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [14:32:56] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1101577|Add Atieno's public key]] [14:33:42] (03Merged) 10jenkins-bot: APIGW: Add configuration to expose LW isvc article-country [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102150 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [14:33:47] subbu: will this patch need testing at all? [14:33:52] nope [14:33:56] ack :) [14:34:06] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [14:35:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2185.codfw.wmnet with reason: host reimage [14:36:17] !log samtar@deploy2002 arlolra, samtar: Backport for [[gerrit:1101577|Add Atieno's public key]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:21] !log samtar@deploy2002 arlolra, samtar: Continuing with sync [14:36:23] (03PS1) 10Jelto: trafficserver: add dedicated mapping for querybuilder [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) [14:38:32] (03CR) 10Xcollazo: data-engineering: add alerts for dumps2 flink app. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena) [14:39:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2085.codfw.wmnet with reason: host reimage [14:41:44] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101577|Add Atieno's public key]] (duration: 08m 47s) [14:41:49] subbu: live :) [14:41:59] thanks [14:42:07] !log done UTC afternoon backport window [14:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2186.codfw.wmnet with OS bookworm [14:48:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2184.codfw.wmnet with OS bookworm [14:48:52] (03CR) 10Gmodena: data-engineering: add alerts for dumps2 flink app. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena) [14:51:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2187.codfw.wmnet with OS bookworm [14:55:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2185.codfw.wmnet with OS bookworm [14:56:23] !log homer 'lsw1-d5-codfw*' commit 'T377877' [14:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:27] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [14:56:56] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:57:25] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:57:45] !log homer 'lsw1-c3-codfw*' commit 'T377877' [14:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [14:58:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10397345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm... [14:59:58] !log homer 'lsw1-d3-codfw*' commit 'T377877' [15:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1500) [15:02:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2085.codfw.wmnet with OS bullseye [15:02:55] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2184-2187].codfw.wmnet [15:02:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2184-2187].codfw.wmnet [15:03:28] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381967#10397383 (10Jelto) [15:04:26] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] trafficserver: add dedicated mapping for querybuilder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [15:11:20] (03PS1) 10Elukey: Updating docker-pkg to 4.0.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1102325 [15:11:47] (03CR) 10Elukey: [V:03+2 C:03+2] Updating docker-pkg to 4.0.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1102325 (owner: 10Elukey) [15:13:04] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@9305554]: Update to 4.0.3 [15:13:34] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@9305554]: Update to 4.0.3 (duration: 00m 37s) [15:13:48] (03PS6) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [15:13:59] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [15:15:34] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [15:15:37] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [15:15:48] FIRING: PuppetFailure: Puppet has failed on wdqs1025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:15:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [15:19:25] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:19:50] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:20:15] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:20:23] (03PS1) 10Elukey: jaeger: fix builder changelog to remove warnings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102329 [15:21:01] (03CR) 10Elukey: [V:03+2 C:03+2] jaeger: fix builder changelog to remove warnings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102329 (owner: 10Elukey) [15:21:42] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:22:30] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:23:48] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:23:54] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:23:57] (03PS1) 10CDanis: upstream_version test: be a bit more specific [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102330 [15:24:02] (03CR) 10Btullis: [V:03+1 C:03+2] dse-k8s: Add tokens for mw-content-history-reconcile-enrich namespaces [puppet] - 10https://gerrit.wikimedia.org/r/1102314 (https://phabricator.wikimedia.org/T381322) (owner: 10Btullis) [15:24:18] (03PS1) 10Elukey: spark: update 3.3 build's changelog to fix warnings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102331 [15:24:36] (03CR) 10Elukey: [V:03+2 C:03+2] spark: update 3.3 build's changelog to fix warnings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102331 (owner: 10Elukey) [15:25:25] (03CR) 10Elukey: [C:03+1] upstream_version test: be a bit more specific [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102330 (owner: 10CDanis) [15:27:12] (03PS1) 10Hnowlan: mediawiki: shorten mercurius job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102332 (https://phabricator.wikimedia.org/T371701) [15:27:18] (03CR) 10CDanis: [C:03+2] upstream_version test: be a bit more specific [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102330 (owner: 10CDanis) [15:27:21] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1102333 (https://phabricator.wikimedia.org/T381993) [15:29:32] (03CR) 10Hnowlan: [C:03+2] mediawiki: shorten mercurius job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102332 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:30:32] (03Merged) 10jenkins-bot: upstream_version test: be a bit more specific [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1102330 (owner: 10CDanis) [15:33:41] (03Merged) 10jenkins-bot: mediawiki: shorten mercurius job name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102332 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:34:55] (03PS7) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [15:35:08] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [15:36:03] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [15:36:30] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:36:35] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:36:47] FIRING: HelmReleaseBadStatus: Helm release mw-videoscaler/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-videoscaler - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:37:03] ^ just fixed [15:37:23] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [15:38:22] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:38:46] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:38:59] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on archiva1002.wikimedia.org with reason: Adding new disk [15:39:12] (03PS1) 10Marostegui: mariadb: Make db2235 m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1102339 (https://phabricator.wikimedia.org/T373579) [15:39:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on archiva1002.wikimedia.org with reason: Adding new disk [15:41:12] (03CR) 10Marostegui: [C:03+2] mariadb: Make db2235 m5 master [puppet] - 10https://gerrit.wikimedia.org/r/1102339 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [15:41:15] (03CR) 10Itamar Givon: [C:03+1] trafficserver: add dedicated mapping for querybuilder [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [15:41:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-videoscaler/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-videoscaler - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:44:04] (03PS1) 10Marostegui: dbproxy2004,dbproxy2008: Add db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1102341 (https://phabricator.wikimedia.org/T373579) [15:44:58] (03CR) 10Marostegui: [C:03+2] dbproxy2004,dbproxy2008: Add db2235 [puppet] - 10https://gerrit.wikimedia.org/r/1102341 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [15:45:19] (03PS8) 10DCausse: wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) [15:45:41] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:45:47] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:46:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:46:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:50:09] (03CR) 10Bking: [C:03+2] wdqs: add graph_name in query logs [puppet] - 10https://gerrit.wikimedia.org/r/1084193 (https://phabricator.wikimedia.org/T376134) (owner: 10DCausse) [15:54:50] (03PS1) 10Marostegui: mariadb: Disable master on db2135 [puppet] - 10https://gerrit.wikimedia.org/r/1102342 [15:56:15] (03CR) 10Marostegui: [C:03+2] mariadb: Disable master on db2135 [puppet] - 10https://gerrit.wikimedia.org/r/1102342 (owner: 10Marostegui) [15:57:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2135.codfw.wmnet with reason: maintenance [15:58:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2135.codfw.wmnet with reason: maintenance [16:02:58] (03CR) 10Ottomata: [C:03+2] "I'd like to make progress on this while I have time. Being bold and merging. If there are still changes needed please comment and I will" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:03:45] (03Merged) 10jenkins-bot: mediawiki.org/beacon/event/index.php - use EventBus->send [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:10:41] !log otto@deploy2002 Started scap sync-world: Backport for [[gerrit:1063222|mediawiki.org/beacon/event/index.php - use EventBus->send (T353817)]] [16:10:41] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:45] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [16:10:47] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:49] (03CR) 10Xcollazo: "Metrics LGTM, but I am unfamiliar with syntax." [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena) [16:11:24] (03CR) 10Xcollazo: [C:03+1] data-engineering: add alerts for dumps2 flink app. [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena) [16:12:56] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1102346 [16:13:40] (03CR) 10Scott French: [C:03+2] shellbox-video: allow egress to swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101944 (https://phabricator.wikimedia.org/T292322) (owner: 10Scott French) [16:14:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:14:44] (03Merged) 10jenkins-bot: shellbox-video: allow egress to swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101944 (https://phabricator.wikimedia.org/T292322) (owner: 10Scott French) [16:16:21] !log otto@deploy2002 otto: Backport for [[gerrit:1063222|mediawiki.org/beacon/event/index.php - use EventBus->send (T353817)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:16:25] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [16:16:37] !log otto@deploy2002 otto: Continuing with sync [16:18:53] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:28] (03CR) 10Alexandros Kosiaris: "Documentation was my angle fwiw. Someone trying to reason about this, shouldn't have to look into the what puppet puts in for the listener" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:20:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:43] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:31] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:21:39] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:22:17] !log otto@deploy2002 Finished scap sync-world: Backport for [[gerrit:1063222|mediawiki.org/beacon/event/index.php - use EventBus->send (T353817)]] (duration: 11m 36s) [16:22:21] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [16:23:21] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:24:33] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:24:39] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:25:26] (03PS2) 10Herron: pyrra: switch liftwing away from increase5m metrics [puppet] - 10https://gerrit.wikimedia.org/r/1102346 (https://phabricator.wikimedia.org/T302995) [16:25:44] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [16:25:48] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [16:28:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.304s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:30:15] (03PS6) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [16:32:24] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:32:30] (03PS7) 10Hnowlan: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [16:33:03] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1058224120 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:33:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.286s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:34:13] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [16:35:19] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:35:41] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002 (10phaultfinder) 03NEW [16:38:05] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 14248 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:42:57] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [16:43:30] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:46:11] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [16:47:19] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [16:48:35] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:49:23] (03PS4) 10Hnowlan: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) [16:49:26] (03CR) 10Scott French: [C:03+1] mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:54:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10397999 (10MatthewVernon) 05Open→03Resolved ms-be2085 now sorted, thanks to @elukey, so closing again. [16:54:33] (03CR) 10Scott French: "The new comments in mesh.configuration, together with a slight wording change in the mesh CHANGELOG (see comment on parent patch) indicate" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:56:32] (03PS8) 10Hnowlan: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [16:57:05] (03PS1) 10Urbanecm: [Growth] Make the typage campaign not specific to 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102350 (https://phabricator.wikimedia.org/T380405) [17:01:01] (03CR) 10Hnowlan: "I've added documentation above each of the sections in the template where we add the values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:05:14] (03CR) 10Elukey: [C:03+1] pyrra: switch liftwing away from increase5m metrics [puppet] - 10https://gerrit.wikimedia.org/r/1102346 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:05:14] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10398045 (10Scott_French) [17:05:19] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10398046 (10Scott_French) Great, thank you very much @Ammarpad and @Jdlrobson. [17:05:29] (03PS1) 10Clément Goubert: wikikube: Decommission 8 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102352 (https://phabricator.wikimedia.org/T379788) [17:08:00] (03CR) 10Herron: [C:03+2] pyrra: switch liftwing away from increase5m metrics [puppet] - 10https://gerrit.wikimedia.org/r/1102346 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:09:37] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2047,2066,2085-2086,2180-2183].codfw.wmnet [17:09:48] (03PS5) 10Hnowlan: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) [17:11:29] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10398064 (10VRiley-WMF) If I recall correctly, last time this happened we ended up replacing two drives. When would it be okay to carry out this activity? [17:12:18] (03CR) 10Hnowlan: [C:03+2] mesh.configuration: dummy commit for 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101917 (owner: 10Hnowlan) [17:13:23] (03Merged) 10jenkins-bot: mesh.configuration: dummy commit for 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101917 (owner: 10Hnowlan) [17:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:14:32] (03PS2) 10DCausse: rdf-streaming-updater: add wdqs udpater streams in event stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099727 (https://phabricator.wikimedia.org/T374919) [17:16:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2047,2066,2085-2086,2180-2183].codfw.wmnet [17:18:18] (03CR) 10Jsn.sherman: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101937 (https://phabricator.wikimedia.org/T381000) (owner: 10Kgraessle) [17:19:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [17:19:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [17:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [17:21:22] (03CR) 10Clément Goubert: [C:03+2] wikikube: Decommission 8 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1102352 (https://phabricator.wikimedia.org/T379788) (owner: 10Clément Goubert) [17:24:40] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017 (10RobH) 03NEW [17:25:28] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10398169 (10RobH) @ayounsi or @cmooney: These two switches will arrive in December. Would one of you be able tot update this task with the cabling directions to... [17:25:42] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10398171 (10RobH) [17:26:04] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10398173 (10RobH) [17:26:11] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10398174 (10RobH) [17:28:04] (03CR) 10Hnowlan: [C:03+2] mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:28:11] (03CR) 10CI reject: [V:04-1] mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:30:25] (03PS9) 10Hnowlan: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [17:31:53] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [17:31:55] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [17:32:00] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[2047,2066,2085-2086].codfw.wmnet [17:33:55] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10398196 (10jhathaway) 05Open→03In progress [17:34:26] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10398198 (10jhathaway) 05In progress→03Resolved [17:34:38] (03PS1) 10Clément Goubert: wikikube: Decom wikikube-worker2086 [puppet] - 10https://gerrit.wikimedia.org/r/1102357 (https://phabricator.wikimedia.org/T379788) [17:34:53] (03PS2) 10Clément Goubert: wikikube: Decom wikikube-worker2086 [puppet] - 10https://gerrit.wikimedia.org/r/1102357 (https://phabricator.wikimedia.org/T379788) [17:35:35] (03PS1) 10Giuseppe Lavagetto: Update code to the last two MRs [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1102360 [17:35:40] (03CR) 10Clément Goubert: [C:03+2] wikikube: Decom wikikube-worker2086 [puppet] - 10https://gerrit.wikimedia.org/r/1102357 (https://phabricator.wikimedia.org/T379788) (owner: 10Clément Goubert) [17:35:59] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Update code to the last two MRs [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1102360 (owner: 10Giuseppe Lavagetto) [17:37:46] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "UI improvements, add uncomitted changes warning - oblivian@cumin1002" [17:37:48] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: UI improvements, add uncomitted changes warning - oblivian@cumin1002 [17:38:19] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: UI improvements, add uncomitted changes warning - oblivian@cumin1002 [17:38:20] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "UI improvements, add uncomitted changes warning - oblivian@cumin1002" [17:40:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101937 (https://phabricator.wikimedia.org/T381000) (owner: 10Kgraessle) [17:41:19] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:19] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:36] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10398276 (10Jhancock.wm) @Dwisehaupt @Papaul got the cable reconnected and confirmed it pings. if there are any issues with it, lmk and I'll take care of it asap. [17:45:58] BGP alerts are jasmine_ and I decommissioning k8s nodes [17:47:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 34086720 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:48:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 85192 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:48:39] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load - https://phabricator.wikimedia.org/T381980#10398289 (10Wargo) And now? [17:48:39] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [17:49:26] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10398295 (10Dwisehaupt) @Jhancock.wm Thanks! I can confirm that I'm in. [17:52:34] (03PS2) 10Hnowlan: base: fix pin on base.meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 [17:53:36] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2047,2066,2085-2086].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [17:54:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2047,2066,2085-2086].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [17:54:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:54:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[2047,2066,2085-2086].codfw.wmnet [17:55:00] !log homer 'lsw1-a6-codfw' commit 'T379788' [17:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:04] T379788: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788 [17:56:53] (03CR) 10Hnowlan: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:57:01] homer 'lsw1-b6-codfw*' commit 'T379788' [17:57:12] (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:57:23] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:26] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:32] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381967#10398377 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:58:33] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T373133#10398381 (10VRiley-WMF) [17:58:47] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:58:50] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:59:16] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:59:19] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T1800) [18:00:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[2180-2183].codfw.wmnet [18:03:12] PROBLEM - BGP status on lsw1-c1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:04:28] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:04:30] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:05:13] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:05:14] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:05:32] (03PS1) 10Herron: alertmanager: remove manually defined sli missing alert in favor or pyrra provided alert [alerts] - 10https://gerrit.wikimedia.org/r/1102366 (https://phabricator.wikimedia.org/T302995) [18:05:42] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:05:44] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:06:14] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:06:17] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [18:06:24] PROBLEM - BGP status on lsw1-d6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:08:06] (03CR) 10Herron: alertmanager: remove manually defined sli missing alert in favor or pyrra provided alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1102366 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:11] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [18:11:14] (03PS10) 10Hnowlan: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [18:15:01] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2180-2183].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [18:16:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2180-2183].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [18:16:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:16:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[2180-2183].codfw.wmnet [18:17:05] !log homer 'lsw1-c1-codfw*' commit 'T379788' [18:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:09] T379788: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788 [18:18:12] RECOVERY - BGP status on lsw1-c1-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:18:21] !log homer 'lsw1-d6-codfw*' commit 'T379788' [18:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:24] RECOVERY - BGP status on lsw1-d6-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:25:54] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: reimage puppetmasters to puppetservers - https://phabricator.wikimedia.org/T345067#10398512 (10jhathaway) [18:40:46] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh! Yeah, I think this should address the "duplicate modules upon vendoring" issue now that base.helper 1.1.4 exists." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 (owner: 10Hnowlan) [18:42:15] 10ops-codfw, 06DC-Ops, 06serviceops: Decommission kubernetes20[07-14].codfw.wmnet - https://phabricator.wikimedia.org/T379788#10398556 (10jasmine_) a:05jasmine_→03None [18:49:24] 10ops-magru, 06Traffic: magru temp check - https://phabricator.wikimedia.org/T382026 (10RobH) 03NEW p:05Triage→03Medium [18:56:15] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10398622 (10RobH) [19:13:06] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10398712 (10VRiley-WMF) a:03VRiley-WMF [19:27:04] 06SRE, 06Infrastructure-Foundations: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10398751 (10Scott_French) a:05Scott_French→03None Great, thank you @NBaca-WMF. Alright, it seems like there are two different issues intertwined here: **Page annotations opt-outs**... [19:37:38] (03PS1) 10Eevans: aqs: Upgrade Cassandra to 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1102377 (https://phabricator.wikimedia.org/T380420) [19:40:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10398772 (10Eevans) >>! In T381742#10398064, @VRiley-WMF wrote: > If I recall correctly, last time this happened we ended up replacing two drives. We did. The original drive that had failed, and another that... [19:44:35] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102377 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [19:57:38] (03CR) 10Scott French: [C:03+1] mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [20:10:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/PageTriage] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102205 (https://phabricator.wikimedia.org/T381741) (owner: 10Novem Linguae) [20:22:18] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10398927 (10VRiley-WMF) [20:24:19] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10398939 (10VRiley-WMF) [20:42:06] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on aqs1014.eqiad.wmnet with reason: Hardware replacement [20:42:21] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on aqs1014.eqiad.wmnet with reason: Hardware replacement [20:44:14] FIRING: [2x] ProbeDown: Service aqs1014-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:57] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10399020 (10VRiley-WMF) Replaced serial number S4KVNA0MB04873 (Slot 6) With S4KVNA0MB04856 [20:47:09] FIRING: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:26] (03CR) 10Brouberol: [C:03+1] "It was probably a mistake of mine. I should have pinned the minor version, not the patch one. Thanks for the fix!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102307 (owner: 10Hnowlan) [20:56:50] ACKNOWLEDGEMENT - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 12, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T382033 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:56:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T382033 (10ops-monitoring-bot) 03NEW [20:57:09] RESOLVED: [4x] ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T2100). [21:00:05] katherine_g and NovemLinguae: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:00:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:00:13] here [21:00:16] o/ [21:00:25] hey katie :) [21:00:30] hi :) [21:00:48] pagetriage backport today. got a bug [21:01:04] 07SRE-Unowned: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#10399059 (10Scott_French) I was able to reproduce the Arelion issue with https://groups.google.com/u/0/a/wikimedia.org/g/ops-maintenance/c/TGXNGkB-gSo (yes, this is a reminder for a mainten... [21:01:08] * TheresNoTime can deploy if needed [21:01:24] yes please. no deployers at the last backport i attended :P [21:01:40] yes please [21:01:49] katherine_g: I'll start with yours then :) [21:01:57] thanks! [21:02:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101937 (https://phabricator.wikimedia.org/T381000) (owner: 10Kgraessle) [21:02:52] (03Merged) 10jenkins-bot: Enable AutoModerator on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101937 (https://phabricator.wikimedia.org/T381000) (owner: 10Kgraessle) [21:03:03] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#10399064 (10Dwisehaupt) 05Open→03Resolved Host is built out and in the configuration stages which is covered in other tasks. Closing. [21:03:12] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1101937|Enable AutoModerator on bnwiki (T381000)]] [21:03:15] T381000: Enable AutoModerator on bnwiki - https://phabricator.wikimedia.org/T381000 [21:03:33] (03CR) 10Samtar: [C:03+2] "start merge for deploy" [extensions/PageTriage] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102205 (https://phabricator.wikimedia.org/T381741) (owner: 10Novem Linguae) [21:03:39] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10399071 (10Dwisehaupt) 05Open→03Resolved Hosts are built out and in the configuration stages which is covered in other tasks. Closing. [21:05:14] I'm good to sync [21:08:01] !log samtar@deploy2002 kgraessle, samtar: Backport for [[gerrit:1101937|Enable AutoModerator on bnwiki (T381000)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:10] katherine_g: hadn't yet properly hit the test servers - could you just double-check and then I'll sync? :) [21:08:51] yep- we're good [21:08:57] thanks! :) [21:08:59] !log samtar@deploy2002 kgraessle, samtar: Continuing with sync [21:09:14] np [21:10:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:10:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:11:39] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS bullseye [21:11:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [21:11:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10399103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic... [21:11:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10399104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic... [21:11:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10399106 (10Jclark-ctr) [21:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:14:13] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101937|Enable AutoModerator on bnwiki (T381000)]] (duration: 11m 01s) [21:14:17] T381000: Enable AutoModerator on bnwiki - https://phabricator.wikimedia.org/T381000 [21:14:18] katherine_g: done :) live on prod [21:14:55] thanks! [21:15:03] NovemLinguae: another ~8 minutes for your patch to merge [21:15:16] 👍 [21:19:44] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [21:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [21:21:25] (03Merged) 10jenkins-bot: Follow-up I9df39fdcc: Convert missed 'this' to 'el' [extensions/PageTriage] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1102205 (https://phabricator.wikimedia.org/T381741) (owner: 10Novem Linguae) [21:21:47] merged [21:22:02] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1102205|Follow-up I9df39fdcc: Convert missed 'this' to 'el' (T381741)]] [21:22:06] T381741: Toolbar tag flyout: changing tag groups is broken - https://phabricator.wikimedia.org/T381741 [21:22:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1012.eqiad.wmnet with OS bullseye [21:22:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10399135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudelastic1012... [21:22:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1011.eqiad.wmnet with OS bullseye [21:22:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10399136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudelastic1011... [21:25:58] !log samtar@deploy2002 novemlinguae, samtar: Backport for [[gerrit:1102205|Follow-up I9df39fdcc: Convert missed 'this' to 'el' (T381741)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:04] NovemLinguae: on mwdebug for testing ^ [21:26:39] tested, works :) [21:26:52] !log samtar@deploy2002 novemlinguae, samtar: Continuing with sync [21:32:04] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102205|Follow-up I9df39fdcc: Convert missed 'this' to 'el' (T381741)]] (duration: 10m 01s) [21:32:08] T381741: Toolbar tag flyout: changing tag groups is broken - https://phabricator.wikimedia.org/T381741 [21:32:10] NovemLinguae: done, live on prod :) [21:32:32] awesome. thank you very much for your time [21:32:40] TheresNoTime ;-) [21:32:57] np! :D [21:33:36] !log done UTC late backport window [21:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:54] (03PS8) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [21:35:04] (03CR) 10Kamila Součková: [WIP, DNM] create sre.k8s.roll-reimage-nodes (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [21:59:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10399164 (10Jclark-ctr) i have updated firmwares and dell sees no issues. these where ordered with 512 memory and are listing the correct amou... [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241211T2200) [22:05:41] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T382033#10399184 (10Jclark-ctr) a:03VRiley-WMF @VRiley-WMF looks like it came back T362841 same drive SDG [22:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Stuck/bugged BMC on ml-lab1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381902#10399191 (10Jclark-ctr) 05Open→03Resolved [22:19:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10399201 (10wiki_willy) @Jclark-ctr - there's nothing that I'm aware of. If there's no additional info in the original procurement task or an... [22:42:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10399231 (10Eevans) Status: Rebuilding... [22:52:52] !log removing three files for legal compliance [22:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:10] !log removing 4 files for legal compliance [23:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:31] !log removing 7 files for legal compliance [23:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log