[00:06:02] (03PS2) 10Brennen Bearnes: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) [00:08:03] (03CR) 10CI reject: [V:04-1] dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) (owner: 10Brennen Bearnes) [00:21:22] (03PS3) 10Brennen Bearnes: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) [00:28:28] 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10414598 (10Bugreporter) [00:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105455 [00:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105455 (owner: 10TrainBranchBot) [00:41:54] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10414621 (10colewhite) 05In progress→03Resolved Zuul is effectively migrated at this point and the... [01:04:03] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10414678 (10Jhancock.wm) Andrew, getting this error now in the installer. [!!] Partition disks Failed to run preseeded command Ex... [01:04:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105455 (owner: 10TrainBranchBot) [01:05:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [01:05:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414681 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm [01:08:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105457 [01:08:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105457 (owner: 10TrainBranchBot) [01:17:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414687 (10Jhancock.wm) @elukey we're having an issue with this last server. es1043 keeps going to the puppetmaster server for it's certificate... [01:38:21] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105457 (owner: 10TrainBranchBot) [01:48:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [01:48:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm... [02:17:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:18:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [02:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [02:28:41] !log krinkle@deploy2002 Started deploy [statsv/statsv@2ee86ea]: Add dogstatsd support [02:28:50] !log krinkle@deploy2002 Finished deploy [statsv/statsv@2ee86ea]: Add dogstatsd support (duration: 00m 18s) [02:34:41] (03PS1) 10Krinkle: webperf: Enable --dogstatsd on statsv.py [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) [02:35:08] (03CR) 10Krinkle: "support for `--dogstatsd` has been deployed. https://sal.toolforge.org/log/rli-3JMBKFqumxvt8GBJ" [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:14:51] PROBLEM - SSH on bast3007 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:16:07] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:21:51] RECOVERY - SSH on bast3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:32:53] 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10414778 (10aliu) Despite the slightly-invalid SVG code, the SVG still renders in browser if I render its code. [03:40:07] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:07] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [05:39:07] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [06:05:15] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [06:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [06:31:31] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.001e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [06:38:07] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [06:40:07] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T0700) [07:00:05] marostegui, Amir1, and arnaudb: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T0700) [07:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:00] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2190.codfw.wmnet with OS bookworm [07:18:03] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2190 [07:18:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2190 [07:37:15] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2190.codfw.wmnet with reason: host reimage [07:40:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2190.codfw.wmnet with reason: host reimage [07:54:40] Doing early +2 for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1105341 for wangombe_g's patch. [07:54:46] wangombe_g: ^^ [07:55:01] noted. [07:55:07] (03CR) 10KartikMistry: [C:03+2] Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [07:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:40] RECOVERY - BGP status on lsw1-c3-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:00:07] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T0800). nyaa~ [08:00:07] wangombe_g: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:25] FIRING: SystemdUnitFailed: user@0.service on elastic2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2190.codfw.wmnet with OS bookworm [08:01:24] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2190.codfw.wmnet [08:01:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2190.codfw.wmnet [08:02:18] I'm deploying wangombe_g's patches.. [08:02:45] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420#10414901 (10Jelto) The host responses normally and a reimage worked. Thanks @Jhancock.wm for the quick help! [08:03:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:03:56] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2061-2062].codfw.wmnet [08:05:16] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:06:20] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10414905 (10Volans) FYI the links at the bottom of https://integration.wikimedia.org/zuul/ ( Job Stats... [08:07:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2061-2062].codfw.wmnet [08:08:54] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2061.codfw.wmnet with OS bookworm [08:08:58] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2062.codfw.wmnet with OS bookworm [08:09:13] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2061 [08:09:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2061 [08:09:17] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2062 [08:09:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2062 [08:10:13] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10414909 (10SLyngshede-WMF) ` cas.theme.default-theme-name=wikimedia # WebAuthN cas.authn.mfa.web-authn.core.application-id=https://idp-test.wikimedia.org cas.authn.mfa.web-authn... [08:12:48] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:14:10] (03Merged) 10jenkins-bot: Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:15:22] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105341|Event logging: pass empty object to translation property (T364460)]] [08:15:27] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:27:10] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2061.codfw.wmnet with reason: host reimage [08:27:20] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2062.codfw.wmnet with reason: host reimage [08:28:01] !log kartik@deploy2002 wangombe, kartik: Backport for [[gerrit:1105341|Event logging: pass empty object to translation property (T364460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:28:05] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:28:19] wangombe_g: Please test [08:28:26] on it [08:31:18] testing on Special:translate is done kart_ [08:31:30] Nice. [08:31:42] Deploying and +2ing 2nd patch as well. [08:31:45] !log kartik@deploy2002 wangombe, kartik: Continuing with sync [08:32:01] (03CR) 10KartikMistry: [C:03+2] Event logging: update schemaId [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:32:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2061.codfw.wmnet with reason: host reimage [08:36:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2062.codfw.wmnet with reason: host reimage [08:37:14] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105341|Event logging: pass empty object to translation property (T364460)]] (duration: 21m 52s) [08:37:18] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:38:03] wangombe_g: first patch is done. [08:38:14] wangombe_g: will wait for CI for 2nd patch now.. [08:38:29] yes. Thanks. Awaiting the second [08:38:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:43:07] (03CR) 10Gmodena: [C:03+2] dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [08:44:14] (03Merged) 10jenkins-bot: dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [08:49:30] 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10414948 (10MatthewVernon) Similar issues have been reported before (e.g. T375324); in the case in point, this object is stored in swift as `text/plain`, w... [08:51:24] (03Merged) 10jenkins-bot: Event logging: update schemaId [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [08:51:57] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105283|Event logging: update schemaId (T364460)]] [08:52:01] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:52:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2061.codfw.wmnet with OS bookworm [08:53:42] wangombe_g: patch is merged.. [08:54:12] testing [08:54:29] wangombe_g: no no. yet to reach on the testservers :D [08:54:52] :D alright! [08:54:53] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:55:09] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw [08:55:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2062.codfw.wmnet with OS bookworm [08:56:11] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [08:56:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw [08:57:50] !log kartik@deploy2002 kartik, wangombe: Backport for [[gerrit:1105283|Event logging: update schemaId (T364460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:57:55] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [08:58:07] wangombe_g: you can test now :) [08:59:56] on it [09:01:31] Works as intended. Thanks. [09:01:48] I've finished testing [09:02:59] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2061-2062].codfw.wmnet [09:03:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2061-2062].codfw.wmnet [09:06:41] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2059-2060].codfw.wmnet [09:07:41] cool. Deploying wangombe_g [09:07:49] !log kartik@deploy2002 kartik, wangombe: Continuing with sync [09:07:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2059-2060].codfw.wmnet [09:08:35] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad [09:09:05] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2060.codfw.wmnet with OS bookworm [09:09:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2059.codfw.wmnet with OS bookworm [09:09:24] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2060 [09:09:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2060 [09:09:26] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2059 [09:09:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2059 [09:09:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad [09:10:11] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [09:12:15] !log upgrading mwdebug* to PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u4 T382077 [09:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:55] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:17:17] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105283|Event logging: update schemaId (T364460)]] (duration: 25m 20s) [09:17:22] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [09:17:43] Done! [09:20:05] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [09:23:51] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [09:23:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1069.eqiad.wmnet with OS bullseye [09:24:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [09:26:41] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1069.eqiad.wmnet [09:26:54] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2060.codfw.wmnet with reason: host reimage [09:26:57] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2059.codfw.wmnet with reason: host reimage [09:28:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1069.eqiad.wmnet [09:28:36] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1296-1300].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:28:44] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1301-1304].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [09:30:15] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bookworm [09:30:26] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1301.eqiad.wmnet with OS bookworm [09:31:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2060.codfw.wmnet with reason: host reimage [09:33:11] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [09:33:30] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1067.eqiad.wmnet with OS bullseye [09:33:45] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:33:45] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:20] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye [09:35:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2059.codfw.wmnet with reason: host reimage [09:35:52] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [09:39:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [09:39:31] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1067.eqiad.wmnet with OS bullseye [09:39:57] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye [09:40:13] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [09:44:59] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10415035 (10fgiunchedi) Thank you all for looking into this -- let's indeed see how `3m` (or larger) goes and if that is satisfactory! >>! In T382396#10413490,... [09:45:43] 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10415040 (10TheDJ) I vaguely remember that this happened for invalid svgs when MediaWiki did not yet supply the content type to swift, and instead we relie... [09:46:14] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [09:48:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [09:49:23] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [09:50:36] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:50:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2060.codfw.wmnet with OS bookworm [09:50:54] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [09:50:58] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1301.eqiad.wmnet with reason: host reimage [09:51:46] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1067.eqiad.wmnet with reason: host reimage [09:52:15] (03PS1) 10Volans: api: allow to skip the START log to SAL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105666 (https://phabricator.wikimedia.org/T324655) [09:53:19] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454#10415055 (10Volans) a:03Volans [09:53:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:53:47] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9041 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [09:53:49] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:54:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage [09:54:30] (03PS1) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) [09:54:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2059.codfw.wmnet with OS bookworm [09:55:46] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [09:57:26] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2059-2060].codfw.wmnet [09:57:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2059-2060].codfw.wmnet [09:58:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1301.eqiad.wmnet with reason: host reimage [09:58:23] 06SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T382477 (10Etienne20) 03NEW [09:58:23] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2057-2058].codfw.wmnet [09:59:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2057-2058].codfw.wmnet [10:00:47] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2057.codfw.wmnet with OS bookworm [10:00:48] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2058.codfw.wmnet with OS bookworm [10:00:57] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1067.eqiad.wmnet with reason: host reimage [10:01:06] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2057 [10:01:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2057 [10:01:07] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2058 [10:01:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2058 [10:03:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [10:04:54] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:36] (03CR) 10Sergio Gimeno: [C:03+1] "Just a question, lgtm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm) [10:08:48] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [10:12:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [10:13:42] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10415112 (10BTullis) I have created the principal for Jimmy. ` btullis@krb1001:~$ sudo manage_principals.py get jly get_principal: Principal does not exist wh... [10:13:52] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:14:16] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1296.eqiad.wmnet with OS bookworm [10:14:48] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1043.eqiad.wmnet with reason: host reimage [10:15:37] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [10:16:02] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1297.eqiad.wmnet with OS bookworm [10:16:41] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2058.codfw.wmnet with OS bookworm [10:17:52] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:18:20] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10415120 (10BTullis) 05Open→03Resolved The `data.yaml` file already reflects the fact that a kerberos principal should be available for this account, s... [10:18:20] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2057.codfw.wmnet with reason: host reimage [10:18:32] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [10:18:32] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1067.eqiad.wmnet with OS bullseye [10:18:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1301.eqiad.wmnet with OS bookworm [10:18:42] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2058.codfw.wmnet with OS bookworm [10:18:45] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2058 [10:18:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2058 [10:18:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1043.eqiad.wmnet with reason: host reimage [10:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [10:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [10:19:52] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:19:56] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1067.eqiad.wmnet [10:20:25] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1302.eqiad.wmnet with OS bookworm [10:21:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2057.codfw.wmnet with reason: host reimage [10:22:42] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1067.eqiad.wmnet [10:23:55] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:45] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:55] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:49] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:30:56] 06SRE, 10Maps: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T382477#10415142 (10Bugreporter) 05Open→03Invalid Wikimedia Maps is just an OpenStreetMap tile server, and you can use other ones. [10:35:54] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2058.codfw.wmnet with reason: host reimage [10:36:24] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [10:36:56] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:relforge [10:37:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:relforge [10:39:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2058.codfw.wmnet with reason: host reimage [10:40:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2057.codfw.wmnet with OS bookworm [10:42:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [10:42:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1043.eqiad.wmnet with OS bookworm [10:44:00] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:44:22] !log restarting slapd on r/w servers to pick up openssl security updates [10:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:50] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Swift [10:45:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10415158 (10elukey) >>! In T378143#10414686, @Jhancock.wm wrote: > @elukey we're having an issue with this last server. es1043 keeps going to th... [10:51:39] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1067.eqiad.wmnet [10:51:46] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:51:49] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1067.eqiad.wmnet [10:52:13] !log installing e2fsprogs security updates [10:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:15] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1067.eqiad.wmnet [10:55:46] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:55:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1067.eqiad.wmnet [10:56:33] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1105326 (owner: 10Muehlenhoff) [10:56:46] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:42] (03PS1) 10Btullis: Revert "Configure the correct role for reimaging installing an-worker nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1105673 (https://phabricator.wikimedia.org/T382410) [10:57:52] (03PS2) 10Btullis: Revert "Configure the correct role for reimaging installing an-worker nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1105673 (https://phabricator.wikimedia.org/T382410) [10:57:52] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Druid roles [puppet] - 10https://gerrit.wikimedia.org/r/1105326 (owner: 10Muehlenhoff) [10:58:26] (03CR) 10Btullis: [C:03+2] Revert "Configure the correct role for reimaging installing an-worker nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1105673 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis) [10:58:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2058.codfw.wmnet with OS bookworm [10:59:39] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2057-2058].codfw.wmnet [10:59:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2057-2058].codfw.wmnet [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1100) [11:00:13] (03PS1) 10Muehlenhoff: Add library hint for e2fsprogs [puppet] - 10https://gerrit.wikimedia.org/r/1105675 [11:00:34] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2055-2056].codfw.wmnet [11:01:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2055-2056].codfw.wmnet [11:02:26] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for e2fsprogs [puppet] - 10https://gerrit.wikimedia.org/r/1105675 (owner: 10Muehlenhoff) [11:02:28] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2056.codfw.wmnet with OS bookworm [11:02:29] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2055.codfw.wmnet with OS bookworm [11:02:47] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2056 [11:02:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2056 [11:02:48] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2055 [11:02:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2055 [11:06:46] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:20] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:08:19] (03CR) 10Elukey: [C:04-1] "Trying manually the config in staging, it doesn't really work afaics, will update you folks when ready" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey) [11:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:46] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:11:00] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:11:09] 10ops-eqiad, 06DC-Ops: Update the labels on an-presto100[1-5] to be an-worker106[5-9] - https://phabricator.wikimedia.org/T382482 (10BTullis) 03NEW [11:11:47] 10ops-eqiad, 06DC-Ops: Update the labels on an-presto100[1-5] to be an-worker106[5-9] - https://phabricator.wikimedia.org/T382482#10415231 (10BTullis) [11:13:57] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1065.eqiad.wmnet [11:15:25] RESOLVED: SystemdUnitFailed: user@0.service on elastic2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:34] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1065.eqiad.wmnet [11:21:37] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1066.eqiad.wmnet [11:22:51] (03CR) 10Urbanecm: [Growth] Disable Surfacing Add Link tasks on all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm) [11:25:24] (03CR) 10Sergio Gimeno: [C:03+1] [Growth] Disable Surfacing Add Link tasks on all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm) [11:28:56] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1302.eqiad.wmnet with OS bookworm [11:29:00] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:29:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1066.eqiad.wmnet [11:29:15] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1067.eqiad.wmnet [11:36:39] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1297.eqiad.wmnet with OS bookworm [11:36:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1067.eqiad.wmnet [11:36:43] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1068.eqiad.wmnet [11:43:24] !log installing gsl security updates [11:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:39] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1068.eqiad.wmnet [11:44:41] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1069.eqiad.wmnet [11:48:47] !log installing distro-info-data updates on bullseye [11:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1069.eqiad.wmnet [11:53:39] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10415303 (10MoritzMuehlenhoff) [11:55:21] !log installing gtk+2.0 security updates [11:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:14] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:05:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:07:18] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:09:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:44] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105689 [12:21:25] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1105691 (owner: 10L10n-bot) [12:23:21] 10ops-eqiad, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485 (10phaultfinder) 03NEW [12:25:57] (03PS1) 10Muehlenhoff: Add library hint for gtk+2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1105693 [12:28:10] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for gtk+2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1105693 (owner: 10Muehlenhoff) [12:50:02] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database tigwiki (T381378) [12:50:06] T381378: Prepare and check storage layer for tigwiki - https://phabricator.wikimedia.org/T381378 [12:52:50] (03CR) 10Gmodena: mw-content-history-reconcile-enrich: Enable K8 HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [12:54:48] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1105691 (owner: 10L10n-bot) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1300) [13:01:06] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1017-1020].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [13:02:45] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1017.eqiad.wmnet with OS bookworm [13:05:33] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:59] (03PS1) 10Muehlenhoff: Add library hint for libsepol [puppet] - 10https://gerrit.wikimedia.org/r/1105705 [13:16:15] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tigwiki (T381378) [13:16:19] T381378: Prepare and check storage layer for tigwiki - https://phabricator.wikimedia.org/T381378 [13:18:53] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libsepol [puppet] - 10https://gerrit.wikimedia.org/r/1105705 (owner: 10Muehlenhoff) [13:19:42] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1017.eqiad.wmnet with reason: host reimage [13:23:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1017.eqiad.wmnet with reason: host reimage [13:24:43] (03CR) 10Elukey: [C:03+1] Deprecate remaining uses of system::role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff) [13:25:57] (03PS1) 10Muehlenhoff: Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707 [13:26:59] (03CR) 10Muehlenhoff: [C:03+2] Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff) [13:33:53] (03PS1) 10Btullis: dse-k8s: Add tokens for dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) [13:36:18] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4720/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis) [13:36:47] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Ceph roles [puppet] - 10https://gerrit.wikimedia.org/r/1105270 (owner: 10Muehlenhoff) [13:38:48] (03PS1) 10Muehlenhoff: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709 [13:38:49] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492 (10RobH) 03NEW [13:39:12] !log installing libsepol security updates [13:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:10] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492#10415566 (10RobH) a:03Andrew @Andrew, Two call outs! The original ordering task had a bad hostname range provided by you for racking "**Hostnames:** cloudvirt1068... [13:41:22] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492#10415571 (10RobH) [13:43:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1017.eqiad.wmnet with OS bookworm [13:45:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1018.eqiad.wmnet with OS bookworm [13:47:18] (03CR) 10Elukey: [C:03+1] Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707 (owner: 10Muehlenhoff) [13:47:58] (03PS3) 10Lucas Werkmeister (WMDE): Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [13:48:14] (03PS5) 10Filippo Giunchedi: prometheus: deploy instances from a single configuration [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) [13:48:30] (03CR) 10Lucas Werkmeister (WMDE): "I updated the commit message so gerritbot will connect it to the task, hope that’s okay." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [13:56:22] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1303.eqiad.wmnet with OS bookworm [13:57:14] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1298.eqiad.wmnet with OS bookworm [13:57:41] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [13:59:54] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1400). [14:00:05] danisztls: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:01:17] I can probably deploy in a few minutes :) [14:02:29] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1018.eqiad.wmnet with reason: host reimage [14:04:32] (03PS1) 10Joal: Revert "[analytics][webrequest] Extend retention for unique devices analysis" [puppet] - 10https://gerrit.wikimedia.org/r/1105713 [14:04:41] alright, I can deploy! [14:04:44] RESOLVED: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [14:04:44] RESOLVED: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:04:57] assuming danisztls is around, that is… [14:05:46] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1297.eqiad.wmnet with OS bookworm [14:05:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1018.eqiad.wmnet with reason: host reimage [14:06:05] (03CR) 10Btullis: [C:03+1] Revert "[analytics][webrequest] Extend retention for unique devices analysis" [puppet] - 10https://gerrit.wikimedia.org/r/1105713 (owner: 10Joal) [14:06:40] (03PS1) 10Joal: Revert "Update webrequest raw retention period on HDFS" [puppet] - 10https://gerrit.wikimedia.org/r/1105714 [14:07:52] (03CR) 10Btullis: [C:03+2] Revert "[analytics][webrequest] Extend retention for unique devices analysis" [puppet] - 10https://gerrit.wikimedia.org/r/1105713 (owner: 10Joal) [14:08:11] (03CR) 10Filippo Giunchedi: "To be merged no earlier than Jan" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [14:08:20] (03CR) 10Btullis: [C:03+2] Revert "Update webrequest raw retention period on HDFS" [puppet] - 10https://gerrit.wikimedia.org/r/1105714 (owner: 10Joal) [14:10:40] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1302.eqiad.wmnet with OS bookworm [14:11:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10415676 (10Jhancock.wm) weird. when i ran the cookbook it was defaulting to puppet 7 since it was bookworm. not sure why it would do that. but!... [14:11:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10415678 (10Jhancock.wm) 05Open→03Resolved [14:13:07] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:27] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2055.codfw.wmnet with OS bookworm [14:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1297.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1297.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:16:37] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1303.eqiad.wmnet with reason: host reimage [14:17:20] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2055.codfw.wmnet with OS bookworm [14:17:24] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2055 [14:17:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2055 [14:17:44] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1298.eqiad.wmnet with reason: host reimage [14:18:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [14:18:43] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:20:03] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:20:04] I have no idea where to reach danisztls for that config change 🤷 [14:20:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1303.eqiad.wmnet with reason: host reimage [14:20:19] he’s offline in slack afaict [14:21:33] FIRING: [2x] KubernetesCalicoDown: wikikube-worker1297.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:22:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10415724 (10phaultfinder) [14:23:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1298.eqiad.wmnet with reason: host reimage [14:24:07] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Console/management wiring - https://phabricator.wikimedia.org/T382383#10415733 (10Papaul) p:05Triage→03Medium [14:24:37] 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10415734 (10Papaul) p:05Triage→03Medium [14:24:56] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10415736 (10Papaul) [14:25:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1018.eqiad.wmnet with OS bookworm [14:27:00] Lucas_WMDE: if you are finished / won't start, I'll make a PrivateSettings change [14:27:07] tgr|away: go ahead [14:27:16] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1019.eqiad.wmnet with OS bookworm [14:27:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10415746 (10phaultfinder) [14:27:50] (03CR) 10Elukey: [C:03+1] Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709 (owner: 10Muehlenhoff) [14:29:59] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1297.eqiad.wmnet with reason: host reimage [14:30:51] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1302.eqiad.wmnet with reason: host reimage [14:33:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1297.eqiad.wmnet with reason: host reimage [14:34:30] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2055.codfw.wmnet with reason: host reimage [14:35:05] (03CR) 10JHathaway: [C:03+1] Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707 (owner: 10Muehlenhoff) [14:35:25] (03CR) 10Bking: [C:03+1] dse-k8s: Add tokens for dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis) [14:35:54] (03CR) 10JHathaway: [C:03+1] Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709 (owner: 10Muehlenhoff) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1302.eqiad.wmnet with reason: host reimage [14:38:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1303.eqiad.wmnet with OS bookworm [14:39:02] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:39:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2055.codfw.wmnet with reason: host reimage [14:40:43] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1304.eqiad.wmnet with OS bookworm [14:42:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1298.eqiad.wmnet with OS bookworm [14:43:20] (03PS1) 10Andrew Bogott: site + preseed entries for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1105724 (https://phabricator.wikimedia.org/T382492) [14:43:37] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1019.eqiad.wmnet with reason: host reimage [14:44:02] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:12] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:36] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1299.eqiad.wmnet with OS bookworm [14:47:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1019.eqiad.wmnet with reason: host reimage [14:48:17] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:48:17] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1297.eqiad.wmnet with OS bookworm [14:52:17] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:52:33] (03CR) 10Andrew Bogott: [C:03+2] site + preseed entries for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1105724 (https://phabricator.wikimedia.org/T382492) (owner: 10Andrew Bogott) [14:53:17] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:55:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492#10415805 (10Andrew) >>! In T382492#10415566, @RobH wrote: > @Andrew, > > Two call outs! The original ordering task had a bad hostname... [14:56:25] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1302.eqiad.wmnet with OS bookworm [14:56:57] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2056.codfw.wmnet with OS bookworm [14:57:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10415807 (10Andrew) [14:57:42] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2056.codfw.wmnet with OS bookworm [14:57:45] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2056 [14:57:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2056 [14:58:42] (03CR) 10Urbanecm: [C:03+2] [Growth] Disable Surfacing Add Link tasks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm) [14:59:24] (03Merged) 10jenkins-bot: [Growth] Disable Surfacing Add Link tasks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm) [14:59:43] 06SRE, 10Observability-Metrics: node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768#10415826 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Has been done at some point in host overview dashboard, sample query: `node_cpu_frequency_hertz{instance=~... [15:00:39] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1105302|[Growth] Disable Surfacing Add Link tasks on all wikis (T382037)]] [15:00:44] T382037: Disable Alpha Test: Surfacing "Add a link" Structured Tasks (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T382037 [15:01:10] 14SRE-Sprint-Week-Sustainability-March2023, 06Infrastructure-Foundations, 10Mail, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867#10415850 (10fgiunchedi) 05Open→03Invalid No longer valid I think, also... [15:01:24] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1304.eqiad.wmnet with reason: host reimage [15:01:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2055.codfw.wmnet with OS bookworm [15:01:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10415854 (10Andrew) [15:02:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10415859 (10Andrew) a:05Andrew→03None [15:03:14] (03CR) 10Muehlenhoff: [C:03+2] Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707 (owner: 10Muehlenhoff) [15:03:35] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Metrics: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155#10415860 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done in {T370506} [15:03:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [15:03:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:04:05] (03CR) 10Urbanecm: [C:03+2] [Growth] Make the typage campaign not specific to 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102350 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm) [15:04:22] (03CR) 10Muehlenhoff: [C:03+2] Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709 (owner: 10Muehlenhoff) [15:04:48] (03Merged) 10jenkins-bot: [Growth] Make the typage campaign not specific to 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102350 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm) [15:05:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1304.eqiad.wmnet with reason: host reimage [15:05:22] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1299.eqiad.wmnet with reason: host reimage [15:06:33] FIRING: KubernetesCalicoDown: wikikube-worker2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2056.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:16] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1105302|[Growth] Disable Surfacing Add Link tasks on all wikis (T382037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:08:21] T382037: Disable Alpha Test: Surfacing "Add a link" Structured Tasks (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T382037 [15:08:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1019.eqiad.wmnet with OS bookworm [15:08:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [15:08:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:11] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1020.eqiad.wmnet with OS bookworm [15:10:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1299.eqiad.wmnet with reason: host reimage [15:10:37] (I'll deploy the PrivateSettings change later, I realized it's better to do it together with some backports) [15:11:01] !log urbanecm@deploy2002 urbanecm: Continuing with sync [15:12:28] (03PS1) 10Gergő Tisza: Make AuthManagerAutoConfig configuration key more distinctive [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180) [15:14:57] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2056.codfw.wmnet with reason: host reimage [15:15:29] urbanecm: looks like the PrivateSettings change got into your sync [15:15:46] tgr|away: if you committed it, then that seems likely [15:15:53] so far it doesn't seem to break anything significant [15:15:56] probably a no-op, but if you see "Authentication failed because of inconsistent provider array" errors in the next few minutes, that's why [15:16:06] godo to know [15:17:10] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:12] I'll re-add it to the private repo then (I committed it and then thought I'd rather deploy rather and did reset it) [15:17:40] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105302|[Growth] Disable Surfacing Add Link tasks on all wikis (T382037)]] (duration: 17m 00s) [15:17:45] T382037: Disable Alpha Test: Surfacing "Add a link" Structured Tasks (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T382037 [15:18:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2056.codfw.wmnet with reason: host reimage [15:19:04] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1102350|[Growth] Make the typage campaign not specific to 2023 (T380405)]] [15:19:08] T380405: Generic Campaign parameter: New Editor Recruitment as part of the Donor Thank You page - https://phabricator.wikimedia.org/T380405 [15:20:45] ugh. it's on mwdebug but not mwmaint. I guess it was already reverted by the time scap backport got to the full sync? [15:21:54] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1102350|[Growth] Make the typage campaign not specific to 2023 (T380405)]] [15:22:09] (03PS2) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) [15:22:14] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:15] I guess I should have used scap lock [15:22:40] whatever, the next scap will clean it up [15:23:14] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:42] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10415928 (10Andrew) This server seems to have a raid controller, which is different from all the other standard ceph OSD nodes. Not sure how that happened b... [15:26:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1304.eqiad.wmnet with OS bookworm [15:26:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1301-1304].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [15:26:42] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:26:42] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1020.eqiad.wmnet with reason: host reimage [15:27:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10415937 (10phaultfinder) [15:28:55] (03PS1) 10Gergő Tisza: SUL3: Disable more auth providers in the local leg of SUL3 login [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180) [15:29:24] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1299.eqiad.wmnet with OS bookworm [15:30:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [15:30:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [15:31:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:31:50] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1300.eqiad.wmnet with OS bookworm [15:31:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1020.eqiad.wmnet with reason: host reimage [15:34:28] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102350|[Growth] Make the typage campaign not specific to 2023 (T380405)]] (duration: 12m 33s) [15:34:33] T380405: Generic Campaign parameter: New Editor Recruitment as part of the Donor Thank You page - https://phabricator.wikimedia.org/T380405 [15:35:00] tgr|away: i'm now done with syncing [15:35:08] feel free to clean up if needed [15:36:17] (03PS1) 10Gergő Tisza: [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) [15:36:24] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:37:30] thanks urbanecm! as far as I can tell, the last round of syncing did clean it up already [15:37:38] cool! [15:38:22] you might want to log the patches on wikitech/Deployments though [15:39:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2056.codfw.wmnet with OS bookworm [15:39:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [15:39:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [15:39:42] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:22] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@6ed5237]: SEAL conda env hotfix [15:48:32] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@6ed5237]: SEAL conda env hotfix (duration: 01m 28s) [15:48:43] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10416007 (10colewhite) 05Resolved→03Open [15:49:20] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2055-2056].codfw.wmnet [15:49:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2055-2056].codfw.wmnet [15:50:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1020.eqiad.wmnet with OS bookworm [15:50:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1017-1020].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [15:53:17] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1300.eqiad.wmnet with reason: host reimage [15:55:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1300.eqiad.wmnet with reason: host reimage [15:57:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507 (10MoritzMuehlenhoff) 03NEW [15:58:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508 (10MoritzMuehlenhoff) 03NEW [16:00:05] dancy: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1600). [16:00:16] (03CR) 10DCausse: team-search-platform: Add alert for wdqs-categories lag (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [16:00:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509 (10MoritzMuehlenhoff) 03NEW [16:00:39] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:01:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511 (10MoritzMuehlenhoff) 03NEW [16:02:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512 (10MoritzMuehlenhoff) 03NEW [16:03:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513 (10MoritzMuehlenhoff) 03NEW [16:04:00] (03PS17) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [16:04:11] (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [16:05:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515 (10MoritzMuehlenhoff) 03NEW [16:07:24] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:04] (03PS2) 10Btullis: dse-k8s: Add tokens for mediawiki-data-dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) [16:09:22] (03CR) 10Gmodena: mw-content-history-reconcile-enrich: Enable K8 HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [16:11:23] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:11:24] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:11:42] (03CR) 10Jelto: "I'll deploy this in January." [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [16:12:24] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:14:44] (03PS3) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) [16:14:55] (03CR) 10Bking: team-search-platform: Add alert for wdqs-categories lag (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [16:15:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1300.eqiad.wmnet with OS bookworm [16:15:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1296-1300].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:15:32] (03PS1) 10Kamila Součková: sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 [16:16:00] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10416237 (10cmooney) >>! In T382396#10415035, @fgiunchedi wrote: > Yes and that's almost always the case, my understanding though is that the samples may not alw... [16:17:56] (03PS3) 10Btullis: dse-k8s: Add tokens for mediawiki-dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) [16:18:04] (03CR) 10Volans: [C:03+1] "LGTM if CI agrees :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková) [16:21:48] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková) [16:21:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518 (10cmooney) 03NEW p:05Triage→03Low [16:22:08] (03PS2) 10Kamila Součková: sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 [16:22:40] (03PS1) 10Btullis: dse-k8s: Add a mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105754 (https://phabricator.wikimedia.org/T382489) [16:23:33] (03CR) 10Bartosz Dziewoński: [C:03+1] [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [16:25:57] 10ops-eqsin, 06SRE, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519 (10cmooney) 03NEW p:05Triage→03Low [16:25:59] ACKNOWLEDGEMENT - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Device is offline, see T382519 [16:26:46] ACKNOWLEDGEMENT - Host ripe-atlas-eqsin IPv6 is DOWN: CRITICAL - Host Unreachable (2001:df2:e500:201:103:102:166:20) Cathal Mooney See T382519 [16:27:04] ACKNOWLEDGEMENT - Host ripe-atlas-eqiad IPv6 is DOWN: CRITICAL - Host Unreachable (2620:0:861:202:208:80:155:69) Cathal Mooney See T382518 [16:27:14] ACKNOWLEDGEMENT - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney See T382518 [16:27:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416338 (10phaultfinder) [16:28:38] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on ripe-atlas-eqiad,ripe-atlas-eqiad IPv6 with reason: Atlas device offline, scheduling reboot [16:28:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on ripe-atlas-eqiad,ripe-atlas-eqiad IPv6 with reason: Atlas device offline, scheduling reboot [16:28:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10416342 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7fe2fd80-b4a4-43f7-ba5a-5238c44bbd7a) set by cmooney@cumin1002 for 30 days,... [16:29:14] (03CR) 10Kamila Součková: [C:03+2] sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková) [16:33:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [16:34:23] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [16:34:30] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105756 [16:35:23] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on ripe-atlas-eqsin,ripe-atlas-eqsin IPv6 with reason: Atlas device offline, scheduling reboot [16:35:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on ripe-atlas-eqsin,ripe-atlas-eqsin IPv6 with reason: Atlas device offline, scheduling reboot [16:35:48] (03Merged) 10jenkins-bot: sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková) [16:35:49] 10ops-eqsin, 06SRE, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#10416397 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=68d77968-a0dd-4bd1-94ad-66be8ab508c5) set by cmooney@cumin1002 for 30 days, 0:00:00 on 2... [16:36:04] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105758 [16:37:50] jouncebot: nowandnext [16:37:50] For the next 0 hour(s) and 22 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1600) [16:37:50] In 0 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1700) [16:39:09] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105758 (owner: 10Clare Ming) [16:39:11] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105756 (owner: 10Clare Ming) [16:39:13] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, 07Wikimedia-production-error: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded - https://phabricator.wikimedia.org/T381109#10416420 (10dancy) [16:40:09] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105758 (owner: 10Clare Ming) [16:40:09] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105756 (owner: 10Clare Ming) [16:50:04] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10416509 (10thcipriani) 05Stalled→03Open a:05thcipriani→03Arnoldokoth >>! In T381851#10401485, @Scott_French wrote: > @Ammarpad - FYI, @thcipriani is out this week, so the next update... [16:50:07] (03CR) 10DDesouza: "You're welcome. Thanks for the fix. Unfortunately I wasn't able to attend the deployment but I scheduled the change for next window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [16:51:42] (03PS1) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760 [16:52:03] (03PS2) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) [16:52:41] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:53:03] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:53:49] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [16:54:03] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [16:55:49] (03PS3) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) [16:56:02] (03CR) 10Dzahn: [C:03+1] "seems good and thanks for fixing that. just cant merge it right now since I am "out of office". Please let Jelto or Arnold merge. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar) [16:56:17] (03CR) 10Btullis: [C:03+2] dse-k8s: Add tokens for mediawiki-dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis) [16:56:43] (03CR) 10Btullis: [C:03+2] dse-k8s: Add a mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105754 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis) [16:57:08] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4722/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall) [16:58:36] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:00:06] jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1700) [17:00:06] No Gerrit patches in the queue for this window AFAICS. [17:00:29] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:00:40] hmm, jouncebot should be smarter about not needlessly getting our attention :P [17:00:51] (03Merged) 10jenkins-bot: dse-k8s: Add a mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105754 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis) [17:06:27] jhathaway: reminds me to delete the next few ones [17:06:33] (03PS4) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) [17:06:42] rzl: thanks [17:07:09] (03CR) 10Elukey: "This one seems to work, I tested it locally with some metrics generated from maps2005. I think it is a reasonable baseline, then we can im" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey) [17:09:11] oh never mind, they're gone! t.hcipriani++ [17:10:32] (03CR) 10DCausse: team-search-platform: Add alert for wdqs-categories lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [17:10:50] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [17:17:51] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:18:42] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:19:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:21:24] PROBLEM - BGP status on cr2-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:21:36] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:23:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:23:55] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1022.eqiad.wmnet with OS bookworm [17:27:22] (03PS1) 10AOkoth: admin: Add ammarpad to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1105773 (https://phabricator.wikimedia.org/T381851) [17:27:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416678 (10phaultfinder) [17:28:25] (03CR) 10Elukey: "elukey@kubestage1006:~$ sudo nsenter -t 2495625 -n curl -s localhost:9102/metrics | grep karto" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey) [17:38:33] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [17:39:00] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [17:39:11] (03PS1) 10Scott French: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1105776 (https://phabricator.wikimedia.org/T382517) [17:39:42] (03PS1) 10Scott French: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105778 (https://phabricator.wikimedia.org/T382517) [17:42:43] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1022.eqiad.wmnet with reason: host reimage [17:46:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1022.eqiad.wmnet with reason: host reimage [17:46:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416778 (10phaultfinder) [17:48:25] jouncebot: nowandnext [17:48:26] For the next 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1700) [17:48:26] In 0 hour(s) and 11 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800) [17:48:26] In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800) [17:48:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:49:12] unless there are any objections, I'll be backporting a fix for a log-spam issue shortly [17:53:17] (03PS1) 10JHathaway: WIP: postfix logging [puppet] - 10https://gerrit.wikimedia.org/r/1105780 [17:53:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:55:41] (03CR) 10JHathaway: "Thanks for the patch Brett" [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall) [18:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800) [18:04:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [extensions/EventBus] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105778 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [18:04:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [extensions/EventBus] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1105776 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [18:06:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1022.eqiad.wmnet with OS bookworm [18:06:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [18:06:46] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [18:07:12] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-12-19-122113-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105787 [18:07:12] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [18:08:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:14:03] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-12-19-122113-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105787 (owner: 10BryanDavis) [18:15:13] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-12-19-122113-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105787 (owner: 10BryanDavis) [18:21:59] (03PS1) 10DLynch: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 [18:22:18] (03PS2) 10DLynch: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) [18:22:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) (owner: 10DLynch) [18:26:15] (03PS1) 10Joal: Revert "Fix security checksum for web_request's refinery-drop-older-than" [puppet] - 10https://gerrit.wikimedia.org/r/1105790 [18:26:44] (03CR) 10Cwhite: [C:03+2] webperf: Enable --dogstatsd on statsv.py [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [18:27:43] (03Merged) 10jenkins-bot: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105778 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [18:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416878 (10phaultfinder) [18:31:07] (03Merged) 10jenkins-bot: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1105776 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [18:31:14] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:31:33] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:31:39] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1105778|maintenance: fix typo in job status logging (T382517)]], [[gerrit:1105776|maintenance: fix typo in job status logging (T382517)]] [18:31:44] T382517: PHP Warning seen by logspam-watch but not by mediawiki-errors logstash page - https://phabricator.wikimedia.org/T382517 [18:31:45] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:32:30] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:32:37] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:32:53] (03CR) 10Btullis: [C:03+2] Revert "Fix security checksum for web_request's refinery-drop-older-than" [puppet] - 10https://gerrit.wikimedia.org/r/1105790 (owner: 10Joal) [18:32:57] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:38:41] (03PS1) 10Herron: pyrra: wdqs match site label with = instead of =~ [puppet] - 10https://gerrit.wikimedia.org/r/1105791 (https://phabricator.wikimedia.org/T302995) [18:41:23] (03CR) 10Herron: [C:03+2] pyrra: wdqs match site label with = instead of =~ [puppet] - 10https://gerrit.wikimedia.org/r/1105791 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:41:35] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1105778|maintenance: fix typo in job status logging (T382517)]], [[gerrit:1105776|maintenance: fix typo in job status logging (T382517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:41:39] T382517: PHP Warning seen by logspam-watch but not by mediawiki-errors logstash page - https://phabricator.wikimedia.org/T382517 [18:42:19] !log swfrench@deploy2002 swfrench: Continuing with sync [18:44:17] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10416914 (10Andrew) I designated every drive a non-raid drive in the bios and now the install is completing. I can't make it stop installing though, it just... [18:47:41] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [18:47:53] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [18:48:50] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105778|maintenance: fix typo in job status logging (T382517)]], [[gerrit:1105776|maintenance: fix typo in job status logging (T382517)]] (duration: 17m 11s) [18:48:55] T382517: PHP Warning seen by logspam-watch but not by mediawiki-errors logstash page - https://phabricator.wikimedia.org/T382517 [18:49:08] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1011 [18:49:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1011 [18:51:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcontrol1011 - jclark@cumin1002" [18:51:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcontrol1011 - jclark@cumin1002" [18:51:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcontrol1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:53:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10416940 (10Jclark-ctr) [18:53:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:11] (03CR) 10BCornwall: [V:03+1] postfix: Enable summary messages on TLS handshakes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall) [18:57:29] (03Abandoned) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall) [18:57:56] (03CR) 10BCornwall: WIP: postfix logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (owner: 10JHathaway) [18:58:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:59:37] (03CR) 10BCornwall: [C:04-1] WIP: postfix logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (owner: 10JHathaway) [18:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416947 (10phaultfinder) [19:00:05] dancy: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1900). [19:00:43] o/ [19:01:12] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105793 (https://phabricator.wikimedia.org/T375667) [19:01:14] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105793 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [19:02:00] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105793 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [19:03:46] o/ [19:04:02] (03PS2) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway) [19:05:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416963 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr rebalanced pdu for B4. L1 A [19:08:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [19:08:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [19:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:35] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.8 refs T375667 [19:12:40] T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667 [19:16:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:18:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:18:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol1011.eqiad.wmnet with OS bookworm [19:18:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10417006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet w... [19:19:31] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:53] hmm, that httpbb failure is a 503 for https://species.wikimedia.org/wiki/Sitta_europaea_caesia [19:26:00] cc dancy, brennen [19:26:28] not immediately sure it's a train thing, still looking, just fyi [19:26:38] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535 (10phaultfinder) 03NEW [19:27:44] hmmm [19:31:23] The https://species.wikimedia.org/wiki/Sitta_europaea_caesia page looks ok. [19:31:27] yeah [19:31:51] I'm going to take a break and see how things look in about 20 minutes [19:31:53] I'm not seeing anything correlated / suspicious in metrics or logs [19:31:55] in _general_ things look pretty much like they did pre-deploy [19:32:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [19:32:46] i'm going for a slice of pizza but i'll take the laptop. [19:32:54] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10417061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2004-dev.codfw.wmnet with OS bul... [19:32:59] yeah, I can't get it to repro either [19:33:12] probably just a hiccup and the alert will clear on the next hourly run, sorry for the false alarm [19:35:11] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:35:31] thanks for spotting and investigating, rzl! [19:38:55] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002" [19:38:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002" [19:38:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:41:02] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:46:29] (03PS3) 10JHathaway: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) [19:46:55] (03CR) 10JHathaway: postfix: Enable summary messages on TLS handshakes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway) [19:50:32] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002" [19:50:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002" [19:50:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:51:13] !log jforrester@deploy2002 Started deploy [integration/docroot@4701376]: I1ea9f34dc6176da4cca5da50c293bd5ff62661b8 for T233089 [19:51:17] T233089: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 [19:51:24] !log jforrester@deploy2002 Finished deploy [integration/docroot@4701376]: I1ea9f34dc6176da4cca5da50c293bd5ff62661b8 for T233089 (duration: 00m 10s) [20:06:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [20:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417106 (10phaultfinder) [20:16:14] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:19:31] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:26] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10417126 (10colewhite) 05Open→03Resolved Thanks @Volans for pointing those out! With the latest de... [20:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417150 (10phaultfinder) [20:38:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1011.eqiad.wmnet with OS bookworm [20:38:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10417151 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with... [20:41:56] (03PS1) 10Scott French: mediawiki: add rsyslog container to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) [20:53:52] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T2100). [21:00:04] tgr, danisztls, kemayo, and Krinkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:06] o/ [21:00:20] i can deploy today [21:00:26] o/ [21:00:31] o/ [21:00:42] (03PS3) 10DLynch: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) [21:00:59] (03CR) 10Urbanecm: [C:03+2] Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) (owner: 10DLynch) [21:01:05] I'll deploy my patches, it involves the private repo [21:01:35] (03PS4) 10Lucas Werkmeister (WMDE): Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:01:38] (03CR) 10Urbanecm: [C:03+2] Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:01:44] tgr|away: ack [21:02:12] (03Merged) 10jenkins-bot: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) (owner: 10DLynch) [21:02:25] (03Merged) 10jenkins-bot: Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:03:07] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1105788|Set Flow to read-only on phase 1 wikis (T378833)]], [[gerrit:1105027|Reader Survey: Undeploy (T378660)]] [21:03:12] T378833: [Config] Set Flow to read-only at all *Phase 1* wikis - https://phabricator.wikimedia.org/T378833 [21:03:13] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:03:37] * Krinkle is here [21:04:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417219 (10phaultfinder) [21:07:45] !log urbanecm@deploy2002 urbanecm, kemayo, dani: Backport for [[gerrit:1105788|Set Flow to read-only on phase 1 wikis (T378833)]], [[gerrit:1105027|Reader Survey: Undeploy (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:58] Kemayo: danisztls: please test :) [21:08:56] urbanecm: Looks good. [21:09:03] ty [21:09:22] urbanecm: looks good [21:09:26] ty [21:09:28] !log urbanecm@deploy2002 urbanecm, kemayo, dani: Continuing with sync [21:09:52] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [21:14:21] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105788|Set Flow to read-only on phase 1 wikis (T378833)]], [[gerrit:1105027|Reader Survey: Undeploy (T378660)]] (duration: 11m 14s) [21:14:26] T378833: [Config] Set Flow to read-only at all *Phase 1* wikis - https://phabricator.wikimedia.org/T378833 [21:14:27] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:14:31] and done [21:14:49] tgr|away: Krinkle: leaving your patches up to you :) [21:15:14] tgr|away: go ahead if you like. I'm writing some docs meanwhile. [21:15:22] ack [21:16:03] (03CR) 10Gergő Tisza: [C:03+2] SUL3: Disable more auth providers in the local leg of SUL3 login [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [21:16:18] (03CR) 10Gergő Tisza: [C:03+2] Make AuthManagerAutoConfig configuration key more distinctive [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [21:17:15] Krinkle: or I can deploy your change while I am waiting for the merges [21:17:20] Sure [21:17:44] (03PS4) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) [21:18:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [21:18:45] (03Merged) 10jenkins-bot: Make AuthManagerAutoConfig configuration key more distinctive [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [21:18:58] (03CR) 10CI reject: [V:04-1] team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [21:19:11] oh wow that was unexpectedly fast [21:19:20] (03Merged) 10jenkins-bot: Enable $wgWMEStatsBeaconUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [21:20:03] I guess I'll just wait for the CentralAuth merge and deploy everything together then [21:20:06] How dare we have CI jobs that complete under 5min. [21:20:18] This extension probalby isnt' in the wmf gate [21:21:08] (03PS5) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) [21:21:56] (03CR) 10Gergő Tisza: [C:03+2] [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [21:22:42] (03Merged) 10jenkins-bot: [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [21:25:34] I just filed a ticket about errors showing up in logspam-watch, starting around 21:14:00 [21:25:34] https://phabricator.wikimedia.org/T382546 [21:26:52] (03Merged) 10jenkins-bot: SUL3: Disable more auth providers in the local leg of SUL3 login [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza) [21:28:21] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [21:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417332 (10phaultfinder) [21:29:47] !log deploying PrivateSettings change 95517e85 [21:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:30] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] [21:31:35] T369180: Ensure no AuthenticationRequests are added to the local login flow in SUL3 mode - https://phabricator.wikimedia.org/T369180 [21:31:36] T355837: Add Prometheus support to statsd.js via mw.track() - https://phabricator.wikimedia.org/T355837 [21:32:10] standing by for test/staging [21:37:35] mw.loader.moduleRegistry['ext.wikimediaEvents'].packageExports['config.json'].WMEStatsBeaconUri [21:37:35] "/beacon/statsv" [21:37:35] !log tgr@deploy2002 krinkle, tgr: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:37:41] T369180: Ensure no AuthenticationRequests are added to the local login flow in SUL3 mode - https://phabricator.wikimedia.org/T369180 [21:37:41] T355837: Add Prometheus support to statsd.js via mw.track() - https://phabricator.wikimedia.org/T355837 [21:37:50] LGTM on mwdebug-next [21:39:22] Also confirmed `mw.track('stats.mediawiki_gadget_track_example_total', 12)` works as expected [21:46:21] (03CR) 10Scott French: "Thanks, Joe!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French) [21:48:03] !log tgr@deploy2002 krinkle, tgr: Continuing with sync [21:48:54] login errors in the next few minutes are expected [21:53:04] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] (duration: 21m 34s) [21:53:10] T369180: Ensure no AuthenticationRequests are added to the local login flow in SUL3 mode - https://phabricator.wikimedia.org/T369180 [21:53:11] T355837: Add Prometheus support to statsd.js via mw.track() - https://phabricator.wikimedia.org/T355837 [21:54:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417383 (10phaultfinder) [21:57:25] !log UTC late deploys done [21:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:23] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4724/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway) [22:07:39] (03CR) 10BCornwall: [C:03+1] postfix: Enable summary messages on TLS handshakes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway) [22:20:36] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [22:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417495 (10phaultfinder) [22:26:20] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, I just left a minor question. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [22:39:36] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [23:03:26] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:03:28] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:03:28] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:05:17] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:05:18] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Wed 15 Jan 2025 08:55:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:05:18] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:08:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [23:08:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [23:23:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:24:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417688 (10phaultfinder) [23:55:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2061:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2061 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown