[00:06:02] <wikibugs>	 (03PS2) 10Brennen Bearnes: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285)
[00:08:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) (owner: 10Brennen Bearnes)
[00:21:22] <wikibugs>	 (03PS3) 10Brennen Bearnes: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285)
[00:28:28] <wikibugs>	 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10414598 (10Bugreporter)
[00:38:29] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105455
[00:38:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105455 (owner: 10TrainBranchBot)
[00:41:54] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10414621 (10colewhite) 05In progress→03Resolved Zuul is effectively migrated at this point and the...
[01:04:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10414678 (10Jhancock.wm) Andrew, getting this error now in the installer.     [!!] Partition disks                      Failed to run preseeded command   Ex...
[01:04:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105455 (owner: 10TrainBranchBot)
[01:05:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm
[01:05:10] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414681 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm
[01:08:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105457
[01:08:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105457 (owner: 10TrainBranchBot)
[01:17:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414687 (10Jhancock.wm) @elukey we're having an issue with this last server. es1043 keeps going to the puppetmaster server for it's certificate...
[01:38:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105457 (owner: 10TrainBranchBot)
[01:48:46] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm
[01:48:55] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm...
[02:17:53] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:18:43] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[02:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[02:28:41] <logmsgbot>	 !log krinkle@deploy2002 Started deploy [statsv/statsv@2ee86ea]: Add dogstatsd support
[02:28:50] <logmsgbot>	 !log krinkle@deploy2002 Finished deploy [statsv/statsv@2ee86ea]: Add dogstatsd support (duration: 00m 18s)
[02:34:41] <wikibugs>	 (03PS1) 10Krinkle: webperf: Enable --dogstatsd on statsv.py [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837)
[02:35:08] <wikibugs>	 (03CR) 10Krinkle: "support for `--dogstatsd` has been deployed. https://sal.toolforge.org/log/rli-3JMBKFqumxvt8GBJ" [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:14:51] <icinga-wm>	 PROBLEM - SSH on bast3007 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:16:07] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[03:21:51] <icinga-wm>	 RECOVERY - SSH on bast3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:32:53] <wikibugs>	 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10414778 (10aliu) Despite the slightly-invalid SVG code, the SVG still renders in browser if I render its code.
[03:40:07] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[03:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:29:07] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[05:39:07] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[06:05:15] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[06:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[06:31:31] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.001e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[06:38:07] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[06:40:07] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T0700)
[07:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:18:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2190.codfw.wmnet with OS bookworm
[07:18:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2190
[07:18:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2190
[07:37:15] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2190.codfw.wmnet with reason: host reimage
[07:40:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2190.codfw.wmnet with reason: host reimage
[07:54:40] <kart_>	 Doing early +2 for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1105341 for wangombe_g's patch.
[07:54:46] <kart_>	 wangombe_g: ^^
[07:55:01] <wangombe_g>	 noted.
[07:55:07] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[07:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:59:40] <icinga-wm>	 RECOVERY - BGP status on lsw1-c3-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:00:07] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T0800). nyaa~
[08:00:07] <jouncebot>	 wangombe_g: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@0.service on elastic2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:00:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2190.codfw.wmnet with OS bookworm
[08:01:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2190.codfw.wmnet
[08:01:27] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2190.codfw.wmnet
[08:02:18] <kart_>	 I'm deploying wangombe_g's patches..
[08:02:45] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420#10414901 (10Jelto) The host responses normally and a reimage worked. Thanks @Jhancock.wm for the quick help!
[08:03:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[08:03:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2061-2062].codfw.wmnet
[08:05:16] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:06:20] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10414905 (10Volans) FYI the links at the bottom of https://integration.wikimedia.org/zuul/ ( Job Stats...
[08:07:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2061-2062].codfw.wmnet
[08:08:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2061.codfw.wmnet with OS bookworm
[08:08:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2062.codfw.wmnet with OS bookworm
[08:09:13] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2061
[08:09:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2061
[08:09:17] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2062
[08:09:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2062
[08:10:13] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10414909 (10SLyngshede-WMF) ` cas.theme.default-theme-name=wikimedia  # WebAuthN cas.authn.mfa.web-authn.core.application-id=https://idp-test.wikimedia.org cas.authn.mfa.web-authn...
[08:12:48] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:14:10] <wikibugs>	 (03Merged) 10jenkins-bot: Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[08:15:22] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105341|Event logging: pass empty object to translation property (T364460)]]
[08:15:27] <stashbot>	 T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460
[08:27:10] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2061.codfw.wmnet with reason: host reimage
[08:27:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2062.codfw.wmnet with reason: host reimage
[08:28:01] <logmsgbot>	 !log kartik@deploy2002 wangombe, kartik: Backport for [[gerrit:1105341|Event logging: pass empty object to translation property (T364460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:28:05] <stashbot>	 T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460
[08:28:19] <kart_>	 wangombe_g: Please test
[08:28:26] <wangombe_g>	 on it
[08:31:18] <wangombe_g>	 testing on Special:translate is done kart_
[08:31:30] <kart_>	 Nice.
[08:31:42] <kart_>	 Deploying and +2ing 2nd patch as well.
[08:31:45] <logmsgbot>	 !log kartik@deploy2002 wangombe, kartik: Continuing with sync
[08:32:01] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Event logging: update schemaId [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[08:32:50] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2061.codfw.wmnet with reason: host reimage
[08:36:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2062.codfw.wmnet with reason: host reimage
[08:37:14] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105341|Event logging: pass empty object to translation property (T364460)]] (duration: 21m 52s)
[08:37:18] <stashbot>	 T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460
[08:38:03] <kart_>	 wangombe_g: first patch is done.
[08:38:14] <kart_>	 wangombe_g: will wait for CI for 2nd patch now..
[08:38:29] <wangombe_g>	 yes. Thanks. Awaiting the second
[08:38:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[08:43:07] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena)
[08:44:14] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena)
[08:49:30] <wikibugs>	 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10414948 (10MatthewVernon) Similar issues have been reported before (e.g. T375324); in the case in point, this object is stored in swift as `text/plain`, w...
[08:51:24] <wikibugs>	 (03Merged) 10jenkins-bot: Event logging: update schemaId [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[08:51:57] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105283|Event logging: update schemaId (T364460)]]
[08:52:01] <stashbot>	 T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460
[08:52:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2061.codfw.wmnet with OS bookworm
[08:53:42] <kart_>	 wangombe_g: patch is merged..
[08:54:12] <wangombe_g>	 testing
[08:54:29] <kart_>	 wangombe_g: no no. yet to reach on the testservers :D
[08:54:52] <wangombe_g>	 :D alright!
[08:54:53] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:55:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw
[08:55:40] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2062.codfw.wmnet with OS bookworm
[08:56:11] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[08:56:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw
[08:57:50] <logmsgbot>	 !log kartik@deploy2002 kartik, wangombe: Backport for [[gerrit:1105283|Event logging: update schemaId (T364460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:57:55] <stashbot>	 T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460
[08:58:07] <kart_>	 wangombe_g: you can test now :)
[08:59:56] <wangombe_g>	 on it
[09:01:31] <wangombe_g>	 Works as intended. Thanks.
[09:01:48] <wangombe_g>	 I've finished testing
[09:02:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2061-2062].codfw.wmnet
[09:03:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2061-2062].codfw.wmnet
[09:06:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2059-2060].codfw.wmnet
[09:07:41] <kart_>	 cool. Deploying wangombe_g
[09:07:49] <logmsgbot>	 !log kartik@deploy2002 kartik, wangombe: Continuing with sync
[09:07:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2059-2060].codfw.wmnet
[09:08:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad
[09:09:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2060.codfw.wmnet with OS bookworm
[09:09:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2059.codfw.wmnet with OS bookworm
[09:09:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2060
[09:09:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2060
[09:09:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2059
[09:09:26] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2059
[09:09:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad
[09:10:11] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[09:12:15] <moritzm>	 !log upgrading mwdebug* to  PHP 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u4 T382077
[09:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:55] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:17:17] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105283|Event logging: update schemaId (T364460)]] (duration: 25m 20s)
[09:17:22] <stashbot>	 T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460
[09:17:43] <kart_>	 Done!
[09:20:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe
[09:23:51] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[09:23:52] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1069.eqiad.wmnet with OS bullseye
[09:24:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe
[09:26:41] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1069.eqiad.wmnet
[09:26:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2060.codfw.wmnet with reason: host reimage
[09:26:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2059.codfw.wmnet with reason: host reimage
[09:28:26] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1069.eqiad.wmnet
[09:28:36] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1296-1300].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[09:28:44] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1301-1304].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[09:30:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bookworm
[09:30:26] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1301.eqiad.wmnet with OS bookworm
[09:31:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2060.codfw.wmnet with reason: host reimage
[09:33:11] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[09:33:30] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1067.eqiad.wmnet with OS bullseye
[09:33:45] <icinga-wm>	 PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:33:45] <icinga-wm>	 PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:34:20] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye
[09:35:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2059.codfw.wmnet with reason: host reimage
[09:35:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[09:39:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[09:39:31] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1067.eqiad.wmnet with OS bullseye
[09:39:57] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye
[09:40:13] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[09:44:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10415035 (10fgiunchedi) Thank you all for looking into this -- let's indeed see how `3m` (or larger) goes and if that is satisfactory!  >>! In T382396#10413490,...
[09:45:43] <wikibugs>	 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10415040 (10TheDJ) I vaguely remember that this happened for invalid svgs when MediaWiki did not yet supply the content type to swift, and instead we relie...
[09:46:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public
[09:48:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public
[09:49:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all
[09:50:36] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:50:50] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2060.codfw.wmnet with OS bookworm
[09:50:54] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage
[09:50:58] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1301.eqiad.wmnet with reason: host reimage
[09:51:46] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1067.eqiad.wmnet with reason: host reimage
[09:52:15] <wikibugs>	 (03PS1) 10Volans: api: allow to skip the START log to SAL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105666 (https://phabricator.wikimedia.org/T324655)
[09:53:19] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454#10415055 (10Volans) a:03Volans
[09:53:44] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[09:53:47] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9041 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[09:53:49] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:54:24] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage
[09:54:30] <wikibugs>	 (03PS1) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176)
[09:54:42] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2059.codfw.wmnet with OS bookworm
[09:55:46] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm
[09:57:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2059-2060].codfw.wmnet
[09:57:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2059-2060].codfw.wmnet
[09:58:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1301.eqiad.wmnet with reason: host reimage
[09:58:23] <wikibugs>	 06SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T382477 (10Etienne20) 03NEW
[09:58:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2057-2058].codfw.wmnet
[09:59:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2057-2058].codfw.wmnet
[10:00:47] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2057.codfw.wmnet with OS bookworm
[10:00:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2058.codfw.wmnet with OS bookworm
[10:00:57] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1067.eqiad.wmnet with reason: host reimage
[10:01:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2057
[10:01:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2057
[10:01:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2058
[10:01:07] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2058
[10:03:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all
[10:04:54] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:07:36] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] "Just a question, lgtm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm)
[10:08:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[10:12:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[10:13:42] <wikibugs>	 10SRE-Access-Requests, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10415112 (10BTullis) I have created the principal for Jimmy. ` btullis@krb1001:~$ sudo manage_principals.py get jly  get_principal: Principal does not exist wh...
[10:13:52] <icinga-wm>	 RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:14:16] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1296.eqiad.wmnet with OS bookworm
[10:14:48] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1043.eqiad.wmnet with reason: host reimage
[10:15:37] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[10:16:02] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1297.eqiad.wmnet with OS bookworm
[10:16:41] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2058.codfw.wmnet with OS bookworm
[10:17:52] <icinga-wm>	 RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:18:20] <wikibugs>	 10SRE-Access-Requests, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create Kerberos identity for Jimmy Ly - https://phabricator.wikimedia.org/T381986#10415120 (10BTullis) 05Open→03Resolved The `data.yaml` file already reflects the fact that a kerberos principal should be available for this account, s...
[10:18:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2057.codfw.wmnet with reason: host reimage
[10:18:32] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[10:18:32] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1067.eqiad.wmnet with OS bullseye
[10:18:37] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1301.eqiad.wmnet with OS bookworm
[10:18:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2058.codfw.wmnet with OS bookworm
[10:18:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2058
[10:18:45] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2058
[10:18:50] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1043.eqiad.wmnet with reason: host reimage
[10:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[10:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[10:19:52] <icinga-wm>	 PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:19:56] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1067.eqiad.wmnet
[10:20:25] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1302.eqiad.wmnet with OS bookworm
[10:21:21] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2057.codfw.wmnet with reason: host reimage
[10:22:42] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1067.eqiad.wmnet
[10:23:55] <icinga-wm>	 PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:26:45] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:26:55] <icinga-wm>	 RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:27:49] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:30:56] <wikibugs>	 06SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T382477#10415142 (10Bugreporter) 05Open→03Invalid Wikimedia Maps is just an OpenStreetMap tile server, and you can use other ones.
[10:35:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2058.codfw.wmnet with reason: host reimage
[10:36:24] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002"
[10:36:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:relforge
[10:37:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:relforge
[10:39:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2058.codfw.wmnet with reason: host reimage
[10:40:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2057.codfw.wmnet with OS bookworm
[10:42:03] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002"
[10:42:03] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1043.eqiad.wmnet with OS bookworm
[10:44:00] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:44:22] <moritzm>	 !log restarting slapd on r/w servers to pick up openssl security updates
[10:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:50] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:45:10] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10415158 (10elukey) >>! In T378143#10414686, @Jhancock.wm wrote: > @elukey we're having an issue with this last server. es1043 keeps going to th...
[10:51:39] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1067.eqiad.wmnet
[10:51:46] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:51:49] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1067.eqiad.wmnet
[10:52:13] <moritzm>	 !log installing e2fsprogs security updates
[10:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:15] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1067.eqiad.wmnet
[10:55:46] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:55:49] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1067.eqiad.wmnet
[10:56:33] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1105326 (owner: 10Muehlenhoff)
[10:56:46] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:57:42] <wikibugs>	 (03PS1) 10Btullis: Revert "Configure the correct role for reimaging installing an-worker nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1105673 (https://phabricator.wikimedia.org/T382410)
[10:57:52] <wikibugs>	 (03PS2) 10Btullis: Revert "Configure the correct role for reimaging installing an-worker nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1105673 (https://phabricator.wikimedia.org/T382410)
[10:57:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Druid roles [puppet] - 10https://gerrit.wikimedia.org/r/1105326 (owner: 10Muehlenhoff)
[10:58:26] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Configure the correct role for reimaging installing an-worker nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1105673 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis)
[10:58:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2058.codfw.wmnet with OS bookworm
[10:59:39] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2057-2058].codfw.wmnet
[10:59:42] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2057-2058].codfw.wmnet
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1100)
[11:00:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for e2fsprogs [puppet] - 10https://gerrit.wikimedia.org/r/1105675
[11:00:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2055-2056].codfw.wmnet
[11:01:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2055-2056].codfw.wmnet
[11:02:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for e2fsprogs [puppet] - 10https://gerrit.wikimedia.org/r/1105675 (owner: 10Muehlenhoff)
[11:02:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2056.codfw.wmnet with OS bookworm
[11:02:29] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2055.codfw.wmnet with OS bookworm
[11:02:47] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2056
[11:02:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2056
[11:02:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2055
[11:02:48] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2055
[11:06:46] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:07:20] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:08:19] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "Trying manually the config in staging, it doesn't really work afaics, will update you folks when ready" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey)
[11:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:46] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:11:00] <icinga-wm>	 PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:11:09] <wikibugs>	 10ops-eqiad, 06DC-Ops: Update the labels on an-presto100[1-5] to be an-worker106[5-9] - https://phabricator.wikimedia.org/T382482 (10BTullis) 03NEW
[11:11:47] <wikibugs>	 10ops-eqiad, 06DC-Ops: Update the labels on an-presto100[1-5] to be an-worker106[5-9] - https://phabricator.wikimedia.org/T382482#10415231 (10BTullis)
[11:13:57] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1065.eqiad.wmnet
[11:15:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: user@0.service on elastic2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:21:34] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1065.eqiad.wmnet
[11:21:37] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1066.eqiad.wmnet
[11:22:51] <wikibugs>	 (03CR) 10Urbanecm: [Growth] Disable Surfacing Add Link tasks on all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm)
[11:25:24] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] [Growth] Disable Surfacing Add Link tasks on all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm)
[11:28:56] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1302.eqiad.wmnet with OS bookworm
[11:29:00] <icinga-wm>	 RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:29:12] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1066.eqiad.wmnet
[11:29:15] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1067.eqiad.wmnet
[11:36:39] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1297.eqiad.wmnet with OS bookworm
[11:36:40] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1067.eqiad.wmnet
[11:36:43] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1068.eqiad.wmnet
[11:43:24] <moritzm>	 !log installing gsl security updates
[11:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:39] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1068.eqiad.wmnet
[11:44:41] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1069.eqiad.wmnet
[11:48:47] <moritzm>	 !log installing distro-info-data updates on bullseye
[11:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:26] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1069.eqiad.wmnet
[11:53:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10415303 (10MoritzMuehlenhoff)
[11:55:21] <moritzm>	 !log installing gtk+2.0 security updates
[11:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:14] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:05:58] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:07:18] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[12:09:50] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:10:06] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:44] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105689
[12:21:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1105691 (owner: 10L10n-bot)
[12:23:21] <wikibugs>	 10ops-eqiad, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485 (10phaultfinder) 03NEW
[12:25:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for gtk+2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1105693
[12:28:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for gtk+2.0 [puppet] - 10https://gerrit.wikimedia.org/r/1105693 (owner: 10Muehlenhoff)
[12:50:02] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database tigwiki (T381378)
[12:50:06] <stashbot>	 T381378: Prepare and check storage layer for tigwiki - https://phabricator.wikimedia.org/T381378
[12:52:50] <wikibugs>	 (03CR) 10Gmodena: mw-content-history-reconcile-enrich: Enable K8 HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[12:54:48] <wikibugs>	 (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1105691 (owner: 10L10n-bot)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1300)
[13:01:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1017-1020].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[13:02:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1017.eqiad.wmnet with OS bookworm
[13:05:33] <wikibugs>	 (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[13:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:15:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libsepol [puppet] - 10https://gerrit.wikimedia.org/r/1105705
[13:16:15] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tigwiki (T381378)
[13:16:19] <stashbot>	 T381378: Prepare and check storage layer for tigwiki - https://phabricator.wikimedia.org/T381378
[13:18:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libsepol [puppet] - 10https://gerrit.wikimedia.org/r/1105705 (owner: 10Muehlenhoff)
[13:19:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1017.eqiad.wmnet with reason: host reimage
[13:23:09] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1017.eqiad.wmnet with reason: host reimage
[13:24:43] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Deprecate remaining uses of system::role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff)
[13:25:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707
[13:26:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff)
[13:33:53] <wikibugs>	 (03PS1) 10Btullis: dse-k8s: Add tokens for dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489)
[13:36:18] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4720/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis)
[13:36:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Ceph roles [puppet] - 10https://gerrit.wikimedia.org/r/1105270 (owner: 10Muehlenhoff)
[13:38:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709
[13:38:49] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492 (10RobH) 03NEW
[13:39:12] <moritzm>	 !log installing libsepol security updates
[13:39:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:10] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492#10415566 (10RobH) a:03Andrew @Andrew,  Two call outs!  The original ordering task had a bad hostname range provided by you for racking "**Hostnames:** cloudvirt1068...
[13:41:22] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492#10415571 (10RobH)
[13:43:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1017.eqiad.wmnet with OS bookworm
[13:45:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1018.eqiad.wmnet with OS bookworm
[13:47:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707 (owner: 10Muehlenhoff)
[13:47:58] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[13:48:14] <wikibugs>	 (03PS5) 10Filippo Giunchedi: prometheus: deploy instances from a single configuration [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087)
[13:48:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I updated the commit message so gerritbot will connect it to the task, hope that’s okay." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[13:56:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1303.eqiad.wmnet with OS bookworm
[13:57:14] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1298.eqiad.wmnet with OS bookworm
[13:57:41] <wikibugs>	 (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[13:59:54] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1400).
[14:00:05] <jouncebot>	 danisztls: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:01:17] <Lucas_WMDE>	 I can probably deploy in a few minutes :)
[14:02:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1018.eqiad.wmnet with reason: host reimage
[14:04:32] <wikibugs>	 (03PS1) 10Joal: Revert "[analytics][webrequest] Extend retention for unique devices analysis" [puppet] - 10https://gerrit.wikimedia.org/r/1105713
[14:04:41] <Lucas_WMDE>	 alright, I can deploy!
[14:04:44] <jinxer-wm>	 RESOLVED: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[14:04:44] <jinxer-wm>	 RESOLVED: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[14:04:57] <Lucas_WMDE>	 assuming danisztls is around, that is…
[14:05:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1297.eqiad.wmnet with OS bookworm
[14:05:58] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1018.eqiad.wmnet with reason: host reimage
[14:06:05] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Revert "[analytics][webrequest] Extend retention for unique devices analysis" [puppet] - 10https://gerrit.wikimedia.org/r/1105713 (owner: 10Joal)
[14:06:40] <wikibugs>	 (03PS1) 10Joal: Revert "Update webrequest raw retention period on HDFS" [puppet] - 10https://gerrit.wikimedia.org/r/1105714
[14:07:52] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "[analytics][webrequest] Extend retention for unique devices analysis" [puppet] - 10https://gerrit.wikimedia.org/r/1105713 (owner: 10Joal)
[14:08:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "To be merged no earlier than Jan" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[14:08:20] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Update webrequest raw retention period on HDFS" [puppet] - 10https://gerrit.wikimedia.org/r/1105714 (owner: 10Joal)
[14:10:40] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1302.eqiad.wmnet with OS bookworm
[14:11:28] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10415676 (10Jhancock.wm) weird. when i ran the cookbook it was defaulting to puppet 7 since it was bookworm. not sure why it would do that. but!...
[14:11:52] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10415678 (10Jhancock.wm) 05Open→03Resolved
[14:13:07] <icinga-wm>	 PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:27] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2055.codfw.wmnet with OS bookworm
[14:16:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1297.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1297.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:16:37] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1303.eqiad.wmnet with reason: host reimage
[14:17:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2055.codfw.wmnet with OS bookworm
[14:17:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2055
[14:17:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2055
[14:17:44] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1298.eqiad.wmnet with reason: host reimage
[14:18:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[14:18:43] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[14:20:03] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:20:04] <Lucas_WMDE>	 I have no idea where to reach danisztls for that config change 🤷
[14:20:17] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1303.eqiad.wmnet with reason: host reimage
[14:20:19] <Lucas_WMDE>	 he’s offline in slack afaict
[14:21:33] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: wikikube-worker1297.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:22:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10415724 (10phaultfinder)
[14:23:23] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1298.eqiad.wmnet with reason: host reimage
[14:24:07] <wikibugs>	 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Console/management  wiring - https://phabricator.wikimedia.org/T382383#10415733 (10Papaul) p:05Triage→03Medium
[14:24:37] <wikibugs>	 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10415734 (10Papaul) p:05Triage→03Medium
[14:24:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10procurement: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10415736 (10Papaul)
[14:25:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1018.eqiad.wmnet with OS bookworm
[14:27:00] <tgr|away>	 Lucas_WMDE: if you are finished / won't start, I'll make a PrivateSettings change
[14:27:07] <Lucas_WMDE>	 tgr|away: go ahead
[14:27:16] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1019.eqiad.wmnet with OS bookworm
[14:27:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10415746 (10phaultfinder)
[14:27:50] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709 (owner: 10Muehlenhoff)
[14:29:59] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1297.eqiad.wmnet with reason: host reimage
[14:30:51] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1302.eqiad.wmnet with reason: host reimage
[14:33:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1297.eqiad.wmnet with reason: host reimage
[14:34:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2055.codfw.wmnet with reason: host reimage
[14:35:05] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707 (owner: 10Muehlenhoff)
[14:35:25] <wikibugs>	 (03CR) 10Bking: [C:03+1] dse-k8s: Add tokens for dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis)
[14:35:54] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709 (owner: 10Muehlenhoff)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1302.eqiad.wmnet with reason: host reimage
[14:38:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1303.eqiad.wmnet with OS bookworm
[14:39:02] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:39:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2055.codfw.wmnet with reason: host reimage
[14:40:43] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1304.eqiad.wmnet with OS bookworm
[14:42:49] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1298.eqiad.wmnet with OS bookworm
[14:43:20] <wikibugs>	 (03PS1) 10Andrew Bogott: site + preseed entries for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1105724 (https://phabricator.wikimedia.org/T382492)
[14:43:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1019.eqiad.wmnet with reason: host reimage
[14:44:02] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:12] <icinga-wm>	 RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:36] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1299.eqiad.wmnet with OS bookworm
[14:47:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1019.eqiad.wmnet with reason: host reimage
[14:48:17] <icinga-wm>	 PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:48:17] <icinga-wm>	 RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:51:52] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1297.eqiad.wmnet with OS bookworm
[14:52:17] <icinga-wm>	 PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:52:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] site + preseed entries for new cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/1105724 (https://phabricator.wikimedia.org/T382492) (owner: 10Andrew Bogott)
[14:53:17] <icinga-wm>	 RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:55:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-74] - https://phabricator.wikimedia.org/T382492#10415805 (10Andrew) >>! In T382492#10415566, @RobH wrote: > @Andrew, >  > Two call outs!  The original ordering task had a bad hostname...
[14:56:25] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1302.eqiad.wmnet with OS bookworm
[14:56:57] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2056.codfw.wmnet with OS bookworm
[14:57:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10415807 (10Andrew)
[14:57:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2056.codfw.wmnet with OS bookworm
[14:57:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2056
[14:57:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2056
[14:58:42] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Growth] Disable Surfacing Add Link tasks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm)
[14:59:24] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Disable Surfacing Add Link tasks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) (owner: 10Urbanecm)
[14:59:43] <wikibugs>	 06SRE, 10Observability-Metrics: node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768#10415826 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Has been done at some point in host overview dashboard, sample query: `node_cpu_frequency_hertz{instance=~...
[15:00:39] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1105302|[Growth] Disable Surfacing Add Link tasks on all wikis (T382037)]]
[15:00:44] <stashbot>	 T382037: Disable Alpha Test: Surfacing "Add a link" Structured Tasks (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T382037
[15:01:10] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 06Infrastructure-Foundations, 10Mail, 10Observability-Metrics, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867#10415850 (10fgiunchedi) 05Open→03Invalid No longer valid I think, also...
[15:01:24] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1304.eqiad.wmnet with reason: host reimage
[15:01:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2055.codfw.wmnet with OS bookworm
[15:01:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10415854 (10Andrew)
[15:02:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10415859 (10Andrew) a:05Andrew→03None
[15:03:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Blacklist btrfs [puppet] - 10https://gerrit.wikimedia.org/r/1105707 (owner: 10Muehlenhoff)
[15:03:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Metrics: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155#10415860 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done in {T370506}
[15:03:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[15:03:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[15:04:05] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Growth] Make the typage campaign not specific to 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102350 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm)
[15:04:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105709 (owner: 10Muehlenhoff)
[15:04:48] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Make the typage campaign not specific to 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1102350 (https://phabricator.wikimedia.org/T380405) (owner: 10Urbanecm)
[15:05:12] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1304.eqiad.wmnet with reason: host reimage
[15:05:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1299.eqiad.wmnet with reason: host reimage
[15:06:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2056.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2056.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:16] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1105302|[Growth] Disable Surfacing Add Link tasks on all wikis (T382037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:08:21] <stashbot>	 T382037: Disable Alpha Test: Surfacing "Add a link" Structured Tasks (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T382037
[15:08:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1019.eqiad.wmnet with OS bookworm
[15:08:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[15:08:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[15:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1020.eqiad.wmnet with OS bookworm
[15:10:34] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1299.eqiad.wmnet with reason: host reimage
[15:10:37] <tgr|away>	 (I'll deploy the PrivateSettings change later, I realized it's better to do it together with some backports)
[15:11:01] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[15:12:28] <wikibugs>	 (03PS1) 10Gergő Tisza: Make AuthManagerAutoConfig configuration key more distinctive [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180)
[15:14:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2056.codfw.wmnet with reason: host reimage
[15:15:29] <tgr|away>	 urbanecm: looks like the PrivateSettings change got into your sync
[15:15:46] <urbanecm>	 tgr|away: if you committed it, then that seems likely
[15:15:53] <urbanecm>	 so far it doesn't seem to break anything significant
[15:15:56] <tgr|away>	 probably a no-op, but if you see "Authentication failed because of inconsistent provider array" errors in the next few minutes, that's why
[15:16:06] <urbanecm>	 godo to know
[15:17:10] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:17:12] <tgr|away>	 I'll re-add it to the private repo then (I committed it and then thought I'd rather deploy rather and did reset it)
[15:17:40] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105302|[Growth] Disable Surfacing Add Link tasks on all wikis (T382037)]] (duration: 17m 00s)
[15:17:45] <stashbot>	 T382037: Disable Alpha Test: Surfacing "Add a link" Structured Tasks (FY24/25 WE1.2.6) - https://phabricator.wikimedia.org/T382037
[15:18:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2056.codfw.wmnet with reason: host reimage
[15:19:04] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1102350|[Growth] Make the typage campaign not specific to 2023 (T380405)]]
[15:19:08] <stashbot>	 T380405: Generic Campaign parameter: New Editor Recruitment as part of the Donor Thank You page - https://phabricator.wikimedia.org/T380405
[15:20:45] <tgr|away>	 ugh. it's on mwdebug but not mwmaint. I guess it was already reverted by the time scap backport got to the full sync?
[15:21:54] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1102350|[Growth] Make the typage campaign not specific to 2023 (T380405)]]
[15:22:09] <wikibugs>	 (03PS2) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916)
[15:22:14] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:22:15] <tgr|away>	 I guess I should have used scap lock
[15:22:40] <tgr|away>	 whatever, the next scap will clean it up
[15:23:14] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:23:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10415928 (10Andrew) This server seems to have a raid controller, which is different from all the other standard ceph OSD nodes. Not sure how that happened b...
[15:26:11] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1304.eqiad.wmnet with OS bookworm
[15:26:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1301-1304].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[15:26:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:26:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1020.eqiad.wmnet with reason: host reimage
[15:27:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10415937 (10phaultfinder)
[15:28:55] <wikibugs>	 (03PS1) 10Gergő Tisza: SUL3: Disable more auth providers in the local leg of SUL3 login [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180)
[15:29:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:30:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1299.eqiad.wmnet with OS bookworm
[15:30:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[15:30:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[15:31:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:31:50] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1300.eqiad.wmnet with OS bookworm
[15:31:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1020.eqiad.wmnet with reason: host reimage
[15:34:28] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1102350|[Growth] Make the typage campaign not specific to 2023 (T380405)]] (duration: 12m 33s)
[15:34:33] <stashbot>	 T380405: Generic Campaign parameter: New Editor Recruitment as part of the Donor Thank You page - https://phabricator.wikimedia.org/T380405
[15:35:00] <urbanecm>	 tgr|away: i'm now done with syncing
[15:35:08] <urbanecm>	 feel free to clean up if needed
[15:36:17] <wikibugs>	 (03PS1) 10Gergő Tisza: [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180)
[15:36:24] <icinga-wm>	 PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:37:30] <tgr|away>	 thanks urbanecm! as far as I can tell, the last round of syncing did clean it up already
[15:37:38] <urbanecm>	 cool!
[15:38:22] <tgr|away>	 you might want to log the patches on wikitech/Deployments though
[15:39:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2056.codfw.wmnet with OS bookworm
[15:39:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[15:39:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[15:39:42] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:47:22] <logmsgbot>	 !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@6ed5237]: SEAL conda env hotfix
[15:48:32] <logmsgbot>	 !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@6ed5237]: SEAL conda env hotfix (duration: 01m 28s)
[15:48:43] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10416007 (10colewhite) 05Resolved→03Open
[15:49:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2055-2056].codfw.wmnet
[15:49:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2055-2056].codfw.wmnet
[15:50:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1020.eqiad.wmnet with OS bookworm
[15:50:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1017-1020].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[15:53:17] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1300.eqiad.wmnet with reason: host reimage
[15:55:50] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1300.eqiad.wmnet with reason: host reimage
[15:57:48] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507 (10MoritzMuehlenhoff) 03NEW
[15:58:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508 (10MoritzMuehlenhoff) 03NEW
[16:00:05] <jouncebot>	 dancy: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1600).
[16:00:16] <wikibugs>	 (03CR) 10DCausse: team-search-platform: Add alert for wdqs-categories lag (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[16:00:23] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509 (10MoritzMuehlenhoff) 03NEW
[16:00:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[16:01:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511 (10MoritzMuehlenhoff) 03NEW
[16:02:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512 (10MoritzMuehlenhoff) 03NEW
[16:03:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513 (10MoritzMuehlenhoff) 03NEW
[16:04:00] <wikibugs>	 (03PS17) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[16:04:11] <wikibugs>	 (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[16:05:11] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515 (10MoritzMuehlenhoff) 03NEW
[16:07:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:09:04] <wikibugs>	 (03PS2) 10Btullis: dse-k8s: Add tokens for mediawiki-data-dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489)
[16:09:22] <wikibugs>	 (03CR) 10Gmodena: mw-content-history-reconcile-enrich: Enable K8 HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[16:11:23] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[16:11:24] <icinga-wm>	 PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:11:42] <wikibugs>	 (03CR) 10Jelto: "I'll deploy this in January." [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[16:12:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:14:44] <wikibugs>	 (03PS3) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916)
[16:14:55] <wikibugs>	 (03CR) 10Bking: team-search-platform: Add alert for wdqs-categories lag (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[16:15:07] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1300.eqiad.wmnet with OS bookworm
[16:15:10] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1296-1300].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[16:15:32] <wikibugs>	 (03PS1) 10Kamila Součková: sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752
[16:16:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10416237 (10cmooney) >>! In T382396#10415035, @fgiunchedi wrote: > Yes and that's almost always the case, my understanding though is that the samples may not alw...
[16:17:56] <wikibugs>	 (03PS3) 10Btullis: dse-k8s: Add tokens for mediawiki-dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489)
[16:18:04] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM if CI agrees :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková)
[16:21:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková)
[16:21:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518 (10cmooney) 03NEW p:05Triage→03Low
[16:22:08] <wikibugs>	 (03PS2) 10Kamila Součková: sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752
[16:22:40] <wikibugs>	 (03PS1) 10Btullis: dse-k8s: Add a mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105754 (https://phabricator.wikimedia.org/T382489)
[16:23:33] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[16:25:57] <wikibugs>	 10ops-eqsin, 06SRE, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519 (10cmooney) 03NEW p:05Triage→03Low
[16:25:59] <icinga-wm>	 ACKNOWLEDGEMENT - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Device is offline, see T382519
[16:26:46] <icinga-wm>	 ACKNOWLEDGEMENT - Host ripe-atlas-eqsin IPv6 is DOWN: CRITICAL - Host Unreachable (2001:df2:e500:201:103:102:166:20) Cathal Mooney See T382519
[16:27:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host ripe-atlas-eqiad IPv6 is DOWN: CRITICAL - Host Unreachable (2620:0:861:202:208:80:155:69) Cathal Mooney See T382518
[16:27:14] <icinga-wm>	 ACKNOWLEDGEMENT - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney See T382518
[16:27:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416338 (10phaultfinder)
[16:28:38] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on ripe-atlas-eqiad,ripe-atlas-eqiad IPv6 with reason: Atlas device offline, scheduling reboot
[16:28:53] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on ripe-atlas-eqiad,ripe-atlas-eqiad IPv6 with reason: Atlas device offline, scheduling reboot
[16:28:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10416342 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7fe2fd80-b4a4-43f7-ba5a-5238c44bbd7a) set by cmooney@cumin1002 for 30 days,...
[16:29:14] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková)
[16:33:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[16:34:23] <wikibugs>	 (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[16:34:30] <wikibugs>	 (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105756
[16:35:23] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on ripe-atlas-eqsin,ripe-atlas-eqsin IPv6 with reason: Atlas device offline, scheduling reboot
[16:35:39] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on ripe-atlas-eqsin,ripe-atlas-eqsin IPv6 with reason: Atlas device offline, scheduling reboot
[16:35:48] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: fix asking for confirmation when --force set [cookbooks] - 10https://gerrit.wikimedia.org/r/1105752 (owner: 10Kamila Součková)
[16:35:49] <wikibugs>	 10ops-eqsin, 06SRE, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqsin offline - https://phabricator.wikimedia.org/T382519#10416397 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=68d77968-a0dd-4bd1-94ad-66be8ab508c5) set by cmooney@cumin1002 for 30 days, 0:00:00 on 2...
[16:36:04] <wikibugs>	 (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105758
[16:37:50] <swfrench-wmf>	 jouncebot: nowandnext
[16:37:50] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1600)
[16:37:50] <jouncebot>	 In 0 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1700)
[16:39:09] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105758 (owner: 10Clare Ming)
[16:39:11] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105756 (owner: 10Clare Ming)
[16:39:13] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, 07Wikimedia-production-error: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded - https://phabricator.wikimedia.org/T381109#10416420 (10dancy)
[16:40:09] <wikibugs>	 (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105758 (owner: 10Clare Ming)
[16:40:09] <wikibugs>	 (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105756 (owner: 10Clare Ming)
[16:50:04] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10416509 (10thcipriani) 05Stalled→03Open a:05thcipriani→03Arnoldokoth >>! In T381851#10401485, @Scott_French wrote: > @Ammarpad - FYI, @thcipriani is out this week, so the next update...
[16:50:07] <wikibugs>	 (03CR) 10DDesouza: "You're welcome. Thanks for the fix. Unfortunately I wasn't able to attend the deployment but I scheduled the change for next window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[16:51:42] <wikibugs>	 (03PS1) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760
[16:52:03] <wikibugs>	 (03PS2) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927)
[16:52:41] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[16:53:03] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[16:53:49] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[16:54:03] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[16:55:49] <wikibugs>	 (03PS3) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927)
[16:56:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "seems good and thanks for fixing that. just cant merge it right now since I am "out of office". Please let Jelto or Arnold merge. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar)
[16:56:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] dse-k8s: Add tokens for mediawiki-dumps-legacy namespace [puppet] - 10https://gerrit.wikimedia.org/r/1105708 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis)
[16:56:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] dse-k8s: Add a mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105754 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis)
[16:57:08] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4722/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall)
[16:58:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:00:06] <jouncebot>	 jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1700)
[17:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:29] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:00:40] <jhathaway>	 hmm, jouncebot should be smarter about not needlessly getting our attention :P
[17:00:51] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s: Add a mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105754 (https://phabricator.wikimedia.org/T382489) (owner: 10Btullis)
[17:06:27] <rzl>	 jhathaway: reminds me to delete the next few ones
[17:06:33] <wikibugs>	 (03PS4) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408)
[17:06:42] <jhathaway>	 rzl: thanks
[17:07:09] <wikibugs>	 (03CR) 10Elukey: "This one seems to work, I tested it locally with some metrics generated from maps2005. I think it is a reasonable baseline, then we can im" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey)
[17:09:11] <rzl>	 oh never mind, they're gone! t.hcipriani++
[17:10:32] <wikibugs>	 (03CR) 10DCausse: team-search-platform: Add alert for wdqs-categories lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[17:10:50] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm
[17:17:51] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:18:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:19:12] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:21:24] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:21:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:23:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:23:55] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1022.eqiad.wmnet with OS bookworm
[17:27:22] <wikibugs>	 (03PS1) 10AOkoth: admin: Add ammarpad to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1105773 (https://phabricator.wikimedia.org/T381851)
[17:27:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416678 (10phaultfinder)
[17:28:25] <wikibugs>	 (03CR) 10Elukey: "elukey@kubestage1006:~$ sudo nsenter -t 2495625 -n curl -s localhost:9102/metrics | grep karto" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey)
[17:38:33] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm
[17:39:00] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm
[17:39:11] <wikibugs>	 (03PS1) 10Scott French: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1105776 (https://phabricator.wikimedia.org/T382517)
[17:39:42] <wikibugs>	 (03PS1) 10Scott French: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105778 (https://phabricator.wikimedia.org/T382517)
[17:42:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1022.eqiad.wmnet with reason: host reimage
[17:46:01] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1022.eqiad.wmnet with reason: host reimage
[17:46:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416778 (10phaultfinder)
[17:48:25] <swfrench-wmf>	 jouncebot: nowandnext
[17:48:26] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1700)
[17:48:26] <jouncebot>	 In 0 hour(s) and 11 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800)
[17:48:26] <jouncebot>	 In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800)
[17:48:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:49:12] <swfrench-wmf>	 unless there are any objections, I'll be backporting a fix for a log-spam issue shortly
[17:53:17] <wikibugs>	 (03PS1) 10JHathaway: WIP: postfix logging [puppet] - 10https://gerrit.wikimedia.org/r/1105780
[17:53:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:55:41] <wikibugs>	 (03CR) 10JHathaway: "Thanks for the patch Brett" [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall)
[18:00:05] <jouncebot>	 bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1800)
[18:04:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [extensions/EventBus] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105778 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[18:04:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [extensions/EventBus] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1105776 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[18:06:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1022.eqiad.wmnet with OS bookworm
[18:06:22] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1022].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[18:06:46] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm
[18:07:12] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-12-19-122113-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105787
[18:07:12] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm
[18:08:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:14:03] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-12-19-122113-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105787 (owner: 10BryanDavis)
[18:15:13] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-12-19-122113-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105787 (owner: 10BryanDavis)
[18:21:59] <wikibugs>	 (03PS1) 10DLynch: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788
[18:22:18] <wikibugs>	 (03PS2) 10DLynch: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833)
[18:22:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) (owner: 10DLynch)
[18:26:15] <wikibugs>	 (03PS1) 10Joal: Revert "Fix security checksum for web_request's refinery-drop-older-than" [puppet] - 10https://gerrit.wikimedia.org/r/1105790
[18:26:44] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] webperf: Enable --dogstatsd on statsv.py [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[18:27:43] <wikibugs>	 (03Merged) 10jenkins-bot: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105778 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[18:29:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416878 (10phaultfinder)
[18:31:07] <wikibugs>	 (03Merged) 10jenkins-bot: maintenance: fix typo in job status logging [extensions/EventBus] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1105776 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[18:31:14] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:31:33] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:31:39] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1105778|maintenance: fix typo in job status logging (T382517)]], [[gerrit:1105776|maintenance: fix typo in job status logging (T382517)]]
[18:31:44] <stashbot>	 T382517: PHP Warning seen by logspam-watch but not by mediawiki-errors logstash page - https://phabricator.wikimedia.org/T382517
[18:31:45] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:32:30] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:32:37] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:32:53] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Fix security checksum for web_request's refinery-drop-older-than" [puppet] - 10https://gerrit.wikimedia.org/r/1105790 (owner: 10Joal)
[18:32:57] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:38:41] <wikibugs>	 (03PS1) 10Herron: pyrra: wdqs match site label with = instead of =~ [puppet] - 10https://gerrit.wikimedia.org/r/1105791 (https://phabricator.wikimedia.org/T302995)
[18:41:23] <wikibugs>	 (03CR) 10Herron: [C:03+2] pyrra: wdqs match site label with = instead of =~ [puppet] - 10https://gerrit.wikimedia.org/r/1105791 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[18:41:35] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1105778|maintenance: fix typo in job status logging (T382517)]], [[gerrit:1105776|maintenance: fix typo in job status logging (T382517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:41:39] <stashbot>	 T382517: PHP Warning seen by logspam-watch but not by mediawiki-errors logstash page - https://phabricator.wikimedia.org/T382517
[18:42:19] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[18:44:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10416914 (10Andrew) I designated every drive a non-raid drive in the bios and now the install is completing. I can't make it stop installing though, it just...
[18:47:41] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm
[18:47:53] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[18:48:50] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105778|maintenance: fix typo in job status logging (T382517)]], [[gerrit:1105776|maintenance: fix typo in job status logging (T382517)]] (duration: 17m 11s)
[18:48:55] <stashbot>	 T382517: PHP Warning seen by logspam-watch but not by mediawiki-errors logstash page - https://phabricator.wikimedia.org/T382517
[18:49:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1011
[18:49:26] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1011
[18:51:28] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcontrol1011 - jclark@cumin1002"
[18:51:33] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcontrol1011 - jclark@cumin1002"
[18:51:33] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:52:13] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcontrol1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:53:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10416940 (10Jclark-ctr)
[18:53:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:57:11] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] postfix: Enable summary messages on TLS handshakes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall)
[18:57:29] <wikibugs>	 (03Abandoned) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105760 (https://phabricator.wikimedia.org/T381927) (owner: 10BCornwall)
[18:57:56] <wikibugs>	 (03CR) 10BCornwall: WIP: postfix logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (owner: 10JHathaway)
[18:58:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:59:37] <wikibugs>	 (03CR) 10BCornwall: [C:04-1] WIP: postfix logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (owner: 10JHathaway)
[18:59:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416947 (10phaultfinder)
[19:00:05] <jouncebot>	 dancy: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T1900).
[19:00:43] <dancy>	 o/
[19:01:12] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105793 (https://phabricator.wikimedia.org/T375667)
[19:01:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105793 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot)
[19:02:00] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105793 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot)
[19:03:46] <brennen>	 o/
[19:04:02] <wikibugs>	 (03PS2) 10BCornwall: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway)
[19:05:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10416963 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr rebalanced pdu for B4. L1 A
[19:08:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[19:08:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[19:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:12:35] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.8  refs T375667
[19:12:40] <stashbot>	 T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667
[19:16:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:18:13] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:18:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcontrol1011.eqiad.wmnet with OS bookworm
[19:18:41] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10417006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet w...
[19:19:31] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:25:53] <rzl>	 hmm, that httpbb failure is a 503 for https://species.wikimedia.org/wiki/Sitta_europaea_caesia
[19:26:00] <rzl>	 cc dancy, brennen
[19:26:28] <rzl>	 not immediately sure it's a train thing, still looking, just fyi
[19:26:38] <wikibugs>	 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535 (10phaultfinder) 03NEW
[19:27:44] <dancy>	 hmmm
[19:31:23] <dancy>	 The https://species.wikimedia.org/wiki/Sitta_europaea_caesia page looks ok.
[19:31:27] <brennen>	 yeah
[19:31:51] <dancy>	 I'm going to take a break and see how things look in about 20 minutes
[19:31:53] <swfrench-wmf>	 I'm not seeing anything correlated / suspicious in metrics or logs
[19:31:55] <brennen>	 in _general_ things look pretty much like they did pre-deploy
[19:32:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye
[19:32:46] <brennen>	 i'm going for a slice of pizza but i'll take the laptop.
[19:32:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10417061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2004-dev.codfw.wmnet with OS bul...
[19:32:59] <rzl>	 yeah, I can't get it to repro either
[19:33:12] <rzl>	 probably just a hiccup and the alert will clear on the next hourly run, sorry for the false alarm
[19:35:11] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[19:35:31] <swfrench-wmf>	 thanks for spotting and investigating, rzl!
[19:38:55] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002"
[19:38:59] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002"
[19:38:59] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:41:02] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[19:46:29] <wikibugs>	 (03PS3) 10JHathaway: postfix: Enable summary messages on TLS handshakes [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927)
[19:46:55] <wikibugs>	 (03CR) 10JHathaway: postfix: Enable summary messages on TLS handshakes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway)
[19:50:32] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002"
[19:50:37] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: reset dns names for cloudcontrol1011 to newly-assigned ones - cmooney@cumin1002"
[19:50:37] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:51:13] <logmsgbot>	 !log jforrester@deploy2002 Started deploy [integration/docroot@4701376]: I1ea9f34dc6176da4cca5da50c293bd5ff62661b8 for T233089
[19:51:17] <stashbot>	 T233089: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089
[19:51:24] <logmsgbot>	 !log jforrester@deploy2002 Finished deploy [integration/docroot@4701376]: I1ea9f34dc6176da4cca5da50c293bd5ff62661b8 for T233089 (duration: 00m 10s)
[20:06:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[20:09:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417106 (10phaultfinder)
[20:16:14] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:19:31] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:21:26] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10417126 (10colewhite) 05Open→03Resolved Thanks @Volans for pointing those out!  With the latest de...
[20:34:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417150 (10phaultfinder)
[20:38:43] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1011.eqiad.wmnet with OS bookworm
[20:38:49] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10417151 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with...
[20:41:56] <wikibugs>	 (03PS1) 10Scott French: mediawiki: add rsyslog container to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517)
[20:53:52] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241219T2100).
[21:00:04] <jouncebot>	 tgr, danisztls, kemayo, and Krinkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:06] <Kemayo>	 o/
[21:00:20] <urbanecm>	 i can deploy today
[21:00:26] <danisztls>	 o/
[21:00:31] <tgr|away>	 o/
[21:00:42] <wikibugs>	 (03PS3) 10DLynch: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833)
[21:00:59] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) (owner: 10DLynch)
[21:01:05] <tgr|away>	 I'll deploy my patches, it involves the private  repo
[21:01:35] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:01:38] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:01:44] <urbanecm>	 tgr|away: ack
[21:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: Set Flow to read-only on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105788 (https://phabricator.wikimedia.org/T378833) (owner: 10DLynch)
[21:02:25] <wikibugs>	 (03Merged) 10jenkins-bot: Reader Survey: Undeploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105027 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:03:07] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1105788|Set Flow to read-only on phase 1 wikis (T378833)]], [[gerrit:1105027|Reader Survey: Undeploy (T378660)]]
[21:03:12] <stashbot>	 T378833: [Config] Set Flow to read-only at all *Phase 1* wikis - https://phabricator.wikimedia.org/T378833
[21:03:13] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:03:37] * Krinkle is here
[21:04:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417219 (10phaultfinder)
[21:07:45] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, kemayo, dani: Backport for [[gerrit:1105788|Set Flow to read-only on phase 1 wikis (T378833)]], [[gerrit:1105027|Reader Survey: Undeploy (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:07:58] <urbanecm>	 Kemayo: danisztls: please test :)
[21:08:56] <Kemayo>	 urbanecm: Looks good.
[21:09:03] <urbanecm>	 ty
[21:09:22] <danisztls>	 urbanecm: looks good
[21:09:26] <urbanecm>	 ty
[21:09:28] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, kemayo, dani: Continuing with sync
[21:09:52] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[21:14:21] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105788|Set Flow to read-only on phase 1 wikis (T378833)]], [[gerrit:1105027|Reader Survey: Undeploy (T378660)]] (duration: 11m 14s)
[21:14:26] <stashbot>	 T378833: [Config] Set Flow to read-only at all *Phase 1* wikis - https://phabricator.wikimedia.org/T378833
[21:14:27] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:14:31] <urbanecm>	 and done
[21:14:49] <urbanecm>	 tgr|away: Krinkle: leaving your patches up to you :)
[21:15:14] <Krinkle>	 tgr|away: go ahead if you like. I'm writing some docs meanwhile.
[21:15:22] <tgr|away>	 ack
[21:16:03] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] SUL3: Disable more auth providers in the local leg of SUL3 login [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[21:16:18] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Make AuthManagerAutoConfig configuration key more distinctive [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[21:17:15] <tgr|away>	 Krinkle: or I can deploy your change while I am waiting for the merges
[21:17:20] <Krinkle>	 Sure
[21:17:44] <wikibugs>	 (03PS4) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916)
[21:18:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[21:18:45] <wikibugs>	 (03Merged) 10jenkins-bot: Make AuthManagerAutoConfig configuration key more distinctive [extensions/IPReputation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105735 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[21:18:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[21:19:11] <tgr|away>	 oh wow that was unexpectedly fast
[21:19:20] <wikibugs>	 (03Merged) 10jenkins-bot: Enable $wgWMEStatsBeaconUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[21:20:03] <tgr|away>	 I guess I'll just wait for the CentralAuth merge and deploy everything together then
[21:20:06] <Krinkle>	 How dare we have CI jobs that complete under 5min.
[21:20:18] <Krinkle>	 This extension probalby isnt' in the wmf gate
[21:21:08] <wikibugs>	 (03PS5) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916)
[21:21:56] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[21:22:42] <wikibugs>	 (03Merged) 10jenkins-bot: [noop] Update private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105742 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[21:25:34] <dancy>	 I just filed a ticket about errors showing up in logspam-watch, starting around 21:14:00
[21:25:34] <dancy>	 https://phabricator.wikimedia.org/T382546
[21:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: SUL3: Disable more auth providers in the local leg of SUL3 login [extensions/CentralAuth] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105739 (https://phabricator.wikimedia.org/T369180) (owner: 10Gergő Tisza)
[21:28:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[21:29:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417332 (10phaultfinder)
[21:29:47] <tgr|away>	 !log deploying PrivateSettings change 95517e85
[21:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:30] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]]
[21:31:35] <stashbot>	 T369180: Ensure no AuthenticationRequests are added to the local login flow in SUL3 mode - https://phabricator.wikimedia.org/T369180
[21:31:36] <stashbot>	 T355837: Add Prometheus support to statsd.js via mw.track() - https://phabricator.wikimedia.org/T355837
[21:32:10] <Krinkle>	 standing by for test/staging
[21:37:35] <Krinkle>	 mw.loader.moduleRegistry['ext.wikimediaEvents'].packageExports['config.json'].WMEStatsBeaconUri
[21:37:35] <Krinkle>	 "/beacon/statsv" 
[21:37:35] <logmsgbot>	 !log tgr@deploy2002 krinkle, tgr: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:37:41] <stashbot>	 T369180: Ensure no AuthenticationRequests are added to the local login flow in SUL3 mode - https://phabricator.wikimedia.org/T369180
[21:37:41] <stashbot>	 T355837: Add Prometheus support to statsd.js via mw.track() - https://phabricator.wikimedia.org/T355837
[21:37:50] <Krinkle>	 LGTM on mwdebug-next
[21:39:22] <Krinkle>	 Also confirmed `mw.track('stats.mediawiki_gadget_track_example_total', 12)` works as expected 
[21:46:21] <wikibugs>	 (03CR) 10Scott French: "Thanks, Joe!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[21:48:03] <logmsgbot>	 !log tgr@deploy2002 krinkle, tgr: Continuing with sync
[21:48:54] <tgr|away>	 login errors in the next few minutes are expected
[21:53:04] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] (duration: 21m 34s)
[21:53:10] <stashbot>	 T369180: Ensure no AuthenticationRequests are added to the local login flow in SUL3 mode - https://phabricator.wikimedia.org/T369180
[21:53:11] <stashbot>	 T355837: Add Prometheus support to statsd.js via mw.track() - https://phabricator.wikimedia.org/T355837
[21:54:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417383 (10phaultfinder)
[21:57:25] <tgr|away>	 !log UTC late deploys done
[21:57:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:23] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4724/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway)
[22:07:39] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] postfix: Enable summary messages on TLS handshakes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105780 (https://phabricator.wikimedia.org/T381927) (owner: 10JHathaway)
[22:20:36] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[22:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417495 (10phaultfinder)
[22:26:20] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, I just left a minor question. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[22:39:36] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[23:03:26] <icinga-wm>	 PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[23:03:28] <icinga-wm>	 PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:03:28] <icinga-wm>	 PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[23:05:17] <icinga-wm>	 RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[23:05:18] <icinga-wm>	 RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Wed 15 Jan 2025 08:55:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[23:05:18] <icinga-wm>	 RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:08:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[23:08:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[23:23:50] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:24:40] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:39:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10417688 (10phaultfinder)
[23:55:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2061:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2061 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown