[00:01:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T370903)', diff saved to https://phabricator.wikimedia.org/P68003 and previous config saved to /var/cache/conftool/dbconfig/20240828-000117-ladsgroup.json
[00:01:36] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[00:03:53] <icinga-wm>	 PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:06:45] <icinga-wm>	 RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:07:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1067450 (owner: 10TrainBranchBot)
[00:12:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T371742)', diff saved to https://phabricator.wikimedia.org/P68004 and previous config saved to /var/cache/conftool/dbconfig/20240828-001214-ladsgroup.json
[00:12:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:12:22] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[00:12:30] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:16:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P68005 and previous config saved to /var/cache/conftool/dbconfig/20240828-001625-ladsgroup.json
[00:31:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P68006 and previous config saved to /var/cache/conftool/dbconfig/20240828-003132-ladsgroup.json
[00:44:37] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590)
[00:46:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T370903)', diff saved to https://phabricator.wikimedia.org/P68007 and previous config saved to /var/cache/conftool/dbconfig/20240828-004639-ladsgroup.json
[00:46:42] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2218.codfw.wmnet with reason: Maintenance
[00:46:44] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[00:46:45] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590)
[00:46:55] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2218.codfw.wmnet with reason: Maintenance
[00:47:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68008 and previous config saved to /var/cache/conftool/dbconfig/20240828-004702-ladsgroup.json
[00:48:28] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[00:49:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[00:50:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10098491 (10JJMC89) Not yet - waiting on a response from @JbuattiWMF.
[00:51:43] <wikibugs>	 (03PS3) 10Andrew Bogott: Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590)
[00:53:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68009 and previous config saved to /var/cache/conftool/dbconfig/20240828-005342-ladsgroup.json
[00:53:46] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[00:57:10] <wikibugs>	 (03PS4) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590)
[01:08:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P68010 and previous config saved to /var/cache/conftool/dbconfig/20240828-010849-ladsgroup.json
[01:23:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P68011 and previous config saved to /var/cache/conftool/dbconfig/20240828-012356-ladsgroup.json
[01:39:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68012 and previous config saved to /var/cache/conftool/dbconfig/20240828-013903-ladsgroup.json
[01:39:08] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[02:01:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance
[02:01:38] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance
[02:01:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T371742)', diff saved to https://phabricator.wikimedia.org/P68013 and previous config saved to /var/cache/conftool/dbconfig/20240828-020145-ladsgroup.json
[02:01:49] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[02:07:19] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:10:33] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 538, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:22:25] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza)
[02:36:27] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:46:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T371742)', diff saved to https://phabricator.wikimedia.org/P68014 and previous config saved to /var/cache/conftool/dbconfig/20240828-024627-ladsgroup.json
[02:46:32] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[03:01:27] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P68015 and previous config saved to /var/cache/conftool/dbconfig/20240828-030135-ladsgroup.json
[03:03:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:16:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P68016 and previous config saved to /var/cache/conftool/dbconfig/20240828-031642-ladsgroup.json
[03:23:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:31:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T371742)', diff saved to https://phabricator.wikimedia.org/P68017 and previous config saved to /var/cache/conftool/dbconfig/20240828-033149-ladsgroup.json
[03:31:51] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[03:31:54] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[03:32:04] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[03:32:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T371742)', diff saved to https://phabricator.wikimedia.org/P68018 and previous config saved to /var/cache/conftool/dbconfig/20240828-033211-ladsgroup.json
[03:38:18] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[04:03:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:13:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:50:23] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1067593
[04:53:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10098595 (10Marostegui)
[04:54:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1067593 (owner: 10Marostegui)
[05:42:07] <wikibugs>	 (03PS2) 10Chlod Alejandro: kaawiktionary: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064363 (https://phabricator.wikimedia.org/T368868)
[05:42:13] <wikibugs>	 (03PS2) 10Chlod Alejandro: kawikisource: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064356 (https://phabricator.wikimedia.org/T368868)
[05:42:15] <wikibugs>	 (03PS2) 10Chlod Alejandro: bewwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063920 (https://phabricator.wikimedia.org/T368868)
[05:42:18] <wikibugs>	 (03PS2) 10Chlod Alejandro: kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868)
[05:42:20] <wikibugs>	 (03PS2) 10Chlod Alejandro: mywikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063918 (https://phabricator.wikimedia.org/T368868)
[05:42:22] <wikibugs>	 (03PS2) 10Chlod Alejandro: iglwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063916 (https://phabricator.wikimedia.org/T368868)
[05:42:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T371742)', diff saved to https://phabricator.wikimedia.org/P68019 and previous config saved to /var/cache/conftool/dbconfig/20240828-054237-ladsgroup.json
[05:42:42] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[05:56:23] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Network report: remove wdqs from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555) (owner: 10Ayounsi)
[05:57:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P68020 and previous config saved to /var/cache/conftool/dbconfig/20240828-055744-ladsgroup.json
[05:58:24] <wikibugs>	 (03Merged) 10jenkins-bot: Network report: remove wdqs from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555) (owner: 10Ayounsi)
[05:59:37] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[05:59:50] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T0600)
[06:01:33] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[06:02:04] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[06:04:36] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:36] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:12:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P68021 and previous config saved to /var/cache/conftool/dbconfig/20240828-061252-ladsgroup.json
[06:13:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French)
[06:28:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T371742)', diff saved to https://phabricator.wikimedia.org/P68022 and previous config saved to /var/cache/conftool/dbconfig/20240828-062759-ladsgroup.json
[06:28:01] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[06:28:04] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[06:28:15] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[06:42:32] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1067397 (https://phabricator.wikimedia.org/T373426) (owner: 10Ssingh)
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T0700).
[07:00:05] <jouncebot>	 srishakatux: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:23:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:27:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:28:20] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1067766
[07:30:29] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2007.codfw.wmnet
[07:31:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2007.codfw.wmnet
[07:31:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1067766 (owner: 10Marostegui)
[07:35:19] <wikibugs>	 (03CR) 10Hashar: "I have noticed this was scheduled for this morning backport window, I would have done it unfortunately I have long forgot how to process c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[07:35:48] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991)
[07:35:48] <wikibugs>	 (03CR) 10Arnaudb: "fix this morning's spam" [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[07:36:53] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:37:06] <wikibugs>	 (03CR) 10Marostegui: "Please add a comment referencing why this is needed, like I asked on yesterday's patch" [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[07:37:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:38:18] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:38:18] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991)
[07:38:54] <wikibugs>	 (03CR) 10Arnaudb: "done, but lets try to also use git blame to avoid duplicating information" [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[07:38:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[07:40:18] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb)
[07:40:53] <wikibugs>	 (03PS1) 10JMeybohm: Rename/Re-IP kubernetes2007 as wikikube-worker2047 [puppet] - 10https://gerrit.wikimedia.org/r/1067873 (https://phabricator.wikimedia.org/T372878)
[07:42:18] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Rename/Re-IP kubernetes2007 as wikikube-worker2047 [puppet] - 10https://gerrit.wikimedia.org/r/1067873 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm)
[07:43:56] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427#10098740 (10kostajh) 05Open→03Resolve...
[07:44:34] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2007 to wikikube-worker2047
[07:44:51] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[07:50:33] <wikibugs>	 (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[07:51:10] <wikibugs>	 (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[07:51:47] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2007 to wikikube-worker2047 - jayme@cumin1002"
[07:52:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2007 to wikikube-worker2047 - jayme@cumin1002"
[07:52:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:52:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2047
[07:52:58] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2047
[07:53:36] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2007 to wikikube-worker2047
[07:53:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes200...
[07:54:04] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2047.codfw.wmnet on all recursors
[07:54:07] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2047.codfw.wmnet on all recursors
[07:54:35] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2047.codfw.wmnet with OS bullseye
[07:54:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f4a5bda6340>
[07:54:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki...
[07:58:18] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[07:58:20] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the (test) switch
[07:58:26] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[07:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[07:59:55] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[08:00:04] <jouncebot>	 hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T0800)
[08:00:18] <andre>	 eh eh
[08:00:29] <wikibugs>	 (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[08:00:32] <hashar>	 not right now cause I am in the middle of some other things still :/
[08:01:18] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:04:15] <wikibugs>	 (03CR) 10Marostegui: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[08:04:27] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2047 - jayme@cumin1002"
[08:04:31] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2047 - jayme@cumin1002"
[08:04:31] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:04:31] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2047.codfw.wmnet 196.0.192.10.in-addr.arpa 6.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:04:34] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2047.codfw.wmnet 196.0.192.10.in-addr.arpa 6.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[08:04:35] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2047
[08:05:10] <icinga-wm>	 PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:05:53] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2047
[08:05:53] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f4a5bda6340>
[08:06:27] <wikibugs>	 (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[08:07:53] <wikibugs>	 (03CR) 10Marostegui: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[08:08:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:08:44] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:08:52] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:09:54] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section Translation in btm/dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420)
[08:09:58] <hashar>	 andre: good morning :)
[08:10:07] <andre>	 hej hej hashar
[08:10:08] <hashar>	 so yeah I am bit rusty this week, I am back from vacations!
[08:10:24] <hashar>	 and I ended up filing wayy toooo manyyyy buuuggggs yesterday
[08:10:26] <hashar>	 so I got lost
[08:10:32] <andre>	 hashar, my respect for dealing with yesterday. At some point I was just like "I'm not gonna be of any help" :-/
[08:11:02] <hashar>	 no :erit
[08:11:04] <hashar>	 no merit
[08:11:20] <hashar>	 just a shit ton of years and years of context being hidden somewhere in my brain cells
[08:11:21] <hashar>	 :D
[08:11:26] <andre>	 last week was so smooth, I was sure this week will blow up 
[08:11:34] <hashar>	 remembers me I need to file a task to get rid of that chmod 777
[08:11:53] <hashar>	 or how rebuildLocalisationCache should really disappear
[08:11:54] <hashar>	  anyway
[08:11:56] <hashar>	 lets train
[08:13:18] <andre>	 hashar: hehe, go ahead (but if you want me to join in a call or such, just say)
[08:14:23] <wikibugs>	 (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[08:15:18] <hashar>	 ah yeah
[08:15:19] <hashar>	 hmm
[08:16:41] <nemo-yiannis>	 Hi, when it comes to mediawiki-config changes for LabsSettings (beta) does it go in the regular deployment-window for mediawiki-config?  
[08:17:41] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[08:17:54] <icinga-wm>	 PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:18:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans)
[08:18:58] <hashar>	 nemo-yiannis: usually yes, or at least sync up here :)
[08:19:11] <hashar>	 that theorically should NOT affect prod, but one never knows
[08:19:16] <nemo-yiannis>	 ok, thanks, i added the patch for the next window
[08:19:23] <hashar>	 we can do it now :)
[08:19:26] <hashar>	 which change is it ?
[08:20:12] <nemo-yiannis>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1065266
[08:21:40] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the (test) switch
[08:21:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[08:21:48] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch
[08:22:02] <icinga-wm>	 PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:22:04] <hashar>	 nemo-yiannis: that needs rebase according to Gerrit :)
[08:22:29] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2047.codfw.wmnet with reason: host reimage
[08:22:30] <wikibugs>	 (03PS2) 10Eevans: Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460)
[08:23:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:24:33] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans)
[08:24:37] <hashar>	 thanks :)
[08:24:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:25:17] <wikibugs>	 (03Merged) 10jenkins-bot: Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans)
[08:26:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2047.codfw.wmnet with reason: host reimage
[08:26:27] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[08:27:05] <hashar>	 nemo-yiannis: the beta update job already triggered before the change got merged
[08:27:23] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067909 (https://phabricator.wikimedia.org/T366965)
[08:27:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067909 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[08:27:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:28:04] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067909 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[08:28:09] <nemo-yiannis>	 hashar: so now the change should be live ?
[08:28:22] <hashar>	 not yet
[08:28:27] <hashar>	 the update job started before the change merged
[08:28:33] <hashar>	 I ll retrigger it
[08:30:16] <nemo-yiannis>	 ah ok got it
[08:30:42] <hashar>	 nemo-yiannis: https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/510658/console :)
[08:30:47] <hashar>	 that is the git pulls
[08:31:01] <hashar>	 then it will triggers another job to run the deployment (using scap, like in prod)
[08:34:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:37:12] <logmsgbot>	 !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.20  refs T366965
[08:37:16] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[08:40:17] <hashar>	 nemo-yiannis: your change should be live on the beta cluster now
[08:40:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance
[08:40:38] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance
[08:40:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T371742)', diff saved to https://phabricator.wikimedia.org/P68023 and previous config saved to /var/cache/conftool/dbconfig/20240828-084045-ladsgroup.json
[08:40:49] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[08:41:25] <icinga-wm>	 RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:41:54] <nemo-yiannis>	 hashar: thanks, checking
[08:41:55] <wikibugs>	 (03PS1) 10Brouberol: airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503)
[08:44:00] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "As from yesterdays discussion, maybe change the maintainer to your team." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman)
[08:44:31] <icinga-wm>	 PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:44:57] <wikibugs>	 (03PS1) 10Marostegui: installserver: Remove db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1067914
[08:45:31] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2047.codfw.wmnet with OS bullseye
[08:45:46] <wikibugs>	 (03PS1) 10Brouberol: Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503)
[08:45:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[08:46:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube...
[08:46:46] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch
[08:46:50] <wikibugs>	 (03PS1) 10Brouberol: deployment_server: define postgresql-test read/write usernames [puppet] - 10https://gerrit.wikimedia.org/r/1067916 (https://phabricator.wikimedia.org/T373503)
[08:47:33] <wikibugs>	 (03CR) 10JMeybohm: icinga: remove check_etcd_mw_config_lastindex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi)
[08:47:34] <wikibugs>	 (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1067914 (owner: 10Marostegui)
[08:48:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Remove db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1067914 (owner: 10Marostegui)
[08:48:56] <jayme>	 !log running homer commit on on lsw1-a6-codfw* - T372878
[08:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:00] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[08:49:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:49:58] <wikibugs>	 (03PS6) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048)
[08:50:55] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2047.codfw.wmnet
[08:50:56] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2047.codfw.wmnet
[08:52:45] <jayme>	 !log running homer commit on on cr*codfw* - T372878
[08:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:23] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:53:24] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373505 (10JMeybohm) 03NEW
[08:56:33] <icinga-wm>	 RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:59:41] <icinga-wm>	 PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:59:42] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add function to wmf-netbox plugin to provide QoS config data (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[09:00:31] <icinga-wm>	 RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:02:42] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[09:04:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:06:04] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Nice!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney)
[09:06:10] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:06:18] <wikibugs>	 (03PS7) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048)
[09:08:55] <wikibugs>	 (03PS8) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048)
[09:09:13] <wikibugs>	 (03CR) 10Klausman: "Done" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman)
[09:09:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:10:00] <wikibugs>	 (03PS1) 10Slyngshede: Management command for importing TOTP tokens from MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/1067918
[09:10:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the (test) switch
[09:13:45] <icinga-wm>	 PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:14:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:15:25] <wikibugs>	 (03PS2) 10Slyngshede: Management command for importing TOTP tokens from MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/1067918
[09:15:35] <icinga-wm>	 RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:15:56] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] deployment_server: define postgresql-test read/write usernames [puppet] - 10https://gerrit.wikimedia.org/r/1067916 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[09:16:46] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[09:17:48] <wikibugs>	 (03CR) 10Stevemunene: "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[09:19:40] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:21:39] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:22:17] <wikibugs>	 (03PS2) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143)
[09:22:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[09:23:14] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3764/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[09:24:19] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373505#10099044 (10Clement_Goubert) →14Duplicate dup:03T373457
[09:24:20] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373491#10099045 (10Clement_Goubert) →14Duplicate dup:03T373457
[09:24:37] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457#10099040 (10Clement_Goubert)
[09:27:02] <wikibugs>	 (03PS3) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143)
[09:28:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:28:41] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457#10099072 (10Clement_Goubert)
[09:31:07] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Upgrade airflow to 2.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067352 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[09:33:58] <claime>	 !log homer 'lsw1-a3-codfw*' commit T372878
[09:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:03] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[09:34:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[09:35:16] <claime>	 !log pooling wikikube-worker2043.codfw.wmnet - T372878
[09:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:33] <wikibugs>	 (03PS4) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143)
[09:35:34] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2043.codfw.wmnet
[09:35:35] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2043.codfw.wmnet
[09:36:46] <claime>	 !log homer 'cr*codfw*' commit 'T372878'
[09:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:59] <wikibugs>	 (03PS5) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143)
[09:37:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[09:37:43] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:37:48] <jinxer-wm>	 RESOLVED: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:38:58] <claime>	 ^expected
[09:39:47] <wikibugs>	 (03CR) 10Marostegui: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[09:40:08] <godog>	 !log start prometheus1005 bookworm upgrade - T326657
[09:40:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:12] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[09:42:58] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans)
[09:43:41] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:43:48] <wikibugs>	 (03PS6) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143)
[09:43:49] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 455, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:44:41] <icinga-wm>	 RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:46:53] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 537, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:47:02] <wikibugs>	 (03PS7) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143)
[09:48:37] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[09:49:04] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch
[09:49:15] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet
[09:49:39] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1067415 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[09:50:53] <icinga-wm>	 RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:53:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:54:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[09:57:06] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch
[09:57:06] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch
[09:57:27] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the (test) switch
[09:58:04] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:58:17] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1000)
[10:01:11] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet
[10:05:19] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1
[10:05:21] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the switch from test-s1 to test-s1
[10:06:21] <wikibugs>	 (03PS1) 10Ladsgroup: Set ruwiki to non simple UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067930 (https://phabricator.wikimedia.org/T372694)
[10:07:24] <Amir1>	 jouncebot: nowandnext
[10:07:25] <jouncebot>	 For the next 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1000)
[10:07:25] <jouncebot>	 In 0 hour(s) and 52 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1100)
[10:07:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1
[10:07:40] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the switch from test-s1 to test-s1
[10:07:44] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[10:07:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[10:08:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T370903)', diff saved to https://phabricator.wikimedia.org/P68024 and previous config saved to /var/cache/conftool/dbconfig/20240828-100803-ladsgroup.json
[10:08:08] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[10:08:30] <wikibugs>	 (03CR) 10Ayounsi: "I left a bunch of comments here and there." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) (owner: 10Cathal Mooney)
[10:10:35] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: revert_risk_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067933 (https://phabricator.wikimedia.org/T369344)
[10:11:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney)
[10:11:36] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1
[10:11:40] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the switch from test-s1 to test-s1
[10:12:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T370903)', diff saved to https://phabricator.wikimedia.org/P68025 and previous config saved to /var/cache/conftool/dbconfig/20240828-101214-ladsgroup.json
[10:12:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1
[10:12:43] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from test-s1 to test-s1
[10:12:54] <wikibugs>	 (03PS16) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850)
[10:13:37] <wikibugs>	 (03PS1) 10Hnowlan: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128)
[10:18:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[10:18:58] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[10:22:00] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[10:24:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[10:25:13] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:27:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P68026 and previous config saved to /var/cache/conftool/dbconfig/20240828-102721-ladsgroup.json
[10:27:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[10:27:58] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet
[10:28:58] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[10:30:13] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:31:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: icinga: remove check_etcd_mw_config_lastindex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi)
[10:31:57] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga: remove check_etcd_mw_config_lastindex [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523)
[10:33:41] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[10:36:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067930 (https://phabricator.wikimedia.org/T372694) (owner: 10Ladsgroup)
[10:37:33] <wikibugs>	 (03Merged) 10jenkins-bot: Set ruwiki to non simple UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067930 (https://phabricator.wikimedia.org/T372694) (owner: 10Ladsgroup)
[10:38:06] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet
[10:38:08] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1067930|Set ruwiki to non simple UI (T372694)]]
[10:38:12] <stashbot>	 T372694: Switch ruwiki to use FlaggedRevs detailed interface mode - https://phabricator.wikimedia.org/T372694
[10:40:00] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Relase v0.7.0 with updated plugin - cmooney@cumin1002
[10:41:32] <godog>	 !log start prometheus2005 bookworm upgrade - T326657
[10:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:36] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[10:42:09] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1067930|Set ruwiki to non simple UI (T372694)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:42:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P68027 and previous config saved to /var/cache/conftool/dbconfig/20240828-104228-ladsgroup.json
[10:42:36] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Enable Redis and TOTP support. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1064354 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[10:44:22] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[10:48:56] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067930|Set ruwiki to non simple UI (T372694)]] (duration: 10m 48s)
[10:49:00] <stashbot>	 T372694: Switch ruwiki to use FlaggedRevs detailed interface mode - https://phabricator.wikimedia.org/T372694
[10:50:58] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Relase v0.7.0 with updated plugin - cmooney@cumin1002
[10:51:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[10:52:55] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[10:53:09] <wikibugs>	 (03PS1) 10Dreamy Jazz: Maintain ranked order of candidates in STV vote summary [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499)
[10:54:58] <wikibugs>	 (03PS3) 10Cathal Mooney: Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351)
[10:56:06] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney)
[10:56:39] <wikibugs>	 (03Merged) 10jenkins-bot: Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney)
[10:57:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T370903)', diff saved to https://phabricator.wikimedia.org/P68028 and previous config saved to /var/cache/conftool/dbconfig/20240828-105735-ladsgroup.json
[10:57:37] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:57:40] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[10:57:50] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:57:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T370903)', diff saved to https://phabricator.wikimedia.org/P68029 and previous config saved to /var/cache/conftool/dbconfig/20240828-105757-ladsgroup.json
[10:58:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499) (owner: 10Dreamy Jazz)
[10:58:46] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:58:46] <jouncebot>	 For the next 0 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1000)
[10:58:46] <jouncebot>	 In 0 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1100)
[10:59:12] <Dreamy_Jazz>	 Going to deploy now if that's okay.
[11:00:05] <jouncebot>	 mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1100).
[11:00:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499) (owner: 10Dreamy Jazz)
[11:02:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T370903)', diff saved to https://phabricator.wikimedia.org/P68030 and previous config saved to /var/cache/conftool/dbconfig/20240828-110200-ladsgroup.json
[11:02:11] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[11:02:35] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[11:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: Maintain ranked order of candidates in STV vote summary [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499) (owner: 10Dreamy Jazz)
[11:03:49] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1067938|Maintain ranked order of candidates in STV vote summary (T373499)]]
[11:03:53] <stashbot>	 T373499: Vote summaries for STV should display user ranked order instead of alphabetical candidate order - https://phabricator.wikimedia.org/T373499
[11:05:26] <wikibugs>	 (03PS1) 10Ayounsi: Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940
[11:05:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T371742)', diff saved to https://phabricator.wikimedia.org/P68031 and previous config saved to /var/cache/conftool/dbconfig/20240828-110535-ladsgroup.json
[11:05:40] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[11:06:02] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1067938|Maintain ranked order of candidates in STV vote summary (T373499)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:06:04] <wikibugs>	 (03PS2) 10Ayounsi: Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940
[11:06:04] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[11:06:50] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[11:07:25] <wikibugs>	 (03CR) 10Ayounsi: Refactor server provision script to select params based on profile (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) (owner: 10Cathal Mooney)
[11:07:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 (owner: 10Ayounsi)
[11:09:09] <wikibugs>	 (03PS3) 10Ayounsi: Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940
[11:10:33] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067938|Maintain ranked order of candidates in STV vote summary (T373499)]] (duration: 06m 44s)
[11:10:37] <stashbot>	 T373499: Vote summaries for STV should display user ranked order instead of alphabetical candidate order - https://phabricator.wikimedia.org/T373499
[11:12:50] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:12:58] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:14:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:17:03] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[11:17:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P68032 and previous config saved to /var/cache/conftool/dbconfig/20240828-111708-ladsgroup.json
[11:18:13] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:20:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P68033 and previous config saved to /var/cache/conftool/dbconfig/20240828-112042-ladsgroup.json
[11:22:58] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:22:59] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[11:23:13] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:23:53] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[11:25:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM.  I suspect for the bigger mistakes we are probably gonna need to go to the backup, but it's a good approach, code is clean (" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi)
[11:29:05] <wikibugs>	 (03PS1) 10Ayounsi: ProvisionServer: add types [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067960
[11:32:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P68034 and previous config saved to /var/cache/conftool/dbconfig/20240828-113215-ladsgroup.json
[11:34:18] <wikibugs>	 (03PS1) 10Hnowlan: changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241)
[11:35:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P68035 and previous config saved to /var/cache/conftool/dbconfig/20240828-113549-ladsgroup.json
[11:38:04] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[11:39:09] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[11:40:07] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[11:40:50] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[11:41:05] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:41:09] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi)
[11:41:55] <wikibugs>	 (03PS2) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1052167 (https://phabricator.wikimedia.org/T339850)
[11:42:08] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[11:42:57] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:42:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[11:43:40] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:43:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[11:44:44] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[11:45:55] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Enable Redis and TOTP support. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1064354 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[11:46:12] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[11:47:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T370903)', diff saved to https://phabricator.wikimedia.org/P68036 and previous config saved to /var/cache/conftool/dbconfig/20240828-114722-ladsgroup.json
[11:47:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[11:47:27] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[11:47:38] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[11:47:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68037 and previous config saved to /var/cache/conftool/dbconfig/20240828-114745-ladsgroup.json
[11:48:05] <wikibugs>	 (03CR) 10Slyngshede: "@jhathaway@wikimedia.org would you mind doing another review on this. I had to add a feature to lookup UIDs in LDAP." [software/bitu] - 10https://gerrit.wikimedia.org/r/1065166 (owner: 10Slyngshede)
[11:48:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:50:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T371742)', diff saved to https://phabricator.wikimedia.org/P68038 and previous config saved to /var/cache/conftool/dbconfig/20240828-115057-ladsgroup.json
[11:51:00] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance
[11:51:03] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[11:51:13] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance
[11:51:14] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[11:51:17] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[11:51:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T371742)', diff saved to https://phabricator.wikimedia.org/P68039 and previous config saved to /var/cache/conftool/dbconfig/20240828-115123-ladsgroup.json
[11:55:01] <wikibugs>	 (03PS2) 10KartikMistry: Enable Section Translation in bdr, btm and dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420)
[11:58:41] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[12:01:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10099466 (10JayCano) Approved as well!
[12:10:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[12:13:54] <wikibugs>	 (03CR) 10David Caro: [C:03+2] aptrepo: upgrade k8s components for 1.26 [puppet] - 10https://gerrit.wikimedia.org/r/1058560 (https://phabricator.wikimedia.org/T370246) (owner: 10Slavina Stefanova)
[12:15:28] <wikibugs>	 (03CR) 10David Caro: [C:03+2] aptrepo: upgrade k8s components for 1.26 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058560 (https://phabricator.wikimedia.org/T370246) (owner: 10Slavina Stefanova)
[12:17:27] <wikibugs>	 (03PS4) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955)
[12:17:54] <wikibugs>	 (03CR) 10David Caro: "rebased on top of production branch, will do the refactor some other day" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[12:19:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos)
[12:19:40] <MichaelG_WMF>	 Hi, we'll be running some CommunityConfiguration/GrowthExperiments maint scripts
[12:19:48] <MichaelG_WMF>	 They are not expected to disrupt anything.
[12:19:50] <MichaelG_WMF>	 !log T371228 running mwscript --wiki testwiki ./extensions/CommunityConfiguration/maintenance/setVersionData.php HelpPanel 1.0.0
[12:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:54] <stashbot>	 T371228: Page title component makes it easy to unintentionally blank page title - https://phabricator.wikimedia.org/T371228
[12:19:55] <urbanecm>	 (we being Michael and myself)
[12:20:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[12:22:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1
[12:22:01] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.finalize (exit_code=97) for the switch from test-s1 to test-s1
[12:23:50] <MichaelG_WMF>	 !log T371228 running foreachwikiindblist growthexperiments ./extensions/CommunityConfiguration/maintenance/setVersionData.php HelpPanel 1.0.0
[12:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:24:12] <wikibugs>	 (03PS1) 10David Caro: updates: fix k8s 1.26 url [puppet] - 10https://gerrit.wikimedia.org/r/1067985 (https://phabricator.wikimedia.org/T370246)
[12:24:36] <wikibugs>	 (03CR) 10David Caro: [C:03+2] updates: fix k8s 1.26 url [puppet] - 10https://gerrit.wikimedia.org/r/1067985 (https://phabricator.wikimedia.org/T370246) (owner: 10David Caro)
[12:24:37] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos)
[12:25:56] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos)
[12:26:19] <wikibugs>	 (03PS1) 10Elukey: profile::docker::reporter: exclude dcl-puppet-pki from base rules [puppet] - 10https://gerrit.wikimedia.org/r/1067986 (https://phabricator.wikimedia.org/T372472)
[12:27:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1
[12:27:02] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.finalize (exit_code=99) for the switch from test-s1 to test-s1
[12:28:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1
[12:28:12] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.finalize (exit_code=99) for the switch from test-s1 to test-s1
[12:29:13] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1
[12:29:32] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from test-s1 to test-s1
[12:30:19] <icinga-wm>	 PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:32:03] <wikibugs>	 (03PS5) 10Arnaudb: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[12:32:29] <MichaelG_WMF>	 All done from our side
[12:32:57] <wikibugs>	 (03CR) 10Arnaudb: "good catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[12:33:37] <wikibugs>	 (03PS3) 10KartikMistry: Enable Section Translation in bdr, btm, and dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420)
[12:37:15] <icinga-wm>	 RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:37:24] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1052167 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[12:38:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1052167 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[12:39:20] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:39:22] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:40:06] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[12:41:27] <icinga-wm>	 PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:41:56] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[12:42:45] <wikibugs>	 (03PS1) 10Slyngshede: Fix syntax error [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1067988
[12:43:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi)
[12:43:52] <wikibugs>	 (03Merged) 10jenkins-bot: RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi)
[12:44:55] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:45:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[12:45:41] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:48:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68040 and previous config saved to /var/cache/conftool/dbconfig/20240828-124801-ladsgroup.json
[12:48:06] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[12:49:00] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:51:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[12:53:25] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067990
[12:54:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067990 (owner: 10Jgiannelos)
[12:55:21] <sukhe>	 hehe. puppetserver1002 is not well
[12:55:48] <sukhe>	 https://grafana.wikimedia.org/goto/nTxGvo3IR?orgId=1
[12:55:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: exclude dcl-puppet-pki from base rules [puppet] - 10https://gerrit.wikimedia.org/r/1067986 (https://phabricator.wikimedia.org/T372472) (owner: 10Elukey)
[12:56:00] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[12:56:09] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991
[12:56:23] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067990 (owner: 10Jgiannelos)
[12:57:17] <sukhe>	 going to power cycle it. 
[12:57:52] <jelto>	 ack thanks, I can not connect over ssh either (only mgmt)
[12:57:59] <sukhe>	 !log sudo ipmitool -I lanplus -H "puppetserver1002.mgmt.eqiad.wmnet" -U root -E chassis power cycle
[12:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:05] <sukhe>	 yeah, it's thrashing, clearly
[12:58:36] <jelto>	 lets see if it comes back properly after the reboot
[12:58:58] <Dreamy_Jazz>	 !log Started MediaModeration scan on enwiki, time limited to 24hrs - https://wikitech.wikimedia.org/wiki/MediaModeration
[12:58:59] <sukhe>	 this is also the reason for the widespread puppet failures https://puppetboard.wikimedia.org/
[12:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:04] <sukhe>	 so that should resolve as well
[12:59:54] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:59:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10099544 (10ssingh)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1300).
[13:00:05] <jouncebot>	 Gerges, nemo-yiannis, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518 (10Southparkfan) 03NEW
[13:00:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add mszabo to deployment and move from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067397 (https://phabricator.wikimedia.org/T373426) (owner: 10Ssingh)
[13:00:25] <Dreamy_Jazz>	 \o My patch is already done, so don't need to use the deployment window
[13:00:30] <nemo-yiannis>	 I think my patch is already deployed too
[13:01:31] <icinga-wm>	 RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:01:51] <sukhe>	 and back
[13:02:03] <jelto>	 ssh works again for me
[13:02:23] <sukhe>	 nice. and we can let the failed agent runs run organically so nothing to do there
[13:02:45] <jelto>	 metrics in prometheus back as well. +1 ^
[13:03:00] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10099570 (10Southparkfan)
[13:03:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P68041 and previous config saved to /var/cache/conftool/dbconfig/20240828-130308-ladsgroup.json
[13:03:31] <godog>	 !log delete 2023 5m blocks from thanos - T351927
[13:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:35] <stashbot>	 T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927
[13:04:34] <topranks>	 !log rolling out config additions of qos schedulers and policers to all network devices T339850
[13:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:37] <stashbot>	 T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850
[13:06:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:06:50] <sukhe>	 yeah, this most certainly needs a network-online.target
[13:06:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10099573 (10ssingh) 05Open→03Resolved a:03ssingh @mszabo: Your request has been merged, also added to Gerrit group wmf-deployment. Please try in ~30 mins. Tha...
[13:07:05] <sukhe>	 which now reminds me that this is the second time puppetserver1002 failed
[13:07:08] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[13:07:10] <sukhe>	 because I think it failed last week as well
[13:07:20] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetmaster::frontend: allow puppetservers via ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:07:25] <sukhe>	 indeed, on 22 Aug as well 
[13:07:28] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] Add safe directory settings to the prod private repo's git config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053272 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:07:30] <sukhe>	 ok, I will file a task
[13:07:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update Thumbor Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067382 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey)
[13:08:08] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: healthcheck every 1s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067992 (https://phabricator.wikimedia.org/T373517)
[13:08:20] <jelto>	 thanks you!
[13:08:34] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[13:09:00] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:09:37] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-ca on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:09:43] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:09:50] <sukhe>	 ^ fixing since we rebooted the host
[13:09:59] <sukhe>	 then a proper fix is to add network-online.target, which I will do later
[13:10:01] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync
[13:10:06] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[13:10:12] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Remove role::common::core_platform, s/Core Platform/ServiceOps/g [puppet] - 10https://gerrit.wikimedia.org/r/1064725 (owner: 10Hnowlan)
[13:11:40] <elukey>	 sukhe: o/ thanks for the puppetserver1002 fix, I didn't notice it, did it happen before? It is not great :(
[13:11:40] <Gerges>	 Here
[13:12:00] <sukhe>	 elukey: it did happen yep, same issue (thrashing) on Aug 22
[13:12:12] <sukhe>	 I will file a task for that later as well so don't worry
[13:12:39] <elukey>	 okok thanks, I'll try to check as well
[13:13:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:15:07] <sukhe>	 elukey: I will assign to you :P 
[13:15:11] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[13:15:15] <wikibugs>	 (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[13:15:16] <elukey>	 sukhe: fair enough :D
[13:18:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P68042 and previous config saved to /var/cache/conftool/dbconfig/20240828-131815-ladsgroup.json
[13:19:37] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-ca on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:19:43] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:20:09] <wikibugs>	 (03PS1) 10AOkoth: Revert "vrts: add yearly ticket count" [puppet] - 10https://gerrit.wikimedia.org/r/1067995
[13:21:24] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10099627 (10ssingh) Hi @Southparkfan! We need two things for this to move forward, otherwise it's a simple addition.  1. Approval from your manager/point of contact. I am going to assume that th...
[13:22:17] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991 (owner: 10Jgiannelos)
[13:23:45] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10099632 (10ssingh) 05Resolved→03Open
[13:24:08] <Gerges>	 Who will deploy this backport patches?
[13:25:36] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10099634 (10ssingh) daily_account_consistency_check reports that:  ` seanleong-wmde present in privileged LDAP group (nda),but not present in data.yaml seanleong-wmde present in pri...
[13:27:10] <wikibugs>	 (03PS1) 10Ssingh: admin: add seanleong-wmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067998 (https://phabricator.wikimedia.org/T371694)
[13:27:16] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991 (owner: 10Jgiannelos)
[13:28:19] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991 (owner: 10Jgiannelos)
[13:28:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add seanleong-wmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067998 (https://phabricator.wikimedia.org/T371694) (owner: 10Ssingh)
[13:28:53] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: exclude dcl-puppet-pki from base rules [puppet] - 10https://gerrit.wikimedia.org/r/1067986 (https://phabricator.wikimedia.org/T372472) (owner: 10Elukey)
[13:31:20] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync
[13:31:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1
[13:31:55] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from test-s1 to test-s1
[13:32:21] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[13:32:32] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[13:33:05] <Gerges>	 Hi Lucas_WMDE and Urbanecm, awight, TheresNoTime, Who will deploy this backport patches?
[13:33:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68043 and previous config saved to /var/cache/conftool/dbconfig/20240828-133323-ladsgroup.json
[13:33:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:33:27] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[13:33:38] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:33:39] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10099650 (10ssingh) 05Open→03Resolved Added to data.yaml, closing this. Thanks!
[13:33:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T370903)', diff saved to https://phabricator.wikimedia.org/P68044 and previous config saved to /var/cache/conftool/dbconfig/20240828-133346-ladsgroup.json
[13:34:54] <wikibugs>	 (03PS5) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955)
[13:36:34] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[13:36:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1
[13:36:40] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from test-s1 to test-s1
[13:37:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[13:37:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T370903)', diff saved to https://phabricator.wikimedia.org/P68045 and previous config saved to /var/cache/conftool/dbconfig/20240828-133753-ladsgroup.json
[13:38:20] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1
[13:38:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:15] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s1 to test-s1
[13:39:21] <wikibugs>	 (03CR) 10Ayounsi: P:idp Clean up CAS 6.6 and Tomcat 9 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede)
[13:39:39] <Gerges>	 jouncebot: 
[13:39:41] <wikibugs>	 10SRE-Access-Requests: abi uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T373522 (10ssingh) 03NEW
[13:39:50] <Gerges>	 jouncebot next
[13:39:50] <jouncebot>	 In 0 hour(s) and 20 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1400)
[13:39:58] <wikibugs>	 10SRE-Access-Requests: abi uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T373522#10099694 (10ssingh) p:05Triage→03High
[13:40:26] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[13:41:55] <wikibugs>	 (03PS3) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589)
[13:42:19] <wikibugs>	 (03PS5) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590)
[13:42:19] <wikibugs>	 (03PS1) 10Andrew Bogott: Add apache to codfw1dev cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1068000 (https://phabricator.wikimedia.org/T359590)
[13:42:30] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068000 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[13:45:32] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[13:45:34] <wikibugs>	 (03PS1) 10Jelto: gerrit: lower thresholds for gerrit, remove gerrit1004 config [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259)
[13:46:31] <wikibugs>	 (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[13:48:07] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3765/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[13:48:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add apache to codfw1dev cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1068000 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[13:49:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2009.codfw.wmnet
[13:50:08] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman)
[13:50:27] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2009.codfw.wmnet
[13:51:04] <wikibugs>	 (03PS9) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048)
[13:51:43] <wikibugs>	 (03CR) 10Klausman: [C:03+2] kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman)
[13:52:16] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman)
[13:52:54] <wikibugs>	 (03PS6) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955)
[13:53:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P68046 and previous config saved to /var/cache/conftool/dbconfig/20240828-135300-ladsgroup.json
[13:53:30] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] Revert "vrts: add yearly ticket count" [puppet] - 10https://gerrit.wikimedia.org/r/1067995 (owner: 10AOkoth)
[13:55:32] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1010.eqiad.wmnet with OS bookworm
[13:55:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[13:55:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099735 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[13:57:05] <wikibugs>	 (03CR) 10Elukey: kserve: Bump version to 0.13 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman)
[13:57:39] <wikibugs>	 (03PS1) 10Elukey: role::deployment_server::kubernetes: upgrade nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1068004 (https://phabricator.wikimedia.org/T368366)
[13:58:56] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: healthcheck every 1s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067992 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[13:59:14] <elukey>	 bd808: o/ - is it ok to deploy toolhub to pick up a new version of mcrouter for https://phabricator.wikimedia.org/T368366 ?
[13:59:20] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[13:59:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[13:59:59] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: healthcheck every 1s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067992 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1400)
[14:00:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[14:00:56] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[14:02:31] <wikibugs>	 (03PS1) 10Ayounsi: Provision script: Assign the mgmt IP as oob_ip [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068008
[14:03:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:05:52] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.dhcp for host ml-serve1009.eqiad.wmnet
[14:06:59] <bd808>	 Elukey: That should be fine, yes. Thanks for taking care of that. 
[14:08:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P68047 and previous config saved to /var/cache/conftool/dbconfig/20240828-140807-ladsgroup.json
[14:08:35] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host ml-serve1009.eqiad.wmnet
[14:11:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T371742)', diff saved to https://phabricator.wikimedia.org/P68048 and previous config saved to /var/cache/conftool/dbconfig/20240828-141108-ladsgroup.json
[14:12:27] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:12:29] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:13:03] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1068004 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[14:13:45] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] icinga: remove check_etcd_mw_config_lastindex [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi)
[14:14:49] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[14:14:53] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[14:18:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1010.eqiad.wmnet with reason: host reimage
[14:18:41] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:18:41] <wikibugs>	 06SRE: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527 (10ssingh) 03NEW
[14:18:51] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:18:53] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:19:02] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:19:04] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:19:41] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:20:26] <wikibugs>	 (03PS6) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590)
[14:20:26] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone/apache.conf: fix listen ports [puppet] - 10https://gerrit.wikimedia.org/r/1068014 (https://phabricator.wikimedia.org/T359590)
[14:20:35] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:21:44] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1010.eqiad.wmnet with reason: host reimage
[14:21:46] <wikibugs>	 (03PS1) 10Jgiannelos: Revert "mobileapps: Enable caching in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068015
[14:22:54] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] Revert "mobileapps: Enable caching in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068015 (owner: 10Jgiannelos)
[14:23:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T370903)', diff saved to https://phabricator.wikimedia.org/P68049 and previous config saved to /var/cache/conftool/dbconfig/20240828-142315-ladsgroup.json
[14:23:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[14:23:20] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[14:23:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[14:23:32] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:23:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:23:48] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:23:52] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mobileapps: Enable caching in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068015 (owner: 10Jgiannelos)
[14:23:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68050 and previous config saved to /var/cache/conftool/dbconfig/20240828-142355-ladsgroup.json
[14:24:11] <sukhe>	 XioNoX: topranks: are these routinator errors known? I have seen them fire up more recently this week than before
[14:24:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] keystone/apache.conf: fix listen ports [puppet] - 10https://gerrit.wikimedia.org/r/1068014 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[14:24:59] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:25:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm
[14:25:04] <wikibugs>	 (03PS7) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955)
[14:25:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[14:25:21] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:26:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P68051 and previous config saved to /var/cache/conftool/dbconfig/20240828-142615-ladsgroup.json
[14:26:19] <XioNoX>	 sukhe: yeah... it's a bit of a pain, it's something we don't have control over, but we want to have alerts if there is a massive issue
[14:26:36] <XioNoX>	 I think the more people deploy RPKI, the more external fetches are going to fail
[14:26:38] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:26:40] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:26:50] <sukhe>	 ah so that is what it is saying
[14:27:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[14:28:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68052 and previous config saved to /var/cache/conftool/dbconfig/20240828-142821-ladsgroup.json
[14:28:26] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[14:28:35] <XioNoX>	 sukhe: I'm going to bump the threshold significantly. Or I should figure out how to have it fail after a certain percentage
[14:28:41] <XioNoX>	 and not an absolute value
[14:28:57] <sukhe>	 no worries on the alerts I guess (non-paging) but I was mostly curious what's up
[14:29:24] <XioNoX>	 sukhe: I hate alerting noise, so I should clean up "mine" first :)
[14:29:32] <sukhe>	 haha
[14:29:48] <sukhe>	 well if we want to go down that path of cleaning up alerting noise... :P
[14:31:09] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[14:35:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon)
[14:35:56] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:36:13] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:36:13] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1010.eqiad.wmnet with OS bookworm
[14:36:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[14:36:27] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:00] <XioNoX>	 sukhe: actually they are moving away from rsync, so that's why only the failed ones are staying around, so we're always above 50% failure rate
[14:38:08] <XioNoX>	 anyway, I'll remove the alerting for that
[14:38:18] <sukhe>	 thanks <3
[14:39:06] <sukhe>	 for your contributions for reducing alert fatigue as well. only a 100 more to go across all SRE :P
[14:41:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P68053 and previous config saved to /var/cache/conftool/dbconfig/20240828-144122-ladsgroup.json
[14:43:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P68054 and previous config saved to /var/cache/conftool/dbconfig/20240828-144328-ladsgroup.json
[14:43:41] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:48:52] <wikibugs>	 (03CR) 10David Caro: [C:03+1] Put cloudcephosd1036 into service [puppet] - 10https://gerrit.wikimedia.org/r/1063861 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott)
[14:50:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1011.eqiad.wmnet with OS bookworm
[14:50:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[14:50:54] <wikibugs>	 (03CR) 10David Caro: Make cloudcephosd1039-1041 into ceph osd nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) (owner: 10Andrew Bogott)
[14:53:13] <wikibugs>	 (03PS1) 10Ayounsi: Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019
[14:54:33] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2009.codfw.wmnet
[14:54:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2009.codfw.wmnet
[14:54:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2009.codfw.wmnet
[14:54:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099910 (10Jclark-ctr)
[14:55:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2009.codfw.wmnet with OS bullseye
[14:55:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10099911 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[14:55:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f9ac7a901f0>
[14:55:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[14:56:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "whatever that's worth 😊" [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi)
[14:56:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T371742)', diff saved to https://phabricator.wikimedia.org/P68056 and previous config saved to /var/cache/conftool/dbconfig/20240828-145629-ladsgroup.json
[14:56:32] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance
[14:56:34] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[14:56:45] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance
[14:56:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T371742)', diff saved to https://phabricator.wikimedia.org/P68057 and previous config saved to /var/cache/conftool/dbconfig/20240828-145651-ladsgroup.json
[14:58:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P68058 and previous config saved to /var/cache/conftool/dbconfig/20240828-145835-ladsgroup.json
[14:59:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2009 - cgoubert@cumin1002"
[14:59:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2009 - cgoubert@cumin1002"
[14:59:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:59:18] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2009.codfw.wmnet 197.16.192.10.in-addr.arpa 7.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:59:21] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2009.codfw.wmnet 197.16.192.10.in-addr.arpa 7.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:59:22] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2009
[14:59:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2009
[14:59:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f9ac7a901f0>
[15:00:58] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Re-enabling caching in prod after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024
[15:01:27] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:59] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365)
[15:02:03] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:02:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1011.eqiad.wmnet with reason: host reimage
[15:03:01] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:05:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:05:42] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1011.eqiad.wmnet with reason: host reimage
[15:07:00] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[15:07:50] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[15:08:49] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[15:09:08] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 453, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:10:00] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:10:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:10:47] <wikibugs>	 (03PS8) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955)
[15:11:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1002
[15:11:13] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1002
[15:11:32] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[15:13:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68059 and previous config saved to /var/cache/conftool/dbconfig/20240828-151342-ladsgroup.json
[15:13:45] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[15:13:47] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[15:13:50] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:13:50] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:13:52] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:13:58] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[15:14:00] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:14:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T370903)', diff saved to https://phabricator.wikimedia.org/P68060 and previous config saved to /var/cache/conftool/dbconfig/20240828-151404-ladsgroup.json
[15:14:47] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:16:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage
[15:17:50] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:18:31] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:18:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T370903)', diff saved to https://phabricator.wikimedia.org/P68061 and previous config saved to /var/cache/conftool/dbconfig/20240828-151831-ladsgroup.json
[15:18:52] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage
[15:20:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:22:04] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: sync
[15:22:17] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 535, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:22:43] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: sync
[15:23:00] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:23:01] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1011.eqiad.wmnet with OS bookworm
[15:23:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100078 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[15:23:41] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: sync
[15:23:53] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: sync
[15:27:27] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:30:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:32:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[15:33:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P68062 and previous config saved to /var/cache/conftool/dbconfig/20240828-153338-ladsgroup.json
[15:33:55] <wikibugs>	 (03PS1) 10Hnowlan: timedmediahandler: revert using shellbox for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517)
[15:34:02] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:34:04] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:34:16] <wikibugs>	 (03PS1) 10JMeybohm: Update cfssl-issuer to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1068026 (https://phabricator.wikimedia.org/T337928)
[15:37:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] timedmediahandler: revert using shellbox for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[15:38:43] <wikibugs>	 (03PS1) 10JMeybohm: Pin cfssl-issuer and CRDs chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068027 (https://phabricator.wikimedia.org/T337928)
[15:38:44] <wikibugs>	 (03PS1) 10JMeybohm: Update cfss-issuer charts to v0.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928)
[15:40:33] <wikibugs>	 (03CR) 10David Caro: "just needed rebasing, essentially, the click change it was depending on, just rebased on top of production to get the stats in before the " [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[15:40:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2009.codfw.wmnet with OS bullseye
[15:40:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[15:40:52] <claime>	 !log homer cr*codfw* commit 'T372878'
[15:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:56] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[15:41:11] <wikibugs>	 (03PS1) 10Hnowlan: videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128)
[15:41:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[15:42:47] <icinga-wm>	 PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[15:43:11] <wikibugs>	 (03CR) 10JMeybohm: Update cfss-issuer charts to v0.4.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[15:43:35] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:07] <icinga-wm>	 PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:13] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:44:15] <icinga-wm>	 PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:21] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:44:41] <wikibugs>	 (03PS2) 10Hnowlan: videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128)
[15:45:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[15:45:35] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3767/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[15:45:36] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm
[15:46:51] <icinga-wm>	 PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:34] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@0b23c91]: Test Refine through Airflow
[15:47:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1009.eqiad.wmnet with OS bookworm
[15:47:45] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@0b23c91]: Test Refine through Airflow (duration: 00m 11s)
[15:47:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[15:47:57] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:48:28] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[15:48:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P68063 and previous config saved to /var/cache/conftool/dbconfig/20240828-154846-ladsgroup.json
[15:49:11] <claime>	 !log homer lsw1-b6-codfw* commit 'T372878'
[15:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:15] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[15:49:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm
[15:49:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[15:49:59] <icinga-wm>	 RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.13 ms
[15:50:03] <icinga-wm>	 RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.41 ms
[15:50:09] <icinga-wm>	 RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.72 ms
[15:50:13] <sukhe>	 hmm
[15:50:19] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:50:23] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:51:27] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:51:53] <icinga-wm>	 RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.57 ms
[15:52:41] <urandom>	 !log TRUNCATE-ing RESTBase tables (`{commons,enwiki,others,wikipedia}_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY`) — T342148
[15:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:45] <stashbot>	 T342148: restbase: high storage utilization - https://phabricator.wikimedia.org/T342148
[15:53:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Cephadm doesn't find the correct image to run a shell - https://phabricator.wikimedia.org/T373185#10100246 (10MatthewVernon) For reference - [[ https://github.com/ceph/ceph/pull/59485 | upstream MR to make cephadm more helpful ]]
[15:53:49] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 73.47 ms
[15:54:27] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1 C:03+2] videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[15:57:21] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:57:22] <wikibugs>	 (03PS1) 10Jgiannelos: Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032
[15:57:43] <wikibugs>	 (03PS2) 10Jgiannelos: Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032
[15:59:05] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032 (owner: 10Jgiannelos)
[15:59:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1009.eqiad.wmnet with reason: host reimage
[16:00:04] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032 (owner: 10Jgiannelos)
[16:00:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm
[16:00:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[16:01:01] <wikibugs>	 (03PS1) 10Elukey: jaeger: add securityContext configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491)
[16:01:09] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[16:01:10] <wikibugs>	 (03PS28) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[16:01:39] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[16:02:39] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[16:02:52] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1009.eqiad.wmnet with reason: host reimage
[16:03:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T370903)', diff saved to https://phabricator.wikimedia.org/P68065 and previous config saved to /var/cache/conftool/dbconfig/20240828-160354-ladsgroup.json
[16:03:57] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[16:04:10] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[16:04:12] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:05:52] <hnowlan>	 jouncebot: nowandnext
[16:05:52] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 54 minute(s)
[16:05:52] <jouncebot>	 In 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1700)
[16:06:24] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[16:07:54] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2009.codfw.wmnet
[16:08:05] <wikibugs>	 (03PS1) 10Hashar: archiva: allow trailing slash for top directories [puppet] - 10https://gerrit.wikimedia.org/r/1068036 (https://phabricator.wikimedia.org/T359031)
[16:09:41] <wikibugs>	 (03CR) 10Hashar: "https://archiva.wikimedia.org/repository/mirrored yields a 404 not found since it lacks a trailing slash and that confused me :]" [puppet] - 10https://gerrit.wikimedia.org/r/1068036 (https://phabricator.wikimedia.org/T359031) (owner: 10Hashar)
[16:13:56] <hnowlan>	 I need to do an out of step deployment to address some error rate issues in videoscaling 
[16:14:46] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:14:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:16:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hnowlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[16:17:12] <wikibugs>	 (03Merged) 10jenkins-bot: timedmediahandler: revert using shellbox for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan)
[16:17:18] <hashar>	 hnowlan: please do !
[16:17:28] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:17:31] <hashar>	 hnowlan: I ran the MediaWiki train earlier today (roughly 8 hours ago)
[16:17:36] <logmsgbot>	 !log hnowlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]]
[16:17:41] <stashbot>	 T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517
[16:17:47] <hashar>	 ah it is happening already \o/
[16:18:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:19:16] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507)
[16:19:43] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I can backport in the evening if I get a +1." [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[16:19:50] <hnowlan>	 hashar: fortunately/unfortunately the errors are definitely unrelated to the train :) 
[16:20:01] <logmsgbot>	 !log hnowlan@deploy1003 hnowlan: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:20:08] <logmsgbot>	 !log hnowlan@deploy1003 hnowlan: Continuing with sync
[16:20:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:20:12] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1009.eqiad.wmnet with OS bookworm
[16:20:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[16:20:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100333 (10Jclark-ctr)
[16:22:19] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance
[16:22:32] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance
[16:22:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T370903)', diff saved to https://phabricator.wikimedia.org/P68066 and previous config saved to /var/cache/conftool/dbconfig/20240828-162239-ladsgroup.json
[16:22:45] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:24:50] <logmsgbot>	 !log hnowlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]] (duration: 07m 13s)
[16:24:54] <wikibugs>	 10ops-magru: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#10100341 (10RobH) 05Open→03Resolved a:03RobH All that remains off this #ops-magru tracking task is the traffic ramp up via T359054 and the geo maps update via T363722.  Since those are only #traf...
[16:24:54] <stashbot>	 T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517
[16:25:56] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507)
[16:26:08] <hnowlan>	 all done
[16:26:47] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:26:49] <logmsgbot>	 !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:27:19] <wikibugs>	 (03CR) 10Scott French: [C:03+1] role::deployment_server::kubernetes: upgrade nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1068004 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[16:29:39] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Update cfssl-issuer to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1068026 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[16:30:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2009.codfw.wmnet
[16:30:11] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2009.codfw.wmnet
[16:32:34] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2009.codfw.wmnet
[16:32:35] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2009.codfw.wmnet
[16:32:49] <wikibugs>	 (03CR) 10Elukey: "Tried to come up with a configuration for Jaeger, with the following assumptions:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[16:32:54] <wikibugs>	 (03PS1) 10Ssingh: admin: fix typo in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1068042
[16:33:55] <wikibugs>	 (03CR) 10Elukey: jaeger: add securityContext configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey)
[16:34:16] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: fix typo in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1068042 (owner: 10Ssingh)
[16:35:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] prometheus/gerrit: also add size of tracking list to exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067415 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[16:35:13] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm
[16:35:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100388 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[16:35:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2009.codfw.wmnet
[16:36:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2009.codfw.wmnet
[16:36:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100393 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool fo...
[16:36:13] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Pin cfssl-issuer and CRDs chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068027 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[16:36:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[16:38:25] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm
[16:38:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[16:41:27] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457#10100422 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:41:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T370903)', diff saved to https://phabricator.wikimedia.org/P68067 and previous config saved to /var/cache/conftool/dbconfig/20240828-164131-ladsgroup.json
[16:41:36] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:44:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm
[16:44:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm
[16:44:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100481 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[16:44:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum...
[16:44:51] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "yes to lowering the values, also tested "2000 without burst" and it still had like 2 IPs affected. the values for gerrit1004 were only her" [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[16:46:52] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Update cfss-issuer charts to v0.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm)
[16:47:53] <wikibugs>	 (03CR) 10Dzahn: "just want to clarify my comments aren't a -1 or anything. I'd say just address comments by Eoghan and merge it and try it out. Then follow" [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth)
[16:48:36] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:49:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:51:03] <topranks>	 !log add qos config to management firewalls T339850
[16:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:07] <stashbot>	 T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850
[16:52:06] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French)
[16:52:07] <wikibugs>	 (03CR) 10Scott French: [C:03+2] sre.hosts.move-vlan: use name property in runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French)
[16:56:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P68068 and previous config saved to /var/cache/conftool/dbconfig/20240828-165638-ladsgroup.json
[16:59:47] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:59:49] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[16:59:54] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:00:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:00:43] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:01:38] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:02:22] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:02:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T371742)', diff saved to https://phabricator.wikimedia.org/P68069 and previous config saved to /var/cache/conftool/dbconfig/20240828-170228-ladsgroup.json
[17:02:34] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[17:02:41] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:03:13] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:04:28] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.move-vlan: use name property in runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French)
[17:04:30] <wikibugs>	 (03PS7) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590)
[17:05:01] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:09:00] <wikibugs>	 (03CR) 10Bking: [C:03+2] airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[17:09:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100575 (10Jclark-ctr)
[17:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[17:10:13] <wikibugs>	 (03CR) 10Bking: [C:03+2] Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[17:11:07] <wikibugs>	 (03Merged) 10jenkins-bot: Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol)
[17:11:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P68070 and previous config saved to /var/cache/conftool/dbconfig/20240828-171146-ladsgroup.json
[17:14:19] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:15:24] <wikibugs>	 10ops-magru: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#10100596 (10ssingh) Yes that's fair, the tasks left are on Traffic. Thanks!
[17:15:41] <jinxer-wm>	 RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:16:34] <wikibugs>	 (03PS1) 10Cathal Mooney: Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850)
[17:17:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[17:17:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P68071 and previous config saved to /var/cache/conftool/dbconfig/20240828-171735-ladsgroup.json
[17:17:52] <wikibugs>	 (03PS2) 10Cathal Mooney: Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850)
[17:17:57] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Failure unrelated, T282893" [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[17:18:18] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: auth: Relax AuthManager session state check while cde00b55 is deployed [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504)
[17:18:45] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "recheck" [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[17:19:16] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288)
[17:19:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100609 (10Jclark-ctr)
[17:22:16] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:22:20] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:22:25] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:22:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100613 (10Jclark-ctr) a:03klausman @klausman. If you can update preseed.yaml file for thes...
[17:23:16] <wikibugs>	 (03CR) 10DLynch: [C:03+1] Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński)
[17:24:41] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2045.codfw.wmnet with OS bullseye
[17:24:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host...
[17:26:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T370903)', diff saved to https://phabricator.wikimedia.org/P68072 and previous config saved to /var/cache/conftool/dbconfig/20240828-172653-ladsgroup.json
[17:26:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:26:57] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:26:58] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:27:23] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[17:29:51] <logmsgbot>	 !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@cb0bc4d]: (no justification provided)
[17:30:10] <logmsgbot>	 !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@cb0bc4d]: (no justification provided) (duration: 00m 18s)
[17:31:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński)
[17:32:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński)
[17:32:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński)
[17:32:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[17:32:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P68073 and previous config saved to /var/cache/conftool/dbconfig/20240828-173242-ladsgroup.json
[17:34:25] <wikibugs>	 (03PS1) 10Bvibber: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546)
[17:35:04] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:35:10] <wikibugs>	 (03PS2) 10Bvibber: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546)
[17:35:38] <sukhe>	 inflatador: is the above known?
[17:35:43] <sukhe>	 k8s-dse alert. known/expected
[17:36:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) (owner: 10Bvibber)
[17:37:15] <inflatador>	 sukhe it's known...not sure why there's a monitor on a non-prod service but I will suppress. Thanks for reaching out
[17:37:22] <sukhe>	 thanks <3
[17:38:36] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[17:39:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:42:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage
[17:42:32] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm
[17:43:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[17:43:26] <wikibugs>	 (03CR) 10Gergő Tisza: "Thanks!" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński)
[17:44:44] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance
[17:45:07] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance
[17:45:12] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage
[17:45:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T370903)', diff saved to https://phabricator.wikimedia.org/P68074 and previous config saved to /var/cache/conftool/dbconfig/20240828-174514-ladsgroup.json
[17:45:19] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:47:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T371742)', diff saved to https://phabricator.wikimedia.org/P68075 and previous config saved to /var/cache/conftool/dbconfig/20240828-174749-ladsgroup.json
[17:47:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance
[17:47:54] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[17:48:05] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance
[17:48:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T371742)', diff saved to https://phabricator.wikimedia.org/P68076 and previous config saved to /var/cache/conftool/dbconfig/20240828-174811-ladsgroup.json
[17:52:41] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:57:04] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2294.codfw.wmnet
[17:57:34] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:57:40] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2294.codfw.wmnet
[17:57:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:59:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Rename mw2295 to wikikube-worker2048 [puppet] - 10https://gerrit.wikimedia.org/r/1068059 (https://phabricator.wikimedia.org/T372878)
[18:00:44] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm
[18:00:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10...
[18:01:25] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:03:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw2295 to wikikube-worker2048 [puppet] - 10https://gerrit.wikimedia.org/r/1068059 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris)
[18:04:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T370903)', diff saved to https://phabricator.wikimedia.org/P68077 and previous config saved to /var/cache/conftool/dbconfig/20240828-180401-ladsgroup.json
[18:04:06] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[18:04:26] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2045.codfw.wmnet with OS bullseye
[18:04:38] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10100786 (10KFrancis) Hello @Southparkfan, please send your full name, mailing address, and email address to kfrancis@wikimedia.org and I will send the NDA agreement to you.  Thanks!
[18:04:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki...
[18:04:39] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2294 to wikikube-worker2048
[18:04:56] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[18:06:40] <wikibugs>	 (03PS1) 10RLazarus: deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061
[18:08:06] <wikibugs>	 (03PS1) 10Ssingh: admin: update keys for abi [puppet] - 10https://gerrit.wikimedia.org/r/1068062 (https://phabricator.wikimedia.org/T373522)
[18:08:13] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2294 to wikikube-worker2048 - akosiaris@cumin1002"
[18:08:47] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2294 to wikikube-worker2048 - akosiaris@cumin1002"
[18:08:47] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:08:48] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2048
[18:09:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 (owner: 10RLazarus)
[18:09:48] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: update keys for abi [puppet] - 10https://gerrit.wikimedia.org/r/1068062 (https://phabricator.wikimedia.org/T373522) (owner: 10Ssingh)
[18:10:06] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2048
[18:10:45] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2294 to wikikube-worker2048
[18:10:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100821 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2294 to...
[18:11:34] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I like having all the patches in master, even if they're intended temporary. The only reason I didn't do that with the other patch is beca" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński)
[18:13:32] <wikibugs>	 10SRE-Access-Requests, 13Patch-For-Review: abi uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T373522#10100826 (10ssingh) 05Open→03Resolved New key updated for shell access. Thanks @abi_ for the quick response!
[18:14:35] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2048.codfw.wmnet with OS bullseye
[18:14:45] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2048
[18:15:08] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[18:15:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host...
[18:16:36] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2045.codfw.wmnet
[18:16:37] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2045.codfw.wmnet
[18:16:50] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10100848 (10Southparkfan) >>! In T373518#10100786, @KFrancis wrote: > Hello @Southparkfan, please send your full name, mailing address, and email address to kfrancis@wikimedia.org and I will sen...
[18:18:18] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2048 - akosiaris@cumin1002"
[18:18:22] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2048 - akosiaris@cumin1002"
[18:18:23] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:18:23] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2048.codfw.wmnet 164.0.192.10.in-addr.arpa 4.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[18:18:26] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2048.codfw.wmnet 164.0.192.10.in-addr.arpa 4.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[18:18:27] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2048
[18:19:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[18:19:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P68078 and previous config saved to /var/cache/conftool/dbconfig/20240828-181908-ladsgroup.json
[18:19:32] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2048
[18:19:32] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2048
[18:22:49] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:53] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:23:11] <sukhe>	 I am assuming this is related to wikikube-worker2048
[18:24:02] <wikibugs>	 (03PS2) 10RLazarus: deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061
[18:27:52] <wikibugs>	 (03PS1) 10RLazarus: mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065
[18:28:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus)
[18:28:55] <topranks>	 sukhe: I’m afk but that ASN is an internal one so that’s likely yeah
[18:29:03] <icinga-wm>	 RECOVERY - Disk space on restbase2022 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2022&var-datasource=codfw+prometheus/ops
[18:29:13] <sukhe>	 topranks: go offline, we are here :P
[18:29:31] <topranks>	 haha
[18:30:14] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone service module: replace https-socket with uwsgi-socket [puppet] - 10https://gerrit.wikimedia.org/r/1068066 (https://phabricator.wikimedia.org/T359590)
[18:30:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10100888 (10KFrancis) Thank you!  The NDA has been sent via DocuSign.  I'll confirm when it's complete.
[18:31:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] keystone service module: replace https-socket with uwsgi-socket [puppet] - 10https://gerrit.wikimedia.org/r/1068066 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott)
[18:33:03] <wikibugs>	 (03PS2) 10RLazarus: mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065
[18:34:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P68079 and previous config saved to /var/cache/conftool/dbconfig/20240828-183416-ladsgroup.json
[18:36:21] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage
[18:39:20] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage
[18:48:07] <wikibugs>	 (03PS1) 10Ssingh: admin: add southparkfan to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1068073 (https://phabricator.wikimedia.org/T373518)
[18:48:36] <wikibugs>	 (03CR) 10Ssingh: "Pending manager/sponsor approval." [puppet] - 10https://gerrit.wikimedia.org/r/1068073 (https://phabricator.wikimedia.org/T373518) (owner: 10Ssingh)
[18:49:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T370903)', diff saved to https://phabricator.wikimedia.org/P68080 and previous config saved to /var/cache/conftool/dbconfig/20240828-184923-ladsgroup.json
[18:49:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance
[18:49:28] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[18:49:40] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance
[18:49:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[18:49:43] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[18:49:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68081 and previous config saved to /var/cache/conftool/dbconfig/20240828-184950-ladsgroup.json
[18:53:03] <wikibugs>	 (03CR) 10Ottomata: "+1 generally but:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm)
[18:53:48] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] eventgate-main: Disable end-to-end readinessProbe (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm)
[18:54:07] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "If we do this for eventgate-main, we should do it for all the other eventgate service too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm)
[18:54:20] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "(unresolving)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm)
[18:59:40] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2048.codfw.wmnet with OS bullseye
[18:59:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki...
[19:02:17] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[19:08:02] <wikibugs>	 (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[19:08:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68082 and previous config saved to /var/cache/conftool/dbconfig/20240828-190817-ladsgroup.json
[19:08:22] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:08:32] <wikibugs>	 (03PS29) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[19:09:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[19:09:54] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:23:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P68083 and previous config saved to /var/cache/conftool/dbconfig/20240828-192325-ladsgroup.json
[19:24:10] <icinga-wm>	 RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops
[19:24:21] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[19:28:12] <wikibugs>	 (03CR) 10Scott French: [C:03+1] deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 (owner: 10RLazarus)
[19:29:21] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[19:32:50] <icinga-wm>	 RECOVERY - Disk space on thanos-be1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops
[19:34:19] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 (owner: 10RLazarus)
[19:36:30] <icinga-wm>	 RECOVERY - Disk space on thanos-be1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops
[19:38:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P68084 and previous config saved to /var/cache/conftool/dbconfig/20240828-193832-ladsgroup.json
[19:39:20] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:42:37] <wikibugs>	 (03PS1) 10Scott French: kubernetes: re-name/IP kubernetes2029 as wikikube-worker2049 [puppet] - 10https://gerrit.wikimedia.org/r/1068081 (https://phabricator.wikimedia.org/T372878)
[19:43:30] <icinga-wm>	 RECOVERY - Disk space on thanos-be1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops
[19:43:42] <icinga-wm>	 RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops
[19:44:44] <wikibugs>	 (03PS6) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271)
[19:45:32] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus)
[19:45:32] <icinga-wm>	 RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[19:49:20] <icinga-wm>	 RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops
[19:49:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "I added an annotation in grafana for the merge time of this. In the following 3 hours we still had 1 IP pop up a few times." [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[19:51:14] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[19:51:23] <Gerges>	 jouncebot: next
[19:51:23] <jouncebot>	 In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T2000)
[19:52:04] <icinga-wm>	 RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops
[19:53:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68085 and previous config saved to /var/cache/conftool/dbconfig/20240828-195339-ladsgroup.json
[19:53:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:53:44] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:53:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:54:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68086 and previous config saved to /var/cache/conftool/dbconfig/20240828-195401-ladsgroup.json
[19:54:12] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[19:56:33] <wikibugs>	 (03CR) 10Srishakatux: "@hashar@free.fr As per @dziewonski@fastmail.fm the only extra step needed is to run the `namespaceDupes.php` maintenance script. Instructi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[19:57:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[19:58:10] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[19:59:27] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T2000).
[20:00:04] <jouncebot>	 Gerges, MatmaRex, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <bvibber>	 o/
[20:00:29] <MatmaRex>	 hi. i have a couple of patches, they're all independent from each other
[20:00:54] <bvibber>	 "you have my bug." "and my task." "and my patch!"
[20:01:42] <cjming>	 hi i can deploy
[20:01:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T371742)', diff saved to https://phabricator.wikimedia.org/P68087 and previous config saved to /var/cache/conftool/dbconfig/20240828-200154-ladsgroup.json
[20:01:59] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[20:02:06] <bvibber>	 whee
[20:02:37] <cjming>	 lol
[20:02:40] <wikibugs>	 (03PS30) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[20:02:54] <cjming>	 i'll go in order - is Gerges around?
[20:03:17] <cjming>	 otherwise i'll start with yours MatmaRex
[20:03:22] <Gerges>	 Here
[20:03:29] <cjming>	 good timing!
[20:03:38] <cjming>	 ok i'll start with yours Gerges
[20:03:57] <wikibugs>	 (03PS3) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468)
[20:04:17] <cjming>	 MatmaRex: can your backports go out together?
[20:04:44] <MatmaRex>	 cjming: yep
[20:05:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon)
[20:05:55] <wikibugs>	 (03Merged) 10jenkins-bot: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon)
[20:06:05] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] auth: Relax AuthManager session state check while cde00b55 is deployed [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński)
[20:06:11] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński)
[20:06:15] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1067433|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]]
[20:06:20] <stashbot>	 T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468
[20:06:20] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[20:08:33] <cjming>	 MatmaRex: since your backports are averaging 28 minutes to merge, i'll do your config patch next, then bvibber's config patch, then come back to your backports
[20:08:45] <bvibber>	 ok
[20:09:41] <MatmaRex>	 thanks
[20:09:54] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:09:58] <logmsgbot>	 !log cjming@deploy1003 cjming, gergesshamon: Backport for [[gerrit:1067433|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:10:21] <cjming>	 Gerges: your patch is ready to test - lmk if/when to sync
[20:10:31] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus)
[20:11:14] <Gerges>	 cjming: How can I test this patch?
[20:11:18] <Reedy>	 You can't :)
[20:11:49] <cjming>	 lol - i guess we sync and hope for the best?
[20:12:21] <Reedy>	 If it's to this point, it's syntactically valid etc
[20:12:23] <Gerges>	 Yup :)
[20:12:24] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus)
[20:12:29] <cjming>	 alrighty
[20:12:37] <logmsgbot>	 !log cjming@deploy1003 cjming, gergesshamon: Continuing with sync
[20:12:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68088 and previous config saved to /var/cache/conftool/dbconfig/20240828-201250-ladsgroup.json
[20:12:55] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:13:27] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: logging: Use '??=' operator to reduce repetition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902
[20:17:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P68089 and previous config saved to /var/cache/conftool/dbconfig/20240828-201701-ladsgroup.json
[20:17:18] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067433|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] (duration: 11m 02s)
[20:17:21] <stashbot>	 T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468
[20:17:40] <cjming>	 Gerges: your patch should be live!
[20:18:00] <cjming>	 MatmaRex: doing your config patch now - assuming it's not really testable either?
[20:18:18] <cjming>	 other than maybe not breaking things
[20:18:20] <Gerges>	 Thanks :)
[20:18:26] <MatmaRex>	 cjming: yeah. it should work exactly the same as before
[20:18:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński)
[20:19:26] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Use '??=' operator to reduce repetition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński)
[20:19:39] <cjming>	 MatmaRex: do you want to check on mwdebug when it's ready or should i just go head and sync?
[20:19:44] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1066902|logging: Use '??=' operator to reduce repetition]]
[20:20:11] <MatmaRex>	 cjming: i think it can be synced directly. CI checks for syntax errors, right? ;)
[20:20:18] <cjming>	 presumably
[20:21:51] <logmsgbot>	 !log cjming@deploy1003 cjming, matmarex: Backport for [[gerrit:1066902|logging: Use '??=' operator to reduce repetition]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:21:52] <logmsgbot>	 !log cjming@deploy1003 cjming, matmarex: Continuing with sync
[20:24:31] <wikibugs>	 (03CR) 10Amire80: "Actually, anoop is probably right: we need to add the current namespaces as aliases for backwards compatibility, so as not to break the li" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[20:25:02] <wikibugs>	 (03PS3) 10Bvibber: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546)
[20:25:10] <bvibber>	 \o/
[20:26:23] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1066902|logging: Use '??=' operator to reduce repetition]] (duration: 06m 39s)
[20:26:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) (owner: 10Bvibber)
[20:27:02] <cjming>	 MatmaRex: config patch should be live - moving onto bvibber's patch while we wait for your backports to merge
[20:27:17] <MatmaRex>	 👍
[20:27:21] <wikibugs>	 (03Merged) 10jenkins-bot: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) (owner: 10Bvibber)
[20:27:38] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1068054|Disable HLS VP9 video tracks in TimedMediaHandler (T373546)]]
[20:27:42] <stashbot>	 T373546: Migrate off HLS mov/mp4 experiment to a flat mov back-compat with WebM and MPEG-DASH - https://phabricator.wikimedia.org/T373546
[20:27:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P68090 and previous config saved to /var/cache/conftool/dbconfig/20240828-202757-ladsgroup.json
[20:29:48] <logmsgbot>	 !log cjming@deploy1003 bvibber, cjming: Backport for [[gerrit:1068054|Disable HLS VP9 video tracks in TimedMediaHandler (T373546)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:29:53] <cjming>	 bvibber: is your patch testable? up on mwdebug if so - lmk if/when to sync
[20:30:00] <bvibber>	 yeah lemme check it
[20:31:17] <bvibber>	 cjming: confirmed updated correctly :D
[20:31:22] <bvibber>	 go ahead and sync
[20:31:23] <cjming>	 nice - syncing!
[20:31:25] <logmsgbot>	 !log cjming@deploy1003 bvibber, cjming: Continuing with sync
[20:32:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P68091 and previous config saved to /var/cache/conftool/dbconfig/20240828-203208-ladsgroup.json
[20:35:49] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068054|Disable HLS VP9 video tracks in TimedMediaHandler (T373546)]] (duration: 08m 10s)
[20:35:55] <stashbot>	 T373546: Migrate off HLS mov/mp4 experiment to a flat mov back-compat with WebM and MPEG-DASH - https://phabricator.wikimedia.org/T373546
[20:35:57] <cjming>	 bvibber: should be live!
[20:36:05] <bvibber>	 \o/
[20:36:26] <wikibugs>	 (03Merged) 10jenkins-bot: auth: Relax AuthManager session state check while cde00b55 is deployed [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński)
[20:36:29] <wikibugs>	 (03Merged) 10jenkins-bot: Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński)
[20:36:30] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński)
[20:36:37] <bvibber>	 cjming: looks good, thanks!
[20:36:42] <cjming>	 yw!
[20:37:41] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1068051|auth: Relax AuthManager session state check while cde00b55 is deployed (T373504)]], [[gerrit:1068052|Fix missing definition of setSaveErrorMessage too (T373288)]], [[gerrit:1068041|CentralAuthApiSessionProvider: Avoid error in internal API requests (T373507)]]
[20:37:47] <stashbot>	 T373504: Wikimedia\NormalizedException\NormalizedException: Authentication failed because of inconsistent provider array - https://phabricator.wikimedia.org/T373504
[20:37:48] <stashbot>	 T373288: Show error message when a shortened URL prevents user from adding a topic or comment - https://phabricator.wikimedia.org/T373288
[20:37:48] <stashbot>	 T373507: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralAuthTokenManager::consume() must be of the type string, null given - https://phabricator.wikimedia.org/T373507
[20:39:45] <logmsgbot>	 !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1068051|auth: Relax AuthManager session state check while cde00b55 is deployed (T373504)]], [[gerrit:1068052|Fix missing definition of setSaveErrorMessage too (T373288)]], [[gerrit:1068041|CentralAuthApiSessionProvider: Avoid error in internal API requests (T373507)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:39:49] <cjming>	 MatmaRex: if they're testable, all your backports are up on test servers - lmk when to sync
[20:40:18] <MatmaRex>	 looking
[20:43:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P68092 and previous config saved to /var/cache/conftool/dbconfig/20240828-204305-ladsgroup.json
[20:44:37] <MatmaRex>	 cjming: looks good. i verified the DiscussionTools fix. the other two are not easily testable, but we have logging which will show whether they're fixed.
[20:44:46] <cjming>	 awesome - syncing!
[20:44:50] <logmsgbot>	 !log cjming@deploy1003 matmarex, cjming: Continuing with sync
[20:47:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T371742)', diff saved to https://phabricator.wikimedia.org/P68093 and previous config saved to /var/cache/conftool/dbconfig/20240828-204715-ladsgroup.json
[20:47:18] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance
[20:47:20] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[20:47:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance
[20:49:13] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068051|auth: Relax AuthManager session state check while cde00b55 is deployed (T373504)]], [[gerrit:1068052|Fix missing definition of setSaveErrorMessage too (T373288)]], [[gerrit:1068041|CentralAuthApiSessionProvider: Avoid error in internal API requests (T373507)]] (duration: 11m 31s)
[20:49:19] <stashbot>	 T373504: Wikimedia\NormalizedException\NormalizedException: Authentication failed because of inconsistent provider array - https://phabricator.wikimedia.org/T373504
[20:49:20] <stashbot>	 T373288: Show error message when a shortened URL prevents user from adding a topic or comment - https://phabricator.wikimedia.org/T373288
[20:49:20] <stashbot>	 T373507: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralAuthTokenManager::consume() must be of the type string, null given - https://phabricator.wikimedia.org/T373507
[20:49:36] <cjming>	 MatmaRex: everything should be live
[20:49:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[20:50:33] <MatmaRex>	 thanks cjming. very smooth deployment today :)
[20:51:09] <cjming>	 nice!
[20:51:15] <cjming>	 !log end of UTC late backport window
[20:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:53:36] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[20:54:02] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[20:57:45] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:58:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68094 and previous config saved to /var/cache/conftool/dbconfig/20240828-205812-ladsgroup.json
[20:58:14] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2190.codfw.wmnet with reason: Maintenance
[20:58:16] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:58:27] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2190.codfw.wmnet with reason: Maintenance
[20:58:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T370903)', diff saved to https://phabricator.wikimedia.org/P68095 and previous config saved to /var/cache/conftool/dbconfig/20240828-205834-ladsgroup.json
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T2100)
[21:07:34] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] kubernetes: re-name/IP kubernetes2029 as wikikube-worker2049 [puppet] - 10https://gerrit.wikimedia.org/r/1068081 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French)
[21:10:19] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2029.codfw.wmnet
[21:10:56] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2029.codfw.wmnet
[21:12:34] <wikibugs>	 (03PS1) 10Reedy: Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107
[21:13:13] <wikibugs>	 (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP kubernetes2029 as wikikube-worker2049 [puppet] - 10https://gerrit.wikimedia.org/r/1068081 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French)
[21:13:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 (owner: 10Reedy)
[21:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:15:52] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2029 to wikikube-worker2049
[21:16:11] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.netbox
[21:17:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T370903)', diff saved to https://phabricator.wikimedia.org/P68096 and previous config saved to /var/cache/conftool/dbconfig/20240828-211734-ladsgroup.json
[21:17:39] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[21:20:02] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2029 to wikikube-worker2049 - swfrench@cumin2002"
[21:20:31] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2029 to wikikube-worker2049 - swfrench@cumin2002"
[21:20:31] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:20:33] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2049
[21:20:52] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2049
[21:21:33] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2029 to wikikube-worker2049
[21:21:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10101297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes...
[21:22:37] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2049.codfw.wmnet on all recursors
[21:22:41] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2049.codfw.wmnet on all recursors
[21:23:29] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2049.codfw.wmnet with OS bullseye
[21:23:41] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2049
[21:23:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10101298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w...
[21:24:19] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.netbox
[21:25:57] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:25:59] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:26:13] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:26:29] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:27:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:28:29] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2049 - swfrench@cumin2002"
[21:28:34] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2049 - swfrench@cumin2002"
[21:28:34] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:28:35] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2049.codfw.wmnet 59.16.192.10.in-addr.arpa 9.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:28:38] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2049.codfw.wmnet 59.16.192.10.in-addr.arpa 9.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:28:39] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2049
[21:29:00] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2049
[21:29:00] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2049
[21:30:59] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:31:18] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:32:14] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:32:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P68097 and previous config saved to /var/cache/conftool/dbconfig/20240828-213242-ladsgroup.json
[21:32:43] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:33:54] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:33:57] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:39:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:39:12] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:39:38] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:43:04] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:43:35] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[21:46:59] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage
[21:47:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P68098 and previous config saved to /var/cache/conftool/dbconfig/20240828-214749-ladsgroup.json
[21:49:22] <wikibugs>	 (03PS1) 10Ladsgroup: Remove the "powered by mediawiki" override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068120
[21:50:43] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage
[21:51:40] <icinga-wm>	 PROBLEM - Host kubernetes2029 is DOWN: PING CRITICAL - Packet loss = 100%
[22:02:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T370903)', diff saved to https://phabricator.wikimedia.org/P68099 and previous config saved to /var/cache/conftool/dbconfig/20240828-220256-ladsgroup.json
[22:02:58] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2194.codfw.wmnet with reason: Maintenance
[22:03:01] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[22:03:11] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2194.codfw.wmnet with reason: Maintenance
[22:03:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T370903)', diff saved to https://phabricator.wikimedia.org/P68100 and previous config saved to /var/cache/conftool/dbconfig/20240828-220318-ladsgroup.json
[22:05:48] <jinxer-wm>	 FIRING: KubernetesCalicoDown: mw2294.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2294.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:09:16] <wikibugs>	 (03PS8) 10Jdlrobson: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[22:11:42] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2049.codfw.wmnet with OS bullseye
[22:11:50] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "I think it's okay to do this for Commons, but we got feedback from English Wikipedia specifically that since portals are not maintained th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim)
[22:11:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10101358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik...
[22:13:58] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[22:14:19] <swfrench-wmf>	 !log running homer 'lsw1-b3-codfw*' commit 'T372878'
[22:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:23] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[22:17:17] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2049.codfw.wmnet
[22:17:18] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2049.codfw.wmnet
[22:17:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[22:18:14] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "There is already an alias for the 'Wikipedia' namespace on every Wikipedia: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[22:19:51] <wikibugs>	 (03PS7) 10Srishakatux: Add project talk aliases for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271)
[22:20:30] <Amir1>	 inflatador: ryankemper: I don't know if you're aware but wdqs is lagging so much the maxlag in wikidata is at 10 basically stopping all bots 
[22:20:36] <Amir1>	 https://www.wikidata.org/w/api.php?action=query&format=json&titles=Main%20Page&maxlag=-1
[22:20:41] <ryankemper>	 yeah just saw it 30s ago actually
[22:20:44] <ryankemper>	 looking at graphs rn
[22:20:49] <Amir1>	 it's wdqs1015
[22:22:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T370903)', diff saved to https://phabricator.wikimedia.org/P68101 and previous config saved to /var/cache/conftool/dbconfig/20240828-222204-ladsgroup.json
[22:22:09] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[22:22:25] <ryankemper>	 !log [WDQS] `ryankemper@wdqs1015:~$ sudo systemctl restart wdqs-blazegraph`
[22:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:20] <swfrench-wmf>	 !log running homer 'cr*codfw*' commit 'T372878'
[22:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:24] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[22:23:58] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[22:30:20] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 449, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:33:05] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance
[22:33:18] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance
[22:33:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T371742)', diff saved to https://phabricator.wikimedia.org/P68102 and previous config saved to /var/cache/conftool/dbconfig/20240828-223325-ladsgroup.json
[22:33:29] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[22:37:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P68103 and previous config saved to /var/cache/conftool/dbconfig/20240828-223711-ladsgroup.json
[22:37:30] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 531, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:52:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P68104 and previous config saved to /var/cache/conftool/dbconfig/20240828-225218-ladsgroup.json
[23:04:54] <jinxer-wm>	 FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:07:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T370903)', diff saved to https://phabricator.wikimedia.org/P68105 and previous config saved to /var/cache/conftool/dbconfig/20240828-230726-ladsgroup.json
[23:07:28] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2205.codfw.wmnet with reason: Maintenance
[23:07:31] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:07:41] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2205.codfw.wmnet with reason: Maintenance
[23:07:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T370903)', diff saved to https://phabricator.wikimedia.org/P68106 and previous config saved to /var/cache/conftool/dbconfig/20240828-230748-ladsgroup.json
[23:26:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T370903)', diff saved to https://phabricator.wikimedia.org/P68107 and previous config saved to /var/cache/conftool/dbconfig/20240828-232653-ladsgroup.json
[23:26:58] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:38:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068175
[23:38:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068175 (owner: 10TrainBranchBot)
[23:42:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P68108 and previous config saved to /var/cache/conftool/dbconfig/20240828-234201-ladsgroup.json
[23:57:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P68109 and previous config saved to /var/cache/conftool/dbconfig/20240828-235708-ladsgroup.json
[23:57:24] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10101487 (10KFrancis) Hi all, I'm confirming the NDA is signed.  Please proceed with next steps.