[00:04:02] <icinga-wm>	 PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:04:52] <icinga-wm>	 RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:07:00] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:10:14] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[00:10:30] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:11:59] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1066946 (owner: 10TrainBranchBot)
[00:12:06] <wikibugs>	 (03Abandoned) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) (owner: 10Jdlrobson)
[00:20:25] <wikibugs>	 (03PS3) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020)
[00:20:34] <wikibugs>	 (03PS3) 10Jdlrobson: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633)
[00:20:46] <wikibugs>	 (03PS3) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041
[00:21:53] <wikibugs>	 (03PS4) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041
[00:25:42] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[00:26:13] <wikibugs>	 (03PS1) 10Jasmine_: admin: adding jasmine to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1066951
[00:29:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T371742)', diff saved to https://phabricator.wikimedia.org/P67852 and previous config saved to /var/cache/conftool/dbconfig/20240827-002944-ladsgroup.json
[00:29:48] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[00:29:51] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] admin: adding jasmine to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1066951 (owner: 10Jasmine_)
[00:39:27] <logmsgbot>	 !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@663c843] (releasing): (no justification provided)
[00:40:08] <logmsgbot>	 !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@663c843] (releasing): (no justification provided) (duration: 00m 40s)
[00:42:30] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[00:44:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P67853 and previous config saved to /var/cache/conftool/dbconfig/20240827-004451-ladsgroup.json
[00:59:16] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[00:59:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P67854 and previous config saved to /var/cache/conftool/dbconfig/20240827-005958-ladsgroup.json
[01:14:44] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[01:15:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T371742)', diff saved to https://phabricator.wikimedia.org/P67855 and previous config saved to /var/cache/conftool/dbconfig/20240827-011505-ladsgroup.json
[01:15:08] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[01:15:10] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[01:15:21] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[01:15:26] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:15:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T371742)', diff saved to https://phabricator.wikimedia.org/P67856 and previous config saved to /var/cache/conftool/dbconfig/20240827-011527-ladsgroup.json
[01:15:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:30:14] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[01:45:44] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[01:49:58] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:50:30] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Idle - NTT, AS2914/IPv6: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0200)
[02:01:10] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[02:17:58] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[02:18:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:18:40] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 112, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:23:08] <brett>	 !log Import corto 0.3-1 into bookworm-wikimedia apt archive
[02:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:36] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Connect - NTT, AS2914/IPv6: Connect - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:33:26] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[02:36:27] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:38:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:38:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:48:56] <icinga-wm>	 RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring
[02:49:04] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox
[02:57:57] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0300)
[03:03:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:08:50] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 112, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:29:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T371742)', diff saved to https://phabricator.wikimedia.org/P67857 and previous config saved to /var/cache/conftool/dbconfig/20240827-032902-ladsgroup.json
[03:29:07] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[03:30:38] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:30:52] <icinga-wm>	 PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv6: Idle - NTT, AS2914/IPv4: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:37:02] <icinga-wm>	 RECOVERY - Disk space on restbase2021 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2021&var-datasource=codfw+prometheus/ops
[03:44:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P67858 and previous config saved to /var/cache/conftool/dbconfig/20240827-034409-ladsgroup.json
[03:52:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:53:04] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 112, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:59:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P67859 and previous config saved to /var/cache/conftool/dbconfig/20240827-035916-ladsgroup.json
[03:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0400)
[04:01:38] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.17 (duration: 01m 28s)
[04:14:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T371742)', diff saved to https://phabricator.wikimedia.org/P67860 and previous config saved to /var/cache/conftool/dbconfig/20240827-041424-ladsgroup.json
[04:14:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[04:14:28] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[04:14:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[04:14:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T371742)', diff saved to https://phabricator.wikimedia.org/P67861 and previous config saved to /var/cache/conftool/dbconfig/20240827-041446-ladsgroup.json
[05:18:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Add db2232 to test-s4" [puppet] - 10https://gerrit.wikimedia.org/r/1067158
[05:33:55] <logmsgbot>	 !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@0b23c91]: (no justification provided)
[05:34:14] <logmsgbot>	 !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@0b23c91]: (no justification provided) (duration: 00m 18s)
[05:39:44] <wikibugs>	 (03PS2) 10KartikMistry: Section Translation: Fix some language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696
[05:40:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 (owner: 10KartikMistry)
[05:48:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0600)
[06:04:36] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:36] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:12:30] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:12:30] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:12:52] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:23:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T371742)', diff saved to https://phabricator.wikimedia.org/P67862 and previous config saved to /var/cache/conftool/dbconfig/20240827-062302-ladsgroup.json
[06:23:07] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[06:31:01] <wikibugs>	 (03PS1) 10Ammarpad: Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414)
[06:36:24] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński)
[06:36:47] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński)
[06:38:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:38:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P67863 and previous config saved to /var/cache/conftool/dbconfig/20240827-063809-ladsgroup.json
[06:39:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński)
[06:41:19] <wikibugs>	 (03Merged) 10jenkins-bot: Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński)
[06:53:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P67864 and previous config saved to /var/cache/conftool/dbconfig/20240827-065316-ladsgroup.json
[06:58:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10094807 (10ayounsi) I manually added `--collector.drbd` to /etc/default/prometheus-node-exporter on one of the Routed Ganeti exporter  Thi...
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0700).
[07:00:05] <jouncebot>	 kart_ and Ammar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:02:24] <kart_>	 here
[07:02:30] <kart_>	 I'll start with my patch.
[07:03:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 (owner: 10KartikMistry)
[07:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: Section Translation: Fix some language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 (owner: 10KartikMistry)
[07:04:14] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1064696|Section Translation: Fix some language codes]]
[07:06:15] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1064696|Section Translation: Fix some language codes]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:07:29] <wikibugs>	 (03PS1) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220
[07:07:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (owner: 10David Caro)
[07:07:56] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[07:08:22] <wikibugs>	 (03PS2) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143)
[07:08:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T371742)', diff saved to https://phabricator.wikimedia.org/P67865 and previous config saved to /var/cache/conftool/dbconfig/20240827-070823-ladsgroup.json
[07:08:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[07:08:28] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[07:08:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[07:08:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T371742)', diff saved to https://phabricator.wikimedia.org/P67866 and previous config saved to /var/cache/conftool/dbconfig/20240827-070845-ladsgroup.json
[07:08:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[07:08:55] <wikibugs>	 (03PS3) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143)
[07:11:24] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-08-27-045705-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067221 (https://phabricator.wikimedia.org/T369815)
[07:11:36] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "To close the loop from our IRC chat:" [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi)
[07:11:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[07:12:24] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064696|Section Translation: Fix some language codes]] (duration: 08m 09s)
[07:13:22] <wikibugs>	 (03CR) 10Jelto: prometheus: create text file export for nft throttling denylist length (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[07:15:35] <wikibugs>	 (03PS1) 10Jelto: prometheus: fix nftables_throttling exporter variable [puppet] - 10https://gerrit.wikimedia.org/r/1067222 (https://phabricator.wikimedia.org/T373136)
[07:18:37] <kart_>	 Ammar: I'm done with my patch.
[07:20:39] <wikibugs>	 (03PS4) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143)
[07:20:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "mariadb: Add db2232 to test-s4" [puppet] - 10https://gerrit.wikimedia.org/r/1067158 (owner: 10Marostegui)
[07:21:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[07:22:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm
[07:23:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "test-s4: Add two new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1067226
[07:23:14] <Ammar>	 kart_: OK
[07:24:41] <wikibugs>	 (03PS2) 10Marostegui: Revert "test-s4: Add two new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1067226
[07:25:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "test-s4: Add two new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1067226 (owner: 10Marostegui)
[07:26:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2230.codfw.wmnet with OS bookworm
[07:26:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2231.codfw.wmnet with OS bookworm
[07:28:58] <wikibugs>	 (03PS5) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143)
[07:30:15] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3748/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[07:39:25] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:45:20] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417 (10Marostegui) 03NEW
[07:45:33] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10094868 (10Marostegui) @Papaul can this be related to the 10G?
[07:45:57] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[07:47:08] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10094869 (10ABran-WMF) 05Open→03In progress p:05Triage→03Medium
[07:49:08] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[07:50:22] <godog>	 !log ack probedown for puppetmaster:8181 - T373369
[07:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:26] <stashbot>	 T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4) - https://phabricator.wikimedia.org/T373369
[07:50:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:50:49] <Ammar>	 urbanecm: are you available for the morning backport?
[07:51:18] <urbanecm>	 Ammar: hey, no one is still deploying? :/
[07:51:20] <urbanecm>	 i can take a look
[07:51:53] <wikibugs>	 (03PS2) 10Ammarpad: Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414)
[07:51:58] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414) (owner: 10Ammarpad)
[07:52:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414) (owner: 10Ammarpad)
[07:52:50] <urbanecm>	 Ammar: that should've been requested earlier. implementing throttle rule less than 72 hours in advance is a bit more complicated on my end (I have to keep in mind additional aspects). i'll deploy it, but i'd appreciate if future requests could be scheduled a little bit earlier. thanks!
[07:53:03] <wikibugs>	 (03CR) 10Jelto: [C:03+2] prometheus: fix nftables_throttling exporter variable [puppet] - 10https://gerrit.wikimedia.org/r/1067222 (https://phabricator.wikimedia.org/T373136) (owner: 10Jelto)
[07:53:20] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1067191|Add throttle rule for Wikimedia Hausa edit-a-thon (T373414)]]
[07:53:24] <stashbot>	 T373414: Requesting temporary lift of IP cap for Wikimedia Hausa edit-a-thon - https://phabricator.wikimedia.org/T373414
[07:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[08:00:03] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067191|Add throttle rule for Wikimedia Hausa edit-a-thon (T373414)]] (duration: 06m 42s)
[08:00:04] <jouncebot>	 hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0800)
[08:00:13] <stashbot>	 T373414: Requesting temporary lift of IP cap for Wikimedia Hausa edit-a-thon - https://phabricator.wikimedia.org/T373414
[08:01:27] <urbanecm>	 !log Clear throttle for 105.113.127.170 via resetAuthenticationThrottle.php (T373414)
[08:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:11] <Ammar>	 urbanecm: Thank you
[08:03:12] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:03:58] <godog>	 jouncebot: now and next
[08:03:58] <jouncebot>	 For the next 1 hour(s) and 56 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0800)
[08:04:14] <icinga-wm>	 PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:10:47] <hashar>	 I am going to run the MediaWiki train
[08:12:37] <hashar>	 well actually no cause there is a blocker
[08:14:44] <wikibugs>	 (03PS1) 10Joely Rooke WMDE: Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315)
[08:14:49] <hashar>	 ah it got fixed
[08:14:51] <hashar>	 good cscott :)
[08:15:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE)
[08:15:32] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067283 (https://phabricator.wikimedia.org/T366965)
[08:15:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067283 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[08:15:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:16:16] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067283 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[08:17:38] <hashar>	  /srv/mediawiki-staging/php-1.43.0-wmf.20/.gitmodules does not exist. Did the train branch commit get merged?
[08:17:39] <hashar>	 ...
[08:18:18] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[08:18:19] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2232.codfw.wmnet with OS bookworm
[08:18:26] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2231.codfw.wmnet with OS bookworm
[08:18:30] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2230.codfw.wmnet with OS bookworm
[08:18:56] <hashar>	 I guess I need a double expresso
[08:18:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm
[08:19:25] <TheresNoTime>	 every day is a double expresso day here
[08:19:44] <hashar>	 yeah that is stressful
[08:20:02] <icinga-wm>	 PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:20:42] <hashar>	 the branch did not get cut this morning due to some failure
[08:20:48] <jnuche>	 hashar: the train branch cut job failed, it looks like a transient gerrit error: https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/243/console
[08:21:03] <jnuche>	 I think just rerunning that job should do the trick
[08:21:31] <hashar>	 but why would Gerrit fail? :D
[08:21:44] <hashar>	 Output: fatal: could not read Username for 'https://gerrit.wikimedia.org': No such device or address
[08:21:45] <hashar>	 fun
[08:22:03] <wikibugs>	 (03PS2) 10Slyngshede: 2FA: Use username as foreign key to security token table. [software/bitu] - 10https://gerrit.wikimedia.org/r/1065166
[08:22:04] <jnuche>	 that I dunno :)
[08:22:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:22:17] <hashar>	 that is when doing the git push
[08:24:14] <icinga-wm>	 PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:25:41] <hashar>	 10:25:27 Warning: Branch wmf/1.43.0-wmf.20 already exists in repository mediawiki/core
[08:26:01] <hashar>	 I guess the job being reentrant and reusing the existing repo is good
[08:26:54] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10094942 (10Marostegui) So I can confirm I've seen db2232 booting up... and seems to get an IP from PXE: ` CLIENT MAC ADDR: 04 32 01 DB D0 C0  GUID: 4C4C4544-004E-3010-8048-B9C04F4B3434 CLIENT...
[08:27:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:27:10] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:27:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:27:24] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:27:34] <wikibugs>	 (03CR) 10Btullis: [C:03+2] analytics.wikimedia.org: improve caching and redirects [puppet] - 10https://gerrit.wikimedia.org/r/1057223 (owner: 10Milimetric)
[08:29:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db1161 - T373328', diff saved to https://phabricator.wikimedia.org/P67867 and previous config saved to /var/cache/conftool/dbconfig/20240827-082923-arnaudb.json
[08:29:28] <stashbot>	 T373328: upgrade db1161 to MariaDB 10.6.19 - https://phabricator.wikimedia.org/T373328
[08:31:07] <hashar>	 hmm
[08:32:06] <hashar>	 the release jenkins got restarted  over night
[08:32:07] <hashar>	 at 00:40
[08:32:10] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:33:18] <icinga-wm>	 RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:34:18] <hashar>	 my guess is something got upgraded / changed and that broke the job
[08:35:10] <wikibugs>	 (03PS2) 10Joely Rooke WMDE: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315)
[08:35:42] <wikibugs>	 (03PS3) 10Joely Rooke WMDE: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315)
[08:36:05] <hashar>	 how do I search a commit in gitlab? :/
[08:36:28] <icinga-wm>	 PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:36:53] <hashar>	 jnuche: I am pretty sure the issue is https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/commit/663c84371a60e1232a501e627c266899d4f5298f :)
[08:36:53] <jnuche>	 hum, yeah, the Jenkins service was restarted, what on earth?
[08:37:04] <hashar>	 that is the sole change I could find
[08:37:06] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1161.eqiad.wmnet with reason: db1161 upgrade
[08:37:12] <hashar>	 that got redeployed (which restarted the jenkins service)
[08:37:19] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1161.eqiad.wmnet with reason: db1161 upgrade
[08:37:26] <hashar>	 and my guess is that whatever version of git / curl we have on releases1003 does not support that NETRC
[08:39:14] <jnuche>	 yeah, it seems that change is the problem
[08:39:24] <jnuche>	 let's roll it back
[08:39:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1161.eqiad.wmnet
[08:39:56] <jnuche>	 hum, wait, they created a different job too
[08:40:53] <hashar>	 in which repo is that?
[08:41:06] <jnuche>	 https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/68
[08:41:12] <jnuche>	 it's a different MR though
[08:41:14] <jnuche>	 should be fine
[08:41:43] <jnuche>	 give me a min and I'll create the revert
[08:41:47] <hashar>	 and that did not get merged
[08:42:10] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:42:15] <hashar>	 what I don't get is that the code seems to create a file named `netrc_file`
[08:42:20] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1161.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1161.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:42:24] <hashar>	 and the commit message refers to an environment variable `NETRC`
[08:42:32] <hashar>	 so I guess that got mixed up
[08:42:45] <hashar>	 or `file()` should have been changed to something like `env()`
[08:43:31] <hashar>	 and I have no lcue what "netrc_file" would be :)
[08:43:45] <arnaudb>	  Replica IO: s5 on db1154 → I'm the noise source, downtiming
[08:44:20] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1154.eqiad.wmnet with reason: upgrading db1161
[08:44:33] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1154.eqiad.wmnet with reason: upgrading db1161
[08:44:53] <hashar>	 ah no the first parameter to `file()` is indeed the name of the environment variable
[08:45:15] <jnuche>	 the `netrc_file` is injected via a secret, so it already exists when the job tries to access it
[08:45:38] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on an-redacteddb1001.eqiad.wmnet with reason: upgrading db1161
[08:45:40] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-redacteddb1001.eqiad.wmnet with reason: upgrading db1161
[08:45:46] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1161.eqiad.wmnet
[08:46:43] <hashar>	 then if I look at https://gitlab.wikimedia.org/repos/releng/release.git it has:
[08:46:43] <hashar>	    netrc_file = os.getenv("netrc_file")
[08:46:43] <hashar>	    if netrc_file:
[08:46:43] <hashar>	       os.symlink(netrc_file, os.path.join(netrc_dir, ".netrc"))
[08:47:10] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:47:20] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:47:23] <hashar>	 03:00:12 netrc_file environment variable not set.  Will not be able to push the branch cut commit
[08:47:23] <hashar>	 03:00:12 Branching mediawiki version 1.43.0-wmf.20 (T366965)
[08:47:24] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[08:47:29] <hashar>	 which really should be a fatal / standout
[08:48:02] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:48:11] <hashar>	 so my guess is https://gitlab.wikimedia.org/repos/releng/release  has a pending merge requests for that
[08:49:06] <hashar>	 and it does not 
[08:49:07] <hashar>	 fun
[08:49:09] <jnuche>	 hashar: MR ready https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/69
[08:49:38] <hashar>	 that commit message refers to https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/244/console
[08:49:42] <hashar>	 which will disappear eventually
[08:50:09] <jnuche>	 I've also added the relevant job output to the MR in a comment
[08:52:03] <hashar>	 I will rephrase it ;)
[08:52:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db1161.eqiad.wmnet with reason: db1161 upgrade
[08:52:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db1161.eqiad.wmnet with reason: db1161 upgrade
[08:53:26] <icinga-wm>	 RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:54:14] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on db2114 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 55796.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:54:38] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on db2114 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_actiontext: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:55:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on db2124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 55885.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:55:46] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on db2124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_actiontext: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:55:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2124', diff saved to https://phabricator.wikimedia.org/P67868 and previous config saved to /var/cache/conftool/dbconfig/20240827-085551-arnaudb.json
[08:56:11] <hashar>	 jnuche: https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/69/diffs?commit_id=8d2d8fec08223ec68f052ad06078e366ddc1a28e  :)
[08:56:25] <jnuche>	 hashar: approve please? :)
[08:56:36] <icinga-wm>	 PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:56:49] <hashar>	 yeah I amm looking for the +2 message :D
[08:57:01] <hashar>	 oh I have hidden it 
[08:57:26] <hashar>	 I like to inline the full context in the commit messages
[08:57:40] <hashar>	 since in a few years for sure the resource pointed by the URL would have vanished
[08:58:14] <jnuche>	 👍
[08:58:18] <jnuche>	 I'm going to deploy the change now
[09:00:15] <hashar>	 thank you!
[09:00:24] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2114 [puppet] - 10https://gerrit.wikimedia.org/r/1067296 (https://phabricator.wikimedia.org/T362948)
[09:00:30] <icinga-wm>	 RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:00:46] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on P{O:logging::opensearch::collector and logstash*.codfw.wmnet} and (A:logstash-collector)
[09:00:50] <logmsgbot>	 !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@8d2d8fe] (releasing): (no justification provided)
[09:01:16] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2232.codfw.wmnet with OS bookworm
[09:01:39] <logmsgbot>	 !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@8d2d8fe] (releasing): (no justification provided) (duration: 00m 48s)
[09:02:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2114.codfw.wmnet
[09:02:46] <jnuche>	 config looks good again, gonna rerun the job
[09:04:38] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::collector and logstash*.codfw.wmnet} and (A:logstash-collector)
[09:07:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:07:41] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10095021 (10ayounsi) Draft dashboard: https://grafana.wikimedia.org/d/f_tZtVlMz/drbd  I think we should be good to deploy it to all of the...
[09:07:58] <hashar>	 I have left a note on dduvall original commit at https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/commit/663c84371a60e1232a501e627c266899d4f5298f 
[09:08:03] * hashar grabs a coffee
[09:08:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[09:08:36] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:10:48] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s6 on db2124 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:10:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE)
[09:11:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2114.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[09:12:07] <hashar>	 hmm
[09:12:12] <icinga-wm>	 RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:12:22] <hashar>	  ! [remote rejected]         HEAD -> refs/for/wmf/1.43.0-wmf.20 (implicit merges detected)
[09:12:22] <hashar>	 lol
[09:13:22] <hashar>	 that is actually good
[09:13:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2114.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[09:13:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:13:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2114.codfw.wmnet
[09:13:48] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on db2124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_only_for_read_old: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:13:54] <jnuche>	 I was actually about to ask about that
[09:14:16] <jnuche>	 never seen that thing about implicit merges
[09:14:18] <hashar>	 it lists 6 commits
[09:14:23] <hashar>	 which are change thtat got merged during the night
[09:14:34] <hashar>	 AFTER the branch cut ran the first time
[09:14:58] <hashar>	 those are merged in master
[09:15:07] <jnuche>	 right, some of the branches were successfully cut last night...
[09:15:22] <icinga-wm>	 PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:15:43] <hashar>	 on the releases hosts it does:
[09:15:44] <hashar>	 11:03:21 Branching mediawiki/core to wmf/1.43.0-wmf.20 from HEAD
[09:15:44] <hashar>	 11:03:21 Warning: Branch wmf/1.43.0-wmf.20 already exists in repository mediawiki/core
[09:16:07] <hashar>	 so my guess is on the host the branch has been updated to whatever master is at and that includes thoses six commits
[09:16:23] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2114.codfw.wmnet - https://phabricator.wikimedia.org/T362948#10095055 (10Marostegui) a:05Marostegui→03None Ready for DC-Ops
[09:16:29] <hashar>	 when pushing that back to Gerrit it complains cause it hasn't seen those commits being proposed as changes to the wmf/1.43.0-wmf.20 branch
[09:16:33] <hashar>	 that bypasses review
[09:16:36] <hashar>	 and it complains
[09:16:49] <hashar>	 so
[09:17:17] <hashar>	 1) the branch cut job should not attempt to refresh / update the branch when it is already existing
[09:17:18] <hashar>	 OR
[09:17:26] <hashar>	 2) we backport all six patches (that sounds overkill)
[09:17:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:18:05] <hashar>	 2bis) I manually push the update
[09:18:33] <jnuche>	 mmmh, the thing is the job also has a different mode of execution that reuses the same branch all the time (precut_branch or something)
[09:18:42] <jnuche>	 changing the update behavior probably would affect that
[09:18:56] <jnuche>	 for the time being, could we fix manually?
[09:20:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T371742)', diff saved to https://phabricator.wikimedia.org/P67870 and previous config saved to /var/cache/conftool/dbconfig/20240827-092005-ladsgroup.json
[09:20:10] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[09:20:14] <icinga-wm>	 RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:22:21] <hashar>	 oh
[09:22:24] <hashar>	 that is that python code
[09:22:28] <hashar>	 NOoo
[09:23:48] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10095063 (10MatthewVernon) Please go ahead! [sorry, I missed this on Friday, and then yesterday was a public holiday]
[09:24:48] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s6 on db2124 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:24:50] <hashar>	 jnuche: I have updated the wmf branch to current master
[09:25:32] <hashar>	 !log train: fast forwarded mediawiki/core wmf/1.43.0-wmf.20 from 1faf18d6570 to ef87455d7c3 # T366965
[09:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:36] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[09:25:41] <jnuche>	 thx, let's try again then :)
[09:25:50] <hashar>	 rebuilding
[09:25:51] <hashar>	 :)
[09:26:04] <jnuche>	 ah you beat me to it
[09:26:21] <hashar>	 yeah sorry
[09:26:22] <hashar>	 !
[09:27:14] <jnuche>	 now let's hope none of the extensions/skins repos got updates overnight...
[09:27:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:27:48] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on db2124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_private: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:29:14] <hashar>	 oh
[09:29:15] <hashar>	 true
[09:29:21] <hashar>	 well they did for sure :/
[09:31:47] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s6 on db2124 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:32:27] <wikibugs>	 (03Abandoned) 10Gmodena: EventStreamConfig: Add webrequest.frontend.v1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena)
[09:32:31] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2124.codfw.wmnet with reason: db2124 fix
[09:32:43] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2124.codfw.wmnet with reason: db2124 fix
[09:33:29] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] EventStreamConfig: remove webrequest_frontend. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062679 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena)
[09:33:57] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.20 [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067300 (https://phabricator.wikimedia.org/T366965)
[09:33:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.20 [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067300 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[09:34:10] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Cephadm doesn't find the correct image to run a shell - https://phabricator.wikimedia.org/T373185#10095084 (10MatthewVernon) 05Open→03Resolved Clusters upgraded to new image, and lo: ` mvernon@moss-be2001:~$ sudo cephadm shell Inferring fsid 59ea825c-2a...
[09:34:12] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: remove webrequest_frontend. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062679 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena)
[09:35:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P67871 and previous config saved to /var/cache/conftool/dbconfig/20240827-093512-ladsgroup.json
[09:36:51] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2017.codfw.wmnet
[09:37:24] <jnuche>	 it managed to create the change request :)
[09:37:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2017.codfw.wmnet
[09:37:42] <wikibugs>	 (03PS1) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301
[09:37:43] <jnuche>	 I need to step away from the desk to prepare lunch, I'll be still checking my messages though
[09:38:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro)
[09:38:38] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2017.codfw.wmnet with OS bullseye
[09:38:42] <hashar>	 jnuche: Waiting up to 3600 seconds for https://gerrit.wikimedia.org/r/c/1067300 to merge
[09:38:47] <hashar>	 that is a good sign I guess :)
[09:38:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[09:39:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:39:19] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2028.codfw.wmnet
[09:39:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2028.codfw.wmnet
[09:40:03] <hashar>	 fun thing dancy indented the python script with  THREE SPACES :)
[09:40:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f057a31dd90>
[09:40:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[09:41:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2114 [puppet] - 10https://gerrit.wikimedia.org/r/1067296 (https://phabricator.wikimedia.org/T362948) (owner: 10Marostegui)
[09:42:10] <wikibugs>	 (03PS1) 10Ayounsi: Ganeti prod: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560)
[09:42:24] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi)
[09:43:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2017 - cgoubert@cumin1002"
[09:43:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2017 - cgoubert@cumin1002"
[09:43:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:43:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2017.codfw.wmnet 76.0.192.10.in-addr.arpa 6.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:43:27] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2017.codfw.wmnet 76.0.192.10.in-addr.arpa 6.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:43:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2017
[09:44:45] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi)
[09:44:57] <icinga-wm>	 RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:44:57] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2028.codfw.wmnet with OS bullseye
[09:45:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2017
[09:45:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f057a31dd90>
[09:45:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095118 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[09:45:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f6e24a10d30>
[09:45:26] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Ganeti prod: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi)
[09:45:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[09:47:37] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:47:39] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:48:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:49:10] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:49:35] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2028 - cgoubert@cumin1002"
[09:49:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2028 - cgoubert@cumin1002"
[09:49:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:49:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2028.codfw.wmnet 178.0.192.10.in-addr.arpa 8.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:49:43] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2028.codfw.wmnet 178.0.192.10.in-addr.arpa 8.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:49:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2028
[09:50:04] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2028
[09:50:04] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f6e24a10d30>
[09:50:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P67872 and previous config saved to /var/cache/conftool/dbconfig/20240827-095019-ladsgroup.json
[09:50:51] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2018.codfw.wmnet
[09:51:11] <wikibugs>	 (03PS2) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301
[09:51:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2018.codfw.wmnet
[09:51:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro)
[09:52:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2018.codfw.wmnet with OS bullseye
[09:52:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[09:53:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f65f7b4bd90>
[09:53:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[09:55:39] <wikibugs>	 (03PS3) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301
[09:56:01] <hashar>	 wmf-quibble-core-vendor-mysql-php74 | ███████▒▒▒ 77% | ETA: 379s 
[09:56:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67873 and previous config saved to /var/cache/conftool/dbconfig/20240827-095627-arnaudb.json
[09:56:50] <wikibugs>	 (03PS3) 10Tiziano Fogli: curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961)
[09:56:57] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3751/console" [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro)
[09:58:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro)
[10:00:04] <wikibugs>	 (03PS4) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1000)
[10:00:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2018 - cgoubert@cumin1002"
[10:00:47] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2018 - cgoubert@cumin1002"
[10:00:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:00:48] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2018.codfw.wmnet 95.0.192.10.in-addr.arpa 5.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:00:52] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2018.codfw.wmnet 95.0.192.10.in-addr.arpa 5.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:00:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2018
[10:01:16] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro)
[10:01:26] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage
[10:01:43] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2018
[10:01:43] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f65f7b4bd90>
[10:02:20] <wikibugs>	 (03PS2) 10Dbrant: Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828)
[10:02:27] <wikibugs>	 (03PS1) 10Zabe: Revert apparent fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067305 (https://phabricator.wikimedia.org/T368712)
[10:02:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro)
[10:03:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.20 [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067300 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[10:04:20] <wikibugs>	 (03CR) 10Btullis: [C:03+2] ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol)
[10:04:55] <hashar>	 jnuche: for the implicit merge being rejected, the Gerrit doc is at https://gerrit.wikimedia.org/r/Documentation/config-project-config.html#receive.rejectImplicitMerges ):
[10:04:55] <hashar>	 :)
[10:05:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage
[10:05:24] <hashar>	 I am resuming the train
[10:05:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T371742)', diff saved to https://phabricator.wikimedia.org/P67874 and previous config saved to /var/cache/conftool/dbconfig/20240827-100527-ladsgroup.json
[10:05:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[10:05:31] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[10:05:42] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[10:05:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T371742)', diff saved to https://phabricator.wikimedia.org/P67875 and previous config saved to /var/cache/conftool/dbconfig/20240827-100548-ladsgroup.json
[10:06:52] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20  refs T366965
[10:06:54] <logmsgbot>	 !log hashar@deploy1003 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.43.0-wmf.20" --no-progress --store-class=LCStoreCDB --threads=22 --lang en  --quiet ' returned non-zero exit status 1. (duration: 00m 02s)
[10:06:55] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[10:07:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage
[10:07:26] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol)
[10:07:59] <hashar>	 RuntimeException from line 88 of /srv/mediawiki-staging/php-1.43.0-wmf.20/includes/language/LCStoreCDB.php: Unable to create the localisation store directory "/srv/mediawiki-staging/php-1.43.0-wmf.20/cache/l10n"
[10:08:00] <hashar>	 fun
[10:09:12] <hashar>	 the parent `cache` belongs to mwpresync:deployment
[10:09:25] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:09:43] <hashar>	 but the cache is rebuilt as www-data
[10:09:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage
[10:10:19] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:10:49] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:11:12] <wikibugs>	 (03PS1) 10David Caro: toolforge:prometheus: only kyverno controllers expose stats [puppet] - 10https://gerrit.wikimedia.org/r/1067307 (https://phabricator.wikimedia.org/T370143)
[10:11:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2030.codfw.wmnet
[10:11:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 2%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67876 and previous config saved to /var/cache/conftool/dbconfig/20240827-101132-arnaudb.json
[10:11:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2030.codfw.wmnet
[10:12:41] <wikibugs>	 (03PS1) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419)
[10:13:00] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20  refs T366965
[10:13:00] <logmsgbot>	 !log hashar@deploy1003 scap failed: PermissionError [Errno 13] Permission denied: '/srv/mediawiki-staging/php-1.43.0-wmf.20/cache/gitinfo' (duration: 00m 00s)
[10:13:03] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[10:13:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[10:13:23] <hashar>	 pff
[10:13:30] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2124.codfw.wmnet with reason: replag
[10:13:42] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2124.codfw.wmnet with reason: replag
[10:14:18] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2030.codfw.wmnet with OS bullseye
[10:14:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w...
[10:14:44] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7fa8baa9bd90>
[10:14:51] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[10:15:52] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on db2124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:16:12] <wikibugs>	 (03PS2) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419)
[10:16:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[10:16:54] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20  refs T366965
[10:16:57] <logmsgbot>	 !log hashar@deploy1003 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.43.0-wmf.20" --no-progress --store-class=LCStoreCDB --threads=22 --lang en  --quiet ' returned non-zero exit status 1. (duration: 00m 02s)
[10:17:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) (owner: 10Tiziano Fogli)
[10:17:15] * hashar files a bug
[10:17:21] <hashar>	 train is blocked
[10:17:37] <wikibugs>	 (03PS1) 10David Caro: toolforge:prometheus: drop metrics as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1067309 (https://phabricator.wikimedia.org/T370143)
[10:17:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage
[10:18:43] <wikibugs>	 (03PS3) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419)
[10:19:00] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2030 - cgoubert@cumin1002"
[10:19:05] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2030 - cgoubert@cumin1002"
[10:19:05] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:19:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2030.codfw.wmnet 177.0.192.10.in-addr.arpa 7.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:19:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[10:19:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2030.codfw.wmnet 177.0.192.10.in-addr.arpa 7.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:19:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2030
[10:19:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2030
[10:19:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7fa8baa9bd90>
[10:20:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage
[10:21:05] <wikibugs>	 (03PS4) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419)
[10:23:34] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[10:24:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2017.codfw.wmnet with OS bullseye
[10:24:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[10:26:37] <claime>	 !log homer 'cr*codfw*' commit 'T372878'
[10:26:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 4%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67877 and previous config saved to /var/cache/conftool/dbconfig/20240827-102638-arnaudb.json
[10:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:41] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[10:27:29] <wikibugs>	 (03PS5) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419)
[10:28:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67878 and previous config saved to /var/cache/conftool/dbconfig/20240827-102827-arnaudb.json
[10:29:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2028.codfw.wmnet with OS bullseye
[10:29:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[10:32:53] <wikibugs>	 (03PS1) 10JMeybohm: Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311
[10:33:03] <wikibugs>	 (03PS2) 10JMeybohm: Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311
[10:33:04] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 465, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:33:12] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067311 (owner: 10JMeybohm)
[10:33:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311 (owner: 10JMeybohm)
[10:33:37] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1067308/3755/" [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[10:33:37] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20  refs T366965
[10:33:42] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[10:33:44] <hashar>	 I went with `sudo -u mwpresync chmod o+w /srv/mediawiki-staging/php-1.43.0-wmf.20/cache/`
[10:34:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszabo - https://phabricator.wikimedia.org/T373426 (10mszabo) 03NEW
[10:36:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage
[10:37:30] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095341 (10Dreamy_Jazz)
[10:37:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszabo - https://phabricator.wikimedia.org/T373426#10095342 (10mszabo)
[10:37:49] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095357 (10Dreamy_Jazz) The logs for the `daily-updates` container h...
[10:38:31] <claime>	 hashar: why are half the files in /srv/mediawiki-staging/php-1.43.0-wmf.20/ owned by your user ?
[10:39:01] <Jhs>	 Hey folks! Would it be possible to add & deploy an IP exception to wmf-config/throttle.php rather quickly?
[10:39:20] <Jhs>	 I just got an email from someone holding an event with 100 people, and many of them are being throttled
[10:39:27] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095359 (10kostajh) The container is still running, though:  ` [khar...
[10:39:35] <claime>	 that's not the case for the previous version where everything belongs to mwpresync:deployment
[10:40:03] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage
[10:40:38] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095361 (10kostajh) I guess we need to stop the container so that a...
[10:40:51] <kostajh>	 Jhs: is there a task? 
[10:40:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2018.codfw.wmnet with OS bullseye
[10:41:17] <Jhs>	 kostajh, not yet. I'm trying to find out the IP they need unthrottled, and the projects that needs to happen in
[10:41:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[10:41:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 6%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67879 and previous config saved to /var/cache/conftool/dbconfig/20240827-104143-arnaudb.json
[10:42:12] <Jhs>	 kostajh, but if the answer to the question about doing it quickly would be "we can't do it until the next deployment window", i think their event might be over by then 😅 which is why i asked that question first
[10:42:13] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 547, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:42:23] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095365 (10Dreamy_Jazz)
[10:43:12] <claime>	 !log Running homer 'lsw1-a5-codfw*' commit 'T372878'
[10:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:15] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[10:43:28] <Jhs>	 I'll of course tell the organizer about the possibility of scheduling such an exemption ahead of time, i'm sure they're just not aware of the possibility and/or how to do it
[10:43:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 2%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67880 and previous config saved to /var/cache/conftool/dbconfig/20240827-104332-arnaudb.json
[10:44:40] <claime>	 Jhs: you can point them to https://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap
[10:44:41] <kostajh>	 Jhs: we could probably do something out of the deployment window (cc hashar ) but we'd need a phab task with the details, for an audit trail 
[10:46:01] <claime>	 !log Running homer 'lsw1-a6-codfw*' commit 'T372878'
[10:46:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:48:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2017.codfw.wmnet
[10:48:33] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2017.codfw.wmnet
[10:48:51] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2028.codfw.wmnet
[10:48:51] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2028.codfw.wmnet
[10:49:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2018.codfw.wmnet
[10:49:04] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2018.codfw.wmnet
[10:49:27] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427#10095378 (10kostajh)
[10:49:31] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:49:57] <wikibugs>	 (03Abandoned) 10JMeybohm: Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311 (owner: 10JMeybohm)
[10:50:14] <wikibugs>	 (03PS1) 10JMeybohm: Decom kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/1067313 (https://phabricator.wikimedia.org/T373428)
[10:50:27] <wikibugs>	 (03PS1) 10JMeybohm: Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428)
[10:50:59] <hashar>	 kostajh: Jhs: I am fine having a throttle config change to be deployed at anytime
[10:51:04] <hashar>	 they are rather straight forward :)
[10:52:33] <hashar>	 I think there is a process about it somewhere
[10:52:49] <hashar>	 probably in a "how to run an edit a thon" or something
[10:52:56] <_joe_>	 hashar: yes the process says ask two weeks in advance :)
[10:53:04] <hashar>	 yeah
[10:53:09] <hashar>	 then it is often missed
[10:53:11] <_joe_>	 which is not there as a condition just for bureaucratic/organizational reasons
[10:53:20] <_joe_>	 although those are also valid
[10:53:21] <hashar>	 cause organizers are not necessarily aware of that limit
[10:53:40] <_joe_>	 I'm pretty sure there's ample documentation of the potential issue
[10:53:49] <_joe_>	 however, let's proceed
[10:54:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2001.codfw.wmnet
[10:54:19] <_joe_>	 Jhs: It's not a possibility, it's a requirement :) see ttps://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap
[10:54:45] <_joe_>	 and yes at a bare minimum we need a phab task following this procedure ^^
[10:55:11] <hashar>	 ah we have ample documentation!
[10:55:18] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database cswikivoyage (T370912)
[10:55:23] <stashbot>	 T370912: Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912
[10:56:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 8%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67881 and previous config saved to /var/cache/conftool/dbconfig/20240827-105649-arnaudb.json
[10:57:33] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:57:39] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:57:52] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:57:53] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:58:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:58:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67882 and previous config saved to /var/cache/conftool/dbconfig/20240827-105815-ladsgroup.json
[10:58:19] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[10:58:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 3%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67883 and previous config saved to /var/cache/conftool/dbconfig/20240827-105837-arnaudb.json
[11:00:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2030.codfw.wmnet with OS bullseye
[11:00:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67884 and previous config saved to /var/cache/conftool/dbconfig/20240827-110024-ladsgroup.json
[11:00:27] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[11:00:57] <Dreamy_Jazz>	 !log Starting MediaModeration time limited scan on group0 to make up monthly request limit - https://wikitech.wikimedia.org/wiki/MediaModeration
[11:00:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10095413 (10mszabo)
[11:01:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095415 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik...
[11:02:40] <Jhs>	 claime, kostajh, _joe_ : Thanks! I still haven't heard back about my question about the IP address and affected projects, so I doubt it will happen today. But i'll give them a tip about what the proper procedure is for next time, so it'll be a better situation for everyone :)
[11:05:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2030.codfw.wmnet
[11:05:23] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2030.codfw.wmnet
[11:05:49] <wikibugs>	 (03PS2) 10JMeybohm: Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428)
[11:09:14] <godog>	 jouncebot: now and next
[11:09:14] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 50 minute(s)
[11:11:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 16%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67885 and previous config saved to /var/cache/conftool/dbconfig/20240827-111154-arnaudb.json
[11:12:06] <godog>	 !log start prometheus6002 bookworm upgrade - T326657
[11:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:09] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[11:13:24] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[11:13:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67886 and previous config saved to /var/cache/conftool/dbconfig/20240827-111343-arnaudb.json
[11:14:02] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[11:14:02] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:14:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2001.codfw.wmnet
[11:15:32] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428) (owner: 10JMeybohm)
[11:15:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P67887 and previous config saved to /var/cache/conftool/dbconfig/20240827-111532-ladsgroup.json
[11:15:40] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/1067313 (https://phabricator.wikimedia.org/T373428) (owner: 10JMeybohm)
[11:18:55] <wikibugs>	 (03Merged) 10jenkins-bot: Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428) (owner: 10JMeybohm)
[11:19:11] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2001.codfw.wmnet - https://phabricator.wikimedia.org/T373428#10095452 (10JMeybohm)
[11:19:45] <kart_>	 I would like to deploy cxserver if no deployments going on (nothing as per calendar)
[11:19:55] <claime>	 !log Deleting misbehaving pod ipoid-production-daily-updates-28742340-h5ckx - T373427
[11:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:59] <stashbot>	 T373427: IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427
[11:20:34] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format db2240 [puppet] - 10https://gerrit.wikimedia.org/r/1067319
[11:20:37] <godog>	 !log start prometheus7001 bookworm upgrade - T326657
[11:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:41] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[11:20:53] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.20  refs T366965 (duration: 47m 15s)
[11:20:54] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database cswikivoyage (T370912)
[11:20:56] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[11:21:00] <stashbot>	 T370912: Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912
[11:22:02] <wikibugs>	 (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1067319 (owner: 10Marostegui)
[11:23:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2240 [puppet] - 10https://gerrit.wikimedia.org/r/1067319 (owner: 10Marostegui)
[11:24:47] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet
[11:27:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67889 and previous config saved to /var/cache/conftool/dbconfig/20240827-112700-arnaudb.json
[11:27:05] <wikibugs>	 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427#10095494 (10kostajh) >>! In T373427#10095462...
[11:28:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 15%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67890 and previous config saved to /var/cache/conftool/dbconfig/20240827-112848-arnaudb.json
[11:30:02] <hashar>	 I ll do group0 after lunch
[11:30:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P67891 and previous config saved to /var/cache/conftool/dbconfig/20240827-113039-ladsgroup.json
[11:30:49] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet
[11:31:30] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-08-27-045705-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067221 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry)
[11:31:43] <wikibugs>	 (03CR) 10Jaime Nuche: releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[11:31:44] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2292.codfw.wmnet
[11:32:19] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2292.codfw.wmnet
[11:32:32] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-08-27-045705-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067221 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry)
[11:33:40] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet
[11:38:34] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[11:38:59] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[11:39:43] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet
[11:42:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67892 and previous config saved to /var/cache/conftool/dbconfig/20240827-114205-arnaudb.json
[11:43:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:45] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878)
[11:43:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67893 and previous config saved to /var/cache/conftool/dbconfig/20240827-114354-arnaudb.json
[11:45:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67894 and previous config saved to /var/cache/conftool/dbconfig/20240827-114546-ladsgroup.json
[11:45:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:45:51] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[11:46:01] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:46:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T370903)', diff saved to https://phabricator.wikimedia.org/P67895 and previous config saved to /var/cache/conftool/dbconfig/20240827-114608-ladsgroup.json
[11:46:19] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[11:46:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris)
[11:46:52] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[11:47:54] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878)
[11:49:26] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[11:50:02] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[11:50:52] <wikibugs>	 (03CR) 10Jaime Nuche: releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[11:51:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris)
[11:51:36] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński)
[11:51:53] <kart_>	 !log Updated cxserver to 2024-08-27-045705-production (T369815)
[11:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:57] <stashbot>	 T369815: Enable in content Translation the new languages Google Translate supports in June 2024 - https://phabricator.wikimedia.org/T369815
[11:53:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T370903)', diff saved to https://phabricator.wikimedia.org/P67896 and previous config saved to /var/cache/conftool/dbconfig/20240827-115318-ladsgroup.json
[11:53:22] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[11:53:57] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2292 to wikikube-worker2043
[11:54:13] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[11:57:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67897 and previous config saved to /var/cache/conftool/dbconfig/20240827-115711-arnaudb.json
[11:58:38] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2292 to wikikube-worker2043 - akosiaris@cumin1002"
[11:59:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67898 and previous config saved to /var/cache/conftool/dbconfig/20240827-115859-arnaudb.json
[11:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[11:59:41] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[11:59:51] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2292 to wikikube-worker2043 - akosiaris@cumin1002"
[11:59:52] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:59:53] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2043
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1200)
[12:00:07] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2043
[12:00:36] <hashar>	 I am doing the group0 promotion since this morning did not work
[12:00:43] <hashar>	 we only reached testwikis
[12:00:46] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2292 to wikikube-worker2043
[12:01:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2292 to...
[12:01:40] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2043.codfw.wmnet with OS bullseye
[12:01:50] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7fa3b11fc520>
[12:01:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host...
[12:02:27] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[12:04:29] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[12:05:33] <icinga-wm>	 PROBLEM - Disk space on restbase2022 is CRITICAL: DISK CRITICAL - free space: /srv/sda4 113109 MB (6% inode=99%): /srv/sdc4 69044 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2022&var-datasource=codfw+prometheus/ops
[12:07:02] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067330 (https://phabricator.wikimedia.org/T366965)
[12:07:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067330 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[12:08:01] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067330 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot)
[12:08:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P67899 and previous config saved to /var/cache/conftool/dbconfig/20240827-120825-ladsgroup.json
[12:10:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc2015.codfw.wmnet with reason: Network maintenance
[12:10:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2015.codfw.wmnet with reason: Network maintenance
[12:11:33] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] profile::firewall::nftables_throttling: fix issue of global metering [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[12:11:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2019.codfw.wmnet
[12:12:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67900 and previous config saved to /var/cache/conftool/dbconfig/20240827-121216-arnaudb.json
[12:14:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67901 and previous config saved to /var/cache/conftool/dbconfig/20240827-121405-arnaudb.json
[12:14:09] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2043 - akosiaris@cumin1002"
[12:14:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2043 - akosiaris@cumin1002"
[12:14:13] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:14:13] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2043.codfw.wmnet 162.0.192.10.in-addr.arpa 2.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:14:16] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2043.codfw.wmnet 162.0.192.10.in-addr.arpa 2.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:14:17] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2043
[12:14:20] <wikibugs>	 (03PS1) 10Kamila Součková: Rename kubernetes2019 to wikikube-worker2044 [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878)
[12:14:47] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[12:15:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2019.codfw.wmnet
[12:15:40] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2043
[12:15:40] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7fa3b11fc520>
[12:16:05] <wikibugs>	 (03PS1) 10David Caro: toolforge:prometheus: remove cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143)
[12:18:16] <logmsgbot>	 !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.20  refs T366965
[12:18:20] <stashbot>	 T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965
[12:18:47] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:18:47] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:23:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P67902 and previous config saved to /var/cache/conftool/dbconfig/20240827-122332-ladsgroup.json
[12:24:01] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2001.codfw.wmnet - https://phabricator.wikimedia.org/T373428#10095626 (10Jhancock.wm) a:03Jhancock.wm
[12:24:14] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10095619 (10ayounsi) 05Open→03Resolved All done!
[12:25:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T371742)', diff saved to https://phabricator.wikimedia.org/P67903 and previous config saved to /var/cache/conftool/dbconfig/20240827-122509-ladsgroup.json
[12:25:15] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[12:28:34] <zabe>	 jouncebot: nowandnext
[12:28:34] <jouncebot>	 For the next 0 hour(s) and 31 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1200)
[12:28:34] <jouncebot>	 In 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1300)
[12:29:05] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert apparent fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067305 (https://phabricator.wikimedia.org/T368712) (owner: 10Zabe)
[12:29:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67904 and previous config saved to /var/cache/conftool/dbconfig/20240827-122910-arnaudb.json
[12:29:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert apparent fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067305 (https://phabricator.wikimedia.org/T368712) (owner: 10Zabe)
[12:30:12] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067305|Revert apparent fix (T368712)]]
[12:30:18] <stashbot>	 T368712: Change sysop_plwiki logo and favicon - https://phabricator.wikimedia.org/T368712
[12:30:31] <wikibugs>	 (03PS1) 10AOkoth: vrts: add ticket count metrics for different queues [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419)
[12:32:18] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage
[12:32:41] <wikibugs>	 (03PS1) 10Jelto: gitlab: add profile::prometheus::nft_throttling_denylist [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882)
[12:33:54] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1067305|Revert apparent fix (T368712)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:33:55] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1067336/3756/" [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[12:34:08] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[12:34:51] <wikibugs>	 (03PS1) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284)
[12:34:52] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3757/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[12:35:23] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage
[12:36:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[12:38:33] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067305|Revert apparent fix (T368712)]] (duration: 08m 20s)
[12:38:37] <stashbot>	 T368712: Change sysop_plwiki logo and favicon - https://phabricator.wikimedia.org/T368712
[12:38:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T370903)', diff saved to https://phabricator.wikimedia.org/P67905 and previous config saved to /var/cache/conftool/dbconfig/20240827-123839-ladsgroup.json
[12:38:42] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:38:43] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[12:38:55] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:40:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P67906 and previous config saved to /var/cache/conftool/dbconfig/20240827-124016-ladsgroup.json
[12:46:09] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[12:46:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[12:46:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T370903)', diff saved to https://phabricator.wikimedia.org/P67907 and previous config saved to /var/cache/conftool/dbconfig/20240827-124629-ladsgroup.json
[12:46:33] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[12:46:46] <zabe>	 !log zabe@mwmaint1002:~$ foreachwikiindblist private wrapOldPasswords.php --type BEP --update # T91917
[12:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:55] <wikibugs>	 (03PS1) 10Brouberol: cloudnative-pg: enable ingress traffic to the prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067340 (https://phabricator.wikimedia.org/T372284)
[12:48:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067340 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[12:48:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] cloudnative-pg: enable ingress traffic to the prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067340 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[12:49:59] <zabe>	 !log zabe@mwmaint1002:~$ foreachwikiindblist fishbowl wrapOldPasswords.php --type BEP --update # T91917
[12:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:50] <wikibugs>	 (03PS2) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284)
[12:51:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T370903)', diff saved to https://phabricator.wikimedia.org/P67908 and previous config saved to /var/cache/conftool/dbconfig/20240827-125139-ladsgroup.json
[12:51:44] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[12:52:11] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:52:14] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:52:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[12:52:56] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge:prometheus: only kyverno controllers expose stats [puppet] - 10https://gerrit.wikimedia.org/r/1067307 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[12:53:00] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge:prometheus: drop metrics as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1067309 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[12:53:18] <wikibugs>	 (03CR) 10David Caro: "Turns out that it might not be cadvisor the culprit, looking" [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro)
[12:53:47] <wikibugs>	 (03PS1) 10Ssingh: P:ntp: set time for CRITICAL alert to 2 hours (from 4) for service check [puppet] - 10https://gerrit.wikimedia.org/r/1067341
[12:54:30] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3758/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067341 (owner: 10Ssingh)
[12:55:21] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2043.codfw.wmnet with OS bullseye
[12:55:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P67909 and previous config saved to /var/cache/conftool/dbconfig/20240827-125523-ladsgroup.json
[12:56:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki...
[12:56:49] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: set time for CRITICAL alert to 2 hours (from 4) for service check [puppet] - 10https://gerrit.wikimedia.org/r/1067341 (owner: 10Ssingh)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1300).
[13:00:05] <jouncebot>	 Daimona and joelyrookewmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:28] <HouseOfM>	 o/
[13:00:49] <joelyrookewmde>	 hello team !
[13:00:56] <Daimona>	 o/
[13:01:12] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] vrts: add ticket count metrics for different queues [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[13:01:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10095785 (10ssingh) Thanks @kzimmerman! This just leaves us with @thcipriani's approval and I will merge the patch once that is in.
[13:01:42] <zabe>	 I can deploy
[13:01:56] <zabe>	 HouseOfM: I can't see a patch for you in the window?
[13:02:05] <wikibugs>	 (03PS4) 10Joely Rooke WMDE: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315)
[13:02:12] <HouseOfM>	 I'm here for @Daimona patch
[13:02:18] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE)
[13:02:35] <zabe>	 ah alright:)
[13:03:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10095796 (10ssingh) >>! In T373142#10094180, @KFrancis wrote: > Hello all, I am confirming as @NCreasy is a contractor with the WMF, there is already and NDA in place.  Thanks!  Sorry for the confus...
[13:03:30] <wikibugs>	 (03PS3) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284)
[13:03:42] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] vrts: add ticket count metrics for different queues [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[13:04:51] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041)
[13:04:53] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy)
[13:05:13] <wikibugs>	 (03Merged) 10jenkins-bot: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE)
[13:05:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy)
[13:05:56] <wikibugs>	 (03Merged) 10jenkins-bot: Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy)
[13:06:15] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067282|Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]], [[gerrit:1066777|Enable CampaignEvents Invitation Lists in production testing environments (T373041)]]
[13:06:22] <stashbot>	 T373041: Release Invitation lists to all wikis with CampaignEvents extension + enable on test wikis - https://phabricator.wikimedia.org/T373041
[13:06:30] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 463, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:06:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P67910 and previous config saved to /var/cache/conftool/dbconfig/20240827-130647-ladsgroup.json
[13:08:38] <logmsgbot>	 !log zabe@deploy1003 joelyrookewmde, daimona, zabe: Backport for [[gerrit:1067282|Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]], [[gerrit:1066777|Enable CampaignEvents Invitation Lists in production testing environments (T373041)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:08:57] <zabe>	 Daimona: HouseOfM: joelyrookewmde: can you test?
[13:09:03] <joelyrookewmde>	 yes will do now
[13:09:27] <HouseOfM>	 testing
[13:09:32] <Daimona>	 yup, thx
[13:10:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T371742)', diff saved to https://phabricator.wikimedia.org/P67911 and previous config saved to /var/cache/conftool/dbconfig/20240827-131031-ladsgroup.json
[13:10:33] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[13:10:35] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[13:10:46] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[13:11:45] <wikibugs>	 (03PS1) 10Brouberol: Remove the pgcluster-test in dse-k8s, no longer useful [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067346
[13:12:25] <Daimona>	 OK I'm officially dumb, the special pages are working, but the feature is broken because I forgot to create the DB tables
[13:13:03] <joelyrookewmde>	 the config change I'm doing currently only affects the beta cluster, but mediawikiDebug doesn't show any options for test servers right now
[13:13:19] <joelyrookewmde>	 Is there something I need to do or can we not test the beta cluster in this way?
[13:13:41] <Daimona>	 zabe can I go ahead and create these tables in production now?
[13:13:47] <zabe>	 joelyrookewmde: you can't test beta cluster with mwdebug; so if it only affects beta, I would just sync it thruogh
[13:13:52] <joelyrookewmde>	 ok
[13:13:53] <zabe>	 Daimona: sure, go ahead:)
[13:13:55] <joelyrookewmde>	 works for me
[13:14:15] <zabe>	 it will reach beta cluster ~10-15 min after the merge
[13:14:24] <joelyrookewmde>	 perfect, thanks!
[13:14:32] <_joe_>	 Daimona: uh wait, what do you mean create the tables in production?
[13:16:57] <Daimona>	 We have a few DB tables that need to be created in production for a new feature. These have already been approved by DBA. We would generally create them in a dedicated window, hence my question.
[13:17:28] <_joe_>	 yeah, just let the DBAs know/confirm now is ok :)
[13:17:58] <marostegui>	 Daimona: Go for it yes
[13:18:00] <Daimona>	 (Also, y'all please bear with me, I'm trying to find the relevant tasks, and now is a perfect time to find out that they're not in the parent-child hierarchy for this project)
[13:19:13] <Daimona>	 Also a perfect time to discover that phab has a limit of 20 subtasks, apparently
[13:19:45] <_joe_>	 Daimona: you can also find out how good phab search is :D
[13:19:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 1%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67912 and previous config saved to /var/cache/conftool/dbconfig/20240827-131947-arnaudb.json
[13:21:33] <HouseOfM>	 @Daimona https://phabricator.wikimedia.org/T369303?
[13:21:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P67913 and previous config saved to /var/cache/conftool/dbconfig/20240827-132154-ladsgroup.json
[13:22:22] <Daimona>	 Yes
[13:22:41] <Daimona>	 Wait wtf, this limit is a new thing isn't it, since the task in question already has 23 subtasks
[13:23:00] <wikibugs>	 (03PS1) 10Ssingh: admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142)
[13:23:39] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::collector and log*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[13:23:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142) (owner: 10Ssingh)
[13:29:54] <Daimona>	 !log Creating new DB tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T369303
[13:29:57] <wikibugs>	 (03PS2) 10Ssingh: admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142)
[13:29:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:58] <stashbot>	 T369303: Create the DB schema for invitation lists in prod - https://phabricator.wikimedia.org/T369303
[13:30:38] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10095911 (10elukey) Today I cleaned up some db nodes reported as debmonitor client failures while I was on holiday:  ` >>> spicerack.debmonitor().host_delete('d...
[13:32:54] <wikibugs>	 (03PS1) 10Brouberol: Upgrade airflow to 2.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067352 (https://phabricator.wikimedia.org/T372284)
[13:33:01] <Daimona>	 zabe: tables created and it's looking good now
[13:33:08] <zabe>	 alright:)
[13:33:16] <logmsgbot>	 !log zabe@deploy1003 joelyrookewmde, daimona, zabe: Continuing with sync
[13:34:02] <Daimona>	 Sorry for the inconvenience, I just completely forgot about the database :O
[13:34:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 2%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67914 and previous config saved to /var/cache/conftool/dbconfig/20240827-133452-arnaudb.json
[13:35:35] <zabe>	 no worries; i investigated some old type password hashes in fishbowl and private wikis in the meantime
[13:36:31] <HouseOfM>	 Thanks Daimona
[13:37:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T370903)', diff saved to https://phabricator.wikimedia.org/P67915 and previous config saved to /var/cache/conftool/dbconfig/20240827-133701-ladsgroup.json
[13:37:03] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[13:37:09] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[13:37:16] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[13:37:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T370903)', diff saved to https://phabricator.wikimedia.org/P67917 and previous config saved to /var/cache/conftool/dbconfig/20240827-133723-ladsgroup.json
[13:37:34] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::collector and log*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[13:37:43] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067282|Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]], [[gerrit:1066777|Enable CampaignEvents Invitation Lists in production testing environments (T373041)]] (duration: 31m 27s)
[13:37:47] <stashbot>	 T373041: Release Invitation lists to all wikis with CampaignEvents extension + enable on test wikis - https://phabricator.wikimedia.org/T373041
[13:39:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T370903)', diff saved to https://phabricator.wikimedia.org/P67918 and previous config saved to /var/cache/conftool/dbconfig/20240827-133933-ladsgroup.json
[13:42:01] <wikibugs>	 (03PS1) 10Klausman: ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353
[13:43:44] <wikibugs>	 (03PS1) 10Elukey: hosts/views.py: add logging when upgrading the host's OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1067354 (https://phabricator.wikimedia.org/T368744)
[13:43:46] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353 (owner: 10Klausman)
[13:44:11] <wikibugs>	 (03CR) 10Klausman: [C:03+2] ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353 (owner: 10Klausman)
[13:44:45] <zabe>	 !log zabe@mwmaint1002:~$ foreachwikiindblist fishbowl sql.php --query "UPDATE user SET user_password = CONCAT(':B:', user_id, ':', user_password) WHERE user_password RLIKE '^[0-9a-f]{32}$';" # T91917
[13:44:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10095958 (10thcipriani) >>! In T373379#10095785, @ssingh wrote: > Thanks @kzimmerman! This just leaves us with @thcipriani's approval and I will merge the patch once...
[13:45:12] <zabe>	 !log zabe@mwmaint1002:~$ foreachwikiindblist private sql.php --query "UPDATE user SET user_password = CONCAT(':B:', user_id, ':', user_password) WHERE user_password RLIKE '^[0-9a-f]{32}$';" # T91917
[13:45:12] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353 (owner: 10Klausman)
[13:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:36] <zabe>	 !log zabe@mwmaint1002:~$ foreachwikiindblist fishbowl wrapOldPasswords.php --type BEP --update # T91917
[13:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:40] <zabe>	 !log zabe@mwmaint1002:~$ foreachwikiindblist private wrapOldPasswords.php --type BEP --update # T91917
[13:46:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:48] <XioNoX>	 !log add routinator to bookworm-wikipedia apt repo - T372909
[13:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:52] <stashbot>	 T372909: Create prod VMs on routed ganeti cluster - https://phabricator.wikimedia.org/T372909
[13:48:25] <XioNoX>	 !log add bgpalerter to bookworm-wikipedia apt repo - T372909
[13:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 3%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67919 and previous config saved to /var/cache/conftool/dbconfig/20240827-134958-arnaudb.json
[13:52:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Rename kubernetes2019 to wikikube-worker2044 [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[13:53:33] <wikibugs>	 (03PS1) 10Ayounsi: RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909)
[13:54:11] <wikibugs>	 (03CR) 10Ayounsi: [V:03+1 C:03+2] Netbox: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi)
[13:54:21] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10096002 (10VRiley-WMF) 05Open→03Resolved Drive has been replaced. Please let us know if there are any other issues with this drive. Thanks!
[13:54:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P67920 and previous config saved to /var/cache/conftool/dbconfig/20240827-135440-ladsgroup.json
[13:57:44] <wikibugs>	 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#10096042 (10ayounsi) 05Open→03Resolved Deployed! let me know if any issue.
[14:02:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2015-2016].codfw.wmnet,pc[1015-1016].eqiad.wmnet with reason: Switchover
[14:02:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2015-2016].codfw.wmnet,pc[1015-1016].eqiad.wmnet with reason: Switchover
[14:02:51] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373401#10096117 (10Jhancock.wm) 05Open→03Resolved
[14:03:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142) (owner: 10Ssingh)
[14:04:25] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote pc2015 to pc4 master [puppet] - 10https://gerrit.wikimedia.org/r/1067357 (https://phabricator.wikimedia.org/T373340)
[14:05:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 5%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67921 and previous config saved to /var/cache/conftool/dbconfig/20240827-140503-arnaudb.json
[14:05:29] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10096149 (10ssingh) 05Open→03Resolved a:03ssingh Added to `nda` group. Please try logging in to Superset after ~30 mins. Thanks!
[14:05:58] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096155 (10Jhancock.wm) @Marostegui hey Papaul's on vacation this week. From what I remember that is a 10G issue. We started using this tag in the reimage script to keep this one from coming u...
[14:06:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] prometheus: create text file export for nft throttling denylist length (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn)
[14:06:58] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gitlab: add profile::prometheus::nft_throttling_denylist [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[14:07:57] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[14:08:09] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add profile::prometheus::nft_throttling_denylist [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[14:09:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P67922 and previous config saved to /var/cache/conftool/dbconfig/20240827-140947-ladsgroup.json
[14:10:09] <wikibugs>	 (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans)
[14:11:03] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] profile::firewall::nftables_throttling: fix issue of global metering [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto)
[14:11:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote pc2015 to pc4 master [puppet] - 10https://gerrit.wikimedia.org/r/1067357 (https://phabricator.wikimedia.org/T373340) (owner: 10Marostegui)
[14:11:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1067222 (https://phabricator.wikimedia.org/T373136) (owner: 10Jelto)
[14:12:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10096191 (10ssingh)
[14:12:21] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096192 (10elukey) >>! In T373417#10096155, @Jhancock.wm wrote: > @Marostegui hey Papaul's on vacation this week. From what I remember that is a 10G issue. We started using this tag in the rei...
[14:13:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10096198 (10ssingh)
[14:13:19] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add jiawang to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1066833 (https://phabricator.wikimedia.org/T373379) (owner: 10Ssingh)
[14:13:29] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2001.codfw.wmnet - https://phabricator.wikimedia.org/T373428#10096188 (10Jhancock.wm) 05Open→03Resolved
[14:13:35] <wikibugs>	 (03PS1) 10AOkoth: vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419)
[14:13:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10096195 (10ssingh) @thcipriani: this requires your approval, thank you.
[14:14:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[14:16:54] <wikibugs>	 (03PS2) 10AOkoth: vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419)
[14:17:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2230.codfw.wmnet with OS bookworm
[14:18:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2231.codfw.wmnet with OS bookworm
[14:18:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm
[14:18:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Switch pc4 master to pc2015 T373340', diff saved to https://phabricator.wikimedia.org/P67923 and previous config saved to /var/cache/conftool/dbconfig/20240827-141845-marostegui.json
[14:18:48] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 545, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:18:49] <stashbot>	 T373340: pc2016 switchover - https://phabricator.wikimedia.org/T373340
[14:18:50] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2230.codfw.wmnet with OS bookworm
[14:18:54] <logmsgbot>	 !log tappof@cumin2002 END (FAIL) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=99) rolling restart_daemons on P{O:logging::opensearch::data and logs*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[14:19:46] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2231.codfw.wmnet with OS bookworm
[14:19:49] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2232.codfw.wmnet with OS bookworm
[14:20:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 15%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67924 and previous config saved to /var/cache/conftool/dbconfig/20240827-142009-arnaudb.json
[14:20:10] <akosiaris>	 !log T327878 uncordon wikikube-worker2043
[14:20:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:16] <stashbot>	 T327878: Tweak Autocomplete search results on the Mongolian Wikipedia - https://phabricator.wikimedia.org/T327878
[14:20:27] <akosiaris>	 sigh, wrong task
[14:20:36] <akosiaris>	 !log T372878 uncordon wikikube-worker2043
[14:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:40] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[14:21:25] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[14:21:33] <elukey>	 jouncebot: next
[14:21:33] <jouncebot>	 In 0 hour(s) and 38 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1500)
[14:22:08] <elukey>	 akosiaris: I'd need to reboot wikikube-ctrl2003 for https://phabricator.wikimedia.org/T371132, am I going to interfere with some work that you are doing?
[14:22:12] <elukey>	 I can wait in case
[14:23:17] <wikibugs>	 (03PS3) 10AOkoth: vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419)
[14:23:56] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2114.codfw.wmnet - https://phabricator.wikimedia.org/T362948#10096268 (10Jhancock.wm) 05Open→03Resolved
[14:24:14] <marostegui>	 !log Update zarcillo db for pc4 master T373340
[14:24:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:21] <stashbot>	 T373340: pc2016 switchover - https://phabricator.wikimedia.org/T373340
[14:24:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T370903)', diff saved to https://phabricator.wikimedia.org/P67925 and previous config saved to /var/cache/conftool/dbconfig/20240827-142454-ladsgroup.json
[14:24:57] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[14:24:59] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[14:25:10] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[14:25:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T370903)', diff saved to https://phabricator.wikimedia.org/P67926 and previous config saved to /var/cache/conftool/dbconfig/20240827-142516-ladsgroup.json
[14:25:45] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1067360/3761/" [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[14:26:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2186.codfw.wmnet with reason: Schema change
[14:26:07] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.dns.netbox
[14:26:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2186.codfw.wmnet with reason: Schema change
[14:26:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on db2186.codfw.wmnet with reason: Schema change
[14:26:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on db2186.codfw.wmnet with reason: Schema change
[14:29:28] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding AAAA field to wdqs101[1-3] and wdqs200[7-8] - brouberol@cumin1002"
[14:29:33] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding AAAA field to wdqs101[1-3] and wdqs200[7-8] - brouberol@cumin1002"
[14:29:33] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:30:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T370903)', diff saved to https://phabricator.wikimedia.org/P67927 and previous config saved to /var/cache/conftool/dbconfig/20240827-143027-ladsgroup.json
[14:30:44] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[14:31:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2230.codfw.wmnet with reason: host reimage
[14:32:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2232.codfw.wmnet with reason: host reimage
[14:32:55] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2231.codfw.wmnet with reason: host reimage
[14:34:11] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "@Bryan It seems that you authored this, can you also have a look?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[14:34:38] <wikibugs>	 (03PS1) 10Btullis: Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203)
[14:35:03] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2230.codfw.wmnet with reason: host reimage
[14:35:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67928 and previous config saved to /var/cache/conftool/dbconfig/20240827-143514-arnaudb.json
[14:35:34] <wikibugs>	 (03PS2) 10Btullis: Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203)
[14:36:18] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3762/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[14:36:27] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:44] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2231.codfw.wmnet with reason: host reimage
[14:37:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10096358 (10ssingh) 05Open→03Resolved a:03ssingh @jwang: request merged, thanks! Please re-open if there are any issues.
[14:39:36] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1067363
[14:40:54] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=wikikube-ctrl2003.codfw.wmnet
[14:41:24] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2232.codfw.wmnet with reason: host reimage
[14:41:32] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on wikikube-ctrl2003.codfw.wmnet with reason: running provision again
[14:41:45] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on wikikube-ctrl2003.codfw.wmnet with reason: running provision again
[14:41:49] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2293.codfw.wmnet
[14:42:14] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: logging: Remove WhatFailureGroupHandler wrapper from handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067364 (https://phabricator.wikimedia.org/T373444)
[14:42:23] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2293.codfw.wmnet
[14:43:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1067363 (owner: 10Marostegui)
[14:44:31] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy GRACEFUL
[14:45:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P67929 and previous config saved to /var/cache/conftool/dbconfig/20240827-144534-ladsgroup.json
[14:45:35] <wikibugs>	 (03CR) 10Brouberol: "Looking at the [dashboard](https://grafana-rw.wikimedia.org/d/cloudnative-pg/cloudnativepg?forceLogin=&from=now-15m&orgId=1&refresh=30s&to" [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[14:45:53] <wikibugs>	 (03PS1) 10Ayounsi: Network report: remove wdqs from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555)
[14:46:12] <wikibugs>	 (03CR) 10Brouberol: "We should also monitor whether the cloudnative-pg operator pod is running and healthy" [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[14:47:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[14:48:51] <icinga-wm>	 PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:49:28] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[14:50:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67930 and previous config saved to /var/cache/conftool/dbconfig/20240827-145020-arnaudb.json
[14:51:19] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2230.codfw.wmnet with OS bookworm
[14:51:24] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2230.codfw.wmnet with OS bookworm completed: - db2230 (**PASS**)   - Removed from Pu...
[14:52:12] <wikibugs>	 (03PS4) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284)
[14:54:01] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2231.codfw.wmnet with OS bookworm
[14:54:04] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2231.codfw.wmnet with OS bookworm completed: - db2231 (**PASS**)   - Removed from Pu...
[14:55:03] <wikibugs>	 (03PS24) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[14:55:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[14:55:46] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096444 (10Marostegui) 05In progress→03Resolved Thanks @Jhancock.wm - that worked!
[14:56:51] <icinga-wm>	 RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:56:55] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555) (owner: 10Ayounsi)
[14:57:08] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2232.codfw.wmnet with OS bookworm
[14:58:22] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096452 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2232.codfw.wmnet with OS bookworm completed: - db2232 (**PASS**)   - Removed fro...
[14:59:16] <wikibugs>	 (03PS25) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[15:00:04] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1500).
[15:00:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P67931 and previous config saved to /var/cache/conftool/dbconfig/20240827-150041-ladsgroup.json
[15:01:27] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:48] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy GRACEFUL
[15:02:27] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2003.codfw.wmnet
[15:05:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67932 and previous config saved to /var/cache/conftool/dbconfig/20240827-150525-arnaudb.json
[15:09:32] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance
[15:09:45] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance
[15:09:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T371742)', diff saved to https://phabricator.wikimedia.org/P67933 and previous config saved to /var/cache/conftool/dbconfig/20240827-150952-ladsgroup.json
[15:09:59] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[15:11:11] <elukey>	 !log restart httpd on crm2001 for libaom upgrades
[15:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:31] <elukey>	 !log restart httpd and librenms-syslog.service on netmon1003 for libaom upgrades
[15:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:00] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Rename kubernetes2019 to wikikube-worker2044 [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková)
[15:15:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2019 to wikikube-worker2044
[15:15:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:15:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T370903)', diff saved to https://phabricator.wikimedia.org/P67934 and previous config saved to /var/cache/conftool/dbconfig/20240827-151548-ladsgroup.json
[15:15:50] <wikibugs>	 (03CR) 10Ebernhardson: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson)
[15:15:50] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[15:15:52] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[15:16:03] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[15:16:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T370903)', diff saved to https://phabricator.wikimedia.org/P67935 and previous config saved to /var/cache/conftool/dbconfig/20240827-151610-ladsgroup.json
[15:18:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T370903)', diff saved to https://phabricator.wikimedia.org/P67936 and previous config saved to /var/cache/conftool/dbconfig/20240827-151819-ladsgroup.json
[15:19:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2019 to wikikube-worker2044 - kamila@cumin1002"
[15:19:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2019 to wikikube-worker2044 - kamila@cumin1002"
[15:19:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:19:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2044
[15:19:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:19:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2044
[15:20:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2019 to wikikube-worker2044
[15:20:17] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878)
[15:20:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from kubernetes20...
[15:20:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67937 and previous config saved to /var/cache/conftool/dbconfig/20240827-152031-arnaudb.json
[15:22:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2044.codfw.wmnet with OS bullseye
[15:22:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f958d5462b0>
[15:22:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik...
[15:23:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris)
[15:23:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:24:50] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] logging: Remove WhatFailureGroupHandler wrapper from handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067364 (https://phabricator.wikimedia.org/T373444) (owner: 10Bartosz Dziewoński)
[15:25:45] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:26:18] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878)
[15:26:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2044 - kamila@cumin1002"
[15:26:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2044 - kamila@cumin1002"
[15:26:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:26:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2044.codfw.wmnet 207.0.192.10.in-addr.arpa 7.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:26:48] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2044.codfw.wmnet 207.0.192.10.in-addr.arpa 7.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:26:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2044
[15:27:05] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2044
[15:27:05] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f958d5462b0>
[15:29:18] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2027.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:29:35] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:30:11] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:31:52] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2027.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:33:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P67939 and previous config saved to /var/cache/conftool/dbconfig/20240827-153327-ladsgroup.json
[15:33:34] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2028.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:34:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris)
[15:35:04] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2028.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:35:06] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2293 to wikikube-worker2045
[15:35:22] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[15:36:15] <wikibugs>	 (03PS3) 10Dbrant: Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828)
[15:36:19] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2029.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:37:27] <wikibugs>	 (03Abandoned) 10Brouberol: Remove the pgcluster-test in dse-k8s, no longer useful [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067346 (owner: 10Brouberol)
[15:39:01] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2029.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:39:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2293 to wikikube-worker2045 - akosiaris@cumin1002"
[15:39:57] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2033.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:42:20] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2293 to wikikube-worker2045 - akosiaris@cumin1002"
[15:42:20] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:42:21] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2045
[15:42:31] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2045
[15:42:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage
[15:43:11] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2293 to wikikube-worker2045
[15:43:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2293 to...
[15:43:49] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2045.codfw.wmnet with OS bullseye
[15:43:59] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f7528213c70>
[15:44:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host...
[15:44:09] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox
[15:45:36] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage
[15:45:46] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2033.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:46:15] <wikibugs>	 (03PS1) 10Elukey: blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363)
[15:46:36] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2034.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:48:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T367856)', diff saved to https://phabricator.wikimedia.org/P67940 and previous config saved to /var/cache/conftool/dbconfig/20240827-154823-marostegui.json
[15:48:31] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[15:48:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P67941 and previous config saved to /var/cache/conftool/dbconfig/20240827-154834-ladsgroup.json
[15:48:46] <wikibugs>	 (03PS3) 10Cathal Mooney: Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351)
[15:49:02] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey)
[15:49:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney)
[15:50:24] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2045 - akosiaris@cumin1002"
[15:50:29] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2045 - akosiaris@cumin1002"
[15:50:29] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:50:29] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2045.codfw.wmnet 163.0.192.10.in-addr.arpa 3.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:50:32] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2045.codfw.wmnet 163.0.192.10.in-addr.arpa 3.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:50:33] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2045
[15:50:58] <wikibugs>	 (03PS26) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[15:51:05] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2034.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:51:44] <wikibugs>	 (03PS4) 10Cathal Mooney: Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351)
[15:52:01] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2045
[15:52:01] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f7528213c70>
[15:52:10] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2035.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:54:44] <denisse>	 !log Start prometheus4002 Bookworm upgrade - T326657
[15:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:47] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[15:57:10] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10096680 (10BTullis) Checking the `megacli` ourput shows that the RAID BBU reports OK. ` btullis@an-worker1127:~$ sudo megacli -AdpBbuCmd -aA...
[15:57:14] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2035.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:57:35] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2036.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:57:48] <jinxer-wm>	 FIRING: KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2292.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:57:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey)
[15:57:58] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10096682 (10BTullis) a:03BTullis
[15:58:57] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2036.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:59:27] <logmsgbot>	 !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2037.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[15:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[16:00:00] <wikibugs>	 (03CR) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[16:00:05] <jouncebot>	 jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:23] <wikibugs>	 (03CR) 10Cathal Mooney: Expose Netbox tunnel data to config templates (038 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney)
[16:00:36] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10096690 (10BTullis) a:03BTullis
[16:01:22] <wikibugs>	 (03Merged) 10jenkins-bot: blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey)
[16:03:16] <logmsgbot>	 !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2037.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw)
[16:03:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P67942 and previous config saved to /var/cache/conftool/dbconfig/20240827-160330-marostegui.json
[16:03:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T370903)', diff saved to https://phabricator.wikimedia.org/P67943 and previous config saved to /var/cache/conftool/dbconfig/20240827-160341-ladsgroup.json
[16:03:43] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[16:03:55] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:03:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[16:04:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T370903)', diff saved to https://phabricator.wikimedia.org/P67944 and previous config saved to /var/cache/conftool/dbconfig/20240827-160403-ladsgroup.json
[16:05:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2044.codfw.wmnet with OS bullseye
[16:05:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub...
[16:08:58] <wikibugs>	 (03CR) 10BryanDavis: "The code is there for exactly what it says on the tin: debugging LDAP problems on wikitech. There have been several times in the past when" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[16:12:10] <kamila_>	 !log ran homer to add wikikube-worker2044 T372878
[16:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:13] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[16:13:04] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2044.codfw.wmnet
[16:13:05] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2044.codfw.wmnet
[16:14:29] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457 (10kamila) 03NEW
[16:17:42] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet
[16:18:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P67945 and previous config saved to /var/cache/conftool/dbconfig/20240827-161837-marostegui.json
[16:19:52] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: store graph type in data_loaded file [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[16:21:41] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet
[16:25:23] <denisse>	 !log Start prometheus5002 Bookworm upgrade - T326657
[16:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:27] <stashbot>	 T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657
[16:31:17] <wikibugs>	 (03PS1) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454)
[16:31:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[16:33:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T367856)', diff saved to https://phabricator.wikimedia.org/P67946 and previous config saved to /var/cache/conftool/dbconfig/20240827-163345-marostegui.json
[16:33:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2166.codfw.wmnet with reason: Maintenance
[16:33:49] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[16:33:55] <wikibugs>	 (03PS2) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454)
[16:34:00] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2166.codfw.wmnet with reason: Maintenance
[16:34:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T367856)', diff saved to https://phabricator.wikimedia.org/P67947 and previous config saved to /var/cache/conftool/dbconfig/20240827-163407-marostegui.json
[16:35:48] <wikibugs>	 (03PS1) 10Elukey: services: update Thumbor Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067382 (https://phabricator.wikimedia.org/T373363)
[16:36:49] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] services: update Thumbor Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067382 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey)
[16:38:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T370903)', diff saved to https://phabricator.wikimedia.org/P67948 and previous config saved to /var/cache/conftool/dbconfig/20240827-163817-ladsgroup.json
[16:38:22] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:39:49] <wikibugs>	 (03PS1) 10Bking: wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368)
[16:40:04] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[16:40:26] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack keystone: add a new auth plugin to validate totp tokens against idm [puppet] - 10https://gerrit.wikimedia.org/r/1064480 (https://phabricator.wikimedia.org/T373462)
[16:40:28] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack keystone: switch to idmtotp for 2fa [puppet] - 10https://gerrit.wikimedia.org/r/1064481 (https://phabricator.wikimedia.org/T373462)
[16:42:23] <wikibugs>	 (03PS2) 10Bking: wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368)
[16:42:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[16:44:37] <wikibugs>	 (03PS1) 10Hnowlan: aptrepo: add ffmpeg buster component [puppet] - 10https://gerrit.wikimedia.org/r/1067384 (https://phabricator.wikimedia.org/T373128)
[16:45:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002"
[16:45:49] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002"
[16:45:49] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:48:37] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "The backend setup has been verified, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[16:49:05] <wikibugs>	 (03PS3) 10Bking: wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368)
[16:50:21] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet
[16:51:50] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[16:52:13] <wikibugs>	 (03PS2) 10JHathaway: puppet8: remove ssl_keystore_location, always set ssl_key_password [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664)
[16:52:29] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[16:53:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P67949 and previous config saved to /var/cache/conftool/dbconfig/20240827-165325-ladsgroup.json
[16:54:22] <wikibugs>	 (03CR) 10Bking: "ACK, looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/traffics" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[16:56:35] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet
[16:57:59] <wikibugs>	 (03CR) 10Ryan Kemper: "PCC looks good. We should discuss with sukhe how this is best deployed; ie can we just do an lvs rolling restart directly or do we need to" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[16:58:04] <icinga-wm>	 PROBLEM - Host an-worker1165 is DOWN: PING CRITICAL - Packet loss = 100%
[17:01:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:01:09] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:01:11] <icinga-wm>	 RECOVERY - Host an-worker1165 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[17:02:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery)
[17:06:04] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "Spoke to sukhe, simple lvs restart should be sufficient for this" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[17:06:19] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[17:08:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P67950 and previous config saved to /var/cache/conftool/dbconfig/20240827-170832-ladsgroup.json
[17:08:42] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis)
[17:08:46] <ryankemper>	 !log T364368 Disabled puppet on all lvs hosts in preparation for rolling restart
[17:08:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:50] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[17:08:52] <logmsgbot>	 !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2045.codfw.wmnet with OS bullseye
[17:09:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10097092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki...
[17:09:22] <wikibugs>	 (03PS4) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247)
[17:10:08] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] php8.1-cli: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[17:10:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10097095 (10VRiley-WMF) logging-sd1001 Rack E 5 U 32 CableID 20220092 Port 18  logging-sd1002 Rack E 6 U 31 CableID 20220057 Port 18  logging-sd1003 F 5 U 31 CableID 20220091...
[17:10:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10097097 (10VRiley-WMF)
[17:11:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] php8.1-fpm: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064815 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[17:12:29] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[17:13:41] <wikibugs>	 (03CR) 10JHathaway: "Though the PCC diff shows the file as being base64 encoded, I have confirmed that this is only how it is displayed in the catalog. The con" [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway)
[17:13:50] <ryankemper>	 !log T364368 Ran puppet on `A:lvs-secondary-eqiad` and restarted pybal.service
[17:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:54] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[17:16:00] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] php8.1-fpm-multiversion-base: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064816 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French)
[17:16:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs1023.eqiad.wmnet are marked down but pooled: wdqs-main_443: Servers wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:16:21] <sukhe>	 ^ looking into it
[17:16:21] <ryankemper>	 ^ known
[17:17:12] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs-main:443 has failed probes (http_wdqs-main_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:17:23] <sukhe>	 all good ^
[17:17:25] <sukhe>	 ACKing
[17:17:27] <sukhe>	 !incidents
[17:17:28] <sirenbot>	 5120 (UNACKED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[17:17:31] <sukhe>	 !ack 5120
[17:17:31] <sirenbot>	 5120 (ACKED)  [2x] ProbeDown sre (ip4 probes/service codfw)
[17:17:44] <wikibugs>	 (03PS1) 10Bking: wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368)
[17:17:50] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389
[17:18:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (owner: 10Jdlrobson)
[17:18:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (owner: 10Jdlrobson)
[17:18:11] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:18:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) (owner: 10Jdlrobson)
[17:18:25] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[17:20:07] <icinga-wm>	 PROBLEM - Host an-worker1165 is DOWN: PING CRITICAL - Packet loss = 100%
[17:22:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[17:23:23] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[17:23:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking)
[17:23:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T370903)', diff saved to https://phabricator.wikimedia.org/P67951 and previous config saved to /var/cache/conftool/dbconfig/20240827-172339-ladsgroup.json
[17:23:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[17:23:43] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:23:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[17:24:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T370903)', diff saved to https://phabricator.wikimedia.org/P67952 and previous config saved to /var/cache/conftool/dbconfig/20240827-172401-ladsgroup.json
[17:24:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T371742)', diff saved to https://phabricator.wikimedia.org/P67953 and previous config saved to /var/cache/conftool/dbconfig/20240827-172436-ladsgroup.json
[17:24:41] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[17:24:54] <ryankemper>	 !log T364368 `ryankemper@cumin2002:~$ sudo cumin 'A:lvs-secondary-eqiad' 'systemctl status pybal.service'`
[17:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:58] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[17:25:19] <wikibugs>	 (03PS3) 10JHathaway: puppet8: remove ssl_keystore_location, always set ssl_key_password [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664)
[17:25:26] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[17:25:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow to 2.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067352 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[17:26:29] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957)
[17:26:46] <wikibugs>	 (03PS2) 10Gergő Tisza: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957)
[17:27:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:27:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice. Should we add a global silence to alertmanager while we are still testing, or will we just all remember that these are pre-productio" [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol)
[17:27:57] <wikibugs>	 (03PS3) 10Gergő Tisza: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957)
[17:29:41] <sukhe>	 !log sukhe@lvs1020:~$ sudo ipvsadm ---delete-service --tcp-service 10.2.2.36:80
[17:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:24] <sukhe>	 !log sukhe@lvs1020:~$ sudo ipvsadm --delete-service --tcp-service 10.2.2.33:80
[17:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:43] <sukhe>	 !log force recheck on Icinga for lvs1020
[17:30:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T370903)', diff saved to https://phabricator.wikimedia.org/P67954 and previous config saved to /var/cache/conftool/dbconfig/20240827-173132-ladsgroup.json
[17:31:36] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:33:03] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:35:46] <wikibugs>	 (03PS5) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247)
[17:37:18] <ryankemper>	 !log T364368 Ran puppet on `A:lvs-low-traffic-eqiad` and restarted `pybal.service`
[17:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:22] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[17:38:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[17:38:51] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:39:01] <sukhe>	 ^ expected
[17:39:02] <ryankemper>	 ^known, cleaning thes eup
[17:39:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67956 and previous config saved to /var/cache/conftool/dbconfig/20240827-173944-ladsgroup.json
[17:40:26] <ryankemper>	 !log T364368 Cleared away old ipvs entries for `10.2.2.33:80` and `10.2.2.36:80`
[17:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:45] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:41:46] <ryankemper>	 !log Forced recheck on lvs2019 to clear alert
[17:41:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:05] <ryankemper>	 !log Typo, meant to say forced recheck on `lvs1019` to clear alert
[17:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:49] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002"
[17:43:53] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002"
[17:43:53] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:44:33] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth)
[17:46:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P67957 and previous config saved to /var/cache/conftool/dbconfig/20240827-174639-ladsgroup.json
[17:47:45] <ryankemper>	 !log T364368 Ran puppet on `A:lvs-secondary-codfw`, restarted `pybal.service`, and cleared away old ipvs entries for `10.2.1.33:80` and `10.2.1.36:80`
[17:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:49] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[17:48:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:50:55] <ryankemper>	 !log T364368 Ran puppet on `A:lvs-low-traffic-codfw`, restarted `pybal.service`, and cleared away old ipvs entries for `10.2.1.33:80` and `10.2.1.36:80`
[17:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:10] <ryankemper>	 !log T364368 Our LVS operation is done; I've enabled/ran puppet on the remaining lvs hosts
[17:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:15] <stashbot>	 T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368
[17:54:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67959 and previous config saved to /var/cache/conftool/dbconfig/20240827-175451-ladsgroup.json
[17:58:42] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs-main:443 has failed probes (http_wdqs-main_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:53] <sukhe>	 inflatador: ryankemper: ^
[18:00:28] <sukhe>	 (resolved)
[18:01:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10097355 (10thcipriani) Approved!
[18:01:41] <inflatador>	 sukhe ACK, thanks again!
[18:01:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P67960 and previous config saved to /var/cache/conftool/dbconfig/20240827-180146-ladsgroup.json
[18:01:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10097356 (10ssingh)
[18:02:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve1009
[18:02:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10097358 (10ssingh) @JayCano: This requires your approval since we already have Tyler's. Thanks!
[18:03:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve1009
[18:04:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve1010
[18:04:38] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve1010
[18:04:38] <wikibugs>	 (03PS1) 10C. Scott Ananian: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454)
[18:04:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve1011
[18:04:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[18:05:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve1011
[18:05:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1001
[18:05:10] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ml-lab1001
[18:05:15] <wikibugs>	 (03PS1) 10C. Scott Ananian: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454)
[18:05:16] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1002
[18:05:27] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1002
[18:05:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[18:05:33] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1009
[18:06:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:06:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[18:06:54] <wikibugs>	 (03PS1) 10Ssingh: admin: add mszabo to deployment and move from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067397 (https://phabricator.wikimedia.org/T373426)
[18:06:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1009
[18:08:46] <wikibugs>	 (03PS1) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789)
[18:09:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[18:09:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T371742)', diff saved to https://phabricator.wikimedia.org/P67961 and previous config saved to /var/cache/conftool/dbconfig/20240827-180958-ladsgroup.json
[18:09:59] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1001
[18:10:00] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[18:10:04] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[18:10:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1001
[18:10:13] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[18:10:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T371742)', diff saved to https://phabricator.wikimedia.org/P67962 and previous config saved to /var/cache/conftool/dbconfig/20240827-181020-ladsgroup.json
[18:11:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:16:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T370903)', diff saved to https://phabricator.wikimedia.org/P67963 and previous config saved to /var/cache/conftool/dbconfig/20240827-181653-ladsgroup.json
[18:16:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[18:16:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10097398 (10Dwisehaupt) a:05Dwisehaupt→03Papaul Assigning to @Papaul for payments2006 setup. Assign back to me when it's ready for OS install an...
[18:16:59] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[18:17:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[18:17:12] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[18:17:13] <wikibugs>	 (03CR) 10Scott French: [C:03+1] aptrepo: add ffmpeg buster component [puppet] - 10https://gerrit.wikimedia.org/r/1067384 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[18:17:26] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[18:17:31] <wikibugs>	 (03PS1) 10Zabe: Update uzwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067400 (https://phabricator.wikimedia.org/T370165)
[18:17:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T370903)', diff saved to https://phabricator.wikimedia.org/P67964 and previous config saved to /var/cache/conftool/dbconfig/20240827-181732-ladsgroup.json
[18:19:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] aptrepo: add ffmpeg buster component [puppet] - 10https://gerrit.wikimedia.org/r/1067384 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan)
[18:19:30] <wikibugs>	 (03PS1) 10Ebernhardson: search update pipeline: correctly handle redirect updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067401
[18:20:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10097416 (10Dwisehaupt) a:05Dwisehaupt→03Papaul Assigning to @Papaul for frdb2005 setup. Assign back to me when it's ready for OS install and setup.
[18:25:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T370903)', diff saved to https://phabricator.wikimedia.org/P67965 and previous config saved to /var/cache/conftool/dbconfig/20240827-182531-ladsgroup.json
[18:25:36] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[18:28:20] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] search update pipeline: correctly handle redirect updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067401 (owner: 10Ebernhardson)
[18:29:23] <wikibugs>	 (03Merged) 10jenkins-bot: search update pipeline: correctly handle redirect updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067401 (owner: 10Ebernhardson)
[18:33:32] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[18:33:37] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:38:31] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[18:38:37] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:40:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P67966 and previous config saved to /var/cache/conftool/dbconfig/20240827-184039-ladsgroup.json
[18:47:00] <jinxer-wm>	 FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[18:49:12] <wikibugs>	 (03PS4) 10Jdlrobson: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633)
[18:49:45] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1003 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:49:59] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[18:55:00] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[18:55:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P67967 and previous config saved to /var/cache/conftool/dbconfig/20240827-185546-ladsgroup.json
[18:58:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[19:01:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002"
[19:01:45] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002"
[19:01:45] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:01:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1001
[19:01:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1001
[19:04:45] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1003 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:05:05] <wikibugs>	 (03CR) 10BCornwall: "Seems nobody piped up." [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall)
[19:05:15] <wikibugs>	 (03PS1) 10Mstyles: security-landing-page: bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829)
[19:07:33] <wikibugs>	 (03CR) 10SBassett: [C:03+1] "Verified image id." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829) (owner: 10Mstyles)
[19:10:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T370903)', diff saved to https://phabricator.wikimedia.org/P67968 and previous config saved to /var/cache/conftool/dbconfig/20240827-191053-ladsgroup.json
[19:10:56] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[19:10:58] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:11:09] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[19:11:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T370903)', diff saved to https://phabricator.wikimedia.org/P67969 and previous config saved to /var/cache/conftool/dbconfig/20240827-191116-ladsgroup.json
[19:19:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T370903)', diff saved to https://phabricator.wikimedia.org/P67970 and previous config saved to /var/cache/conftool/dbconfig/20240827-191915-ladsgroup.json
[19:19:21] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:25:28] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update uzwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067400 (https://phabricator.wikimedia.org/T370165) (owner: 10Zabe)
[19:26:11] <wikibugs>	 (03Merged) 10jenkins-bot: Update uzwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067400 (https://phabricator.wikimedia.org/T370165) (owner: 10Zabe)
[19:26:56] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067400|Update uzwiki logo (T370165)]]
[19:27:03] <stashbot>	 T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165
[19:30:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1025.eqiad.wmnet, kubernetes1023.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1408.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, kubernetes1017.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, mw1394.eqiad.wmnet, mw1360.eqiad.wmnet, parse1012.eqiad.wmnet, kubernetes1015.eqiad.wmnet, mw1352.eqiad.wmnet, parse1006.eqiad
[19:30:17] <icinga-wm>	 mw1355.eqiad.wmnet, mw1472.eqiad.wmnet, kubernetes1026.eqiad.wmnet, mw1409.eqiad.wmnet, mw1383.eqiad.wmnet, wikikube-worker1032.eqiad.wmnet, mw1416.eqiad.wmnet, kubernetes1054.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, parse1014.eqiad.wmnet, mw1478.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1040.eqiad.wmnet, wikikube-worker1012.eqiad.wmnet, kubernetes1016.eqiad.wmnet, mw1461.eqiad.wmnet, 
[19:30:17] <icinga-wm>	 qiad.wmnet, wikikube-worker1017.eqiad.wmnet, mw1423.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1020.eqiad.wmnet, mw1397.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, mw1399.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal
[19:30:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers wikikube-worker1012.eqiad.wmnet, wikikube-worker1028.eqiad.wmnet, mw1409.eqiad.wmnet, kubernetes1036.eqiad.wmnet, parse1007.eqiad.wmnet, mw1457.eqiad.wmnet, mw1455.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, mw1475.eqiad.wmnet, mw1374.eqiad.wmnet, kubernetes1062.eqiad.wmnet, kubernetes1022.eqiad.wmnet, kubernetes1037.eqiad.wmne
[19:30:23] <icinga-wm>	 4.eqiad.wmnet, wikikube-worker1032.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1482.eqiad.wmnet, kubernetes1040.eqiad.wmnet, mw1495.eqiad.wmnet, parse1024.eqiad.wmnet, wikikube-worker1017.eqiad.wmnet, mw1477.eqiad.wmnet, mw1423.eqiad.wmnet, wikikube-worker1025.eqiad.wmnet, kubernetes1020.eqiad.wmnet, mw1397.eqiad.wmnet, mw1394.eqiad.wmnet, mw1385.eqiad.wmnet, mw1452.eqiad.wmnet, mw1422.eqiad.wmnet, mw1361.eqiad.wmnet, parse1008.eqiad.wmnet
[19:30:23] <icinga-wm>	 be-worker1027.eqiad.wmnet, kubernetes1009.eqiad.wmnet, mw1448.eqiad.wmnet, wikikube-worker1030.eqiad.wmnet, mw1421.eqiad.wmnet, mw1377.eqiad.wmnet, kubernetes1029.eqiad.wmnet, parse1004 https://wikitech.wikimedia.org/wiki/PyBal
[19:31:05] <sukhe>	 woah
[19:31:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:31:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:31:26] <sukhe>	 what is this about 
[19:31:33] <sukhe>	 ah
[19:32:28] <wikibugs>	 (03PS2) 10Hashar: Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson)
[19:33:03] <wikibugs>	 (03CR) 10Hashar: "I have attached it to T373463 with:" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson)
[19:34:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P67971 and previous config saved to /var/cache/conftool/dbconfig/20240827-193424-ladsgroup.json
[19:37:47] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1067400|Update uzwiki logo (T370165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:37:48] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[19:37:51] <stashbot>	 T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165
[19:38:46] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[19:44:04] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067400|Update uzwiki logo (T370165)]] (duration: 17m 07s)
[19:44:08] <stashbot>	 T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165
[19:49:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P67972 and previous config saved to /var/cache/conftool/dbconfig/20240827-194930-ladsgroup.json
[19:49:36] <wikibugs>	 (03PS1) 10Scott French: kubernetes: re-name/IP kubernetes2026 as wikikube-worker2046 [puppet] - 10https://gerrit.wikimedia.org/r/1067414 (https://phabricator.wikimedia.org/T372878)
[19:51:39] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson)
[19:57:30] <wikibugs>	 (03PS6) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247)
[19:58:15] <wikibugs>	 (03PS1) 10Dzahn: prometheus/gerrit: also add size of tracking list to exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067415 (https://phabricator.wikimedia.org/T373136)
[19:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[19:59:44] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant)
[19:59:46] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T2000).
[20:00:05] <jouncebot>	 dbrant, Pppery, Jdlrobson, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:06] <zabe>	 I can deploy
[20:00:08] <Pppery>	 here
[20:00:12] <dbrant>	 o/
[20:00:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant)
[20:00:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery)
[20:00:25] <wikibugs>	 (03Merged) 10jenkins-bot: Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant)
[20:00:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery)
[20:00:56] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1065189|Turn account vanishing contact form into a redirect. (T372828)]], [[gerrit:1066903|Revert "[svwikt] Add a temporary logo for the 100.000 pages" (T364247)]]
[20:01:05] <stashbot>	 T372828: Redirect old vanishing form to new one - https://phabricator.wikimedia.org/T372828
[20:01:06] <stashbot>	 T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247
[20:01:41] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[20:01:48] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:02:40] <Jdlrobson>	 o/
[20:03:43] <wikibugs>	 (03PS2) 10Dzahn: codesearch: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677)
[20:04:08] <mutante>	 jouncebot: now
[20:04:08] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T2000)
[20:04:13] <logmsgbot>	 !log zabe@deploy1003 dbrant, zabe, pppery: Backport for [[gerrit:1065189|Turn account vanishing contact form into a redirect. (T372828)]], [[gerrit:1066903|Revert "[svwikt] Add a temporary logo for the 100.000 pages" (T364247)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:04:15] <zabe>	 Pppery: dbrant: can you test?
[20:04:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T370903)', diff saved to https://phabricator.wikimedia.org/P67973 and previous config saved to /var/cache/conftool/dbconfig/20240827-200437-ladsgroup.json
[20:04:40] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[20:04:41] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:04:53] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[20:05:00] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[20:05:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T370903)', diff saved to https://phabricator.wikimedia.org/P67974 and previous config saved to /var/cache/conftool/dbconfig/20240827-200459-ladsgroup.json
[20:05:06] <Pppery>	 I had to bypass my browser cache to get the new logo to show, but it seems to work
[20:05:28] <dbrant>	 mine looks good
[20:05:41] <zabe>	 alright
[20:05:42] <logmsgbot>	 !log zabe@deploy1003 dbrant, zabe, pppery: Continuing with sync
[20:06:16] <cscott>	 zabe: oi
[20:06:19] <cscott>	 zabe: i'm here
[20:06:30] <zabe>	 hello hello
[20:06:38] <mutante>	 if you want to you can run a command to purge the logo from caches
[20:06:44] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] kubernetes: re-name/IP kubernetes2026 as wikikube-worker2046 [puppet] - 10https://gerrit.wikimedia.org/r/1067414 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French)
[20:06:48] <mutante>	 but just waiting will also work
[20:07:00] <jinxer-wm>	 RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning
[20:07:08] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) (owner: 10Jdlrobson)
[20:07:48] <zabe>	 since the old logo was located at a different url, it should work without purging, I guess?
[20:07:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] codesearch: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[20:08:07] <zabe>	 cscott: do your changes depend on each other?
[20:08:15] <mutante>	 zabe: yes, true. in that case
[20:08:24] <wikibugs>	 (03Merged) 10jenkins-bot: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) (owner: 10Jdlrobson)
[20:08:50] <cscott>	 zabe the last two do: the ParserMigration extension patch needs to be backported before the config change is made
[20:09:07] <cscott>	 zabe: the first two just prevent logspam and can be done in any order
[20:09:19] <zabe>	 alright
[20:09:25] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Tweak styling of compact Parsoid indicator [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:09:26] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[20:09:27] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[20:11:12] <zabe>	 Jdlrobson: your changes do not depend on each other, do they?
[20:11:23] <Jdlrobson>	 zabe: nope
[20:11:30] <Jdlrobson>	 can go out separately or together.. whatever is easiest
[20:12:25] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1065189|Turn account vanishing contact form into a redirect. (T372828)]], [[gerrit:1066903|Revert "[svwikt] Add a temporary logo for the 100.000 pages" (T364247)]] (duration: 11m 28s)
[20:12:30] <stashbot>	 T372828: Redirect old vanishing form to new one - https://phabricator.wikimedia.org/T372828
[20:12:30] <stashbot>	 T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247
[20:12:40] <zabe>	 ok, lets start with your config patch then - the other one is still running ci
[20:12:47] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1057026|Disable mobile Watchlist on wikidata since its broken (T263633)]]
[20:12:51] <stashbot>	 T263633: Mobile Special:EditWatchlist displays item IDs instead of labels - https://phabricator.wikimedia.org/T263633
[20:12:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T370903)', diff saved to https://phabricator.wikimedia.org/P67975 and previous config saved to /var/cache/conftool/dbconfig/20240827-201256-ladsgroup.json
[20:13:00] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:13:04] <zabe>	 dbrant: Pppery: your changes should be live
[20:14:04] <Pppery>	 thanks
[20:14:29] <Pppery>	 Although it's not really my change - I just shepherd it through the process after seeing it languish in Phabricator for weeks
[20:14:36] <zabe>	 yeah fair
[20:15:00] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[20:15:04] <logmsgbot>	 !log zabe@deploy1003 jdlrobson, zabe: Backport for [[gerrit:1057026|Disable mobile Watchlist on wikidata since its broken (T263633)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:15:07] <zabe>	 Jdlrobson: is your config patch testable?
[20:15:34] <Jdlrobson>	 zabe: yep
[20:15:39] <Jdlrobson>	 let me know when its on debug 
[20:15:53] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[20:16:42] <Jdlrobson>	 i see it is - zabe looks good - please sync!
[20:17:53] <zabe>	 alright
[20:17:55] <logmsgbot>	 !log zabe@deploy1003 jdlrobson, zabe: Continuing with sync
[20:18:40] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn)
[20:22:27] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1057026|Disable mobile Watchlist on wikidata since its broken (T263633)]] (duration: 09m 39s)
[20:22:31] <stashbot>	 T263633: Mobile Special:EditWatchlist displays item IDs instead of labels - https://phabricator.wikimedia.org/T263633
[20:22:51] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:22:55] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:22:56] <wikibugs>	 (03PS3) 10JHathaway: puppet8: mtail, check if notify is defined [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664)
[20:23:03] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[20:24:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson)
[20:24:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson)
[20:24:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:24:27] <wikibugs>	 (03Merged) 10jenkins-bot: Tweak styling of compact Parsoid indicator [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:24:47] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067389|Revert "Allow gadget/browser extension extensibility of empty search state" (T373463)]], [[gerrit:1066882|Tweak styling of compact Parsoid indicator (T372789)]]
[20:24:56] <stashbot>	 T373463: Text "empty" appears after search input when first clicking into it - https://phabricator.wikimedia.org/T373463
[20:24:56] <stashbot>	 T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789
[20:27:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2043.codfw.wmnet
[20:27:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2043.codfw.wmnet
[20:27:43] <logmsgbot>	 !log zabe@deploy1003 cscott, zabe, jdlrobson: Backport for [[gerrit:1067389|Revert "Allow gadget/browser extension extensibility of empty search state" (T373463)]], [[gerrit:1066882|Tweak styling of compact Parsoid indicator (T372789)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:28:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P67976 and previous config saved to /var/cache/conftool/dbconfig/20240827-202803-ladsgroup.json
[20:28:05] <zabe>	 Jdlrobson: your backport is at mwdebug
[20:28:57] <zabe>	 cscott: is the parsoid indicator backport testable?
[20:29:07] <cscott>	 only testable after the config deploy, alas.
[20:29:12] <zabe>	 alright
[20:29:17] <zabe>	 then I will just sync it
[20:29:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T371742)', diff saved to https://phabricator.wikimedia.org/P67977 and previous config saved to /var/cache/conftool/dbconfig/20240827-202954-ladsgroup.json
[20:29:58] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[20:30:05] <cscott>	 i verified that it doesn't horribly crash anything at least :)
[20:30:15] <zabe>	 good
[20:30:49] <cscott>	 (by loading en.wikivoyage.org on the debug servers, which has parsoid read views on by default and the old indicator style, which is not/should not be affected by the backported patch)
[20:30:54] <zabe>	 Jdlrobson: I quickly tried testing your patch myself, but unless I am doing something wrong on testwiki, it does not seem to fix the issue
[20:32:05] <cscott>	 zabe: if the Kartographer patches are live I can try to verify the absence of logspam
[20:32:30] <Jdlrobson>	 zabe: (looking)
[20:33:09] <Jdlrobson>	 zabe: lgtm - perhaps you are getting cached JS or CSS?
[20:33:14] <Jdlrobson>	 this looks good to sync to me!
[20:33:43] <zabe>	 oh yeah - clearing browser cache fixed it
[20:33:48] <zabe>	 cool
[20:33:49] <logmsgbot>	 !log zabe@deploy1003 cscott, zabe, jdlrobson: Continuing with sync
[20:37:21] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] security-landing-page: bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829) (owner: 10Mstyles)
[20:38:10] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067389|Revert "Allow gadget/browser extension extensibility of empty search state" (T373463)]], [[gerrit:1066882|Tweak styling of compact Parsoid indicator (T372789)]] (duration: 13m 23s)
[20:38:15] <stashbot>	 T373463: Text "empty" appears after search input when first clicking into it - https://phabricator.wikimedia.org/T373463
[20:38:16] <stashbot>	 T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789
[20:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[20:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[20:38:29] <wikibugs>	 (03CR) 10Srishakatux: "I checked with @amir.aharoni@mail.huji.ac.il and as per his feedback this is not needed as the `core-Namespaces.php` is for aliases and ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[20:38:33] <wikibugs>	 (03PS4) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271)
[20:38:36] <wikibugs>	 (03Merged) 10jenkins-bot: security-landing-page: bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829) (owner: 10Mstyles)
[20:39:01] <wikibugs>	 (03PS27) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204)
[20:40:40] <wikibugs>	 (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[20:40:58] <wikibugs>	 (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins)
[20:43:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P67978 and previous config saved to /var/cache/conftool/dbconfig/20240827-204310-ladsgroup.json
[20:44:13] <wikibugs>	 (03PS3) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454)
[20:44:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[20:45:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67979 and previous config saved to /var/cache/conftool/dbconfig/20240827-204501-ladsgroup.json
[20:45:13] <wikibugs>	 (03Merged) 10jenkins-bot: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[20:45:14] <wikibugs>	 (03Merged) 10jenkins-bot: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian)
[20:45:34] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067396|Remove warning on non-existing category (T373454)]], [[gerrit:1067395|Remove warning on non-existing category (T373454)]]
[20:45:39] <stashbot>	 T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454
[20:48:29] <logmsgbot>	 !log zabe@deploy1003 cscott, zabe: Backport for [[gerrit:1067396|Remove warning on non-existing category (T373454)]], [[gerrit:1067395|Remove warning on non-existing category (T373454)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:49:15] <cscott>	 zabe: i'm watching the logs and i don't see any of the canaries, but i don't know how long i'd have to watch to be sure of that.
[20:49:22] <logmsgbot>	 !log zabe@deploy1003 cscott, zabe: Continuing with sync
[20:49:31] <cscott>	 zabe: yeah, great.
[20:49:41] <logmsgbot>	 !log mstyles@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[20:49:50] <zabe>	 I would just sync through and keep a look at the logs while doing that
[20:50:02] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:50:22] <Jdlrobson>	 thanks zabe for the help today!
[20:50:28] <zabe>	 yw
[20:50:37] <cscott>	 https://logstash.wikimedia.org/goto/fe96b774b9ec8273a41333a492b8dcb2 is what i'm looking at
[20:51:01] <logmsgbot>	 !log mstyles@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[20:51:43] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[20:52:14] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[20:52:18] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[20:52:47] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[20:52:55] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[20:52:57] <logmsgbot>	 !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[20:53:04] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[20:53:06] <logmsgbot>	 !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[20:53:17] <cscott>	 zabe i added one more config patch which i'd missed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067381 sorry about that. it's belt and suspenders for the commonswiki logspam, but also avoids some crashes on hewiki.
[20:53:38] <wikibugs>	 (03PS2) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789)
[20:53:45] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067396|Remove warning on non-existing category (T373454)]], [[gerrit:1067395|Remove warning on non-existing category (T373454)]] (duration: 08m 11s)
[20:53:47] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:53:49] <stashbot>	 T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454
[20:53:57] <wikibugs>	 (03PS4) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454)
[20:53:58] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[20:54:20] <cscott>	 zabe: the commonswiki logspam seems to have stopped, yay
[20:54:34] <wikibugs>	 (03Merged) 10jenkins-bot: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian)
[20:54:40] <zabe>	 cool
[20:54:42] <wikibugs>	 (03Merged) 10jenkins-bot: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[20:55:08] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067398|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]], [[gerrit:1067381|Rollback Parsoid+Kartographer rollout on hewiki and commons (T373454 T373460)]]
[20:55:14] <stashbot>	 T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789
[20:55:15] <stashbot>	 T373460: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T373460
[20:57:13] <logmsgbot>	 !log zabe@deploy1003 ihurbain, zabe, cscott: Backport for [[gerrit:1067398|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]], [[gerrit:1067381|Rollback Parsoid+Kartographer rollout on hewiki and commons (T373454 T373460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:57:28] <zabe>	 cscott: both config patches are at mwdebug
[20:57:46] <wikibugs>	 (03PS4) 10JHathaway: puppet8: mtail, check if notify is defined [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664)
[20:58:10] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[20:58:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T370903)', diff saved to https://phabricator.wikimedia.org/P67980 and previous config saved to /var/cache/conftool/dbconfig/20240827-205817-ladsgroup.json
[20:58:20] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[20:58:22] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:58:33] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[20:58:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[20:58:48] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[20:58:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T370903)', diff saved to https://phabricator.wikimedia.org/P67981 and previous config saved to /var/cache/conftool/dbconfig/20240827-205855-ladsgroup.json
[20:58:56] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2026.codfw.wmnet
[20:59:29] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2026.codfw.wmnet
[20:59:36] <cscott>	 zabe: ok testing.
[21:00:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67982 and previous config saved to /var/cache/conftool/dbconfig/20240827-210008-ladsgroup.json
[21:00:27] <wikibugs>	 (03CR) 10Subramanya Sastry: Rollback Parsoid+Kartographer rollout on hewiki and commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[21:01:03] <cscott>	 zabe ok, verified the kartographer/hewiki one.  checking the other.
[21:01:30] <cscott>	 zabe: yep, that looks good to.  good to sync
[21:01:31] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.461s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:01:34] <cscott>	 *too
[21:01:36] <zabe>	 alright
[21:01:39] <logmsgbot>	 !log zabe@deploy1003 ihurbain, zabe, cscott: Continuing with sync
[21:02:13] <wikibugs>	 (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP kubernetes2026 as wikikube-worker2046 [puppet] - 10https://gerrit.wikimedia.org/r/1067414 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French)
[21:02:40] <cscott>	 subbu: w/ x-wikimedia-debug on, https://en.wikivoyage.org/wiki/Windsor_(Ontario) should have a compact parsoid indicator and https://he.wikipedia.org/wiki/%D7%9E%D7%92%D7%93%D7%9C_%D7%93%D7%9E%D7%A8%D7%99 should render in parsoid read views w/o crashing.
[21:03:04] <cscott>	 subbu: not fully synced yet, just on canaries so far
[21:03:12] <subbu>	 is than an fyi or do you want me to verify?
[21:03:17] <subbu>	 *that
[21:03:44] <cscott>	 subbu: yes?  i was giving you an fyi so that if you wanted to verify you could, or you could test some urls other than the one I did :)
[21:03:50] <cscott>	 but it looks good to me
[21:04:36] <subbu>	 should be good if you tested it.
[21:06:03] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067398|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]], [[gerrit:1067381|Rollback Parsoid+Kartographer rollout on hewiki and commons (T373454 T373460)]] (duration: 10m 55s)
[21:06:05] <subbu>	 cscott, but fyi reg https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067381 .. i left a comment there. you could not disable it on commons.
[21:06:06] <zabe>	 should be live
[21:06:09] <stashbot>	 T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789
[21:06:09] <stashbot>	 T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454
[21:06:09] <stashbot>	 T373460: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T373460
[21:06:22] <cscott>	 subbu: yeah, but i figured belt-and-suspenders
[21:06:28] <subbu>	 okay. :)
[21:06:31] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.461s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:06:34] <cscott>	 subbu: i verified that the logspam stopped on commons before we deployed that
[21:06:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T370903)', diff saved to https://phabricator.wikimedia.org/P67983 and previous config saved to /var/cache/conftool/dbconfig/20240827-210646-ladsgroup.json
[21:06:48] <subbu>	 k
[21:06:51] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[21:07:01] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2026 to wikikube-worker2046
[21:07:21] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.netbox
[21:08:15] <wikibugs>	 (03CR) 10C. Scott Ananian: Rollback Parsoid+Kartographer rollout on hewiki and commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[21:11:11] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2026 to wikikube-worker2046 - swfrench@cumin2002"
[21:11:50] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2026 to wikikube-worker2046 - swfrench@cumin2002"
[21:11:50] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:11:51] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2046
[21:12:18] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2046
[21:12:59] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2026 to wikikube-worker2046
[21:13:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes...
[21:13:51] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2046.codfw.wmnet on all recursors
[21:13:54] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2046.codfw.wmnet on all recursors
[21:14:58] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2046.codfw.wmnet with OS bullseye
[21:15:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w...
[21:15:10] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host <spicerack.netbox.NetboxServer object at 0x7f46e8b0b1c0>
[21:15:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T371742)', diff saved to https://phabricator.wikimedia.org/P67984 and previous config saved to /var/cache/conftool/dbconfig/20240827-211516-ladsgroup.json
[21:15:18] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[21:15:20] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[21:15:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[21:15:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T371742)', diff saved to https://phabricator.wikimedia.org/P67985 and previous config saved to /var/cache/conftool/dbconfig/20240827-211538-ladsgroup.json
[21:15:54] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.netbox
[21:18:58] <wikibugs>	 (03PS1) 10JHathaway: puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664)
[21:19:37] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[21:19:40] <wikibugs>	 (03CR) 10JHathaway: [V:03+2 C:03+2] puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[21:20:03] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2046 - swfrench@cumin2002"
[21:20:09] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2046 - swfrench@cumin2002"
[21:20:09] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:20:10] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2046.codfw.wmnet 69.0.192.10.in-addr.arpa 9.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:20:12] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2046.codfw.wmnet 69.0.192.10.in-addr.arpa 9.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:20:14] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2046
[21:20:20] <cscott>	 zabe: thanks! i forgot to say thank you!
[21:20:48] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2046
[21:20:49] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host <spicerack.netbox.NetboxServer object at 0x7f46e8b0b1c0>
[21:21:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10098141 (10jhathaway)
[21:21:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P67986 and previous config saved to /var/cache/conftool/dbconfig/20240827-212153-ladsgroup.json
[21:23:19] <wikibugs>	 (03CR) 10JHathaway: "The code is a bit ugly, the other option is changing all the mtail define types to add a new parameter, rather than a metaparameter." [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway)
[21:29:32] <wikibugs>	 (03CR) 10Cwhite: "Worth keeping the pattern around for possible use in the future, but probably not needed now since we finished the restarts today?" [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) (owner: 10Tiziano Fogli)
[21:35:16] <wikibugs>	 (03PS5) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271)
[21:36:15] <wikibugs>	 (03CR) 10Amire80: [C:03+1] Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[21:37:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P67987 and previous config saved to /var/cache/conftool/dbconfig/20240827-213700-ladsgroup.json
[21:38:38] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage
[21:41:48] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage
[21:52:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T370903)', diff saved to https://phabricator.wikimedia.org/P67988 and previous config saved to /var/cache/conftool/dbconfig/20240827-215208-ladsgroup.json
[21:52:10] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[21:52:14] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[21:52:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[21:52:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67989 and previous config saved to /var/cache/conftool/dbconfig/20240827-215230-ladsgroup.json
[21:57:03] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:57:33] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:57:47] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:59:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67990 and previous config saved to /var/cache/conftool/dbconfig/20240827-215958-ladsgroup.json
[22:00:06] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[22:01:27] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:01:33] <wikibugs>	 (03PS1) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468)
[22:01:37] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:01:55] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2046.codfw.wmnet with OS bullseye
[22:01:57] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:02:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik...
[22:02:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon)
[22:02:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon)
[22:04:48] <swfrench-wmf>	 !log Running homer 'lsw1-a8-codfw*' commit 'T372878'
[22:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:52] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[22:06:35] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2046.codfw.wmnet
[22:06:35] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2046.codfw.wmnet
[22:07:09] <swfrench-wmf>	 !log pooled / uncordoned wikikube-worker2046.codfw.wmnet - T372878
[22:07:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:58] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373491 (10Scott_French) 03NEW
[22:09:41] <wikibugs>	 10ops-magru, 06SRE: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371618#10098219 (10RobH) 05Open→03Declined Dupe of T371554, issue being tracked there
[22:11:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:15:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P67991 and previous config saved to /var/cache/conftool/dbconfig/20240827-221506-ladsgroup.json
[22:15:19] <swfrench-wmf>	 !log running homer 'cr*codfw*' commit 'T372878' 
[22:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:23] <stashbot>	 T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878
[22:15:48] <wikibugs>	 (03PS2) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468)
[22:20:33] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 457, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:25:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux)
[22:27:21] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 539, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:30:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P67992 and previous config saved to /var/cache/conftool/dbconfig/20240827-223013-ladsgroup.json
[22:41:51] <wikibugs>	 (03PS1) 10Scott French: sre.hosts.move-vlan: use name property in runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440
[22:45:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67993 and previous config saved to /var/cache/conftool/dbconfig/20240827-224520-ladsgroup.json
[22:45:23] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:45:25] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[22:45:36] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:45:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T370903)', diff saved to https://phabricator.wikimedia.org/P67994 and previous config saved to /var/cache/conftool/dbconfig/20240827-224542-ladsgroup.json
[22:53:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T370903)', diff saved to https://phabricator.wikimedia.org/P67995 and previous config saved to /var/cache/conftool/dbconfig/20240827-225332-ladsgroup.json
[22:53:37] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:08:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P67996 and previous config saved to /var/cache/conftool/dbconfig/20240827-230839-ladsgroup.json
[23:23:31] <wikibugs>	 (03CR) 10Bartosz Dziewoński: ""Audit" is a big word, I was just trying to comprehend it and I tried to simplify some parts that defied comprehension. I didn't like this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[23:23:47] <wikibugs>	 (03Abandoned) 10Bartosz Dziewoński: wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński)
[23:23:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P67997 and previous config saved to /var/cache/conftool/dbconfig/20240827-232346-ladsgroup.json
[23:26:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T371742)', diff saved to https://phabricator.wikimedia.org/P67998 and previous config saved to /var/cache/conftool/dbconfig/20240827-232653-ladsgroup.json
[23:26:57] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[23:27:34] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] alert: Update alertmanager tests hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse)
[23:38:18] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:38:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1067450
[23:38:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1067450 (owner: 10TrainBranchBot)
[23:38:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T370903)', diff saved to https://phabricator.wikimedia.org/P67999 and previous config saved to /var/cache/conftool/dbconfig/20240827-233854-ladsgroup.json
[23:38:56] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance
[23:38:58] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:39:09] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance
[23:41:35] <icinga-wm>	 PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[23:42:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P68000 and previous config saved to /var/cache/conftool/dbconfig/20240827-234200-ladsgroup.json
[23:42:01] <icinga-wm>	 RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.31 ms
[23:46:34] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance
[23:46:47] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance
[23:54:07] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2208.codfw.wmnet with reason: Maintenance
[23:54:20] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2208.codfw.wmnet with reason: Maintenance
[23:54:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T370903)', diff saved to https://phabricator.wikimedia.org/P68001 and previous config saved to /var/cache/conftool/dbconfig/20240827-235426-ladsgroup.json
[23:54:31] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:57:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P68002 and previous config saved to /var/cache/conftool/dbconfig/20240827-235707-ladsgroup.json
[23:59:27] <swfrench-wmf>	 FYI, I'm looking into those KubernetesCalicoDown alerts. these are a little surprising, as they correspond to the old names two nodes in various stages of rename/reimage (one of which ostensibly finished).
[23:59:41] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections