[00:04:02] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:04:52] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:07:00] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:10:14] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [00:10:30] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:11:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1066946 (owner: 10TrainBranchBot) [00:12:06] (03Abandoned) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) (owner: 10Jdlrobson) [00:20:25] (03PS3) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [00:20:34] (03PS3) 10Jdlrobson: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) [00:20:46] (03PS3) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [00:21:53] (03PS4) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [00:25:42] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [00:26:13] (03PS1) 10Jasmine_: admin: adding jasmine to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1066951 [00:29:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T371742)', diff saved to https://phabricator.wikimedia.org/P67852 and previous config saved to /var/cache/conftool/dbconfig/20240827-002944-ladsgroup.json [00:29:48] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:29:51] (03CR) 10RLazarus: [C:03+2] admin: adding jasmine to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1066951 (owner: 10Jasmine_) [00:39:27] !log dduvall@deploy1003 Started deploy [releng/jenkins-deploy@663c843] (releasing): (no justification provided) [00:40:08] !log dduvall@deploy1003 Finished deploy [releng/jenkins-deploy@663c843] (releasing): (no justification provided) (duration: 00m 40s) [00:42:30] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns3004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [00:44:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P67853 and previous config saved to /var/cache/conftool/dbconfig/20240827-004451-ladsgroup.json [00:59:16] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [00:59:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P67854 and previous config saved to /var/cache/conftool/dbconfig/20240827-005958-ladsgroup.json [01:14:44] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns4004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [01:15:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T371742)', diff saved to https://phabricator.wikimedia.org/P67855 and previous config saved to /var/cache/conftool/dbconfig/20240827-011505-ladsgroup.json [01:15:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [01:15:10] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:15:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1241.eqiad.wmnet with reason: Maintenance [01:15:26] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:15:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T371742)', diff saved to https://phabricator.wikimedia.org/P67856 and previous config saved to /var/cache/conftool/dbconfig/20240827-011527-ladsgroup.json [01:15:48] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:30:14] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [01:45:44] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [01:49:58] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:50:30] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Idle - NTT, AS2914/IPv6: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0200) [02:01:10] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [02:17:58] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [02:18:10] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:18:40] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 112, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:23:08] !log Import corto 0.3-1 into bookworm-wikimedia apt archive [02:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:36] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Connect - NTT, AS2914/IPv6: Connect - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:33:26] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [02:36:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:06] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:56] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns7002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [02:49:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox [02:57:57] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0300) [03:03:40] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:50] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 112, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:29:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T371742)', diff saved to https://phabricator.wikimedia.org/P67857 and previous config saved to /var/cache/conftool/dbconfig/20240827-032902-ladsgroup.json [03:29:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:30:38] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:30:52] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv6: Idle - NTT, AS2914/IPv4: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:37:02] RECOVERY - Disk space on restbase2021 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2021&var-datasource=codfw+prometheus/ops [03:44:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P67858 and previous config saved to /var/cache/conftool/dbconfig/20240827-034409-ladsgroup.json [03:52:48] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:53:04] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 112, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:59:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P67859 and previous config saved to /var/cache/conftool/dbconfig/20240827-035916-ladsgroup.json [03:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0400) [04:01:38] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.17 (duration: 01m 28s) [04:14:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T371742)', diff saved to https://phabricator.wikimedia.org/P67860 and previous config saved to /var/cache/conftool/dbconfig/20240827-041424-ladsgroup.json [04:14:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [04:14:28] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:14:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1242.eqiad.wmnet with reason: Maintenance [04:14:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T371742)', diff saved to https://phabricator.wikimedia.org/P67861 and previous config saved to /var/cache/conftool/dbconfig/20240827-041446-ladsgroup.json [05:18:11] (03PS1) 10Marostegui: Revert "mariadb: Add db2232 to test-s4" [puppet] - 10https://gerrit.wikimedia.org/r/1067158 [05:33:55] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@0b23c91]: (no justification provided) [05:34:14] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@0b23c91]: (no justification provided) (duration: 00m 18s) [05:39:44] (03PS2) 10KartikMistry: Section Translation: Fix some language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 [05:40:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 (owner: 10KartikMistry) [05:48:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0600) [06:00:05] marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0600) [06:04:36] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:36] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:30] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:30] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:52] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T371742)', diff saved to https://phabricator.wikimedia.org/P67862 and previous config saved to /var/cache/conftool/dbconfig/20240827-062302-ladsgroup.json [06:23:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:31:01] (03PS1) 10Ammarpad: Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414) [06:36:24] (03CR) 10Slyngshede: [C:03+2] Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński) [06:36:47] (03CR) 10Slyngshede: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński) [06:38:06] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P67863 and previous config saved to /var/cache/conftool/dbconfig/20240827-063809-ladsgroup.json [06:39:13] (03CR) 10Slyngshede: [C:03+2] Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński) [06:41:19] (03Merged) 10jenkins-bot: Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 (owner: 10Bartosz Dziewoński) [06:53:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P67864 and previous config saved to /var/cache/conftool/dbconfig/20240827-065316-ladsgroup.json [06:58:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10094807 (10ayounsi) I manually added `--collector.drbd` to /etc/default/prometheus-node-exporter on one of the Routed Ganeti exporter Thi... [07:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0700). [07:00:05] kart_ and Ammar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:24] here [07:02:30] I'll start with my patch. [07:03:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 (owner: 10KartikMistry) [07:04:01] (03Merged) 10jenkins-bot: Section Translation: Fix some language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064696 (owner: 10KartikMistry) [07:04:14] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1064696|Section Translation: Fix some language codes]] [07:06:15] !log kartik@deploy1003 kartik: Backport for [[gerrit:1064696|Section Translation: Fix some language codes]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:29] (03PS1) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 [07:07:55] (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (owner: 10David Caro) [07:07:56] !log kartik@deploy1003 kartik: Continuing with sync [07:08:22] (03PS2) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) [07:08:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T371742)', diff saved to https://phabricator.wikimedia.org/P67865 and previous config saved to /var/cache/conftool/dbconfig/20240827-070823-ladsgroup.json [07:08:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [07:08:28] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:08:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1243.eqiad.wmnet with reason: Maintenance [07:08:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T371742)', diff saved to https://phabricator.wikimedia.org/P67866 and previous config saved to /var/cache/conftool/dbconfig/20240827-070845-ladsgroup.json [07:08:48] (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [07:08:55] (03PS3) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) [07:11:24] (03PS1) 10KartikMistry: Update cxserver to 2024-08-27-045705-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067221 (https://phabricator.wikimedia.org/T369815) [07:11:36] (03CR) 10Ayounsi: [C:03+2] "To close the loop from our IRC chat:" [puppet] - 10https://gerrit.wikimedia.org/r/1066799 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi) [07:11:39] (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [07:12:24] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064696|Section Translation: Fix some language codes]] (duration: 08m 09s) [07:13:22] (03CR) 10Jelto: prometheus: create text file export for nft throttling denylist length (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [07:15:35] (03PS1) 10Jelto: prometheus: fix nftables_throttling exporter variable [puppet] - 10https://gerrit.wikimedia.org/r/1067222 (https://phabricator.wikimedia.org/T373136) [07:18:37] Ammar: I'm done with my patch. [07:20:39] (03PS4) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) [07:20:50] (03CR) 10Marostegui: [C:03+2] Revert "mariadb: Add db2232 to test-s4" [puppet] - 10https://gerrit.wikimedia.org/r/1067158 (owner: 10Marostegui) [07:21:08] (03CR) 10CI reject: [V:04-1] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [07:22:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm [07:23:13] (03PS1) 10Marostegui: Revert "test-s4: Add two new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1067226 [07:23:14] kart_: OK [07:24:41] (03PS2) 10Marostegui: Revert "test-s4: Add two new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1067226 [07:25:35] (03CR) 10Marostegui: [C:03+2] Revert "test-s4: Add two new hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1067226 (owner: 10Marostegui) [07:26:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2230.codfw.wmnet with OS bookworm [07:26:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2231.codfw.wmnet with OS bookworm [07:28:58] (03PS5) 10David Caro: p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) [07:30:15] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3748/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [07:39:25] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:45:20] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417 (10Marostegui) 03NEW [07:45:33] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10094868 (10Marostegui) @Papaul can this be related to the 10G? [07:45:57] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [07:47:08] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10094869 (10ABran-WMF) 05Open→03In progress p:05Triage→03Medium [07:49:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [07:50:22] !log ack probedown for puppetmaster:8181 - T373369 [07:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:26] T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4) - https://phabricator.wikimedia.org/T373369 [07:50:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:49] urbanecm: are you available for the morning backport? [07:51:18] Ammar: hey, no one is still deploying? :/ [07:51:20] i can take a look [07:51:53] (03PS2) 10Ammarpad: Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414) [07:51:58] (03CR) 10Urbanecm: [C:03+2] Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414) (owner: 10Ammarpad) [07:52:48] (03Merged) 10jenkins-bot: Add throttle rule for Wikimedia Hausa edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067191 (https://phabricator.wikimedia.org/T373414) (owner: 10Ammarpad) [07:52:50] Ammar: that should've been requested earlier. implementing throttle rule less than 72 hours in advance is a bit more complicated on my end (I have to keep in mind additional aspects). i'll deploy it, but i'd appreciate if future requests could be scheduled a little bit earlier. thanks! [07:53:03] (03CR) 10Jelto: [C:03+2] prometheus: fix nftables_throttling exporter variable [puppet] - 10https://gerrit.wikimedia.org/r/1067222 (https://phabricator.wikimedia.org/T373136) (owner: 10Jelto) [07:53:20] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1067191|Add throttle rule for Wikimedia Hausa edit-a-thon (T373414)]] [07:53:24] T373414: Requesting temporary lift of IP cap for Wikimedia Hausa edit-a-thon - https://phabricator.wikimedia.org/T373414 [07:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:00:03] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067191|Add throttle rule for Wikimedia Hausa edit-a-thon (T373414)]] (duration: 06m 42s) [08:00:04] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0800) [08:00:13] T373414: Requesting temporary lift of IP cap for Wikimedia Hausa edit-a-thon - https://phabricator.wikimedia.org/T373414 [08:01:27] !log Clear throttle for 105.113.127.170 via resetAuthenticationThrottle.php (T373414) [08:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:11] urbanecm: Thank you [08:03:12] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:03:58] jouncebot: now and next [08:03:58] For the next 1 hour(s) and 56 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T0800) [08:04:14] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:10:47] I am going to run the MediaWiki train [08:12:37] well actually no cause there is a blocker [08:14:44] (03PS1) 10Joely Rooke WMDE: Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) [08:14:49] ah it got fixed [08:14:51] good cscott :) [08:15:26] (03CR) 10CI reject: [V:04-1] Activate feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [08:15:32] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067283 (https://phabricator.wikimedia.org/T366965) [08:15:33] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067283 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [08:15:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:16] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067283 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [08:17:38] /srv/mediawiki-staging/php-1.43.0-wmf.20/.gitmodules does not exist. Did the train branch commit get merged? [08:17:39] ... [08:18:18] (03CR) 10David Caro: [V:03+1 C:03+2] p:m:toolforge::prometheus: drop the heaviest unused series [puppet] - 10https://gerrit.wikimedia.org/r/1067220 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [08:18:19] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2232.codfw.wmnet with OS bookworm [08:18:26] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2231.codfw.wmnet with OS bookworm [08:18:30] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2230.codfw.wmnet with OS bookworm [08:18:56] I guess I need a double expresso [08:18:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm [08:19:25] every day is a double expresso day here [08:19:44] yeah that is stressful [08:20:02] PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:20:42] the branch did not get cut this morning due to some failure [08:20:48] hashar: the train branch cut job failed, it looks like a transient gerrit error: https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/243/console [08:21:03] I think just rerunning that job should do the trick [08:21:31] but why would Gerrit fail? :D [08:21:44] Output: fatal: could not read Username for 'https://gerrit.wikimedia.org': No such device or address [08:21:45] fun [08:22:03] (03PS2) 10Slyngshede: 2FA: Use username as foreign key to security token table. [software/bitu] - 10https://gerrit.wikimedia.org/r/1065166 [08:22:04] that I dunno :) [08:22:10] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:17] that is when doing the git push [08:24:14] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:25:41] 10:25:27 Warning: Branch wmf/1.43.0-wmf.20 already exists in repository mediawiki/core [08:26:01] I guess the job being reentrant and reusing the existing repo is good [08:26:54] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10094942 (10Marostegui) So I can confirm I've seen db2232 booting up... and seems to get an IP from PXE: ` CLIENT MAC ADDR: 04 32 01 DB D0 C0 GUID: 4C4C4544-004E-3010-8048-B9C04F4B3434 CLIENT... [08:27:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:10] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:24] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:34] (03CR) 10Btullis: [C:03+2] analytics.wikimedia.org: improve caching and redirects [puppet] - 10https://gerrit.wikimedia.org/r/1057223 (owner: 10Milimetric) [08:29:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db1161 - T373328', diff saved to https://phabricator.wikimedia.org/P67867 and previous config saved to /var/cache/conftool/dbconfig/20240827-082923-arnaudb.json [08:29:28] T373328: upgrade db1161 to MariaDB 10.6.19 - https://phabricator.wikimedia.org/T373328 [08:31:07] hmm [08:32:06] the release jenkins got restarted over night [08:32:07] at 00:40 [08:32:10] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:18] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:34:18] my guess is something got upgraded / changed and that broke the job [08:35:10] (03PS2) 10Joely Rooke WMDE: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) [08:35:42] (03PS3) 10Joely Rooke WMDE: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) [08:36:05] how do I search a commit in gitlab? :/ [08:36:28] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:36:53] jnuche: I am pretty sure the issue is https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/commit/663c84371a60e1232a501e627c266899d4f5298f :) [08:36:53] hum, yeah, the Jenkins service was restarted, what on earth? [08:37:04] that is the sole change I could find [08:37:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1161.eqiad.wmnet with reason: db1161 upgrade [08:37:12] that got redeployed (which restarted the jenkins service) [08:37:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1161.eqiad.wmnet with reason: db1161 upgrade [08:37:26] and my guess is that whatever version of git / curl we have on releases1003 does not support that NETRC [08:39:14] yeah, it seems that change is the problem [08:39:24] let's roll it back [08:39:35] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1161.eqiad.wmnet [08:39:56] hum, wait, they created a different job too [08:40:53] in which repo is that? [08:41:06] https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/68 [08:41:12] it's a different MR though [08:41:14] should be fine [08:41:43] give me a min and I'll create the revert [08:41:47] and that did not get merged [08:42:10] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:15] what I don't get is that the code seems to create a file named `netrc_file` [08:42:20] PROBLEM - MariaDB Replica IO: s5 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1161.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1161.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:42:24] and the commit message refers to an environment variable `NETRC` [08:42:32] so I guess that got mixed up [08:42:45] or `file()` should have been changed to something like `env()` [08:43:31] and I have no lcue what "netrc_file" would be :) [08:43:45] Replica IO: s5 on db1154 → I'm the noise source, downtiming [08:44:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1154.eqiad.wmnet with reason: upgrading db1161 [08:44:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1154.eqiad.wmnet with reason: upgrading db1161 [08:44:53] ah no the first parameter to `file()` is indeed the name of the environment variable [08:45:15] the `netrc_file` is injected via a secret, so it already exists when the job tries to access it [08:45:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on an-redacteddb1001.eqiad.wmnet with reason: upgrading db1161 [08:45:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-redacteddb1001.eqiad.wmnet with reason: upgrading db1161 [08:45:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1161.eqiad.wmnet [08:46:43] then if I look at https://gitlab.wikimedia.org/repos/releng/release.git it has: [08:46:43] netrc_file = os.getenv("netrc_file") [08:46:43] if netrc_file: [08:46:43] os.symlink(netrc_file, os.path.join(netrc_dir, ".netrc")) [08:47:10] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:20] RECOVERY - MariaDB Replica IO: s5 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:47:23] 03:00:12 netrc_file environment variable not set. Will not be able to push the branch cut commit [08:47:23] 03:00:12 Branching mediawiki version 1.43.0-wmf.20 (T366965) [08:47:24] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [08:47:29] which really should be a fatal / standout [08:48:02] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:48:11] so my guess is https://gitlab.wikimedia.org/repos/releng/release has a pending merge requests for that [08:49:06] and it does not [08:49:07] fun [08:49:09] hashar: MR ready https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/69 [08:49:38] that commit message refers to https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/244/console [08:49:42] which will disappear eventually [08:50:09] I've also added the relevant job output to the MR in a comment [08:52:03] I will rephrase it ;) [08:52:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db1161.eqiad.wmnet with reason: db1161 upgrade [08:52:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db1161.eqiad.wmnet with reason: db1161 upgrade [08:53:26] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:54:14] PROBLEM - MariaDB Replica Lag: s6 on db2114 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 55796.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:54:38] PROBLEM - MariaDB Replica SQL: s6 on db2114 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_actiontext: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:44] PROBLEM - MariaDB Replica Lag: s6 on db2124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 55885.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:46] PROBLEM - MariaDB Replica SQL: s6 on db2124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_actiontext: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:55:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2124', diff saved to https://phabricator.wikimedia.org/P67868 and previous config saved to /var/cache/conftool/dbconfig/20240827-085551-arnaudb.json [08:56:11] jnuche: https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/69/diffs?commit_id=8d2d8fec08223ec68f052ad06078e366ddc1a28e :) [08:56:25] hashar: approve please? :) [08:56:36] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:56:49] yeah I amm looking for the +2 message :D [08:57:01] oh I have hidden it [08:57:26] I like to inline the full context in the commit messages [08:57:40] since in a few years for sure the resource pointed by the URL would have vanished [08:58:14] 👍 [08:58:18] I'm going to deploy the change now [09:00:15] thank you! [09:00:24] (03PS1) 10Marostegui: mariadb: Decommission db2114 [puppet] - 10https://gerrit.wikimedia.org/r/1067296 (https://phabricator.wikimedia.org/T362948) [09:00:30] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:00:46] !log tappof@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on P{O:logging::opensearch::collector and logstash*.codfw.wmnet} and (A:logstash-collector) [09:00:50] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@8d2d8fe] (releasing): (no justification provided) [09:01:16] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2232.codfw.wmnet with OS bookworm [09:01:39] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@8d2d8fe] (releasing): (no justification provided) (duration: 00m 48s) [09:02:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2114.codfw.wmnet [09:02:46] config looks good again, gonna rerun the job [09:04:38] !log tappof@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::collector and logstash*.codfw.wmnet} and (A:logstash-collector) [09:07:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10095021 (10ayounsi) Draft dashboard: https://grafana.wikimedia.org/d/f_tZtVlMz/drbd I think we should be good to deploy it to all of the... [09:07:58] I have left a note on dduvall original commit at https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/commit/663c84371a60e1232a501e627c266899d4f5298f [09:08:03] * hashar grabs a coffee [09:08:09] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [09:08:36] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:10:48] RECOVERY - MariaDB Replica SQL: s6 on db2124 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:10:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [09:11:16] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2114.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:12:07] hmm [09:12:12] RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:12:22] ! [remote rejected] HEAD -> refs/for/wmf/1.43.0-wmf.20 (implicit merges detected) [09:12:22] lol [09:13:22] that is actually good [09:13:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2114.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [09:13:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:13:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2114.codfw.wmnet [09:13:48] PROBLEM - MariaDB Replica SQL: s6 on db2124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_only_for_read_old: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:13:54] I was actually about to ask about that [09:14:16] never seen that thing about implicit merges [09:14:18] it lists 6 commits [09:14:23] which are change thtat got merged during the night [09:14:34] AFTER the branch cut ran the first time [09:14:58] those are merged in master [09:15:07] right, some of the branches were successfully cut last night... [09:15:22] PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:15:43] on the releases hosts it does: [09:15:44] 11:03:21 Branching mediawiki/core to wmf/1.43.0-wmf.20 from HEAD [09:15:44] 11:03:21 Warning: Branch wmf/1.43.0-wmf.20 already exists in repository mediawiki/core [09:16:07] so my guess is on the host the branch has been updated to whatever master is at and that includes thoses six commits [09:16:23] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2114.codfw.wmnet - https://phabricator.wikimedia.org/T362948#10095055 (10Marostegui) a:05Marostegui→03None Ready for DC-Ops [09:16:29] when pushing that back to Gerrit it complains cause it hasn't seen those commits being proposed as changes to the wmf/1.43.0-wmf.20 branch [09:16:33] that bypasses review [09:16:36] and it complains [09:16:49] so [09:17:17] 1) the branch cut job should not attempt to refresh / update the branch when it is already existing [09:17:18] OR [09:17:26] 2) we backport all six patches (that sounds overkill) [09:17:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:05] 2bis) I manually push the update [09:18:33] mmmh, the thing is the job also has a different mode of execution that reuses the same branch all the time (precut_branch or something) [09:18:42] changing the update behavior probably would affect that [09:18:56] for the time being, could we fix manually? [09:20:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T371742)', diff saved to https://phabricator.wikimedia.org/P67870 and previous config saved to /var/cache/conftool/dbconfig/20240827-092005-ladsgroup.json [09:20:10] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:20:14] RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:22:21] oh [09:22:24] that is that python code [09:22:28] NOoo [09:23:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10095063 (10MatthewVernon) Please go ahead! [sorry, I missed this on Friday, and then yesterday was a public holiday] [09:24:48] RECOVERY - MariaDB Replica SQL: s6 on db2124 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:24:50] jnuche: I have updated the wmf branch to current master [09:25:32] !log train: fast forwarded mediawiki/core wmf/1.43.0-wmf.20 from 1faf18d6570 to ef87455d7c3 # T366965 [09:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:36] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [09:25:41] thx, let's try again then :) [09:25:50] rebuilding [09:25:51] :) [09:26:04] ah you beat me to it [09:26:21] yeah sorry [09:26:22] ! [09:27:14] now let's hope none of the extensions/skins repos got updates overnight... [09:27:40] RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:48] PROBLEM - MariaDB Replica SQL: s6 on db2124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP COLUMN cuc_private: check that it exists on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:29:14] oh [09:29:15] true [09:29:21] well they did for sure :/ [09:31:47] RECOVERY - MariaDB Replica SQL: s6 on db2124 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:32:27] (03Abandoned) 10Gmodena: EventStreamConfig: Add webrequest.frontend.v1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [09:32:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2124.codfw.wmnet with reason: db2124 fix [09:32:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2124.codfw.wmnet with reason: db2124 fix [09:33:29] (03CR) 10Gmodena: [C:03+2] EventStreamConfig: remove webrequest_frontend. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062679 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena) [09:33:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.20 [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067300 (https://phabricator.wikimedia.org/T366965) [09:33:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.20 [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067300 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [09:34:10] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Cephadm doesn't find the correct image to run a shell - https://phabricator.wikimedia.org/T373185#10095084 (10MatthewVernon) 05Open→03Resolved Clusters upgraded to new image, and lo: ` mvernon@moss-be2001:~$ sudo cephadm shell Inferring fsid 59ea825c-2a... [09:34:12] (03Merged) 10jenkins-bot: EventStreamConfig: remove webrequest_frontend. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062679 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena) [09:35:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P67871 and previous config saved to /var/cache/conftool/dbconfig/20240827-093512-ladsgroup.json [09:36:51] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2017.codfw.wmnet [09:37:24] it managed to create the change request :) [09:37:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2017.codfw.wmnet [09:37:42] (03PS1) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 [09:37:43] I need to step away from the desk to prepare lunch, I'll be still checking my messages though [09:38:06] (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [09:38:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2017.codfw.wmnet with OS bullseye [09:38:42] jnuche: Waiting up to 3600 seconds for https://gerrit.wikimedia.org/r/c/1067300 to merge [09:38:47] that is a good sign I guess :) [09:38:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:39:10] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:19] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2028.codfw.wmnet [09:39:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2028.codfw.wmnet [09:40:03] fun thing dancy indented the python script with THREE SPACES :) [09:40:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [09:40:10] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:41:46] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2114 [puppet] - 10https://gerrit.wikimedia.org/r/1067296 (https://phabricator.wikimedia.org/T362948) (owner: 10Marostegui) [09:42:10] (03PS1) 10Ayounsi: Ganeti prod: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560) [09:42:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi) [09:43:20] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2017 - cgoubert@cumin1002" [09:43:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2017 - cgoubert@cumin1002" [09:43:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:43:24] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2017.codfw.wmnet 76.0.192.10.in-addr.arpa 6.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:43:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2017.codfw.wmnet 76.0.192.10.in-addr.arpa 6.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:43:28] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2017 [09:44:45] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi) [09:44:57] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:44:57] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2028.codfw.wmnet with OS bullseye [09:45:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2017 [09:45:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [09:45:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095118 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:45:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [09:45:26] (03CR) 10Ayounsi: [C:03+2] Ganeti prod: enable drbd prometheus collector [puppet] - 10https://gerrit.wikimedia.org/r/1067302 (https://phabricator.wikimedia.org/T299560) (owner: 10Ayounsi) [09:45:31] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:47:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:47:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:10] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:35] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2028 - cgoubert@cumin1002" [09:49:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2028 - cgoubert@cumin1002" [09:49:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:40] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2028.codfw.wmnet 178.0.192.10.in-addr.arpa 8.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:49:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2028.codfw.wmnet 178.0.192.10.in-addr.arpa 8.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:49:43] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2028 [09:50:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2028 [09:50:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [09:50:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P67872 and previous config saved to /var/cache/conftool/dbconfig/20240827-095019-ladsgroup.json [09:50:51] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2018.codfw.wmnet [09:51:11] (03PS2) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 [09:51:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2018.codfw.wmnet [09:51:34] (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [09:52:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2018.codfw.wmnet with OS bullseye [09:52:27] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:53:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [09:53:47] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [09:55:39] (03PS3) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 [09:56:01] wmf-quibble-core-vendor-mysql-php74 | ███████▒▒▒ 77% | ETA: 379s [09:56:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 1%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67873 and previous config saved to /var/cache/conftool/dbconfig/20240827-095627-arnaudb.json [09:56:50] (03PS3) 10Tiziano Fogli: curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) [09:56:57] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3751/console" [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [09:58:33] (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [10:00:04] (03PS4) 10David Caro: spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1000) [10:00:43] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2018 - cgoubert@cumin1002" [10:00:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2018 - cgoubert@cumin1002" [10:00:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:00:48] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2018.codfw.wmnet 95.0.192.10.in-addr.arpa 5.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:00:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2018.codfw.wmnet 95.0.192.10.in-addr.arpa 5.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:00:52] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2018 [10:01:16] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [10:01:26] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage [10:01:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2018 [10:01:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [10:02:20] (03PS2) 10Dbrant: Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) [10:02:27] (03PS1) 10Zabe: Revert apparent fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067305 (https://phabricator.wikimedia.org/T368712) [10:02:59] (03CR) 10CI reject: [V:04-1] spicerack: allow running by non-ops [puppet] - 10https://gerrit.wikimedia.org/r/1067301 (owner: 10David Caro) [10:03:05] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.20 [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067300 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [10:04:20] (03CR) 10Btullis: [C:03+2] ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [10:04:55] jnuche: for the implicit merge being rejected, the Gerrit doc is at https://gerrit.wikimedia.org/r/Documentation/config-project-config.html#receive.rejectImplicitMerges ): [10:04:55] :) [10:05:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage [10:05:24] I am resuming the train [10:05:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T371742)', diff saved to https://phabricator.wikimedia.org/P67874 and previous config saved to /var/cache/conftool/dbconfig/20240827-100527-ladsgroup.json [10:05:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [10:05:31] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:05:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1244.eqiad.wmnet with reason: Maintenance [10:05:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T371742)', diff saved to https://phabricator.wikimedia.org/P67875 and previous config saved to /var/cache/conftool/dbconfig/20240827-100548-ladsgroup.json [10:06:52] !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20 refs T366965 [10:06:54] !log hashar@deploy1003 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.43.0-wmf.20" --no-progress --store-class=LCStoreCDB --threads=22 --lang en --quiet ' returned non-zero exit status 1. (duration: 00m 02s) [10:06:55] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [10:07:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage [10:07:26] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:07:57] (03Merged) 10jenkins-bot: ceph-csi-rbd: add digest to image tag, ensuring the image immutability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064761 (https://phabricator.wikimedia.org/T373000) (owner: 10Brouberol) [10:07:59] RuntimeException from line 88 of /srv/mediawiki-staging/php-1.43.0-wmf.20/includes/language/LCStoreCDB.php: Unable to create the localisation store directory "/srv/mediawiki-staging/php-1.43.0-wmf.20/cache/l10n" [10:08:00] fun [10:09:12] the parent `cache` belongs to mwpresync:deployment [10:09:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:09:43] but the cache is rebuilt as www-data [10:09:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage [10:10:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:10:49] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:11:12] (03PS1) 10David Caro: toolforge:prometheus: only kyverno controllers expose stats [puppet] - 10https://gerrit.wikimedia.org/r/1067307 (https://phabricator.wikimedia.org/T370143) [10:11:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2030.codfw.wmnet [10:11:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 2%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67876 and previous config saved to /var/cache/conftool/dbconfig/20240827-101132-arnaudb.json [10:11:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2030.codfw.wmnet [10:12:41] (03PS1) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) [10:13:00] !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20 refs T366965 [10:13:00] !log hashar@deploy1003 scap failed: PermissionError [Errno 13] Permission denied: '/srv/mediawiki-staging/php-1.43.0-wmf.20/cache/gitinfo' (duration: 00m 00s) [10:13:03] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [10:13:05] (03CR) 10CI reject: [V:04-1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [10:13:23] pff [10:13:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2124.codfw.wmnet with reason: replag [10:13:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2124.codfw.wmnet with reason: replag [10:14:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2030.codfw.wmnet with OS bullseye [10:14:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:14:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [10:14:51] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:15:52] RECOVERY - MariaDB Replica Lag: s6 on db2124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:16:12] (03PS2) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) [10:16:36] (03CR) 10CI reject: [V:04-1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [10:16:54] !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20 refs T366965 [10:16:57] !log hashar@deploy1003 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.43.0-wmf.20" --no-progress --store-class=LCStoreCDB --threads=22 --lang en --quiet ' returned non-zero exit status 1. (duration: 00m 02s) [10:17:14] (03CR) 10Filippo Giunchedi: [C:03+1] curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) (owner: 10Tiziano Fogli) [10:17:15] * hashar files a bug [10:17:21] train is blocked [10:17:37] (03PS1) 10David Caro: toolforge:prometheus: drop metrics as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1067309 (https://phabricator.wikimedia.org/T370143) [10:17:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage [10:18:43] (03PS3) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) [10:19:00] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2030 - cgoubert@cumin1002" [10:19:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2030 - cgoubert@cumin1002" [10:19:05] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:19:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2030.codfw.wmnet 177.0.192.10.in-addr.arpa 7.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:19:07] (03CR) 10CI reject: [V:04-1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [10:19:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2030.codfw.wmnet 177.0.192.10.in-addr.arpa 7.7.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:19:09] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2030 [10:19:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2030 [10:19:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [10:20:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2018.codfw.wmnet with reason: host reimage [10:21:05] (03PS4) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) [10:23:34] (03CR) 10Btullis: [V:03+1 C:03+2] Add a matomo_plugins component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [10:24:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2017.codfw.wmnet with OS bullseye [10:24:58] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:26:37] !log homer 'cr*codfw*' commit 'T372878' [10:26:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 4%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67877 and previous config saved to /var/cache/conftool/dbconfig/20240827-102638-arnaudb.json [10:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:41] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [10:27:29] (03PS5) 10AOkoth: vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) [10:28:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67878 and previous config saved to /var/cache/conftool/dbconfig/20240827-102827-arnaudb.json [10:29:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2028.codfw.wmnet with OS bullseye [10:29:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:32:53] (03PS1) 10JMeybohm: Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311 [10:33:03] (03PS2) 10JMeybohm: Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311 [10:33:04] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 465, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:33:12] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067311 (owner: 10JMeybohm) [10:33:29] (03CR) 10CI reject: [V:04-1] Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311 (owner: 10JMeybohm) [10:33:37] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1067308/3755/" [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [10:33:37] !log hashar@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.20 refs T366965 [10:33:42] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [10:33:44] I went with `sudo -u mwpresync chmod o+w /srv/mediawiki-staging/php-1.43.0-wmf.20/cache/` [10:34:38] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszabo - https://phabricator.wikimedia.org/T373426 (10mszabo) 03NEW [10:36:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage [10:37:30] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095341 (10Dreamy_Jazz) [10:37:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for mszabo - https://phabricator.wikimedia.org/T373426#10095342 (10mszabo) [10:37:49] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095357 (10Dreamy_Jazz) The logs for the `daily-updates` container h... [10:38:31] hashar: why are half the files in /srv/mediawiki-staging/php-1.43.0-wmf.20/ owned by your user ? [10:39:01] Hey folks! Would it be possible to add & deploy an IP exception to wmf-config/throttle.php rather quickly? [10:39:20] I just got an email from someone holding an event with 100 people, and many of them are being throttled [10:39:27] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095359 (10kostajh) The container is still running, though: ` [khar... [10:39:35] that's not the case for the previous version where everything belongs to mwpresync:deployment [10:40:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage [10:40:38] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095361 (10kostajh) I guess we need to stop the container so that a... [10:40:51] Jhs: is there a task? [10:40:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2018.codfw.wmnet with OS bullseye [10:41:17] kostajh, not yet. I'm trying to find out the IP they need unthrottled, and the projects that needs to happen in [10:41:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:41:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 6%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67879 and previous config saved to /var/cache/conftool/dbconfig/20240827-104143-arnaudb.json [10:42:12] kostajh, but if the answer to the question about doing it quickly would be "we can't do it until the next deployment window", i think their event might be over by then 😅 which is why i asked that question first [10:42:13] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 547, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:42:23] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the container apparently crashed - https://phabricator.wikimedia.org/T373427#10095365 (10Dreamy_Jazz) [10:43:12] !log Running homer 'lsw1-a5-codfw*' commit 'T372878' [10:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:15] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [10:43:28] I'll of course tell the organizer about the possibility of scheduling such an exemption ahead of time, i'm sure they're just not aware of the possibility and/or how to do it [10:43:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 2%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67880 and previous config saved to /var/cache/conftool/dbconfig/20240827-104332-arnaudb.json [10:44:40] Jhs: you can point them to https://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap [10:44:41] Jhs: we could probably do something out of the deployment window (cc hashar ) but we'd need a phab task with the details, for an audit trail [10:46:01] !log Running homer 'lsw1-a6-codfw*' commit 'T372878' [10:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:10] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:32] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2017.codfw.wmnet [10:48:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2017.codfw.wmnet [10:48:51] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2028.codfw.wmnet [10:48:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2028.codfw.wmnet [10:49:04] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2018.codfw.wmnet [10:49:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2018.codfw.wmnet [10:49:27] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427#10095378 (10kostajh) [10:49:31] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:49:57] (03Abandoned) 10JMeybohm: Don't merge: Test PCC run for brokers without ID [puppet] - 10https://gerrit.wikimedia.org/r/1067311 (owner: 10JMeybohm) [10:50:14] (03PS1) 10JMeybohm: Decom kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/1067313 (https://phabricator.wikimedia.org/T373428) [10:50:27] (03PS1) 10JMeybohm: Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428) [10:50:59] kostajh: Jhs: I am fine having a throttle config change to be deployed at anytime [10:51:04] they are rather straight forward :) [10:52:33] I think there is a process about it somewhere [10:52:49] probably in a "how to run an edit a thon" or something [10:52:56] <_joe_> hashar: yes the process says ask two weeks in advance :) [10:53:04] yeah [10:53:09] then it is often missed [10:53:11] <_joe_> which is not there as a condition just for bureaucratic/organizational reasons [10:53:20] <_joe_> although those are also valid [10:53:21] cause organizers are not necessarily aware of that limit [10:53:40] <_joe_> I'm pretty sure there's ample documentation of the potential issue [10:53:49] <_joe_> however, let's proceed [10:54:15] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kafka-main2001.codfw.wmnet [10:54:19] <_joe_> Jhs: It's not a possibility, it's a requirement :) see ttps://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap [10:54:45] <_joe_> and yes at a bare minimum we need a phab task following this procedure ^^ [10:55:11] ah we have ample documentation! [10:55:18] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database cswikivoyage (T370912) [10:55:23] T370912: Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912 [10:56:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 8%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67881 and previous config saved to /var/cache/conftool/dbconfig/20240827-105649-arnaudb.json [10:57:33] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:57:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:57:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:58:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:58:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67882 and previous config saved to /var/cache/conftool/dbconfig/20240827-105815-ladsgroup.json [10:58:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:58:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 3%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67883 and previous config saved to /var/cache/conftool/dbconfig/20240827-105837-arnaudb.json [11:00:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2030.codfw.wmnet with OS bullseye [11:00:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67884 and previous config saved to /var/cache/conftool/dbconfig/20240827-110024-ladsgroup.json [11:00:27] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [11:00:57] !log Starting MediaModeration time limited scan on group0 to make up monthly request limit - https://wikitech.wikimedia.org/wiki/MediaModeration [11:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:13] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10095413 (10mszabo) [11:01:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095415 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [11:02:40] claime, kostajh, _joe_ : Thanks! I still haven't heard back about my question about the IP address and affected projects, so I doubt it will happen today. But i'll give them a tip about what the proper procedure is for next time, so it'll be a better situation for everyone :) [11:05:23] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2030.codfw.wmnet [11:05:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2030.codfw.wmnet [11:05:49] (03PS2) 10JMeybohm: Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428) [11:09:14] jouncebot: now and next [11:09:14] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [11:11:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 16%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67885 and previous config saved to /var/cache/conftool/dbconfig/20240827-111154-arnaudb.json [11:12:06] !log start prometheus6002 bookworm upgrade - T326657 [11:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:09] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [11:13:24] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [11:13:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67886 and previous config saved to /var/cache/conftool/dbconfig/20240827-111343-arnaudb.json [11:14:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-main2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [11:14:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:14:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-main2001.codfw.wmnet [11:15:32] (03CR) 10JMeybohm: [C:03+2] Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428) (owner: 10JMeybohm) [11:15:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P67887 and previous config saved to /var/cache/conftool/dbconfig/20240827-111532-ladsgroup.json [11:15:40] (03CR) 10JMeybohm: [C:03+2] Decom kafka-main2001 [puppet] - 10https://gerrit.wikimedia.org/r/1067313 (https://phabricator.wikimedia.org/T373428) (owner: 10JMeybohm) [11:18:55] (03Merged) 10jenkins-bot: Remove to be decommissioned kafka brokers from fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067315 (https://phabricator.wikimedia.org/T373428) (owner: 10JMeybohm) [11:19:11] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2001.codfw.wmnet - https://phabricator.wikimedia.org/T373428#10095452 (10JMeybohm) [11:19:45] I would like to deploy cxserver if no deployments going on (nothing as per calendar) [11:19:55] !log Deleting misbehaving pod ipoid-production-daily-updates-28742340-h5ckx - T373427 [11:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:59] T373427: IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427 [11:20:34] (03PS1) 10Marostegui: installserver: Do not format db2240 [puppet] - 10https://gerrit.wikimedia.org/r/1067319 [11:20:37] !log start prometheus7001 bookworm upgrade - T326657 [11:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:41] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [11:20:53] !log hashar@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.20 refs T366965 (duration: 47m 15s) [11:20:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database cswikivoyage (T370912) [11:20:56] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [11:21:00] T370912: Prepare and check storage layer for cswikivoyage - https://phabricator.wikimedia.org/T370912 [11:22:02] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1067319 (owner: 10Marostegui) [11:23:26] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2240 [puppet] - 10https://gerrit.wikimedia.org/r/1067319 (owner: 10Marostegui) [11:24:47] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet [11:27:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67889 and previous config saved to /var/cache/conftool/dbconfig/20240827-112700-arnaudb.json [11:27:05] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427#10095494 (10kostajh) >>! In T373427#10095462... [11:28:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 15%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67890 and previous config saved to /var/cache/conftool/dbconfig/20240827-112848-arnaudb.json [11:30:02] I ll do group0 after lunch [11:30:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P67891 and previous config saved to /var/cache/conftool/dbconfig/20240827-113039-ladsgroup.json [11:30:49] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet [11:31:30] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-08-27-045705-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067221 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry) [11:31:43] (03CR) 10Jaime Nuche: releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:31:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2292.codfw.wmnet [11:32:19] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2292.codfw.wmnet [11:32:32] (03Merged) 10jenkins-bot: Update cxserver to 2024-08-27-045705-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067221 (https://phabricator.wikimedia.org/T369815) (owner: 10KartikMistry) [11:33:40] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet [11:38:34] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [11:38:59] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:39:43] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet [11:42:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67892 and previous config saved to /var/cache/conftool/dbconfig/20240827-114205-arnaudb.json [11:43:40] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:45] (03PS1) 10Alexandros Kosiaris: Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878) [11:43:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67893 and previous config saved to /var/cache/conftool/dbconfig/20240827-114354-arnaudb.json [11:45:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T370903)', diff saved to https://phabricator.wikimedia.org/P67894 and previous config saved to /var/cache/conftool/dbconfig/20240827-114546-ladsgroup.json [11:45:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:45:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:46:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:46:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T370903)', diff saved to https://phabricator.wikimedia.org/P67895 and previous config saved to /var/cache/conftool/dbconfig/20240827-114608-ladsgroup.json [11:46:19] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:46:22] (03CR) 10CI reject: [V:04-1] Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [11:46:52] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:47:54] (03PS2) 10Alexandros Kosiaris: Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878) [11:49:26] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:50:02] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:50:52] (03CR) 10Jaime Nuche: releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:51:16] (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw2292 to wikikube-worker2043 [puppet] - 10https://gerrit.wikimedia.org/r/1067325 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [11:51:36] (03CR) 10Jforrester: [C:03+1] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński) [11:51:53] !log Updated cxserver to 2024-08-27-045705-production (T369815) [11:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:57] T369815: Enable in content Translation the new languages Google Translate supports in June 2024 - https://phabricator.wikimedia.org/T369815 [11:53:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T370903)', diff saved to https://phabricator.wikimedia.org/P67896 and previous config saved to /var/cache/conftool/dbconfig/20240827-115318-ladsgroup.json [11:53:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:53:57] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2292 to wikikube-worker2043 [11:54:13] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [11:57:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67897 and previous config saved to /var/cache/conftool/dbconfig/20240827-115711-arnaudb.json [11:58:38] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2292 to wikikube-worker2043 - akosiaris@cumin1002" [11:59:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67898 and previous config saved to /var/cache/conftool/dbconfig/20240827-115859-arnaudb.json [11:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [11:59:41] (03CR) 10EoghanGaffney: [C:03+1] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [11:59:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2292 to wikikube-worker2043 - akosiaris@cumin1002" [11:59:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:59:53] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2043 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1200) [12:00:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2043 [12:00:36] I am doing the group0 promotion since this morning did not work [12:00:43] we only reached testwikis [12:00:46] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2292 to wikikube-worker2043 [12:01:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2292 to... [12:01:40] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2043.codfw.wmnet with OS bullseye [12:01:50] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host [12:01:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [12:02:27] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [12:04:29] (03CR) 10AOkoth: [C:03+2] vrts: create queries to test exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067308 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [12:05:33] PROBLEM - Disk space on restbase2022 is CRITICAL: DISK CRITICAL - free space: /srv/sda4 113109 MB (6% inode=99%): /srv/sdc4 69044 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2022&var-datasource=codfw+prometheus/ops [12:07:02] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067330 (https://phabricator.wikimedia.org/T366965) [12:07:04] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067330 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [12:08:01] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067330 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [12:08:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P67899 and previous config saved to /var/cache/conftool/dbconfig/20240827-120825-ladsgroup.json [12:10:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc2015.codfw.wmnet with reason: Network maintenance [12:10:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2015.codfw.wmnet with reason: Network maintenance [12:11:33] (03CR) 10Jelto: [V:03+1 C:03+2] profile::firewall::nftables_throttling: fix issue of global metering [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [12:11:52] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2019.codfw.wmnet [12:12:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: post upgrade repooling', diff saved to https://phabricator.wikimedia.org/P67900 and previous config saved to /var/cache/conftool/dbconfig/20240827-121216-arnaudb.json [12:14:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67901 and previous config saved to /var/cache/conftool/dbconfig/20240827-121405-arnaudb.json [12:14:09] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2043 - akosiaris@cumin1002" [12:14:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2043 - akosiaris@cumin1002" [12:14:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:14:13] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2043.codfw.wmnet 162.0.192.10.in-addr.arpa 2.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:14:16] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2043.codfw.wmnet 162.0.192.10.in-addr.arpa 2.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:14:17] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2043 [12:14:20] (03PS1) 10Kamila Součková: Rename kubernetes2019 to wikikube-worker2044 [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878) [12:14:47] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [12:15:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2019.codfw.wmnet [12:15:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2043 [12:15:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [12:16:05] (03PS1) 10David Caro: toolforge:prometheus: remove cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) [12:18:16] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.20 refs T366965 [12:18:20] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [12:18:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:18:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:23:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P67902 and previous config saved to /var/cache/conftool/dbconfig/20240827-122332-ladsgroup.json [12:24:01] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2001.codfw.wmnet - https://phabricator.wikimedia.org/T373428#10095626 (10Jhancock.wm) a:03Jhancock.wm [12:24:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 10observability, 13Patch-For-Review: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560#10095619 (10ayounsi) 05Open→03Resolved All done! [12:25:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T371742)', diff saved to https://phabricator.wikimedia.org/P67903 and previous config saved to /var/cache/conftool/dbconfig/20240827-122509-ladsgroup.json [12:25:15] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:28:34] jouncebot: nowandnext [12:28:34] For the next 0 hour(s) and 31 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1200) [12:28:34] In 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1300) [12:29:05] (03CR) 10Zabe: [C:03+2] Revert apparent fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067305 (https://phabricator.wikimedia.org/T368712) (owner: 10Zabe) [12:29:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67904 and previous config saved to /var/cache/conftool/dbconfig/20240827-122910-arnaudb.json [12:29:49] (03Merged) 10jenkins-bot: Revert apparent fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067305 (https://phabricator.wikimedia.org/T368712) (owner: 10Zabe) [12:30:12] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067305|Revert apparent fix (T368712)]] [12:30:18] T368712: Change sysop_plwiki logo and favicon - https://phabricator.wikimedia.org/T368712 [12:30:31] (03PS1) 10AOkoth: vrts: add ticket count metrics for different queues [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419) [12:32:18] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage [12:32:41] (03PS1) 10Jelto: gitlab: add profile::prometheus::nft_throttling_denylist [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882) [12:33:54] !log zabe@deploy1003 zabe: Backport for [[gerrit:1067305|Revert apparent fix (T368712)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:33:55] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1067336/3756/" [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [12:34:08] !log zabe@deploy1003 zabe: Continuing with sync [12:34:51] (03PS1) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [12:34:52] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3757/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [12:35:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage [12:36:28] (03CR) 10CI reject: [V:04-1] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [12:38:33] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067305|Revert apparent fix (T368712)]] (duration: 08m 20s) [12:38:37] T368712: Change sysop_plwiki logo and favicon - https://phabricator.wikimedia.org/T368712 [12:38:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T370903)', diff saved to https://phabricator.wikimedia.org/P67905 and previous config saved to /var/cache/conftool/dbconfig/20240827-123839-ladsgroup.json [12:38:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:38:43] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:38:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:40:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P67906 and previous config saved to /var/cache/conftool/dbconfig/20240827-124016-ladsgroup.json [12:46:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:46:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:46:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T370903)', diff saved to https://phabricator.wikimedia.org/P67907 and previous config saved to /var/cache/conftool/dbconfig/20240827-124629-ladsgroup.json [12:46:33] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:46:46] !log zabe@mwmaint1002:~$ foreachwikiindblist private wrapOldPasswords.php --type BEP --update # T91917 [12:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:55] (03PS1) 10Brouberol: cloudnative-pg: enable ingress traffic to the prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067340 (https://phabricator.wikimedia.org/T372284) [12:48:09] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067340 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [12:48:34] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: enable ingress traffic to the prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067340 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [12:49:59] !log zabe@mwmaint1002:~$ foreachwikiindblist fishbowl wrapOldPasswords.php --type BEP --update # T91917 [12:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:50] (03PS2) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [12:51:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T370903)', diff saved to https://phabricator.wikimedia.org/P67908 and previous config saved to /var/cache/conftool/dbconfig/20240827-125139-ladsgroup.json [12:51:44] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:52:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:52:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:52:15] (03CR) 10CI reject: [V:04-1] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [12:52:56] (03CR) 10David Caro: [C:03+2] toolforge:prometheus: only kyverno controllers expose stats [puppet] - 10https://gerrit.wikimedia.org/r/1067307 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [12:53:00] (03CR) 10David Caro: [C:03+2] toolforge:prometheus: drop metrics as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1067309 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [12:53:18] (03CR) 10David Caro: "Turns out that it might not be cadvisor the culprit, looking" [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [12:53:47] (03PS1) 10Ssingh: P:ntp: set time for CRITICAL alert to 2 hours (from 4) for service check [puppet] - 10https://gerrit.wikimedia.org/r/1067341 [12:54:30] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3758/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067341 (owner: 10Ssingh) [12:55:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2043.codfw.wmnet with OS bullseye [12:55:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P67909 and previous config saved to /var/cache/conftool/dbconfig/20240827-125523-ladsgroup.json [12:56:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [12:56:49] (03CR) 10Ssingh: [V:03+1 C:03+2] P:ntp: set time for CRITICAL alert to 2 hours (from 4) for service check [puppet] - 10https://gerrit.wikimedia.org/r/1067341 (owner: 10Ssingh) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1300). [13:00:05] Daimona and joelyrookewmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:28] o/ [13:00:49] hello team ! [13:00:56] o/ [13:01:12] (03CR) 10EoghanGaffney: [C:03+1] vrts: add ticket count metrics for different queues [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [13:01:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10095785 (10ssingh) Thanks @kzimmerman! This just leaves us with @thcipriani's approval and I will merge the patch once that is in. [13:01:42] I can deploy [13:01:56] HouseOfM: I can't see a patch for you in the window? [13:02:05] (03PS4) 10Joely Rooke WMDE: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) [13:02:12] I'm here for @Daimona patch [13:02:18] (03CR) 10Zabe: [C:03+2] Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:02:35] ah alright:) [13:03:19] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10095796 (10ssingh) >>! In T373142#10094180, @KFrancis wrote: > Hello all, I am confirming as @NCreasy is a contractor with the WMF, there is already and NDA in place. Thanks! Sorry for the confus... [13:03:30] (03PS3) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [13:03:42] (03CR) 10AOkoth: [C:03+2] vrts: add ticket count metrics for different queues [puppet] - 10https://gerrit.wikimedia.org/r/1067336 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [13:04:51] (03PS2) 10Daimona Eaytoy: Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) [13:04:53] (03CR) 10Zabe: [C:03+2] Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy) [13:05:13] (03Merged) 10jenkins-bot: Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067282 (https://phabricator.wikimedia.org/T66315) (owner: 10Joely Rooke WMDE) [13:05:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy) [13:05:56] (03Merged) 10jenkins-bot: Enable CampaignEvents Invitation Lists in production testing environments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066777 (https://phabricator.wikimedia.org/T373041) (owner: 10Daimona Eaytoy) [13:06:15] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067282|Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]], [[gerrit:1066777|Enable CampaignEvents Invitation Lists in production testing environments (T373041)]] [13:06:22] T373041: Release Invitation lists to all wikis with CampaignEvents extension + enable on test wikis - https://phabricator.wikimedia.org/T373041 [13:06:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 463, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:06:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P67910 and previous config saved to /var/cache/conftool/dbconfig/20240827-130647-ladsgroup.json [13:08:38] !log zabe@deploy1003 joelyrookewmde, daimona, zabe: Backport for [[gerrit:1067282|Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]], [[gerrit:1066777|Enable CampaignEvents Invitation Lists in production testing environments (T373041)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:57] Daimona: HouseOfM: joelyrookewmde: can you test? [13:09:03] yes will do now [13:09:27] testing [13:09:32] yup, thx [13:10:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T371742)', diff saved to https://phabricator.wikimedia.org/P67911 and previous config saved to /var/cache/conftool/dbconfig/20240827-131031-ladsgroup.json [13:10:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:10:35] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:10:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:11:45] (03PS1) 10Brouberol: Remove the pgcluster-test in dse-k8s, no longer useful [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067346 [13:12:25] OK I'm officially dumb, the special pages are working, but the feature is broken because I forgot to create the DB tables [13:13:03] the config change I'm doing currently only affects the beta cluster, but mediawikiDebug doesn't show any options for test servers right now [13:13:19] Is there something I need to do or can we not test the beta cluster in this way? [13:13:41] zabe can I go ahead and create these tables in production now? [13:13:47] joelyrookewmde: you can't test beta cluster with mwdebug; so if it only affects beta, I would just sync it thruogh [13:13:52] ok [13:13:53] Daimona: sure, go ahead:) [13:13:55] works for me [13:14:15] it will reach beta cluster ~10-15 min after the merge [13:14:24] perfect, thanks! [13:14:32] <_joe_> Daimona: uh wait, what do you mean create the tables in production? [13:16:57] We have a few DB tables that need to be created in production for a new feature. These have already been approved by DBA. We would generally create them in a dedicated window, hence my question. [13:17:28] <_joe_> yeah, just let the DBAs know/confirm now is ok :) [13:17:58] Daimona: Go for it yes [13:18:00] (Also, y'all please bear with me, I'm trying to find the relevant tasks, and now is a perfect time to find out that they're not in the parent-child hierarchy for this project) [13:19:13] Also a perfect time to discover that phab has a limit of 20 subtasks, apparently [13:19:45] <_joe_> Daimona: you can also find out how good phab search is :D [13:19:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 1%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67912 and previous config saved to /var/cache/conftool/dbconfig/20240827-131947-arnaudb.json [13:21:33] @Daimona https://phabricator.wikimedia.org/T369303? [13:21:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P67913 and previous config saved to /var/cache/conftool/dbconfig/20240827-132154-ladsgroup.json [13:22:22] Yes [13:22:41] Wait wtf, this limit is a new thing isn't it, since the task in question already has 23 subtasks [13:23:00] (03PS1) 10Ssingh: admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142) [13:23:39] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::collector and log*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [13:23:55] (03CR) 10CI reject: [V:04-1] admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142) (owner: 10Ssingh) [13:29:54] !log Creating new DB tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T369303 [13:29:57] (03PS2) 10Ssingh: admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142) [13:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:58] T369303: Create the DB schema for invitation lists in prod - https://phabricator.wikimedia.org/T369303 [13:30:38] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10095911 (10elukey) Today I cleaned up some db nodes reported as debmonitor client failures while I was on holiday: ` >>> spicerack.debmonitor().host_delete('d... [13:32:54] (03PS1) 10Brouberol: Upgrade airflow to 2.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067352 (https://phabricator.wikimedia.org/T372284) [13:33:01] zabe: tables created and it's looking good now [13:33:08] alright:) [13:33:16] !log zabe@deploy1003 joelyrookewmde, daimona, zabe: Continuing with sync [13:34:02] Sorry for the inconvenience, I just completely forgot about the database :O [13:34:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 2%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67914 and previous config saved to /var/cache/conftool/dbconfig/20240827-133452-arnaudb.json [13:35:35] no worries; i investigated some old type password hashes in fishbowl and private wikis in the meantime [13:36:31] Thanks Daimona [13:37:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T370903)', diff saved to https://phabricator.wikimedia.org/P67915 and previous config saved to /var/cache/conftool/dbconfig/20240827-133701-ladsgroup.json [13:37:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:37:09] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:37:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:37:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T370903)', diff saved to https://phabricator.wikimedia.org/P67917 and previous config saved to /var/cache/conftool/dbconfig/20240827-133723-ladsgroup.json [13:37:34] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::collector and log*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [13:37:43] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067282|Register feature flag for moving wikibase item to Other Projects sidebar in pilot wikis.]], [[gerrit:1066777|Enable CampaignEvents Invitation Lists in production testing environments (T373041)]] (duration: 31m 27s) [13:37:47] T373041: Release Invitation lists to all wikis with CampaignEvents extension + enable on test wikis - https://phabricator.wikimedia.org/T373041 [13:39:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T370903)', diff saved to https://phabricator.wikimedia.org/P67918 and previous config saved to /var/cache/conftool/dbconfig/20240827-133933-ladsgroup.json [13:42:01] (03PS1) 10Klausman: ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353 [13:43:44] (03PS1) 10Elukey: hosts/views.py: add logging when upgrading the host's OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1067354 (https://phabricator.wikimedia.org/T368744) [13:43:46] (03CR) 10AikoChou: [C:03+1] ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353 (owner: 10Klausman) [13:44:11] (03CR) 10Klausman: [C:03+2] ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353 (owner: 10Klausman) [13:44:45] !log zabe@mwmaint1002:~$ foreachwikiindblist fishbowl sql.php --query "UPDATE user SET user_password = CONCAT(':B:', user_id, ':', user_password) WHERE user_password RLIKE '^[0-9a-f]{32}$';" # T91917 [13:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10095958 (10thcipriani) >>! In T373379#10095785, @ssingh wrote: > Thanks @kzimmerman! This just leaves us with @thcipriani's approval and I will merge the patch once... [13:45:12] !log zabe@mwmaint1002:~$ foreachwikiindblist private sql.php --query "UPDATE user SET user_password = CONCAT(':B:', user_id, ':', user_password) WHERE user_password RLIKE '^[0-9a-f]{32}$';" # T91917 [13:45:12] (03Merged) 10jenkins-bot: ml-services: switch nlwiki-damaging to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067353 (owner: 10Klausman) [13:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:36] !log zabe@mwmaint1002:~$ foreachwikiindblist fishbowl wrapOldPasswords.php --type BEP --update # T91917 [13:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:40] !log zabe@mwmaint1002:~$ foreachwikiindblist private wrapOldPasswords.php --type BEP --update # T91917 [13:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:48] !log add routinator to bookworm-wikipedia apt repo - T372909 [13:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:52] T372909: Create prod VMs on routed ganeti cluster - https://phabricator.wikimedia.org/T372909 [13:48:25] !log add bgpalerter to bookworm-wikipedia apt repo - T372909 [13:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 3%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67919 and previous config saved to /var/cache/conftool/dbconfig/20240827-134958-arnaudb.json [13:52:52] (03CR) 10Clément Goubert: [C:03+1] Rename kubernetes2019 to wikikube-worker2044 [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [13:53:33] (03PS1) 10Ayounsi: RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909) [13:54:11] (03CR) 10Ayounsi: [V:03+1 C:03+2] Netbox: enable devicetype validator [puppet] - 10https://gerrit.wikimedia.org/r/1066722 (https://phabricator.wikimedia.org/T348036) (owner: 10Ayounsi) [13:54:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10096002 (10VRiley-WMF) 05Open→03Resolved Drive has been replaced. Please let us know if there are any other issues with this drive. Thanks! [13:54:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P67920 and previous config saved to /var/cache/conftool/dbconfig/20240827-135440-ladsgroup.json [13:57:44] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#10096042 (10ayounsi) 05Open→03Resolved Deployed! let me know if any issue. [14:02:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2015-2016].codfw.wmnet,pc[1015-1016].eqiad.wmnet with reason: Switchover [14:02:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2015-2016].codfw.wmnet,pc[1015-1016].eqiad.wmnet with reason: Switchover [14:02:51] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373401#10096117 (10Jhancock.wm) 05Open→03Resolved [14:03:12] (03CR) 10Ssingh: [C:03+2] admin: add ncreasy to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067348 (https://phabricator.wikimedia.org/T373142) (owner: 10Ssingh) [14:04:25] (03PS1) 10Marostegui: mariadb: Promote pc2015 to pc4 master [puppet] - 10https://gerrit.wikimedia.org/r/1067357 (https://phabricator.wikimedia.org/T373340) [14:05:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 5%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67921 and previous config saved to /var/cache/conftool/dbconfig/20240827-140503-arnaudb.json [14:05:29] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T373142#10096149 (10ssingh) 05Open→03Resolved a:03ssingh Added to `nda` group. Please try logging in to Superset after ~30 mins. Thanks! [14:05:58] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096155 (10Jhancock.wm) @Marostegui hey Papaul's on vacation this week. From what I remember that is a 10G issue. We started using this tag in the reimage script to keep this one from coming u... [14:06:37] (03CR) 10Dzahn: [C:03+2] prometheus: create text file export for nft throttling denylist length (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064823 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [14:06:58] (03CR) 10Dzahn: [C:03+1] gitlab: add profile::prometheus::nft_throttling_denylist [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:07:57] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [14:08:09] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add profile::prometheus::nft_throttling_denylist [puppet] - 10https://gerrit.wikimedia.org/r/1067337 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:09:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P67922 and previous config saved to /var/cache/conftool/dbconfig/20240827-140947-ladsgroup.json [14:10:09] (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [14:11:03] (03CR) 10Dzahn: [C:03+1] profile::firewall::nftables_throttling: fix issue of global metering [puppet] - 10https://gerrit.wikimedia.org/r/1066782 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:11:05] (03CR) 10Marostegui: [C:03+2] mariadb: Promote pc2015 to pc4 master [puppet] - 10https://gerrit.wikimedia.org/r/1067357 (https://phabricator.wikimedia.org/T373340) (owner: 10Marostegui) [14:11:57] (03CR) 10Dzahn: [C:03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1067222 (https://phabricator.wikimedia.org/T373136) (owner: 10Jelto) [14:12:20] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10096191 (10ssingh) [14:12:21] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096192 (10elukey) >>! In T373417#10096155, @Jhancock.wm wrote: > @Marostegui hey Papaul's on vacation this week. From what I remember that is a 10G issue. We started using this tag in the rei... [14:13:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10096198 (10ssingh) [14:13:19] (03CR) 10Ssingh: [C:03+2] admin: add jiawang to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1066833 (https://phabricator.wikimedia.org/T373379) (owner: 10Ssingh) [14:13:29] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission kafka-main2001.codfw.wmnet - https://phabricator.wikimedia.org/T373428#10096188 (10Jhancock.wm) 05Open→03Resolved [14:13:35] (03PS1) 10AOkoth: vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) [14:13:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10096195 (10ssingh) @thcipriani: this requires your approval, thank you. [14:14:07] (03CR) 10CI reject: [V:04-1] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [14:16:54] (03PS2) 10AOkoth: vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) [14:17:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2230.codfw.wmnet with OS bookworm [14:18:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2231.codfw.wmnet with OS bookworm [14:18:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm [14:18:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Switch pc4 master to pc2015 T373340', diff saved to https://phabricator.wikimedia.org/P67923 and previous config saved to /var/cache/conftool/dbconfig/20240827-141845-marostegui.json [14:18:48] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 545, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:18:49] T373340: pc2016 switchover - https://phabricator.wikimedia.org/T373340 [14:18:50] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2230.codfw.wmnet with OS bookworm [14:18:54] !log tappof@cumin2002 END (FAIL) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=99) rolling restart_daemons on P{O:logging::opensearch::data and logs*.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [14:19:46] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2231.codfw.wmnet with OS bookworm [14:19:49] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2232.codfw.wmnet with OS bookworm [14:20:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 15%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67924 and previous config saved to /var/cache/conftool/dbconfig/20240827-142009-arnaudb.json [14:20:10] !log T327878 uncordon wikikube-worker2043 [14:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:16] T327878: Tweak Autocomplete search results on the Mongolian Wikipedia - https://phabricator.wikimedia.org/T327878 [14:20:27] sigh, wrong task [14:20:36] !log T372878 uncordon wikikube-worker2043 [14:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:40] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:21:25] (03CR) 10Krinkle: [C:03+1] wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [14:21:33] jouncebot: next [14:21:33] In 0 hour(s) and 38 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1500) [14:22:08] akosiaris: I'd need to reboot wikikube-ctrl2003 for https://phabricator.wikimedia.org/T371132, am I going to interfere with some work that you are doing? [14:22:12] I can wait in case [14:23:17] (03PS3) 10AOkoth: vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) [14:23:56] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2114.codfw.wmnet - https://phabricator.wikimedia.org/T362948#10096268 (10Jhancock.wm) 05Open→03Resolved [14:24:14] !log Update zarcillo db for pc4 master T373340 [14:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:21] T373340: pc2016 switchover - https://phabricator.wikimedia.org/T373340 [14:24:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T370903)', diff saved to https://phabricator.wikimedia.org/P67925 and previous config saved to /var/cache/conftool/dbconfig/20240827-142454-ladsgroup.json [14:24:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:24:59] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:25:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [14:25:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T370903)', diff saved to https://phabricator.wikimedia.org/P67926 and previous config saved to /var/cache/conftool/dbconfig/20240827-142516-ladsgroup.json [14:25:45] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1067360/3761/" [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [14:26:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2186.codfw.wmnet with reason: Schema change [14:26:07] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [14:26:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2186.codfw.wmnet with reason: Schema change [14:26:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on db2186.codfw.wmnet with reason: Schema change [14:26:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on db2186.codfw.wmnet with reason: Schema change [14:29:28] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding AAAA field to wdqs101[1-3] and wdqs200[7-8] - brouberol@cumin1002" [14:29:33] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding AAAA field to wdqs101[1-3] and wdqs200[7-8] - brouberol@cumin1002" [14:29:33] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:30:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T370903)', diff saved to https://phabricator.wikimedia.org/P67927 and previous config saved to /var/cache/conftool/dbconfig/20240827-143027-ladsgroup.json [14:30:44] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:31:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2230.codfw.wmnet with reason: host reimage [14:32:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2232.codfw.wmnet with reason: host reimage [14:32:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2231.codfw.wmnet with reason: host reimage [14:34:11] (03CR) 10Bartosz Dziewoński: "@Bryan It seems that you authored this, can you also have a look?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [14:34:38] (03PS1) 10Btullis: Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) [14:35:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2230.codfw.wmnet with reason: host reimage [14:35:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67928 and previous config saved to /var/cache/conftool/dbconfig/20240827-143514-arnaudb.json [14:35:34] (03PS2) 10Btullis: Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) [14:36:18] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3762/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [14:36:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2231.codfw.wmnet with reason: host reimage [14:37:50] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment group for jiawang - https://phabricator.wikimedia.org/T373379#10096358 (10ssingh) 05Open→03Resolved a:03ssingh @jwang: request merged, thanks! Please re-open if there are any issues. [14:39:36] (03PS1) 10Marostegui: installserver: Do not reimage db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1067363 [14:40:54] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=wikikube-ctrl2003.codfw.wmnet [14:41:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2232.codfw.wmnet with reason: host reimage [14:41:32] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on wikikube-ctrl2003.codfw.wmnet with reason: running provision again [14:41:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on wikikube-ctrl2003.codfw.wmnet with reason: running provision again [14:41:49] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2293.codfw.wmnet [14:42:14] (03PS1) 10Bartosz Dziewoński: logging: Remove WhatFailureGroupHandler wrapper from handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067364 (https://phabricator.wikimedia.org/T373444) [14:42:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2293.codfw.wmnet [14:43:14] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1067363 (owner: 10Marostegui) [14:44:31] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:45:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P67929 and previous config saved to /var/cache/conftool/dbconfig/20240827-144534-ladsgroup.json [14:45:35] (03CR) 10Brouberol: "Looking at the [dashboard](https://grafana-rw.wikimedia.org/d/cloudnative-pg/cloudnativepg?forceLogin=&from=now-15m&orgId=1&refresh=30s&to" [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [14:45:53] (03PS1) 10Ayounsi: Network report: remove wdqs from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555) [14:46:12] (03CR) 10Brouberol: "We should also monitor whether the cloudnative-pg operator pod is running and healthy" [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [14:47:57] (03CR) 10Dzahn: [C:03+1] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [14:48:51] PROBLEM - BGP status on lsw1-a2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:49:28] (03CR) 10Elukey: [C:03+1] Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [14:50:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67930 and previous config saved to /var/cache/conftool/dbconfig/20240827-145020-arnaudb.json [14:51:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2230.codfw.wmnet with OS bookworm [14:51:24] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2230.codfw.wmnet with OS bookworm completed: - db2230 (**PASS**) - Removed from Pu... [14:52:12] (03PS4) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [14:54:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2231.codfw.wmnet with OS bookworm [14:54:04] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096442 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2231.codfw.wmnet with OS bookworm completed: - db2231 (**PASS**) - Removed from Pu... [14:55:03] (03PS24) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [14:55:44] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [14:55:46] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096444 (10Marostegui) 05In progress→03Resolved Thanks @Jhancock.wm - that worked! [14:56:51] RECOVERY - BGP status on lsw1-a2-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:55] (03CR) 10Brouberol: [C:03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555) (owner: 10Ayounsi) [14:57:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2232.codfw.wmnet with OS bookworm [14:58:22] 10ops-codfw, 06DBA, 06DC-Ops: db2230, db2231 and db2232 reimage failure - https://phabricator.wikimedia.org/T373417#10096452 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2232.codfw.wmnet with OS bookworm completed: - db2232 (**PASS**) - Removed fro... [14:59:16] (03PS25) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [15:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1500). [15:00:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P67931 and previous config saved to /var/cache/conftool/dbconfig/20240827-150041-ladsgroup.json [15:01:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:02:27] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2003.codfw.wmnet [15:05:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67932 and previous config saved to /var/cache/conftool/dbconfig/20240827-150525-arnaudb.json [15:09:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [15:09:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [15:09:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T371742)', diff saved to https://phabricator.wikimedia.org/P67933 and previous config saved to /var/cache/conftool/dbconfig/20240827-150952-ladsgroup.json [15:09:59] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:11:11] !log restart httpd on crm2001 for libaom upgrades [15:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:31] !log restart httpd and librenms-syslog.service on netmon1003 for libaom upgrades [15:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] (03CR) 10Kamila Součková: [C:03+2] Rename kubernetes2019 to wikikube-worker2044 [puppet] - 10https://gerrit.wikimedia.org/r/1067331 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [15:15:13] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2019 to wikikube-worker2044 [15:15:30] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:15:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T370903)', diff saved to https://phabricator.wikimedia.org/P67934 and previous config saved to /var/cache/conftool/dbconfig/20240827-151548-ladsgroup.json [15:15:50] (03CR) 10Ebernhardson: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [15:15:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:15:52] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:16:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [15:16:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T370903)', diff saved to https://phabricator.wikimedia.org/P67935 and previous config saved to /var/cache/conftool/dbconfig/20240827-151610-ladsgroup.json [15:18:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T370903)', diff saved to https://phabricator.wikimedia.org/P67936 and previous config saved to /var/cache/conftool/dbconfig/20240827-151819-ladsgroup.json [15:19:02] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2019 to wikikube-worker2044 - kamila@cumin1002" [15:19:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2019 to wikikube-worker2044 - kamila@cumin1002" [15:19:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:19:20] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2044 [15:19:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:19:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2044 [15:20:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2019 to wikikube-worker2044 [15:20:17] (03PS1) 10Alexandros Kosiaris: Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878) [15:20:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from kubernetes20... [15:20:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: post maintenance', diff saved to https://phabricator.wikimedia.org/P67937 and previous config saved to /var/cache/conftool/dbconfig/20240827-152031-arnaudb.json [15:22:39] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2044.codfw.wmnet with OS bullseye [15:22:54] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host [15:22:59] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [15:23:04] (03CR) 10CI reject: [V:04-1] Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [15:23:12] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:24:50] (03CR) 10Krinkle: [C:03+1] logging: Remove WhatFailureGroupHandler wrapper from handlers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067364 (https://phabricator.wikimedia.org/T373444) (owner: 10Bartosz Dziewoński) [15:25:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:26:18] (03PS2) 10Alexandros Kosiaris: Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878) [15:26:40] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2044 - kamila@cumin1002" [15:26:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2044 - kamila@cumin1002" [15:26:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:26:45] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2044.codfw.wmnet 207.0.192.10.in-addr.arpa 7.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:26:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2044.codfw.wmnet 207.0.192.10.in-addr.arpa 7.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:26:49] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2044 [15:27:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2044 [15:27:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [15:29:18] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2027.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:29:35] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:52] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2027.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:33:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P67939 and previous config saved to /var/cache/conftool/dbconfig/20240827-153327-ladsgroup.json [15:33:34] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2028.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:34:34] (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw2293 to wikikube-worker2045 [puppet] - 10https://gerrit.wikimedia.org/r/1067373 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [15:35:04] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2028.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:35:06] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2293 to wikikube-worker2045 [15:35:22] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [15:36:15] (03PS3) 10Dbrant: Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) [15:36:19] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2029.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:37:27] (03Abandoned) 10Brouberol: Remove the pgcluster-test in dse-k8s, no longer useful [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067346 (owner: 10Brouberol) [15:39:01] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2029.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:39:43] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2293 to wikikube-worker2045 - akosiaris@cumin1002" [15:39:57] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2033.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:42:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2293 to wikikube-worker2045 - akosiaris@cumin1002" [15:42:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:21] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2045 [15:42:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2045 [15:42:45] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage [15:43:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2293 to wikikube-worker2045 [15:43:23] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2293 to... [15:43:49] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2045.codfw.wmnet with OS bullseye [15:43:59] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host [15:44:01] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [15:44:09] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [15:45:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage [15:45:46] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2033.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:46:15] (03PS1) 10Elukey: blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363) [15:46:36] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2034.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:48:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T367856)', diff saved to https://phabricator.wikimedia.org/P67940 and previous config saved to /var/cache/conftool/dbconfig/20240827-154823-marostegui.json [15:48:31] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:48:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P67941 and previous config saved to /var/cache/conftool/dbconfig/20240827-154834-ladsgroup.json [15:48:46] (03PS3) 10Cathal Mooney: Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) [15:49:02] (03CR) 10Hnowlan: [C:03+1] blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey) [15:49:54] (03CR) 10CI reject: [V:04-1] Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [15:50:24] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2045 - akosiaris@cumin1002" [15:50:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2045 - akosiaris@cumin1002" [15:50:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:50:29] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2045.codfw.wmnet 163.0.192.10.in-addr.arpa 3.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:50:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2045.codfw.wmnet 163.0.192.10.in-addr.arpa 3.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:50:33] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2045 [15:50:58] (03PS26) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [15:51:05] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2034.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:51:44] (03PS4) 10Cathal Mooney: Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) [15:52:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2045 [15:52:01] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [15:52:10] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2035.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:54:44] !log Start prometheus4002 Bookworm upgrade - T326657 [15:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:47] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [15:57:10] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10096680 (10BTullis) Checking the `megacli` ourput shows that the RAID BBU reports OK. ` btullis@an-worker1127:~$ sudo megacli -AdpBbuCmd -aA... [15:57:14] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2035.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:57:35] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2036.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:57:48] FIRING: KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2292.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:57:48] (03CR) 10Elukey: [C:03+2] blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey) [15:57:58] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10096682 (10BTullis) a:03BTullis [15:58:57] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2036.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:59:27] !log tappof@cumin2002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on P{O:logging::opensearch::data and logs*2037.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [15:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:00:00] (03CR) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [16:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:23] (03CR) 10Cathal Mooney: Expose Netbox tunnel data to config templates (038 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [16:00:36] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10096690 (10BTullis) a:03BTullis [16:01:22] (03Merged) 10jenkins-bot: blubber: no-op change to trigger a rebuild and get security updates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1067379 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey) [16:03:16] !log tappof@cumin2002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on P{O:logging::opensearch::data and logs*2037.codfw.wmnet} and (A:datahubsearch or A:logstash-eqiad or A:logstash-codfw) [16:03:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P67942 and previous config saved to /var/cache/conftool/dbconfig/20240827-160330-marostegui.json [16:03:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T370903)', diff saved to https://phabricator.wikimedia.org/P67943 and previous config saved to /var/cache/conftool/dbconfig/20240827-160341-ladsgroup.json [16:03:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [16:03:55] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:03:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [16:04:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T370903)', diff saved to https://phabricator.wikimedia.org/P67944 and previous config saved to /var/cache/conftool/dbconfig/20240827-160403-ladsgroup.json [16:05:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2044.codfw.wmnet with OS bullseye [16:05:39] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [16:08:58] (03CR) 10BryanDavis: "The code is there for exactly what it says on the tin: debugging LDAP problems on wikitech. There have been several times in the past when" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [16:12:10] !log ran homer to add wikikube-worker2044 T372878 [16:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:13] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [16:13:04] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2044.codfw.wmnet [16:13:05] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2044.codfw.wmnet [16:14:29] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457 (10kamila) 03NEW [16:17:42] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet [16:18:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P67945 and previous config saved to /var/cache/conftool/dbconfig/20240827-161837-marostegui.json [16:19:52] (03PS4) 10Ryan Kemper: wdqs: store graph type in data_loaded file [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking) [16:21:41] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet [16:25:23] !log Start prometheus5002 Bookworm upgrade - T326657 [16:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:27] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [16:31:17] (03PS1) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) [16:31:59] (03CR) 10CI reject: [V:04-1] Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [16:33:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T367856)', diff saved to https://phabricator.wikimedia.org/P67946 and previous config saved to /var/cache/conftool/dbconfig/20240827-163345-marostegui.json [16:33:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2166.codfw.wmnet with reason: Maintenance [16:33:49] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [16:33:55] (03PS2) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) [16:34:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2166.codfw.wmnet with reason: Maintenance [16:34:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T367856)', diff saved to https://phabricator.wikimedia.org/P67947 and previous config saved to /var/cache/conftool/dbconfig/20240827-163407-marostegui.json [16:35:48] (03PS1) 10Elukey: services: update Thumbor Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067382 (https://phabricator.wikimedia.org/T373363) [16:36:49] (03CR) 10Hnowlan: [C:03+1] services: update Thumbor Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067382 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey) [16:38:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T370903)', diff saved to https://phabricator.wikimedia.org/P67948 and previous config saved to /var/cache/conftool/dbconfig/20240827-163817-ladsgroup.json [16:38:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:39:49] (03PS1) 10Bking: wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) [16:40:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [16:40:26] (03PS2) 10Andrew Bogott: openstack keystone: add a new auth plugin to validate totp tokens against idm [puppet] - 10https://gerrit.wikimedia.org/r/1064480 (https://phabricator.wikimedia.org/T373462) [16:40:28] (03PS2) 10Andrew Bogott: openstack keystone: switch to idmtotp for 2fa [puppet] - 10https://gerrit.wikimedia.org/r/1064481 (https://phabricator.wikimedia.org/T373462) [16:42:23] (03PS2) 10Bking: wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) [16:42:39] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [16:44:37] (03PS1) 10Hnowlan: aptrepo: add ffmpeg buster component [puppet] - 10https://gerrit.wikimedia.org/r/1067384 (https://phabricator.wikimedia.org/T373128) [16:45:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002" [16:45:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002" [16:45:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:48:37] (03CR) 10Ssingh: [C:03+1] "The backend setup has been verified, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [16:49:05] (03PS3) 10Bking: wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) [16:50:21] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet [16:51:50] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [16:52:13] (03PS2) 10JHathaway: puppet8: remove ssl_keystore_location, always set ssl_key_password [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) [16:52:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [16:53:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P67949 and previous config saved to /var/cache/conftool/dbconfig/20240827-165325-ladsgroup.json [16:54:22] (03CR) 10Bking: "ACK, looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/traffics" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [16:56:35] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet [16:57:59] (03CR) 10Ryan Kemper: "PCC looks good. We should discuss with sukhe how this is best deployed; ie can we just do an lvs rolling restart directly or do we need to" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [16:58:04] PROBLEM - Host an-worker1165 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:09] RECOVERY - Hadoop NodeManager on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:01:09] RECOVERY - Hadoop DataNode on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:01:11] RECOVERY - Host an-worker1165 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [17:02:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery) [17:06:04] (03CR) 10Ryan Kemper: [C:03+1] "Spoke to sukhe, simple lvs restart should be sufficient for this" [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [17:06:19] (03CR) 10Bking: [C:03+2] wdqs-main, wdqs-scholarly: use TLS for pybal pools [puppet] - 10https://gerrit.wikimedia.org/r/1067383 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [17:08:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P67950 and previous config saved to /var/cache/conftool/dbconfig/20240827-170832-ladsgroup.json [17:08:42] (03CR) 10Btullis: [V:03+1 C:03+2] Add the matomo-plugin-customreports package to Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1067362 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [17:08:46] !log T364368 Disabled puppet on all lvs hosts in preparation for rolling restart [17:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:50] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [17:08:52] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2045.codfw.wmnet with OS bullseye [17:09:15] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10097092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [17:09:22] (03PS4) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) [17:10:08] (03CR) 10Hnowlan: [C:03+1] php8.1-cli: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064814 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [17:10:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10097095 (10VRiley-WMF) logging-sd1001 Rack E 5 U 32 CableID 20220092 Port 18 logging-sd1002 Rack E 6 U 31 CableID 20220057 Port 18 logging-sd1003 F 5 U 31 CableID 20220091... [17:10:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10097097 (10VRiley-WMF) [17:11:34] (03CR) 10Hnowlan: [C:03+1] php8.1-fpm: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064815 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [17:12:29] (03CR) 10Subramanya Sastry: [C:03+1] Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [17:13:41] (03CR) 10JHathaway: "Though the PCC diff shows the file as being base64 encoded, I have confirmed that this is only how it is displayed in the catalog. The con" [puppet] - 10https://gerrit.wikimedia.org/r/1065284 (https://phabricator.wikimedia.org/T372667) (owner: 10JHathaway) [17:13:50] !log T364368 Ran puppet on `A:lvs-secondary-eqiad` and restarted pybal.service [17:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:54] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [17:16:00] (03CR) 10Hnowlan: [C:03+1] php8.1-fpm-multiversion-base: initial release of 8.1-based image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1064816 (https://phabricator.wikimedia.org/T372602) (owner: 10Scott French) [17:16:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs1023.eqiad.wmnet are marked down but pooled: wdqs-main_443: Servers wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:16:21] ^ looking into it [17:16:21] ^ known [17:17:12] FIRING: [2x] ProbeDown: Service wdqs-main:443 has failed probes (http_wdqs-main_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:23] all good ^ [17:17:25] ACKing [17:17:27] !incidents [17:17:28] 5120 (UNACKED) [2x] ProbeDown sre (ip4 probes/service codfw) [17:17:31] !ack 5120 [17:17:31] 5120 (ACKED) [2x] ProbeDown sre (ip4 probes/service codfw) [17:17:44] (03PS1) 10Bking: wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) [17:17:50] (03PS1) 10Jdlrobson: Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 [17:18:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (owner: 10Jdlrobson) [17:18:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (owner: 10Jdlrobson) [17:18:11] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:18:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) (owner: 10Jdlrobson) [17:18:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [17:20:07] PROBLEM - Host an-worker1165 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:49] (03CR) 10Ssingh: [C:03+1] wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [17:23:23] (03CR) 10Bking: [C:03+2] wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [17:23:26] (03CR) 10Ryan Kemper: [C:03+2] wdqs-main, wdqs-scholarly: use HTTPS for health check [puppet] - 10https://gerrit.wikimedia.org/r/1067388 (https://phabricator.wikimedia.org/T364368) (owner: 10Bking) [17:23:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T370903)', diff saved to https://phabricator.wikimedia.org/P67951 and previous config saved to /var/cache/conftool/dbconfig/20240827-172339-ladsgroup.json [17:23:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance [17:23:43] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:23:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance [17:24:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T370903)', diff saved to https://phabricator.wikimedia.org/P67952 and previous config saved to /var/cache/conftool/dbconfig/20240827-172401-ladsgroup.json [17:24:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T371742)', diff saved to https://phabricator.wikimedia.org/P67953 and previous config saved to /var/cache/conftool/dbconfig/20240827-172436-ladsgroup.json [17:24:41] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:24:54] !log T364368 `ryankemper@cumin2002:~$ sudo cumin 'A:lvs-secondary-eqiad' 'systemctl status pybal.service'` [17:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:58] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [17:25:19] (03PS3) 10JHathaway: puppet8: remove ssl_keystore_location, always set ssl_key_password [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) [17:25:26] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1065283 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [17:25:42] (03CR) 10Btullis: [C:03+1] Upgrade airflow to 2.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067352 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [17:26:29] (03PS1) 10Gergő Tisza: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) [17:26:46] (03PS2) 10Gergő Tisza: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) [17:27:13] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:27:31] (03CR) 10Btullis: [C:03+1] "Nice. Should we add a global silence to alertmanager while we are still testing, or will we just all remember that these are pre-productio" [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [17:27:57] (03PS3) 10Gergő Tisza: Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) [17:29:41] !log sukhe@lvs1020:~$ sudo ipvsadm ---delete-service --tcp-service 10.2.2.36:80 [17:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:24] !log sukhe@lvs1020:~$ sudo ipvsadm --delete-service --tcp-service 10.2.2.33:80 [17:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:43] !log force recheck on Icinga for lvs1020 [17:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T370903)', diff saved to https://phabricator.wikimedia.org/P67954 and previous config saved to /var/cache/conftool/dbconfig/20240827-173132-ladsgroup.json [17:31:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:33:03] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:35:46] (03PS5) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) [17:37:18] !log T364368 Ran puppet on `A:lvs-low-traffic-eqiad` and restarted `pybal.service` [17:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:22] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [17:38:15] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:38:51] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:39:01] ^ expected [17:39:02] ^known, cleaning thes eup [17:39:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67956 and previous config saved to /var/cache/conftool/dbconfig/20240827-173944-ladsgroup.json [17:40:26] !log T364368 Cleared away old ipvs entries for `10.2.2.33:80` and `10.2.2.36:80` [17:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:45] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:41:46] !log Forced recheck on lvs2019 to clear alert [17:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:05] !log Typo, meant to say forced recheck on `lvs1019` to clear alert [17:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:49] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002" [17:43:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002" [17:43:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:44:33] (03CR) 10EoghanGaffney: [C:03+1] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [17:46:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P67957 and previous config saved to /var/cache/conftool/dbconfig/20240827-174639-ladsgroup.json [17:47:45] !log T364368 Ran puppet on `A:lvs-secondary-codfw`, restarted `pybal.service`, and cleared away old ipvs entries for `10.2.1.33:80` and `10.2.1.36:80` [17:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:49] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [17:48:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:55] !log T364368 Ran puppet on `A:lvs-low-traffic-codfw`, restarted `pybal.service`, and cleared away old ipvs entries for `10.2.1.33:80` and `10.2.1.36:80` [17:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:10] !log T364368 Our LVS operation is done; I've enabled/ran puppet on the remaining lvs hosts [17:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:15] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [17:54:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P67959 and previous config saved to /var/cache/conftool/dbconfig/20240827-175451-ladsgroup.json [17:58:42] RESOLVED: [2x] ProbeDown: Service wdqs-main:443 has failed probes (http_wdqs-main_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:53] inflatador: ryankemper: ^ [18:00:28] (resolved) [18:01:12] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10097355 (10thcipriani) Approved! [18:01:41] sukhe ACK, thanks again! [18:01:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P67960 and previous config saved to /var/cache/conftool/dbconfig/20240827-180146-ladsgroup.json [18:01:50] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10097356 (10ssingh) [18:02:45] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve1009 [18:02:52] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10097358 (10ssingh) @JayCano: This requires your approval since we already have Tyler's. Thanks! [18:03:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve1009 [18:04:30] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve1010 [18:04:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve1010 [18:04:38] (03PS1) 10C. Scott Ananian: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) [18:04:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve1011 [18:04:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [18:05:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve1011 [18:05:10] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1001 [18:05:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ml-lab1001 [18:05:15] (03PS1) 10C. Scott Ananian: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) [18:05:16] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1002 [18:05:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1002 [18:05:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [18:05:33] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1009 [18:06:10] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:06:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [18:06:54] (03PS1) 10Ssingh: admin: add mszabo to deployment and move from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067397 (https://phabricator.wikimedia.org/T373426) [18:06:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1009 [18:08:46] (03PS1) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) [18:09:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [18:09:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T371742)', diff saved to https://phabricator.wikimedia.org/P67961 and previous config saved to /var/cache/conftool/dbconfig/20240827-180958-ladsgroup.json [18:09:59] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1001 [18:10:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [18:10:04] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:10:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1001 [18:10:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [18:10:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T371742)', diff saved to https://phabricator.wikimedia.org/P67962 and previous config saved to /var/cache/conftool/dbconfig/20240827-181020-ladsgroup.json [18:11:10] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T370903)', diff saved to https://phabricator.wikimedia.org/P67963 and previous config saved to /var/cache/conftool/dbconfig/20240827-181653-ladsgroup.json [18:16:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:16:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10097398 (10Dwisehaupt) a:05Dwisehaupt→03Papaul Assigning to @Papaul for payments2006 setup. Assign back to me when it's ready for OS install an... [18:16:59] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:17:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:17:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:17:13] (03CR) 10Scott French: [C:03+1] aptrepo: add ffmpeg buster component [puppet] - 10https://gerrit.wikimedia.org/r/1067384 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [18:17:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:17:31] (03PS1) 10Zabe: Update uzwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067400 (https://phabricator.wikimedia.org/T370165) [18:17:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T370903)', diff saved to https://phabricator.wikimedia.org/P67964 and previous config saved to /var/cache/conftool/dbconfig/20240827-181732-ladsgroup.json [18:19:07] (03CR) 10Hnowlan: [C:03+2] aptrepo: add ffmpeg buster component [puppet] - 10https://gerrit.wikimedia.org/r/1067384 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [18:19:30] (03PS1) 10Ebernhardson: search update pipeline: correctly handle redirect updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067401 [18:20:20] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10097416 (10Dwisehaupt) a:05Dwisehaupt→03Papaul Assigning to @Papaul for frdb2005 setup. Assign back to me when it's ready for OS install and setup. [18:25:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T370903)', diff saved to https://phabricator.wikimedia.org/P67965 and previous config saved to /var/cache/conftool/dbconfig/20240827-182531-ladsgroup.json [18:25:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:28:20] (03CR) 10Ebernhardson: [C:03+2] search update pipeline: correctly handle redirect updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067401 (owner: 10Ebernhardson) [18:29:23] (03Merged) 10jenkins-bot: search update pipeline: correctly handle redirect updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067401 (owner: 10Ebernhardson) [18:33:32] !log ebernhardson@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:33:37] !log ebernhardson@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:38:31] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:38:37] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:40:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P67966 and previous config saved to /var/cache/conftool/dbconfig/20240827-184039-ladsgroup.json [18:47:00] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [18:49:12] (03PS4) 10Jdlrobson: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) [18:49:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1003 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:49:59] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [18:55:00] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [18:55:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P67967 and previous config saved to /var/cache/conftool/dbconfig/20240827-185546-ladsgroup.json [18:58:37] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:01:40] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002" [19:01:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ml-lab servers - jclark@cumin1002" [19:01:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:01:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1001 [19:01:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1001 [19:04:45] RECOVERY - Uncommitted DNS changes in Netbox on netbox1003 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:05:05] (03CR) 10BCornwall: "Seems nobody piped up." [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [19:05:15] (03PS1) 10Mstyles: security-landing-page: bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829) [19:07:33] (03CR) 10SBassett: [C:03+1] "Verified image id." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829) (owner: 10Mstyles) [19:10:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T370903)', diff saved to https://phabricator.wikimedia.org/P67968 and previous config saved to /var/cache/conftool/dbconfig/20240827-191053-ladsgroup.json [19:10:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [19:10:58] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:11:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance [19:11:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T370903)', diff saved to https://phabricator.wikimedia.org/P67969 and previous config saved to /var/cache/conftool/dbconfig/20240827-191116-ladsgroup.json [19:19:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T370903)', diff saved to https://phabricator.wikimedia.org/P67970 and previous config saved to /var/cache/conftool/dbconfig/20240827-191915-ladsgroup.json [19:19:21] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:25:28] (03CR) 10Zabe: [C:03+2] Update uzwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067400 (https://phabricator.wikimedia.org/T370165) (owner: 10Zabe) [19:26:11] (03Merged) 10jenkins-bot: Update uzwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067400 (https://phabricator.wikimedia.org/T370165) (owner: 10Zabe) [19:26:56] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067400|Update uzwiki logo (T370165)]] [19:27:03] T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165 [19:30:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1025.eqiad.wmnet, kubernetes1023.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1408.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, kubernetes1017.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, mw1394.eqiad.wmnet, mw1360.eqiad.wmnet, parse1012.eqiad.wmnet, kubernetes1015.eqiad.wmnet, mw1352.eqiad.wmnet, parse1006.eqiad [19:30:17] mw1355.eqiad.wmnet, mw1472.eqiad.wmnet, kubernetes1026.eqiad.wmnet, mw1409.eqiad.wmnet, mw1383.eqiad.wmnet, wikikube-worker1032.eqiad.wmnet, mw1416.eqiad.wmnet, kubernetes1054.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, parse1014.eqiad.wmnet, mw1478.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1040.eqiad.wmnet, wikikube-worker1012.eqiad.wmnet, kubernetes1016.eqiad.wmnet, mw1461.eqiad.wmnet, [19:30:17] qiad.wmnet, wikikube-worker1017.eqiad.wmnet, mw1423.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1020.eqiad.wmnet, mw1397.eqiad.wmnet, wikikube-worker1021.eqiad.wmnet, mw1399.eqiad.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [19:30:23] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers wikikube-worker1012.eqiad.wmnet, wikikube-worker1028.eqiad.wmnet, mw1409.eqiad.wmnet, kubernetes1036.eqiad.wmnet, parse1007.eqiad.wmnet, mw1457.eqiad.wmnet, mw1455.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, mw1475.eqiad.wmnet, mw1374.eqiad.wmnet, kubernetes1062.eqiad.wmnet, kubernetes1022.eqiad.wmnet, kubernetes1037.eqiad.wmne [19:30:23] 4.eqiad.wmnet, wikikube-worker1032.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1482.eqiad.wmnet, kubernetes1040.eqiad.wmnet, mw1495.eqiad.wmnet, parse1024.eqiad.wmnet, wikikube-worker1017.eqiad.wmnet, mw1477.eqiad.wmnet, mw1423.eqiad.wmnet, wikikube-worker1025.eqiad.wmnet, kubernetes1020.eqiad.wmnet, mw1397.eqiad.wmnet, mw1394.eqiad.wmnet, mw1385.eqiad.wmnet, mw1452.eqiad.wmnet, mw1422.eqiad.wmnet, mw1361.eqiad.wmnet, parse1008.eqiad.wmnet [19:30:23] be-worker1027.eqiad.wmnet, kubernetes1009.eqiad.wmnet, mw1448.eqiad.wmnet, wikikube-worker1030.eqiad.wmnet, mw1421.eqiad.wmnet, mw1377.eqiad.wmnet, kubernetes1029.eqiad.wmnet, parse1004 https://wikitech.wikimedia.org/wiki/PyBal [19:31:05] woah [19:31:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:31:23] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:31:26] what is this about [19:31:33] ah [19:32:28] (03PS2) 10Hashar: Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson) [19:33:03] (03CR) 10Hashar: "I have attached it to T373463 with:" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson) [19:34:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P67971 and previous config saved to /var/cache/conftool/dbconfig/20240827-193424-ladsgroup.json [19:37:47] !log zabe@deploy1003 zabe: Backport for [[gerrit:1067400|Update uzwiki logo (T370165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:37:48] FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:37:51] T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165 [19:38:46] !log zabe@deploy1003 zabe: Continuing with sync [19:44:04] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067400|Update uzwiki logo (T370165)]] (duration: 17m 07s) [19:44:08] T370165: Proposed Revisions to the Uzbek Wikipedia Logo - https://phabricator.wikimedia.org/T370165 [19:49:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P67972 and previous config saved to /var/cache/conftool/dbconfig/20240827-194930-ladsgroup.json [19:49:36] (03PS1) 10Scott French: kubernetes: re-name/IP kubernetes2026 as wikikube-worker2046 [puppet] - 10https://gerrit.wikimedia.org/r/1067414 (https://phabricator.wikimedia.org/T372878) [19:51:39] (03CR) 10Zabe: [C:03+2] Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson) [19:57:30] (03PS6) 10Pppery: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) [19:58:15] (03PS1) 10Dzahn: prometheus/gerrit: also add size of tracking list to exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067415 (https://phabricator.wikimedia.org/T373136) [19:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [19:59:44] (03CR) 10Zabe: [C:03+2] Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant) [19:59:46] (03CR) 10Zabe: [C:03+2] Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T2000). [20:00:05] dbrant, Pppery, Jdlrobson, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] I can deploy [20:00:08] here [20:00:12] o/ [20:00:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant) [20:00:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery) [20:00:25] (03Merged) 10jenkins-bot: Turn account vanishing contact form into a redirect. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065189 (https://phabricator.wikimedia.org/T372828) (owner: 10Dbrant) [20:00:37] (03Merged) 10jenkins-bot: Revert "[svwikt] Add a temporary logo for the 100.000 pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066903 (https://phabricator.wikimedia.org/T364247) (owner: 10Pppery) [20:00:56] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1065189|Turn account vanishing contact form into a redirect. (T372828)]], [[gerrit:1066903|Revert "[svwikt] Add a temporary logo for the 100.000 pages" (T364247)]] [20:01:05] T372828: Redirect old vanishing form to new one - https://phabricator.wikimedia.org/T372828 [20:01:06] T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247 [20:01:41] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:01:48] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:02:40] o/ [20:03:43] (03PS2) 10Dzahn: codesearch: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677) [20:04:08] jouncebot: now [20:04:08] For the next 0 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240827T2000) [20:04:13] !log zabe@deploy1003 dbrant, zabe, pppery: Backport for [[gerrit:1065189|Turn account vanishing contact form into a redirect. (T372828)]], [[gerrit:1066903|Revert "[svwikt] Add a temporary logo for the 100.000 pages" (T364247)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:04:15] Pppery: dbrant: can you test? [20:04:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T370903)', diff saved to https://phabricator.wikimedia.org/P67973 and previous config saved to /var/cache/conftool/dbconfig/20240827-200437-ladsgroup.json [20:04:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [20:04:41] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:04:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [20:05:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [20:05:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T370903)', diff saved to https://phabricator.wikimedia.org/P67974 and previous config saved to /var/cache/conftool/dbconfig/20240827-200459-ladsgroup.json [20:05:06] I had to bypass my browser cache to get the new logo to show, but it seems to work [20:05:28] mine looks good [20:05:41] alright [20:05:42] !log zabe@deploy1003 dbrant, zabe, pppery: Continuing with sync [20:06:16] zabe: oi [20:06:19] zabe: i'm here [20:06:30] hello hello [20:06:38] if you want to you can run a command to purge the logo from caches [20:06:44] (03CR) 10RLazarus: [C:03+1] kubernetes: re-name/IP kubernetes2026 as wikikube-worker2046 [puppet] - 10https://gerrit.wikimedia.org/r/1067414 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [20:06:48] but just waiting will also work [20:07:00] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [20:07:08] (03CR) 10Zabe: [C:03+2] Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) (owner: 10Jdlrobson) [20:07:48] since the old logo was located at a different url, it should work without purging, I guess? [20:07:54] (03CR) 10Dzahn: [C:03+2] codesearch: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:08:07] cscott: do your changes depend on each other? [20:08:15] zabe: yes, true. in that case [20:08:24] (03Merged) 10jenkins-bot: Disable mobile Watchlist on wikidata since its broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057026 (https://phabricator.wikimedia.org/T263633) (owner: 10Jdlrobson) [20:08:50] zabe the last two do: the ParserMigration extension patch needs to be backported before the config change is made [20:09:07] zabe: the first two just prevent logspam and can be done in any order [20:09:19] alright [20:09:25] (03CR) 10Zabe: [C:03+2] Tweak styling of compact Parsoid indicator [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:09:26] (03CR) 10Zabe: [C:03+2] Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [20:09:27] (03CR) 10Zabe: [C:03+2] Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [20:11:12] Jdlrobson: your changes do not depend on each other, do they? [20:11:23] zabe: nope [20:11:30] can go out separately or together.. whatever is easiest [20:12:25] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1065189|Turn account vanishing contact form into a redirect. (T372828)]], [[gerrit:1066903|Revert "[svwikt] Add a temporary logo for the 100.000 pages" (T364247)]] (duration: 11m 28s) [20:12:30] T372828: Redirect old vanishing form to new one - https://phabricator.wikimedia.org/T372828 [20:12:30] T364247: Requesting temporary logo change for sv.wiktionary.org - https://phabricator.wikimedia.org/T364247 [20:12:40] ok, lets start with your config patch then - the other one is still running ci [20:12:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1057026|Disable mobile Watchlist on wikidata since its broken (T263633)]] [20:12:51] T263633: Mobile Special:EditWatchlist displays item IDs instead of labels - https://phabricator.wikimedia.org/T263633 [20:12:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T370903)', diff saved to https://phabricator.wikimedia.org/P67975 and previous config saved to /var/cache/conftool/dbconfig/20240827-201256-ladsgroup.json [20:13:00] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:13:04] dbrant: Pppery: your changes should be live [20:14:04] thanks [20:14:29] Although it's not really my change - I just shepherd it through the process after seeing it languish in Phabricator for weeks [20:14:36] yeah fair [20:15:00] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [20:15:04] !log zabe@deploy1003 jdlrobson, zabe: Backport for [[gerrit:1057026|Disable mobile Watchlist on wikidata since its broken (T263633)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:07] Jdlrobson: is your config patch testable? [20:15:34] zabe: yep [20:15:39] let me know when its on debug [20:15:53] (03CR) 10Dzahn: [C:03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:16:42] i see it is - zabe looks good - please sync! [20:17:53] alright [20:17:55] !log zabe@deploy1003 jdlrobson, zabe: Continuing with sync [20:18:40] (03CR) 10Dzahn: [V:03+1] releases: upgrade Java JDK version from 11 to 17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064437 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [20:22:27] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1057026|Disable mobile Watchlist on wikidata since its broken (T263633)]] (duration: 09m 39s) [20:22:31] T263633: Mobile Special:EditWatchlist displays item IDs instead of labels - https://phabricator.wikimedia.org/T263633 [20:22:51] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:22:55] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:22:56] (03PS3) 10JHathaway: puppet8: mtail, check if notify is defined [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) [20:23:03] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:24:22] (03Merged) 10jenkins-bot: Revert "Allow gadget/browser extension extensibility of empty search state" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson) [20:24:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067389 (https://phabricator.wikimedia.org/T373463) (owner: 10Jdlrobson) [20:24:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:24:27] (03Merged) 10jenkins-bot: Tweak styling of compact Parsoid indicator [extensions/ParserMigration] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1066882 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:24:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067389|Revert "Allow gadget/browser extension extensibility of empty search state" (T373463)]], [[gerrit:1066882|Tweak styling of compact Parsoid indicator (T372789)]] [20:24:56] T373463: Text "empty" appears after search input when first clicking into it - https://phabricator.wikimedia.org/T373463 [20:24:56] T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789 [20:27:04] !log bking@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2043.codfw.wmnet [20:27:37] !log bking@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2043.codfw.wmnet [20:27:43] !log zabe@deploy1003 cscott, zabe, jdlrobson: Backport for [[gerrit:1067389|Revert "Allow gadget/browser extension extensibility of empty search state" (T373463)]], [[gerrit:1066882|Tweak styling of compact Parsoid indicator (T372789)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:28:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P67976 and previous config saved to /var/cache/conftool/dbconfig/20240827-202803-ladsgroup.json [20:28:05] Jdlrobson: your backport is at mwdebug [20:28:57] cscott: is the parsoid indicator backport testable? [20:29:07] only testable after the config deploy, alas. [20:29:12] alright [20:29:17] then I will just sync it [20:29:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T371742)', diff saved to https://phabricator.wikimedia.org/P67977 and previous config saved to /var/cache/conftool/dbconfig/20240827-202954-ladsgroup.json [20:29:58] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:30:05] i verified that it doesn't horribly crash anything at least :) [20:30:15] good [20:30:49] (by loading en.wikivoyage.org on the debug servers, which has parsoid read views on by default and the old indicator style, which is not/should not be affected by the backported patch) [20:30:54] Jdlrobson: I quickly tried testing your patch myself, but unless I am doing something wrong on testwiki, it does not seem to fix the issue [20:32:05] zabe: if the Kartographer patches are live I can try to verify the absence of logspam [20:32:30] zabe: (looking) [20:33:09] zabe: lgtm - perhaps you are getting cached JS or CSS? [20:33:14] this looks good to sync to me! [20:33:43] oh yeah - clearing browser cache fixed it [20:33:48] cool [20:33:49] !log zabe@deploy1003 cscott, zabe, jdlrobson: Continuing with sync [20:37:21] (03CR) 10AOkoth: [C:03+2] security-landing-page: bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829) (owner: 10Mstyles) [20:38:10] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067389|Revert "Allow gadget/browser extension extensibility of empty search state" (T373463)]], [[gerrit:1066882|Tweak styling of compact Parsoid indicator (T372789)]] (duration: 13m 23s) [20:38:15] T373463: Text "empty" appears after search input when first clicking into it - https://phabricator.wikimedia.org/T373463 [20:38:16] T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789 [20:38:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [20:38:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [20:38:29] (03CR) 10Srishakatux: "I checked with @amir.aharoni@mail.huji.ac.il and as per his feedback this is not needed as the `core-Namespaces.php` is for aliases and ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [20:38:33] (03PS4) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [20:38:36] (03Merged) 10jenkins-bot: security-landing-page: bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067406 (https://phabricator.wikimedia.org/T372829) (owner: 10Mstyles) [20:39:01] (03PS27) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [20:40:40] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [20:40:58] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [20:43:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P67978 and previous config saved to /var/cache/conftool/dbconfig/20240827-204310-ladsgroup.json [20:44:13] (03PS3) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) [20:44:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [20:45:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67979 and previous config saved to /var/cache/conftool/dbconfig/20240827-204501-ladsgroup.json [20:45:13] (03Merged) 10jenkins-bot: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067396 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [20:45:14] (03Merged) 10jenkins-bot: Remove warning on non-existing category [extensions/Kartographer] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1067395 (https://phabricator.wikimedia.org/T373454) (owner: 10C. Scott Ananian) [20:45:34] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067396|Remove warning on non-existing category (T373454)]], [[gerrit:1067395|Remove warning on non-existing category (T373454)]] [20:45:39] T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454 [20:48:29] !log zabe@deploy1003 cscott, zabe: Backport for [[gerrit:1067396|Remove warning on non-existing category (T373454)]], [[gerrit:1067395|Remove warning on non-existing category (T373454)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:49:15] zabe: i'm watching the logs and i don't see any of the canaries, but i don't know how long i'd have to watch to be sure of that. [20:49:22] !log zabe@deploy1003 cscott, zabe: Continuing with sync [20:49:31] zabe: yeah, great. [20:49:41] !log mstyles@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [20:49:50] I would just sync through and keep a look at the logs while doing that [20:50:02] (03CR) 10Zabe: [C:03+2] Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:50:22] thanks zabe for the help today! [20:50:28] yw [20:50:37] https://logstash.wikimedia.org/goto/fe96b774b9ec8273a41333a492b8dcb2 is what i'm looking at [20:51:01] !log mstyles@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [20:51:43] !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:52:14] !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:52:18] !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:52:47] !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:52:55] !log mstyles@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:52:57] !log mstyles@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:53:04] !log mstyles@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:53:06] !log mstyles@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:53:17] zabe i added one more config patch which i'd missed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067381 sorry about that. it's belt and suspenders for the commonswiki logspam, but also avoids some crashes on hewiki. [20:53:38] (03PS2) 10C. Scott Ananian: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) [20:53:45] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067396|Remove warning on non-existing category (T373454)]], [[gerrit:1067395|Remove warning on non-existing category (T373454)]] (duration: 08m 11s) [20:53:47] (03CR) 10Zabe: [C:03+2] Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:53:49] T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454 [20:53:57] (03PS4) 10Isabelle Hurbain-Palatin: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) [20:53:58] (03CR) 10Zabe: [C:03+2] Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [20:54:20] zabe: the commonswiki logspam seems to have stopped, yay [20:54:34] (03Merged) 10jenkins-bot: Activates the "compact" Parsoid indicator on all wikivoyage wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067398 (https://phabricator.wikimedia.org/T372789) (owner: 10C. Scott Ananian) [20:54:40] cool [20:54:42] (03Merged) 10jenkins-bot: Rollback Parsoid+Kartographer rollout on hewiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [20:55:08] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1067398|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]], [[gerrit:1067381|Rollback Parsoid+Kartographer rollout on hewiki and commons (T373454 T373460)]] [20:55:14] T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789 [20:55:15] T373460: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T373460 [20:57:13] !log zabe@deploy1003 ihurbain, zabe, cscott: Backport for [[gerrit:1067398|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]], [[gerrit:1067381|Rollback Parsoid+Kartographer rollout on hewiki and commons (T373454 T373460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:57:28] cscott: both config patches are at mwdebug [20:57:46] (03PS4) 10JHathaway: puppet8: mtail, check if notify is defined [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) [20:58:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [20:58:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T370903)', diff saved to https://phabricator.wikimedia.org/P67980 and previous config saved to /var/cache/conftool/dbconfig/20240827-205817-ladsgroup.json [20:58:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:58:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:58:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:58:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:58:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:58:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T370903)', diff saved to https://phabricator.wikimedia.org/P67981 and previous config saved to /var/cache/conftool/dbconfig/20240827-205855-ladsgroup.json [20:58:56] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2026.codfw.wmnet [20:59:29] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2026.codfw.wmnet [20:59:36] zabe: ok testing. [21:00:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67982 and previous config saved to /var/cache/conftool/dbconfig/20240827-210008-ladsgroup.json [21:00:27] (03CR) 10Subramanya Sastry: Rollback Parsoid+Kartographer rollout on hewiki and commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [21:01:03] zabe ok, verified the kartographer/hewiki one. checking the other. [21:01:30] zabe: yep, that looks good to. good to sync [21:01:31] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.461s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:01:34] *too [21:01:36] alright [21:01:39] !log zabe@deploy1003 ihurbain, zabe, cscott: Continuing with sync [21:02:13] (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP kubernetes2026 as wikikube-worker2046 [puppet] - 10https://gerrit.wikimedia.org/r/1067414 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [21:02:40] subbu: w/ x-wikimedia-debug on, https://en.wikivoyage.org/wiki/Windsor_(Ontario) should have a compact parsoid indicator and https://he.wikipedia.org/wiki/%D7%9E%D7%92%D7%93%D7%9C_%D7%93%D7%9E%D7%A8%D7%99 should render in parsoid read views w/o crashing. [21:03:04] subbu: not fully synced yet, just on canaries so far [21:03:12] is than an fyi or do you want me to verify? [21:03:17] *that [21:03:44] subbu: yes? i was giving you an fyi so that if you wanted to verify you could, or you could test some urls other than the one I did :) [21:03:50] but it looks good to me [21:04:36] should be good if you tested it. [21:06:03] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067398|Activates the "compact" Parsoid indicator on all wikivoyage wikis (T372789)]], [[gerrit:1067381|Rollback Parsoid+Kartographer rollout on hewiki and commons (T373454 T373460)]] (duration: 10m 55s) [21:06:05] cscott, but fyi reg https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1067381 .. i left a comment there. you could not disable it on commons. [21:06:06] should be live [21:06:09] T372789: Compact Parsoid indicator for ParserMigration for wikivoyage - https://phabricator.wikimedia.org/T372789 [21:06:09] T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454 [21:06:09] T373460: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T373460 [21:06:22] subbu: yeah, but i figured belt-and-suspenders [21:06:28] okay. :) [21:06:31] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.461s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:06:34] subbu: i verified that the logspam stopped on commons before we deployed that [21:06:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T370903)', diff saved to https://phabricator.wikimedia.org/P67983 and previous config saved to /var/cache/conftool/dbconfig/20240827-210646-ladsgroup.json [21:06:48] k [21:06:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:07:01] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2026 to wikikube-worker2046 [21:07:21] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [21:08:15] (03CR) 10C. Scott Ananian: Rollback Parsoid+Kartographer rollout on hewiki and commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067381 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [21:11:11] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2026 to wikikube-worker2046 - swfrench@cumin2002" [21:11:50] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2026 to wikikube-worker2046 - swfrench@cumin2002" [21:11:50] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:11:51] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2046 [21:12:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2046 [21:12:59] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2026 to wikikube-worker2046 [21:13:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [21:13:51] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2046.codfw.wmnet on all recursors [21:13:54] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2046.codfw.wmnet on all recursors [21:14:58] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2046.codfw.wmnet with OS bullseye [21:15:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [21:15:10] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host [21:15:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T371742)', diff saved to https://phabricator.wikimedia.org/P67984 and previous config saved to /var/cache/conftool/dbconfig/20240827-211516-ladsgroup.json [21:15:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [21:15:20] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:15:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Maintenance [21:15:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T371742)', diff saved to https://phabricator.wikimedia.org/P67985 and previous config saved to /var/cache/conftool/dbconfig/20240827-211538-ladsgroup.json [21:15:54] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [21:18:58] (03PS1) 10JHathaway: puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) [21:19:37] (03CR) 10JHathaway: [C:03+2] puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:19:40] (03CR) 10JHathaway: [V:03+2 C:03+2] puppet8: add phd_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1067430 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:20:03] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2046 - swfrench@cumin2002" [21:20:09] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2046 - swfrench@cumin2002" [21:20:09] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:10] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2046.codfw.wmnet 69.0.192.10.in-addr.arpa 9.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:20:12] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2046.codfw.wmnet 69.0.192.10.in-addr.arpa 9.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:20:14] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2046 [21:20:20] zabe: thanks! i forgot to say thank you! [21:20:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2046 [21:20:49] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [21:21:01] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10098141 (10jhathaway) [21:21:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P67986 and previous config saved to /var/cache/conftool/dbconfig/20240827-212153-ladsgroup.json [21:23:19] (03CR) 10JHathaway: "The code is a bit ugly, the other option is changing all the mtail define types to add a new parameter, rather than a metaparameter." [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:29:32] (03CR) 10Cwhite: "Worth keeping the pattern around for possible use in the future, but probably not needed now since we finished the restarts today?" [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) (owner: 10Tiziano Fogli) [21:35:16] (03PS5) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [21:36:15] (03CR) 10Amire80: [C:03+1] Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [21:37:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P67987 and previous config saved to /var/cache/conftool/dbconfig/20240827-213700-ladsgroup.json [21:38:38] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage [21:41:48] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage [21:52:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T370903)', diff saved to https://phabricator.wikimedia.org/P67988 and previous config saved to /var/cache/conftool/dbconfig/20240827-215208-ladsgroup.json [21:52:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [21:52:14] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:52:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [21:52:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67989 and previous config saved to /var/cache/conftool/dbconfig/20240827-215230-ladsgroup.json [21:57:03] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:47] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:59:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67990 and previous config saved to /var/cache/conftool/dbconfig/20240827-215958-ladsgroup.json [22:00:06] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:01:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.283 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:01:33] (03PS1) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) [22:01:37] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:01:55] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2046.codfw.wmnet with OS bullseye [22:01:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:02:06] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [22:02:28] (03CR) 10CI reject: [V:04-1] Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [22:02:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [22:04:48] !log Running homer 'lsw1-a8-codfw*' commit 'T372878' [22:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:52] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [22:06:35] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2046.codfw.wmnet [22:06:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2046.codfw.wmnet [22:07:09] !log pooled / uncordoned wikikube-worker2046.codfw.wmnet - T372878 [22:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:58] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373491 (10Scott_French) 03NEW [22:09:41] 10ops-magru, 06SRE: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371618#10098219 (10RobH) 05Open→03Declined Dupe of T371554, issue being tracked there [22:11:10] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P67991 and previous config saved to /var/cache/conftool/dbconfig/20240827-221506-ladsgroup.json [22:15:19] !log running homer 'cr*codfw*' commit 'T372878' [22:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:23] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [22:15:48] (03PS2) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) [22:20:33] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 457, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:25:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [22:27:21] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 539, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:30:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P67992 and previous config saved to /var/cache/conftool/dbconfig/20240827-223013-ladsgroup.json [22:41:51] (03PS1) 10Scott French: sre.hosts.move-vlan: use name property in runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 [22:45:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T370903)', diff saved to https://phabricator.wikimedia.org/P67993 and previous config saved to /var/cache/conftool/dbconfig/20240827-224520-ladsgroup.json [22:45:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:45:25] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:45:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:45:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T370903)', diff saved to https://phabricator.wikimedia.org/P67994 and previous config saved to /var/cache/conftool/dbconfig/20240827-224542-ladsgroup.json [22:53:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T370903)', diff saved to https://phabricator.wikimedia.org/P67995 and previous config saved to /var/cache/conftool/dbconfig/20240827-225332-ladsgroup.json [22:53:37] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:08:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P67996 and previous config saved to /var/cache/conftool/dbconfig/20240827-230839-ladsgroup.json [23:23:31] (03CR) 10Bartosz Dziewoński: ""Audit" is a big word, I was just trying to comprehend it and I tried to simplify some parts that defied comprehension. I didn't like this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [23:23:47] (03Abandoned) 10Bartosz Dziewoński: wikitech: Remove LDAP debug logging disabled since 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066899 (owner: 10Bartosz Dziewoński) [23:23:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P67997 and previous config saved to /var/cache/conftool/dbconfig/20240827-232346-ladsgroup.json [23:26:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T371742)', diff saved to https://phabricator.wikimedia.org/P67998 and previous config saved to /var/cache/conftool/dbconfig/20240827-232653-ladsgroup.json [23:26:57] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:27:34] (03CR) 10Andrea Denisse: [C:03+2] alert: Update alertmanager tests hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [23:38:18] FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1067450 [23:38:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1067450 (owner: 10TrainBranchBot) [23:38:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T370903)', diff saved to https://phabricator.wikimedia.org/P67999 and previous config saved to /var/cache/conftool/dbconfig/20240827-233854-ladsgroup.json [23:38:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [23:38:58] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:39:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [23:41:35] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:42:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P68000 and previous config saved to /var/cache/conftool/dbconfig/20240827-234200-ladsgroup.json [23:42:01] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.31 ms [23:46:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance [23:46:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance [23:54:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2208.codfw.wmnet with reason: Maintenance [23:54:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2208.codfw.wmnet with reason: Maintenance [23:54:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T370903)', diff saved to https://phabricator.wikimedia.org/P68001 and previous config saved to /var/cache/conftool/dbconfig/20240827-235426-ladsgroup.json [23:54:31] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:57:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P68002 and previous config saved to /var/cache/conftool/dbconfig/20240827-235707-ladsgroup.json [23:59:27] FYI, I'm looking into those KubernetesCalicoDown alerts. these are a little surprising, as they correspond to the old names two nodes in various stages of rename/reimage (one of which ostensibly finished). [23:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections