[00:02:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057416 (owner: 10TrainBranchBot) [00:06:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T367856)', diff saved to https://phabricator.wikimedia.org/P66980 and previous config saved to /var/cache/conftool/dbconfig/20240729-022221-marostegui.json [02:22:26] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [02:37:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P66981 and previous config saved to /var/cache/conftool/dbconfig/20240729-023728-marostegui.json [02:39:21] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P66982 and previous config saved to /var/cache/conftool/dbconfig/20240729-025235-marostegui.json [02:59:21] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:25] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:07:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T367856)', diff saved to https://phabricator.wikimedia.org/P66983 and previous config saved to /var/cache/conftool/dbconfig/20240729-030742-marostegui.json [03:07:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2216.codfw.wmnet with reason: Maintenance [03:07:53] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [03:07:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2216.codfw.wmnet with reason: Maintenance [03:08:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T367856)', diff saved to https://phabricator.wikimedia.org/P66984 and previous config saved to /var/cache/conftool/dbconfig/20240729-030804-marostegui.json [03:54:49] (03PS1) 10KartikMistry: Update MinT to 2024-07-24-145137-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057421 (https://phabricator.wikimedia.org/T355304) [04:05:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T371190#10021355 (10Marostegui) [04:33:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T370556#10021373 (10Marostegui) [05:03:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#10021377 (10Marostegui) Thank you Papaul! [05:03:48] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T370556#10021374 (10Marostegui) 05Open→03Resolved a:03VRiley-WMF Thanks @VRiley-WMF - the host is now looking good [05:09:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#10021378 (10Marostegui) Did this server get the data checksummed or cloned before repooling it back? [05:27:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [05:32:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [05:39:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [05:44:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:02] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1057436 (https://phabricator.wikimedia.org/T371205) [06:19:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T371205 [06:19:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2140 with weight 0 T371205', diff saved to https://phabricator.wikimedia.org/P66987 and previous config saved to /var/cache/conftool/dbconfig/20240729-061940-root.json [06:19:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T371205 [06:21:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2140 from API/vslow/dump T371205', diff saved to https://phabricator.wikimedia.org/P66988 and previous config saved to /var/cache/conftool/dbconfig/20240729-062123-root.json [06:22:22] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1057436 (https://phabricator.wikimedia.org/T371205) (owner: 10Gerrit maintenance bot) [06:26:05] (03PS1) 10Marostegui: db2179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1057603 [06:26:40] (03CR) 10Marostegui: [C:03+2] db2179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1057603 (owner: 10Marostegui) [06:39:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse) [06:42:27] !log Starting s4 codfw failover from db2179 to db2140 - T371205 [06:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:31] T371205: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T371205 [06:42:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2140 to s4 primary T371205', diff saved to https://phabricator.wikimedia.org/P66989 and previous config saved to /var/cache/conftool/dbconfig/20240729-064250-marostegui.json [06:44:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2179 T371205', diff saved to https://phabricator.wikimedia.org/P66990 and previous config saved to /var/cache/conftool/dbconfig/20240729-064405-marostegui.json [06:46:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Long schema change [06:47:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Long schema change [06:48:16] !log Deploy schema change on s4 codfw db2179 dbmaint T367856 [06:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:21] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T0700) [07:00:05] kart_ and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] o/ [07:01:17] * kart_ is here [07:02:07] I'll start with my patch. [07:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057432 (owner: 10KartikMistry) [07:02:54] (03Merged) 10jenkins-bot: Temporary disable MinT for Wikireaders for bn, fa, hi, and ko [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057432 (owner: 10KartikMistry) [07:03:27] !log kartik@deploy1002 Started scap sync-world: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] [07:13:51] (03CR) 10Giuseppe Lavagetto: [C:03+2] profile::haproxy: move tls_terminator.pp to profile module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto) [07:17:16] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: add confd_file define [puppet] - 10https://gerrit.wikimedia.org/r/1056875 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [07:18:15] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [07:18:22] (03PS4) 10CDanis: haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [07:18:25] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [07:19:09] !log kartik@deploy1002 kartik: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:19:09] !log kartik@deploy1002 Sync cancelled. [07:19:31] eh. Seems accidental key pressed. Retying. [07:19:52] !log kartik@deploy1002 Started scap sync-world: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] [07:19:52] Sorry dcausse :/ [07:20:11] kart_: no worries! :) [07:25:13] !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts karapace1002.eqiad.wmnet [07:25:45] !log kartik@deploy1002 kartik: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:25:49] !log kartik@deploy1002 kartik: Continuing with sync [07:29:57] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [07:32:30] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [07:34:00] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [07:34:00] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:34:00] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts karapace1002.eqiad.wmnet [07:34:21] !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts karapace1001.eqiad.wmnet [07:34:34] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] (duration: 14m 42s) [07:34:49] dcausse: done! [07:35:03] kart_: thanks! will deploy mine [07:37:17] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 24482 [07:38:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse) [07:39:02] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 24482 [07:39:17] (03Merged) 10jenkins-bot: GeoData: add pool counter settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse) [07:39:33] !log dcausse@deploy1002 Started scap sync-world: Backport for [[gerrit:1055890|GeoData: add pool counter settings (T370621)]] [07:39:39] T370621: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621 [07:39:55] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 24482 [07:40:40] (03PS1) 10Filippo Giunchedi: benthos: smaller batches for mw_accesslog_metrics [puppet] - 10https://gerrit.wikimedia.org/r/1057798 (https://phabricator.wikimedia.org/T369256) [07:41:34] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [07:42:25] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10021585 (10Fabfur) 05Open→03In progress a:03Fabfur [07:42:46] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1055890|GeoData: add pool counter settings (T370621)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:42:54] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10021587 (10Fabfur) 05Open→03In progress a:03Fabfur [07:44:38] (03CR) 10Filippo Giunchedi: [C:03+1] "Good to go once requisites are in place" [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [07:45:30] (03Abandoned) 10Filippo Giunchedi: ignore, test [alerts] - 10https://gerrit.wikimedia.org/r/1056897 (owner: 10Filippo Giunchedi) [07:45:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:45] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] logstash: consume k8s logs topics [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [07:46:12] !log dcausse@deploy1002 dcausse: Continuing with sync [07:46:21] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [07:46:53] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3433/co" [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [07:47:31] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [07:47:32] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:47:32] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts karapace1001.eqiad.wmnet [07:48:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T371190#10021604 (10Marostegui) p:05Triage→03Medium This host is probably out of warranty, but can we check if there're disks we can use somewhere? Thanks [07:49:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:48] (03PS1) 10Stevemunene: idp-test: Register airflow-analytics-test IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) [07:51:10] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1055890|GeoData: add pool counter settings (T370621)]] (duration: 11m 36s) [07:51:14] T370621: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621 [07:51:27] (03PS1) 10Brouberol: karapace: cleanup after karapace100[12] were decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1057800 (https://phabricator.wikimedia.org/T363461) [07:53:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 24482 [07:54:09] !log closing the backport window [07:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:46] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1057800 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:01:20] (03PS10) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [08:03:17] (03CR) 10Brouberol: [C:03+2] karapace: cleanup after karapace100[12] were decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1057800 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:09:44] (03CR) 10Brouberol: [C:03+1] "LGTM! Will the Node have an associated Kubernetes label allowing Pods to target it specifically?" [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [08:11:12] (03CR) 10Brouberol: [C:03+1] "LGTM except for a small typo" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [08:11:23] (03CR) 10Klausman: "That will be done by pods requiring the GPU resource (which is added by the AMDGPU role). If we find that we need stricter control, we can" [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [08:12:34] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10021674 (10ayounsi) [08:13:25] RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:53] (03CR) 10Elukey: "Hey folks, I added Simon from I/F, please always involve somebody from I/F before merging changes to IDP :)" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [08:31:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T367856)', diff saved to https://phabricator.wikimedia.org/P66991 and previous config saved to /var/cache/conftool/dbconfig/20240729-083115-marostegui.json [08:31:21] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:32:15] (03PS1) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) [08:35:33] (03PS1) 10Fabfur: geo-maps: make esams default DC for France [dns] - 10https://gerrit.wikimedia.org/r/1057812 (https://phabricator.wikimedia.org/T371216) [08:38:36] (03CR) 10Ayounsi: [C:03+1] geo-maps: make esams default DC for France [dns] - 10https://gerrit.wikimedia.org/r/1057812 (https://phabricator.wikimedia.org/T371216) (owner: 10Fabfur) [08:41:56] (03CR) 10Fabfur: [C:03+2] geo-maps: make esams default DC for France [dns] - 10https://gerrit.wikimedia.org/r/1057812 (https://phabricator.wikimedia.org/T371216) (owner: 10Fabfur) [08:45:31] (03CR) 10Brouberol: dns: provision airflow-analytics-test domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [08:46:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P66992 and previous config saved to /var/cache/conftool/dbconfig/20240729-084622-marostegui.json [08:48:30] (03PS11) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [08:48:30] (03PS1) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) [08:50:02] (03PS2) 10Stevemunene: idp-test: Register airflow-analytics-test IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) [08:51:25] (03CR) 10Stevemunene: "Ack, thanks Luca 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [08:52:02] (03CR) 10CI reject: [V:04-1] ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [08:52:42] 07Puppet: Single member group breaks cross validation script - https://phabricator.wikimedia.org/T371221 (10SLyngshede-WMF) 03NEW [08:54:53] (03PS1) 10Slyngshede: data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) [08:55:35] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10021879 (10Fabfur) Looping in @thcipriani just for a quick confirmation that this is both for a new shell account and for adding the user to the `restricted` group [08:55:45] (03CR) 10CI reject: [V:04-1] data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede) [08:58:04] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10021886 (10hnowlan) >>! In T370789#10015615, @BCornwall wrote: > That's right! Thanks for reminding. Anyone have any qualms with going that route? Makes sense to me. [09:01:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P66994 and previous config saved to /var/cache/conftool/dbconfig/20240729-090129-marostegui.json [09:02:30] (03CR) 10Kamila Součková: [C:03+1] "I actually increased them in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050367 , but looking back I'm not sure what (if any) ef" [puppet] - 10https://gerrit.wikimedia.org/r/1057798 (https://phabricator.wikimedia.org/T369256) (owner: 10Filippo Giunchedi) [09:04:43] (03PS1) 10Filippo Giunchedi: rsyslog: send all k8s logs to dedicated kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1057819 (https://phabricator.wikimedia.org/T366710) [09:05:18] (03PS1) 10Stevemunene: Add airflow-analytics-test secret [labs/private] - 10https://gerrit.wikimedia.org/r/1057820 (https://phabricator.wikimedia.org/T371209) [09:05:30] (03CR) 10Filippo Giunchedi: "I've verified with k8s staging that logging happens as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1057819 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [09:07:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032 investigate access denied errors', diff saved to https://phabricator.wikimedia.org/P66995 and previous config saved to /var/cache/conftool/dbconfig/20240729-090730-root.json [09:07:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change [09:07:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change [09:08:03] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: smaller batches for mw_accesslog_metrics [puppet] - 10https://gerrit.wikimedia.org/r/1057798 (https://phabricator.wikimedia.org/T369256) (owner: 10Filippo Giunchedi) [09:09:17] (03PS2) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) [09:09:17] (03PS12) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [09:09:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool 25% of es1032', diff saved to https://phabricator.wikimedia.org/P66996 and previous config saved to /var/cache/conftool/dbconfig/20240729-090953-marostegui.json [09:11:02] (03PS3) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) [09:11:02] (03PS13) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [09:12:43] (03PS2) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) [09:13:09] (03PS3) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) [09:14:16] (03PS2) 10Slyngshede: data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) [09:14:50] (03CR) 10CI reject: [V:04-1] data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede) [09:14:54] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3434/co" [puppet] - 10https://gerrit.wikimedia.org/r/1057187 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:16:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T367856)', diff saved to https://phabricator.wikimedia.org/P66997 and previous config saved to /var/cache/conftool/dbconfig/20240729-091637-marostegui.json [09:16:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [09:16:42] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:16:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [09:16:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T367856)', diff saved to https://phabricator.wikimedia.org/P66998 and previous config saved to /var/cache/conftool/dbconfig/20240729-091658-marostegui.json [09:19:15] (03PS3) 10Slyngshede: data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) [09:22:23] (03PS4) 10Slyngshede: P:openldap::management Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) [09:22:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032 investigate access denied errors', diff saved to https://phabricator.wikimedia.org/P66999 and previous config saved to /var/cache/conftool/dbconfig/20240729-092239-root.json [09:24:50] (03PS1) 10Fabfur: hiera:benthos: remove benthos from ulsfo cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) [09:25:35] (03Abandoned) 10Fabfur: hiera: enable benthos on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [09:25:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [09:27:23] !log dcausse@deploy1002 Started deploy [airflow-dags/search@7da1ef0]: search: process_sparql_query workaround oom issues [09:27:44] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@7da1ef0]: search: process_sparql_query workaround oom issues (duration: 00m 20s) [09:28:40] (03PS2) 10Hnowlan: mesh.configuration: copypasta commit in advance of changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056891 (https://phabricator.wikimedia.org/T356241) [09:28:40] (03PS5) 10Hnowlan: mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [09:28:40] (03PS3) 10Hnowlan: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241) [09:28:51] (03CR) 10Elukey: "Looks good, I just have a question about how pyyaml renders empty lists :)" [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede) [09:29:44] (03CR) 10Slyngshede: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:31:14] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [09:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:53] (03CR) 10Giuseppe Lavagetto: [C:04-1] "LGTM but:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [09:35:19] (03CR) 10Fabfur: "Don't know if there's any usefulness in keeping benthos references in haproxy/cache base profiles (that defaults to false anyway)..." [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [09:36:12] (03CR) 10Hashar: [C:03+1] gerrit: drop gerrit-replica-new.wikimedia.org from list of replicas [puppet] - 10https://gerrit.wikimedia.org/r/1056996 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [09:38:45] (03CR) 10Stevemunene: dns: provision airflow-analytics-test domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [09:39:14] (03PS4) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) [09:39:14] (03PS14) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [09:39:25] (03CR) 10Elukey: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:42:34] (03CR) 10CI reject: [V:04-1] ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:43:18] (03PS1) 10Elukey: WIP provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826 [09:43:40] (03PS5) 10Slyngshede: P:openldap::management Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) [09:44:16] (03CR) 10Slyngshede: P:openldap::management Unbreak cross-validate-accounts script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede) [09:44:52] (03CR) 10Hnowlan: "Done - used the Envoy default of 5m, which is a little steep but means no surprises should we encounter it elsewhere." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [09:44:57] (03PS6) 10Hnowlan: mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) [09:45:15] (03CR) 10Elukey: [C:03+1] "Thanks for the follow up!" [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede) [09:46:07] (03CR) 10Slyngshede: "This bug was noticed due to: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056452/3/modules/admin/data/data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede) [09:47:25] (03CR) 10Slyngshede: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:48:38] (03PS5) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) [09:48:38] (03PS15) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [09:48:49] (03CR) 10Elukey: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:49:58] (03CR) 10Brouberol: [C:03+1] dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [09:51:39] (03PS1) 10Slyngshede: IDP: Switch to CAS 7.0 hosts. [dns] - 10https://gerrit.wikimedia.org/r/1057827 (https://phabricator.wikimedia.org/T367487) [09:51:54] (03CR) 10Slyngshede: [C:03+2] P:openldap::management Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede) [09:55:34] 07Puppet, 13Patch-For-Review: Single member group breaks cross validation script - https://phabricator.wikimedia.org/T371221#10022153 (10SLyngshede-WMF) 05Open→03Resolved [09:56:18] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [09:56:47] (03PS1) 10Jelto: gitlab: reduce max_storage_concurrency for test instance [puppet] - 10https://gerrit.wikimedia.org/r/1057828 (https://phabricator.wikimedia.org/T371222) [09:58:18] (03CR) 10Slyngshede: [C:03+2] "Nicely spotted, thank you" [software/bitu] - 10https://gerrit.wikimedia.org/r/1055998 (owner: 10Bartosz Dziewoński) [09:58:22] (03PS1) 10Clément Goubert: kubernetes: reimage 1 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1057829 (https://phabricator.wikimedia.org/T351074) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1000) [10:07:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2441.mgmt.codfw.wmnet with reboot policy GRACEFUL [10:11:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change [10:11:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change [10:12:09] (03PS1) 10Stevemunene: trafficserver: add airflow-analytics-test discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) [10:12:25] !log bounce benthos@mw_accesslog_sampler on logstash collectors [10:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67000 and previous config saved to /var/cache/conftool/dbconfig/20240729-101348-root.json [10:14:23] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2441.mgmt.codfw.wmnet with reboot policy GRACEFUL [10:18:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234 (10MatthewVernon) 03NEW [10:19:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234#10022286 (10MatthewVernon) p:05Triage→03Medium [10:20:13] !log Deploy schema change on s7 eqiad master with replication dbmaint T370394 [10:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:18] T370394: Drop gb_by from globalblocks table - https://phabricator.wikimedia.org/T370394 [10:26:42] (03PS2) 10Arturo Borrero Gonzalez: wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto) [10:26:45] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto) [10:27:03] (03CR) 10CI reject: [V:04-1] wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto) [10:27:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192#10022321 (10MatthewVernon) p:05Triage→03High [10:27:41] (03PS3) 10Arturo Borrero Gonzalez: wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto) [10:27:47] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto) [10:28:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67001 and previous config saved to /var/cache/conftool/dbconfig/20240729-102853-root.json [10:30:01] (03PS1) 10Alexandros Kosiaris: Add eqsin, drmrs wrongly numbered hosts to typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057832 [10:30:15] (03CR) 10Alexandros Kosiaris: [C:03+2] deployment: Switch master deployment host to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1056878 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [10:31:50] (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "PCC fails when this change is applied with:" [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto) [10:33:02] (03PS1) 10Alexandros Kosiaris: Switch deployment.eqiad.wmnet to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1057833 (https://phabricator.wikimedia.org/T364417) [10:34:28] (03CR) 10Alexandros Kosiaris: [C:03+2] Switch deployment.eqiad.wmnet to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1057833 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [10:35:37] (03PS5) 10Clément Goubert: mwdebug: Add logstash and otelcol config [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) [10:36:56] (03CR) 10Elukey: [C:03+2] ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [10:37:52] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:43:08] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#10022431 (10Ladsgroup) No but it had ten days of replication replayed (with RBR) and if it had issues, it would have broken replication really quickly. Also logs also said aria recovery was... [10:43:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67002 and previous config saved to /var/cache/conftool/dbconfig/20240729-104358-root.json [10:44:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#10022434 (10Marostegui) Sure, that's fine (remember we don't use Aria, so in this case that can be misleading). [10:46:36] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#10022450 (10elukey) ` elukey@ldap-maint1001:~$ sudo add-ldap-group --gid 724 ops-limited successfully created group ops-limited, with gidNumber 724 and 0 members ` [10:46:38] (03CR) 10EoghanGaffney: [C:03+1] gitlab: reduce max_storage_concurrency for test instance [puppet] - 10https://gerrit.wikimedia.org/r/1057828 (https://phabricator.wikimedia.org/T371222) (owner: 10Jelto) [10:47:28] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10022456 (10Volans) [10:49:50] (03PS16) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [10:49:50] (03PS1) 10Elukey: ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 [10:49:57] (03CR) 10Alexandros Kosiaris: [C:03+2] Add eqsin, drmrs wrongly numbered hosts to typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057832 (owner: 10Alexandros Kosiaris) [10:50:48] (03Merged) 10jenkins-bot: Add eqsin, drmrs wrongly numbered hosts to typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057832 (owner: 10Alexandros Kosiaris) [10:51:51] (03CR) 10Jelto: [C:03+2] gitlab: reduce max_storage_concurrency for test instance [puppet] - 10https://gerrit.wikimedia.org/r/1057828 (https://phabricator.wikimedia.org/T371222) (owner: 10Jelto) [10:53:26] (03CR) 10CI reject: [V:04-1] ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 (owner: 10Elukey) [10:54:07] !log akosiaris@deploy1003 Started scap sync-world: check the deployment server after switchover [10:56:00] (03PS1) 10Abijeet Patro: TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) [10:56:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:56:26] (03PS2) 10Elukey: ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 [10:56:26] (03PS17) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [10:58:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Let’s try it and keep an eye on Grafana: https://grafana.wikimedia.org/d/000000316/memcache" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:59:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67003 and previous config saved to /var/cache/conftool/dbconfig/20240729-105904-root.json [10:59:54] (03CR) 10CI reject: [V:04-1] ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 (owner: 10Elukey) [11:00:51] (03PS3) 10Elukey: ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 [11:00:51] (03PS18) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [11:03:44] (03PS1) 10Clément Goubert: cumin: Remove mw-api aliases [puppet] - 10https://gerrit.wikimedia.org/r/1057841 (https://phabricator.wikimedia.org/T367949) [11:04:41] (03CR) 10Ladsgroup: "Does this work for you Manuel?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [11:04:52] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10022490 (10Clement_Goubert) [11:05:56] (03CR) 10Volans: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1057841 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [11:06:27] (03CR) 10Clément Goubert: [C:03+2] cumin: Remove mw-api aliases [puppet] - 10https://gerrit.wikimedia.org/r/1057841 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [11:14:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67004 and previous config saved to /var/cache/conftool/dbconfig/20240729-111410-root.json [11:19:53] (03CR) 10Marostegui: "This works for me, we rarely touch any of this. We only interact now with db-production to set external store as RO sometimes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [11:22:52] (03CR) 10Ladsgroup: [C:03+1] "Then let's go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [11:26:36] !log akosiaris@deploy1003 Finished scap: check the deployment server after switchover (duration: 32m 28s) [11:32:58] (03CR) 10Klausman: [C:03+2] knative-serving: Switch activator to use Calico NP/k8s services (1/9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054538 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [11:34:21] FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:36:20] (03Merged) 10jenkins-bot: knative-serving: Switch activator to use Calico NP/k8s services (1/9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054538 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [11:37:21] (03PS1) 10Jelto: gitlab: add missing max_concurrency value in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1057851 (https://phabricator.wikimedia.org/T371222) [11:39:21] RESOLVED: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:36] (03CR) 10Hnowlan: [C:03+1] kubernetes: reimage 1 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1057829 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:40:39] FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:41:17] (03CR) 10Clément Goubert: [C:03+2] kubernetes: reimage 1 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1057829 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:41:57] FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:43] * volans got paged [11:42:49] same [11:42:51] Looking [11:44:03] pods are up, but they're failing their readiness probes, service is throwing 503s [11:44:09] service logs are empty [11:44:11] * kamila_ looking [11:44:21] RESOLVED: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:39] FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:49] * Emperor got a page, are more hands needed? [11:46:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2441 to wikikube-worker2039 [11:46:10] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:46:16] Why did it page everyone first? I would have expected me and kamila_ would get paged first before everyone. [11:46:43] I'll go look at VO [11:46:53] thanks Emperor <3 [11:46:57] RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:45] VO says "User escalator_sysuser routed incident #4929 from SRE:SRE Business Hours (Escalation) to SRE:SRE Batphone (Escalation)" at basically the same time as the alert fired [11:48:25] man the error rate on recommendation-api suuuucks, 25-30% is normal [11:48:57] FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:16] !incidents [11:49:17] 4930 (UNACKED) ProbeDown sre (10.2.1.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 codfw) [11:49:17] 4929 (RESOLVED) ProbeDown sre (10.2.2.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 eqiad) [11:49:24] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2441 to wikikube-worker2039 - cgoubert@cumin1002" [11:49:28] !ack 4930 [11:49:28] 4930 (ACKED) ProbeDown sre (10.2.1.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 codfw) [11:49:42] I think VO may just have messed up - AFAICT the escalation policy is correctly configured (Business hours first, then batphone after 5m) [11:50:06] (03PS1) 10Klausman: charts: Version bump for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057852 [11:50:06] (03PS1) 10KartikMistry: AX: Unregister "axArticleFooterEntrypointRegistrar" hook handler [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338) [11:50:10] hnowlan: what's weird is I can curl the readiness probe and get a 200 [11:51:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2441 to wikikube-worker2039 - cgoubert@cumin1002" [11:51:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:51:21] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2039 [11:51:22] kamila_: if it's OK with you, I'll open a ticket against sre-observability about the escalation failure for this incident? [11:51:41] Emperor: thanks, sgtm [11:51:53] claime: I just get "fault filter abort" when curling it [11:52:00] and a 503 [11:52:21] should we roll restart? there is very little by way of docs or logging for this [11:52:36] (03CR) 10Jelto: [C:03+2] gitlab: add missing max_concurrency value in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1057851 (https://phabricator.wikimedia.org/T371222) (owner: 10Jelto) [11:52:38] only thing I can think of is the service having issues connecting to mysql [11:52:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338) (owner: 10KartikMistry) [11:53:00] * kamila_ was going to suggest roll restarting, +1 hnowlan [11:53:22] hnowlan: ok i get that going through recommendation-api.discovery.wmnet:4632, but not http://10.67.148.182:9632/robots.txt [11:53:24] (03CR) 10Klausman: [C:03+2] charts: Version bump for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057852 (owner: 10Klausman) [11:53:49] do we want to keep one of the bad pods around for debugging with the relabeling trick? [11:53:57] RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:06] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10022582 (10JayCano) As Máté's manager, I approve this request. [11:56:42] (03Merged) 10jenkins-bot: charts: Version bump for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057852 (owner: 10Klausman) [11:56:55] (03PS2) 10Anzx: dtpwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076) [11:57:00] hnowlan: are you roll restarting? [11:57:00] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244 (10MatthewVernon) 03NEW [11:57:16] ^-- ticket re the mis-directed page [11:57:35] (03PS2) 10Anzx: mywikisource: add portal, author and translation namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057824 (https://phabricator.wikimedia.org/T371060) [11:57:40] kamila_: haven't yet - what's the relabelling trick? [11:58:42] hnowlan: https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Administration#Isolate_a_pod_from_traffic_and_deployments [11:58:46] cc eoghan ^ [11:58:55] (thanks a.lex <3) [11:59:21] RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:55] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:00:39] FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:54] kamila_: ah, cool - will do now [12:01:47] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:01:56] thanks hnowlan <3 [12:01:57] FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:02:30] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [12:02:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [12:04:07] !incidents [12:04:08] 4931 (ACKED) [2x] ProbeDown sre (ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4) [12:04:08] 4930 (RESOLVED) ProbeDown sre (10.2.1.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 codfw) [12:04:08] 4929 (RESOLVED) ProbeDown sre (10.2.2.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 eqiad) [12:04:14] still seeing 503s [12:04:46] sigh, time to look at the codebase [12:05:02] can we raise someone from research? [12:05:21] * kamila_ doesn't see anything in SAL [12:05:46] hnowlan: Did you do/are you doing a rolling restart? [12:05:48] hnowlan, eoghan: let's move to -sre, for noise reduction [12:06:04] eoghan: the scap sync above was a roll restart I assume [12:06:17] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:06:20] Oh yes, sorry. Missed that! [12:06:33] np, it's not obvious from the message [12:06:41] (maybe should be fixed someday) [12:06:42] ack [12:06:57] RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2039 [12:07:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2441 to wikikube-worker2039 [12:08:31] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2039.codfw.wmnet with OS bullseye [12:08:57] FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:46] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022648 (10Jhancock.wm) [12:13:57] RESOLVED: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2221.codfw.wmnet with OS bookworm [12:14:16] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2221.codfw.wmnet with OS bookworm [12:16:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:16:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2222.codfw.wmnet with OS bookworm [12:16:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm [12:17:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2223.codfw.wmnet with OS bookworm [12:17:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022683 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2223.codfw.wmnet with OS bookworm [12:17:30] spike of RU NEL [12:17:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2224.codfw.wmnet with OS bookworm [12:17:51] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2224.codfw.wmnet with OS bookworm [12:17:57] FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2225.codfw.wmnet with OS bookworm [12:18:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm [12:18:28] oh come on, is my silence bad? [12:18:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2226.codfw.wmnet with OS bookworm [12:18:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [12:18:58] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2226.codfw.wmnet with OS bookworm [12:19:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm [12:19:12] RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [12:22:44] (03PS1) 10Klausman: charts/knative-serving: fix selector for activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057859 [12:27:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2039.codfw.wmnet with reason: host reimage [12:27:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:12] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:29:34] (03PS1) 10Slyngshede: Initial 2FA support [software/bitu] - 10https://gerrit.wikimedia.org/r/1057862 [12:30:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2221.codfw.wmnet with reason: host reimage [12:32:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2039.codfw.wmnet with reason: host reimage [12:32:57] I'll do early +2 for my wmf.15 backport patch (Also for probably abijeet's patch) as CI will take 20-25 minutes. [12:33:28] (03PS3) 10Slyngshede: Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923 [12:33:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2226.codfw.wmnet with reason: host reimage [12:33:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2223.codfw.wmnet with reason: host reimage [12:34:41] (03CR) 10Klausman: [C:03+2] charts/knative-serving: fix selector for activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057859 (owner: 10Klausman) [12:34:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2224.codfw.wmnet with reason: host reimage [12:34:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2225.codfw.wmnet with reason: host reimage [12:35:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2221.codfw.wmnet with reason: host reimage [12:35:29] kart_: but you’re at the end of the deployment order [12:35:40] (03CR) 10Slyngshede: [C:03+2] Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923 (owner: 10Slyngshede) [12:35:46] !log test benthos 4.27 on logstash1023 [12:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:54] (03PS7) 10Slyngshede: Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 [12:37:40] (03CR) 10Slyngshede: [C:03+2] Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 (owner: 10Slyngshede) [12:38:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: host reimage [12:39:02] Lucas_WMDE: OK, in that case, I can +2 at the start of the window? [12:39:16] yeah, IMHO that should be enough time [12:39:16] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1057866 (https://phabricator.wikimedia.org/T371251) [12:39:21] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1057867 (https://phabricator.wikimedia.org/T371251) [12:40:45] kart_: maybe abijeet’s change can be +2ed a bit before the window starts, not sure [12:41:00] but that one will need a bit of time to verify that everything is okay after the full deployment (can’t be tested as well on mwdebug) [12:41:07] so I’d like to leave some time there before your backport [12:41:08] (03Merged) 10jenkins-bot: charts/knative-serving: fix selector for activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057859 (owner: 10Klausman) [12:41:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: host reimage [12:41:37] Sure [12:43:25] (03CR) 10Elukey: [C:03+2] ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 (owner: 10Elukey) [12:43:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2223.codfw.wmnet with reason: host reimage [12:45:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2222.codfw.wmnet with OS bookworm [12:45:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022764 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm executed with errors: - db... [12:46:54] !log upgrade and roll-restart benthos@mw_accesslog_sampler on logstash hosts [12:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:07] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:47:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2226.codfw.wmnet with reason: host reimage [12:48:45] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:51:09] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [12:53:12] RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:53:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2039.codfw.wmnet with OS bullseye [12:55:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [12:55:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2221.codfw.wmnet with OS bookworm [12:55:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2221.codfw.wmnet with OS bookworm completed: - db2221 (**PAS... [12:55:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [12:56:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: conftool and pyparsing requirements - https://phabricator.wikimedia.org/T371252 (10elukey) 03NEW [12:56:13] Seems abijeet is not around. [12:56:48] we can wait a bit, but I think I also feel relatively confident to deploy that backport myself [12:57:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2222.codfw.wmnet with OS bookworm [12:57:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [12:57:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2224.codfw.wmnet with OS bookworm [12:57:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [12:57:31] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm [12:57:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2224.codfw.wmnet with OS bookworm completed: - db2224 (**PAS... [12:57:54] kart_: but I guess we can +2 your backport first, then [12:58:01] (and let gate-and-submit run while deploying the config changes) [12:58:09] sure [12:58:31] (03CR) 10Lucas Werkmeister (WMDE): "Okay to deploy now (backport window is in a few minutes)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE)) [12:58:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [12:59:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db2225.codfw.wmnet with OS bookworm [12:59:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm completed: - db2225 (**PAS... [12:59:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm executed with errors: - db... [12:59:59] heh, deploy1003 actually seems to be a “weaker” machine than deploy1002? (at least it has fewer nproc and RAM; haven’t looked into the exact CPU specs or anything ^^) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1300). nyaa~ [13:00:05] Lucas_WMDE, Gerges, abijeet, and kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:00:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338) (owner: 10KartikMistry) [13:00:29] (03PS2) 10Lucas Werkmeister (WMDE): Enable mul language code on Wikidata (limited mode) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) [13:00:50] * Lucas_WMDE waits for diffConfig build [13:01:12] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:01:24] Lucas_WMDE: we can also +2 abijeet's change. [13:01:26] no diff in -labs- or in testwikidatawiki, as expected [13:01:56] kart_: I would wait a bit more with that [13:01:59] hmm, scap backport fails [13:01:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:02:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2223.codfw.wmnet with OS bookworm [13:02:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2223.codfw.wmnet with OS bookworm completed: - db2223 (**PAS... [13:02:22] akosiaris: I might be having issues on deploy1003… I’ll look a bit closer at it but I assume you’d be interested [13:02:33] Lucas_WMDE: what do you experience? [13:02:33] … git remote get-url origin --recursive' failed with exit code 128 [13:02:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2225.codfw.wmnet with OS bookworm [13:02:38] error: unknown option `recursive' [13:02:42] are we on an older git? [13:02:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022823 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm [13:02:50] newer for sure, it's bullseye [13:02:59] and the older hosts are buster [13:03:06] yup, 2.20.1 to 2.30.2 [13:03:10] so did git remove the option? o_O [13:03:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:03:58] Oops. [13:04:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:04:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [13:04:25] hm, I see no evidence of it ever having been in Documentation/git-remote.txt [13:04:29] hello, patch for review: 1057840: TranslatablePage: Split translatable page id cache into multiple shards | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1057840 -- I don't have rights to +2 [13:04:31] (in git.git) [13:04:36] patch for backport** [13:04:44] Lucas_WMDE: I was about to point out, I don't see it in git-remote man page either [13:04:45] abijeet: we’ll get to it, but currently it looks like we might not be able to deploy at all [13:04:51] what's that --recursive thing? [13:04:57] Lucas_WMDE, too many patches already? [13:04:58] (03PS1) 10Physikerwelt: Make native MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057870 (https://phabricator.wikimedia.org/T371254) [13:05:09] akosiaris: yeah, even on deploy1002 it’s not in the docs [13:05:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2225.codfw.wmnet with reason: host reimage [13:05:18] it’s also not really clear to me what it would do [13:05:22] but scap tries to run it… [13:05:30] * Lucas_WMDE looks at scap code [13:05:42] I did a scap sync-world today and didn't notice such a thing [13:05:46] is it scap backport ? [13:06:29] yes [13:06:31] scap/plugins/backport.py has [13:06:34] paths_urls = git.list_submodules_paths_urls(location, "--recursive") [13:06:41] and that just pastes the --recursive to the end of the git command [13:06:55] I think git might just have silently ignored it before? [13:07:07] Lucas_WMDE, no rush, we can deploy it during the UTC late backport window. I'll be around. [13:07:09] (03CR) 10Jelto: [C:03+1] "sounds reasonable to not log changes on the staging host" [puppet] - 10https://gerrit.wikimedia.org/r/1056941 (owner: 10EoghanGaffney) [13:07:14] I guess I get to practice deploying without scap backport today [13:07:23] abijeet: I definitely have a change of my own I want to deploy though :D [13:07:28] we announced a date to the community and all [13:07:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: host reimage [13:08:10] ah so this is probably mean for git submodule then [13:08:20] the --recursive appears in https://gitlab.wikimedia.org/repos/releng/scap/-/commit/f1477e7856 [13:08:21] which does have multiple commands supporting --recursive [13:08:26] is jeena around by any chance? [13:08:36] also I guess I should definitely file a phab task [13:08:40] easier to paste the error output there [13:09:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:09:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [13:09:18] Lucas_WMDE: will the older scap way work? [13:09:27] kart_: I assume so [13:09:30] I’ll try once the phab task is filed [13:10:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:10:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2226.codfw.wmnet with OS bookworm [13:10:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2226.codfw.wmnet with OS bookworm completed: - db2226 (**PAS... [13:10:36] OK. We've to deploy CX patch. It is quite important one :/ [13:10:46] akosiaris: T371255 [13:10:47] T371255: scap backport broken on deploy1003 (bullseye, Git 2.30) - https://phabricator.wikimedia.org/T371255 [13:11:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2222.codfw.wmnet with reason: host reimage [13:11:04] I’ll try to deploy with the old-style commands now [13:11:12] RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:11:37] (03CR) 10Lucas Werkmeister (WMDE): "Deploying (manual +2 because `scap backport` is broken, T371255)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE)) [13:11:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "*actual* +2 vote lol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE)) [13:12:05] akosiaris: would `scap backport` from deploy1002 work? or is that a terrible idea? ^^ [13:12:10] Lucas_WMDE: yeah, it's passing --recursive to the wrong git subcommand, it should be passing --recursive to git submodule foreach [13:12:27] ah, and then it would just echo a bit more, okay [13:12:34] (03Merged) 10jenkins-bot: Enable mul language code on Wikidata (limited mode) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE)) [13:13:07] oh jeez how do you even sync to mwdebug hosts [13:13:16] I guess I’ll just scap pull on one bare-metal mwdebug [13:13:19] and it’ll only be testable there [13:13:23] no idea how to do k8s-mwdebug ^^ [13:13:46] ok pulled on mwdebug1002 [13:13:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2222.codfw.wmnet with reason: host reimage [13:13:54] testing… [13:14:17] Lucas_WMDE: it will probably work but take quite a bit of time to deploy from deploy1002. [13:14:25] alright, then let’s not do that probably [13:14:28] thanks! [13:15:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:15:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [13:16:19] okay I think my config change is working, so let’s do a sync-world [13:16:28] or sync-file (does sync-file still exist? ^^) [13:17:33] looks like it does [13:17:47] yes it does [13:17:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2225.codfw.wmnet with OS bookworm [13:18:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm completed: - db2225 (**PAS... [13:18:05] though I guess all it does is make the rsync on ~5 remaining hosts a tiny bit faster [13:18:30] as I assume the image building doesn’t take the path into account [13:20:12] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:20:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:20:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [13:24:14] !log lucaswerkmeister-wmde@deploy1003 Synchronized wmf-config/: Backport for [[gerrit:1055434|Enable mul language code on Wikidata (limited mode) (T330281)]] (duration: 06m 47s) [13:24:19] T330281: MUL - Phased rollout on Wikidata.org (Stage 2 of 3: Initial limited release) - https://phabricator.wikimedia.org/T330281 [13:25:30] (03CR) 10Jelto: gerrit: use list of replicas from hiera again, don't do puppet DB lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [13:26:07] (03PS3) 10Jelto: gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [13:26:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:26:34] alright, I think my change was deployed successfully AFAICT [13:26:41] so kart_ is up next once CI finishes [13:26:53] (03PS1) 10Klausman: charts/knative-serving: Drop selector for activator networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057872 [13:27:05] and I think we can already +2 abijeet’s backport [13:27:10] unless you want to wait for tonight? [13:27:18] but I wouldn’t mind deploying it now [13:28:25] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3435/console" [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [13:29:10] Lucas_WMDE, fine with me [13:29:15] lets deploy it now [13:29:18] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:29:32] Lucas_WMDE: Nice! [13:29:33] ok, then let’s +2 it and it should be merged by the time we’re done with kart_’s backport [13:29:41] (03Merged) 10jenkins-bot: AX: Unregister "axArticleFooterEntrypointRegistrar" hook handler [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338) (owner: 10KartikMistry) [13:29:47] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "+2ing ahead of deployment" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:30:28] kart_: your backport should be on mwdebug1002, can you test? [13:30:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:30:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2222.codfw.wmnet with OS bookworm [13:31:40] Tricky, but let me see. [13:32:31] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022911 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm completed: - db2222 (**PAS... [13:32:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022912 (10Jhancock.wm) [13:33:04] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafkamon2003.codfw.wmnet [13:33:05] (03CR) 10Klausman: [C:03+2] charts/knative-serving: Drop selector for activator networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057872 (owner: 10Klausman) [13:33:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2228.codfw.wmnet with OS bookworm [13:33:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2229.codfw.wmnet with OS bookworm [13:33:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2230.codfw.wmnet with OS bookworm [13:33:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2231.codfw.wmnet with OS bookworm [13:33:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm [13:33:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2228.codfw.wmnet with OS bookworm [13:33:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2233.codfw.wmnet with OS bookworm [13:33:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2229.codfw.wmnet with OS bookworm [13:33:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2230.codfw.wmnet with OS bookworm [13:33:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2231.codfw.wmnet with OS bookworm [13:33:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2232.codfw.wmnet with OS bookworm [13:34:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm [13:35:25] Lucas_WMDE: still testing with Nik in parallel. Give me one more minute. [13:35:57] sure [13:36:30] (03Merged) 10jenkins-bot: charts/knative-serving: Drop selector for activator networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057872 (owner: 10Klausman) [13:36:47] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2003.codfw.wmnet [13:37:14] Lucas_WMDE: looks good. Please go ahead [13:39:16] (03PS1) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874 [13:39:25] kart_: syncing, thanks for testing! [13:39:41] (03CR) 10CI reject: [V:04-1] service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (owner: 10Alexandros Kosiaris) [13:40:38] (03PS1) 10Fabfur: Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) [13:41:11] !log push new pfw policies - T371137 [13:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:24] (03PS2) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874 [13:41:33] (03CR) 10CI reject: [V:04-1] Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur) [13:42:25] (03PS2) 10Fabfur: Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) [13:42:34] (03CR) 10Kamila Součková: [C:03+1] "Copied votes on follow-up patch sets have been updated:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (owner: 10Alexandros Kosiaris) [13:42:47] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:43:21] (03CR) 10Alexandros Kosiaris: service: Remove probes from recommendation-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (owner: 10Alexandros Kosiaris) [13:43:31] (03CR) 10Ssingh: [C:03+1] Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur) [13:43:39] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur) [13:43:54] (03PS1) 10Slyngshede: data.yaml: Extend andyrussg until the end of August. [puppet] - 10https://gerrit.wikimedia.org/r/1057877 [13:43:54] (03CR) 10Fabfur: [C:03+2] Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur) [13:44:34] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10022944 (10cmooney) [13:44:50] (03PS1) 10DCausse: wdqs: configure internal federation between main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1057878 [13:44:56] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:45:27] (03PS3) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874 [13:45:53] !log lucaswerkmeister-wmde@deploy1003 Synchronized php-1.43.0-wmf.15/extensions/ContentTranslation/extension.json: Backport for [[gerrit:1057853|AX: Unregister "axArticleFooterEntrypointRegistrar" hook handler (T363338)]] (duration: 06m 36s) [13:45:58] T363338: MinT for Wiki Readers MVP: Access from the footer of an article - https://phabricator.wikimedia.org/T363338 [13:46:02] kart_: should be deployed everywhere now [13:46:11] cool. Thanks a lot Lucas_WMDE [13:46:15] np [13:46:27] up next, abijeet, once CI finishes [13:46:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm [13:46:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db... [13:46:39] and I haven’t seen Gerges yet (but I don’t mind if there’s less to deploy while scap backport is broken ^^) [13:46:59] Here [13:47:09] 06SRE, 10Continuous-Integration-Infrastructure, 06Infrastructure-Foundations, 06Release-Engineering-Team: package_builder python-all conflicts with base::standard_packages python2.7 removal - https://phabricator.wikimedia.org/T370337#10022948 (10hashar) 05Open→03Resolved a:03hashar I have solved... [13:47:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2228.codfw.wmnet with reason: host reimage [13:47:19] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1240 - jclark@cumin1002" [13:47:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2230.codfw.wmnet with reason: host reimage [13:47:37] Gerges: alright, we’ll see if we still have time at the end of the window [13:47:55] Ok [13:47:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2232.codfw.wmnet with reason: host reimage [13:48:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1240 - jclark@cumin1002" [13:48:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2229.codfw.wmnet with reason: host reimage [13:48:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2231.codfw.wmnet with reason: host reimage [13:48:32] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10022967 (10Fabfur) The user should be now part of the required group(s), please test it and let me know if anything doesn't work as expected! [13:48:57] (03CR) 10Herron: [C:03+1] prometheus: clean up legacy parameters [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:49:15] (03CR) 10Herron: [C:03+1] "🧹🧼" [puppet] - 10https://gerrit.wikimedia.org/r/1057187 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:49:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [13:49:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2228.codfw.wmnet with reason: host reimage [13:50:40] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10022968 (10Fabfur) p:05Triage→03Low [13:52:00] 06SRE, 10conftool, 06Infrastructure-Foundations, 10Puppet-Infrastructure: conftool and pyparsing requirements - https://phabricator.wikimedia.org/T371252#10022989 (10Volans) [13:52:06] (03PS2) 10Klausman: charts/knative-serving: Re-add app selector in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057879 [13:52:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2232.codfw.wmnet with reason: host reimage [13:52:11] (03CR) 10Klausman: [C:03+2] charts/knative-serving: Re-add app selector in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057879 (owner: 10Klausman) [13:53:00] (03PS4) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (https://phabricator.wikimedia.org/T338471) [13:53:02] * Lucas_WMDE has now installed P8845 on a laptop that previously never needed it thanks to scap backport ^^ [13:53:27] (ok, no stashbot – that’s https://phabricator.wikimedia.org/P8845, `backport-summary` script to generate the message for the scap sync-file) [13:53:42] (03CR) 10Alexandros Kosiaris: [C:03+2] service: Remove probes from recommendation-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [13:53:59] (03CR) 10Alexandros Kosiaris: [C:03+2] "Comments addressed, got a +1 already, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris) [13:54:10] (03CR) 10Cathal Mooney: "Thanks Daniel! Overall LGTM thanks for taking a look... the only worry I would have is are we in danger of removing confd for some roles " [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn) [13:54:16] Lucas_WMDE, 3-4 minutes remaining hopefully. [13:54:44] * Lucas_WMDE nods [13:54:46] jouncebot: next [13:54:46] In 1 hour(s) and 35 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1530) [13:54:57] no other window we’re about to run into at least [13:55:21] (03Merged) 10jenkins-bot: charts/knative-serving: Re-add app selector in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057879 (owner: 10Klausman) [13:55:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2230.codfw.wmnet with reason: host reimage [13:55:46] 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10022995 (10hashar) I must have declined this as part of a task triage since I usually leave a comment when... [13:56:33] !log homer 'cr*codfw*' commit 'T351074' [13:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:37] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:57:02] (03Merged) 10jenkins-bot: TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:57:38] abijeet: I pulled the change to mwdebug1002, anything to test there? [13:57:45] or should we just sync it everywhere and be ready to revert? [13:57:56] I am around as well [13:58:06] hi effie :) [13:58:14] :) [13:58:22] Lucas_WMDE: I've created a scap release with the backport fix, let me know when I can deploy it [13:58:44] jnuche: I’m not scap’ing right now, I think you could do it now [13:58:50] I’m assuming it doesn’t take ages ^^ [13:58:59] nope, should be fast [13:59:04] gonna do it then [13:59:08] alright, thanks! [13:59:11] and then I can try it out right afterwards [13:59:16] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1057187 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:59:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: host reimage [13:59:22] !log jnuche@deploy1003 Installing scap version "4.94.0" for 211 hosts [13:59:31] (03PS1) 10Herron: grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1057882 (https://phabricator.wikimedia.org/T269333) [14:00:00] (03PS1) 10Klausman: charts/knative-serving: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057883 [14:00:12] (03CR) 10Klausman: [C:03+2] charts/knative-serving: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057883 (owner: 10Klausman) [14:00:15] Lucas_WMDE, checking [14:00:38] !log jnuche@deploy1003 Installing scap version "4.94.0" for 210 hosts [14:00:47] ping [14:01:12] !log jnuche@deploy1003 Installation of scap version "4.94.0" completed for 210 hosts [14:01:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED [14:01:28] Gerges: I don’t think we’ll have time for your config changes in this window, sorry [14:01:31] Lucas_WMDE, we can monitor this: https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&var-kClass=pagetranslation&from=now-1h&to=now [14:01:34] we’ve had some problem with the deployment system [14:01:43] Lucas_WMDE: done, hopefully the problem is fixed now! [14:01:48] \o/ [14:01:49] let’s try it [14:01:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1240.eqiad.wmnet with OS bullseye [14:01:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10023061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bull... [14:02:10] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1057840|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] [14:02:14] looking good so far :) [14:02:18] Well no problem [14:02:33] (03PS1) 10Filippo Giunchedi: burrow: restart on failure [puppet] - 10https://gerrit.wikimedia.org/r/1057886 (https://phabricator.wikimedia.org/T366573) [14:02:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2231.codfw.wmnet with reason: host reimage [14:03:19] (03Merged) 10jenkins-bot: charts/knative-serving: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057883 (owner: 10Klausman) [14:04:05] Lucas_WMDE, looks good. [14:04:21] alright [14:04:32] (scap backport is running now btw) [14:06:01] (03PS1) 10Klausman: charts/knative-serving: move selector to Egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057887 [14:06:06] (03CR) 10Klausman: [C:03+2] charts/knative-serving: move selector to Egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057887 (owner: 10Klausman) [14:06:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:07:45] Does scap backport work now, or I wait for the late backport window? [14:07:48] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2039.codfw.wmnet),cluster=kubernetes,service=kubesvc [reason: Pooling and uncordoning - T351074] [14:07:53] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [14:08:02] it works now, but I’m not going to start another deployment after this one, as the window is already over [14:08:17] but there’s no known blocker for deploying this in the evening window later (assuming someone else is around to do it) [14:08:26] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1057882 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [14:09:10] ): [14:09:16] (03Merged) 10jenkins-bot: charts/knative-serving: move selector to Egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057887 (owner: 10Klausman) [14:09:17] !log rerunning airflow mediawiki_history_check_denormalize dag as down stream task after rerunning mediawiki_history_denormalize dag [14:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:30] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T371260 (10Clement_Goubert) 03NEW [14:09:33] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: clean up legacy parameters [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [14:09:35] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T371260#10023126 (10Clement_Goubert) p:05Triage→03Low [14:09:36] feels like docker_pull_k8s is taking unusually long [14:10:31] (on the previous deployments they took 16/20 seconds) [14:11:03] Hmm I hope that's not because I just pooled a node [14:11:35] > ImportError: cannot import name 'cli' from 'scap' (unknown location) [14:11:36] o_O [14:11:40] (03CR) 10Herron: [C:03+2] grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1057882 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [14:11:46] 2 masters had sync errors [14:11:52] Huh yeah that's not me [14:11:53] (deploy1002 and deploy2002, I think?) [14:12:13] claime: the build-and-push-container-images also took 4m13s, idk if that makes it more or less likely to be related to the pooled node? [14:12:26] feels like “bigger image diff” to me but idk why that would be [14:12:27] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:12:32] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10023129 (10Clement_Goubert) p:05Triage→03Low [14:12:43] Lucas_WMDE: nodes are not involved in build-and-push [14:12:53] ok [14:13:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:13:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2228.codfw.wmnet with OS bookworm [14:13:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2228.codfw.wmnet with OS bookworm completed: - db2228 (**PAS... [14:13:34] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Backport for [[gerrit:1057840|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:13:42] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Continuing with sync [14:14:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [14:14:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm [14:15:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:15:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2232.codfw.wmnet with OS bookworm [14:15:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023153 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2232.codfw.wmnet with OS bookworm completed: - db2232 (**PAS... [14:15:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:19:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:19:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2230.codfw.wmnet with OS bookworm [14:19:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:19:13] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023182 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2230.codfw.wmnet with OS bookworm completed: - db2230 (**PAS... [14:20:30] noticing a simikar spike in traffic again...will monitor for some more time. [14:20:42] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:20:51] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:20:52] oh dear [14:21:03] ohhhh yeah TX bandwith is going up [14:21:23] (k8s deployment is done btw, scap is just finishing up) [14:21:32] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:21:33] that sync-masters error is now tracked at T371261 btw [14:21:34] !log lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1057840|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] (duration: 19m 24s) [14:21:37] T371261: scap broken on deploy1002 / deploy2002 (buster) - https://phabricator.wikimedia.org/T371261 [14:21:58] scap returned non-zero exit status… I assume that’s because of the sync-masters [14:22:07] * Lucas_WMDE scrolls up [14:22:18] yeah I don’t see any other errors in the output [14:22:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:22:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2231.codfw.wmnet with OS bookworm [14:22:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023205 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2231.codfw.wmnet with OS bookworm completed: - db2231 (**PAS... [14:23:07] Lucas_WMDE: lets give it ~10' and revert [14:23:11] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:23:14] ok [14:23:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:23:43] or, maybe: let’s upload and +2 the revert now, and leave ourselves the option to abort the merge if we decide not to revert after all? [14:23:55] though that would make it more than 10 minutes before the revert merges normally [14:24:04] !log the grafana default datasource has been changed from graphite to thanos T269333 [14:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:11] T269333: Switch default Grafana datasource to Thanos - https://phabricator.wikimedia.org/T269333 [14:24:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:24:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2229.codfw.wmnet with OS bookworm [14:24:44] (03PS1) 10Klausman: charts/knative-serving: remove selector again [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057892 [14:24:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2229.codfw.wmnet with OS bookworm completed: - db2229 (**PAS... [14:25:09] Lucas_WMDE: based on last week, it is unlikely we will not revert :) [14:25:12] RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:25:24] yeah [14:25:59] (03PS1) 10Lucas Werkmeister (WMDE): Revert "TranslatablePage: Split translatable page id cache into multiple shards" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) [14:26:07] effie, abijeet: ^ [14:26:26] ack [14:26:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023220 (10Jhancock.wm) [14:26:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "Let’s start gate-and-submit while we continue to look at Grafana for a bit; if we decide to deploy the revert, we might force-merge this b" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE)) [14:26:54] (03CR) 10Effie Mouzeli: [C:03+1] Revert "TranslatablePage: Split translatable page id cache into multiple shards" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE)) [14:27:24] 06SRE, 10conftool, 06Infrastructure-Foundations, 10Puppet-Infrastructure: conftool and pyparsing requirements - https://phabricator.wikimedia.org/T371252#10023233 (10elukey) p:05Triage→03Medium [14:28:43] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023235 (10Jhancock.wm) 27 and 33 having some issues. will check again shortly. 27 is not connecting to the right puppet hosts. > Generated Puppet certificate > [1/10... [14:28:45] Lets give it another 2 minutes, and then we can revert it. [14:28:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10023237 (10cmooney) p:05Triage→03Medium [14:29:07] ack [14:30:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10023247 (10cmooney) p:05Triage→03Medium [14:30:29] (03CR) 10Klausman: [C:03+2] charts/knative-serving: remove selector again [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057892 (owner: 10Klausman) [14:31:31] the lines are going up and down a bit but to me they don’t look like they’re settling down to a reasonable level [14:31:34] let’s revert? [14:33:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE)) [14:33:07] Lucas_WMDE, yea I was hoping they'd keep going down, but that doesn't appear to be happening [14:33:15] Lets revert [14:33:21] (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "force-merging the revert" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE)) [14:33:24] !log A:wikidough: debdeploy upgrade anycast-hc to 0.9.8 [14:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:39] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1057893|Revert "TranslatablePage: Split translatable page id cache into multiple shards" (T366455)]] [14:33:39] alright, merged, now scap is running again [14:33:43] (03Merged) 10jenkins-bot: charts/knative-serving: remove selector again [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057892 (owner: 10Klausman) [14:33:52] !log A:wikidough: debdeploy upgrade anycast-hc to 0.9.8: T370068 [14:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:57] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [14:34:52] !log sudo cumin -b1 -s120 'O:wikidough' 'run-puppet-agent' [14:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:34] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1057893|Revert "TranslatablePage: Split translatable page id cache into multiple shards" (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:35:37] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [14:36:27] (building and pulling the image was much faster again this time, btw) [14:37:32] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:37:40] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10023283 (10hnowlan) Are we classifying "incident issue closed" as resolved? [14:38:02] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325#10023279 (10ayounsi) 05Open→03Stalled p:05High→03Low [14:39:11] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:39:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023287 (10Marostegui) @Jhancock.wm I've fixed db2227's certificate issues. Puppet finished correctly. I am going to reimage it again and see if it works fine this time.... [14:39:21] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:31] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#10023289 (10Volans) a:05Volans→03None [14:40:25] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#10023290 (10elukey) [14:41:38] !log lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1057893|Revert "TranslatablePage: Split translatable page id cache into multiple shards" (T366455)]] (duration: 07m 58s) [14:41:49] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#10023295 (10joanna_borun) p:05High→03Medium [14:42:41] memcached looks fine again to me [14:43:24] cheers thank you [14:45:06] !log UTC afternoon backport+config window done [14:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:22] 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10023319 (10hnowlan) One of the big challenges I can see here is the use of compound words - currently we use lazy names like incident-create and incident-list because adding a verb and subverbs w... [14:46:41] and IMHO someone™ should look at T371261 – we probably either need to install the older scap version there(?) or remove them from some masters list so that scap@deploy1003 won’t try to deploy to them anymore [14:46:42] T371261: scap broken on deploy1002 / deploy2002 (buster) - https://phabricator.wikimedia.org/T371261 [14:47:06] CC akosiaris and jnuche for ^ [14:48:10] Lucas_WMDE: ehm, what? that's new [14:48:43] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10023337 (10SLyngshede-WMF) a:03SLyngshede-WMF [14:49:02] I am not sure how to rollback scap tbh. And while I 'll remove deploy1002 within the week, and upgrade deploy2002 to bullseye [14:49:15] I am not sure what is going on there [14:49:35] ah dammit python versions [14:49:36] sigh [14:49:37] to me it looks like an issue with some other commit that was included in the new release [14:49:38] yeah [14:49:58] or is it just using a different python version of where the package was built, maybe [14:50:03] I’m not seeing anything obvious in the git log at least [14:52:12] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:52:16] (03PS3) 10Ottomata: mediawiki.org - Apache rewrite /beacon/event -> /w/beacon/event.php [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) [14:52:25] RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:13] (03CR) 10Ottomata: "Okay, ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [14:54:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2233.codfw.wmnet with OS bookworm [14:54:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm executed with errors: - db... [14:56:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10023429 (10cmooney) 05Open→03Resolved Gonna close this one, I see hosts have been assigned to the new range and are reachable ` cmo... [14:56:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2233.codfw.wmnet with OS bookworm [14:56:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm [14:56:37] (03PS4) 10Ottomata: mediawiki.org - Rewrite /beacon/event -> EventLogging rest handler [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) [14:57:08] (03CR) 10Scott French: [C:03+1] "LGTM to align with the configuration of mwdebug1001." [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:57:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234#10023444 (10Jhancock.wm) a:03Jhancock.wm [14:57:42] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:58:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm [14:58:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db... [14:58:35] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10023447 (10mszabo) 05In progress→03Resolved [14:58:42] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10023448 (10mszabo) Thanks, looks good! [14:58:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2003.codfw.wmnet with OS bookworm [14:58:52] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm [14:59:21] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:59:21] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:53] (03CR) 10Clément Goubert: [C:03+2] mwdebug: Add logstash and otelcol config [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [15:02:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [15:02:12] RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:02:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm [15:03:22] (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki.org - Rewrite /beacon/event -> EventLogging rest handler [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:04:56] Lucas_WMDE: I think that scap issue comes from the scap installer/self-updater [15:05:09] it's not critically urgent but I'll take a look soon [15:05:14] (ish) [15:06:07] ok, thanks! [15:08:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023467 (10Papaul) @Marostegui 2233 was a switch port issue so it should be fix now. @Jhancock.wm started the re -image already on it Cookbook cookbooks.sre.hosts.reima... [15:09:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2006.wikimedia.org [reason: upgrading anycast-hc: T370068] [15:09:44] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [15:10:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2233.codfw.wmnet with reason: host reimage [15:10:56] !log [dns2006] upgrade anycast-healthchecker to 0.9.8-1+wmf12u2: T370068 [15:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:51] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#10023498 (10SLyngshede-WMF) 05In progress→03Resolved Closing this task, I've created https://phabricator.wikimedia.org/T371270 for the issues r... [15:12:12] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:13:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org [reason: finished upgrading anycast-hc: T370068] [15:13:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2233.codfw.wmnet with reason: host reimage [15:14:05] !log running authdns-update after dns2006 depool [15:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023508 (10Marostegui) @Papaul can you also check db2227? It is not rebooting after I issued the reimage cookbook. The idrac screen is also blank so I cannot see where i... [15:16:23] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:16:49] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:17:58] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:18:05] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:18:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1240.eqiad.wmnet with OS bullseye [15:18:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10023511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bullseye... [15:22:12] RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:23:38] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:23:46] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:27:08] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10023539 (10Clement_Goubert) 05In progress→03Resolved [15:29:55] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T371100#10023591 (10phaultfinder) [15:29:57] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:30:04] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1530). Please do the needful. [15:32:34] (03CR) 10BCornwall: [C:03+1] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [15:33:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:33:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2233.codfw.wmnet with OS bookworm [15:33:36] (03PS1) 10Alexandros Kosiaris: [DNM] Showcase atomic: false for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057907 [15:33:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm completed: - db2233 (**PAS... [15:34:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023610 (10Jhancock.wm) [15:40:26] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gerrit2003.codfw.wmnet with OS bookworm [15:40:32] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm executed with errors: - gerr... [15:41:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2003.codfw.wmnet with OS bookworm [15:41:26] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023643 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm [15:42:06] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10023645 (10Clement_Goubert) [15:47:33] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gerrit2003.codfw.wmnet with OS bookworm [15:47:37] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023654 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm executed with errors: - gerr... [15:48:40] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:49:09] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:51:30] (03PS2) 10Clément Goubert: Cleanup old config [puppet] - 10https://gerrit.wikimedia.org/r/1056895 (https://phabricator.wikimedia.org/T367949) [15:51:35] (03CR) 10Clément Goubert: [C:03+2] Cleanup old config [puppet] - 10https://gerrit.wikimedia.org/r/1056895 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [15:53:15] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:54:46] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:55:14] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:55:41] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add public vlan for gerrit2003 - pt1979@cumin2002" [15:56:04] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.5-1wm1_amd64.changes T339134 [15:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:09] T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134 [15:56:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add public vlan for gerrit2003 - pt1979@cumin2002" [15:56:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:57:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2003.wikimedia.org with OS bookworm [15:57:14] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.wikimedia.org with OS bookworm [16:01:39] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-drmrs or A:cp-text_ [16:01:39] drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [16:02:36] jouncebot: nowandnext [16:02:36] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [16:02:36] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700) [16:02:36] In 0 hour(s) and 57 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700) [16:02:46] (03CR) 10Urbanecm: [C:03+2] Ignore help-links with no title configured [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) (owner: 10Michael Große) [16:04:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp- [16:04:33] drmrs or A:cp-text_drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [16:04:39] scary [16:05:05] urbanecm: Once you are done deploying could you ping me and then I can deploy? [16:05:18] Dreamy_Jazz: sure, will do [16:05:32] Dreamy_Jazz: depending on what you have, i can also sequeeze it into my scap if you want to. up2you. [16:05:47] Yet to write my patch, so don't want to hold you up. [16:05:51] urbanecm: keep https://phabricator.wikimedia.org/T371261 in case it's shows up [16:05:58] in mind, in case* [16:06:09] Be advised of of a.kosiaris was faster [16:06:19] akosiaris: good to know, thanks. [16:06:40] scap help works at least now [16:06:53] on deploy1003, yes it does [16:07:01] it's the other 2 hosts that are borked [16:07:12] yep. i first sshed to 1002, and it yelled at me "do not use, use 1003 instead", so i switched. [16:07:22] chances are you will be fine btw. But keep it in mind [16:07:28] yup, thanks for the headsup [16:08:11] Dreamy_Jazz: i'm literally waiting on CI, so...no problem if it'll come until CI is finished (it says 20 mins eta) [16:08:33] I'm just writing it now, so I should have it ready in time :) [16:12:23] hey, is it okay if I quickly merge a labs-only change in between? ;) [16:12:55] zabe: go for it [16:14:23] (03CR) 10Zabe: [C:03+2] Make native MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057870 (https://phabricator.wikimedia.org/T371254) (owner: 10Physikerwelt) [16:14:31] thx [16:15:13] (03Merged) 10jenkins-bot: Make native MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057870 (https://phabricator.wikimedia.org/T371254) (owner: 10Physikerwelt) [16:15:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2003.wikimedia.org with reason: host reimage [16:16:01] done [16:16:30] zabe: did you pull to deploy too? [16:16:33] or should i? [16:16:39] pulled [16:16:43] ack, thanks [16:17:36] (03PS1) 10Dreamy Jazz: Display a GlobalBlock link to stewards in Special:CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057917 (https://phabricator.wikimedia.org/T370463) [16:17:57] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet [reason: testing ATS 9.2.5 upgrade] [16:18:46] (03CR) 10Dreamy Jazz: [C:03+2] Display a GlobalBlock link to stewards in Special:CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057917 (https://phabricator.wikimedia.org/T370463) (owner: 10Dreamy Jazz) [16:18:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2003.wikimedia.org with reason: host reimage [16:19:32] (03Merged) 10jenkins-bot: Display a GlobalBlock link to stewards in Special:CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057917 (https://phabricator.wikimedia.org/T370463) (owner: 10Dreamy Jazz) [16:19:33] urbanecm: I've created the config patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1057917 and given it a +2 [16:20:06] I don't have steward rights, so won't be able to test beyond ensuring that Special:CheckUser didn't break. [16:20:43] The relevant code is tested so I feel confident that it'll work. [16:21:48] FIRING: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [16:23:23] Dreamy_Jazz: ack, sounds good [16:23:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057918 [16:23:27] i can help with testing [16:23:32] :D [16:23:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) (owner: 10Michael Große) [16:29:29] :D [16:30:11] !log restart swift-proxy on ms-fe2011 T360913 [16:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:16] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [16:31:02] (03CR) 10Volans: [C:03+1] "very late post merge issue found after T371132" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:31:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023867 (10Papaul) @Marostegui checking [16:33:32] (03PS2) 10Ssingh: Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) [16:33:37] (03CR) 10CI reject: [V:04-1] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [16:36:35] (03Merged) 10jenkins-bot: Ignore help-links with no title configured [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) (owner: 10Michael Große) [16:36:47] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1057917|Display a GlobalBlock link to stewards in Special:CheckUser (T370463 T178571)]], [[gerrit:1057001|Ignore help-links with no title configured (T370941)]] [16:36:52] progress! [16:36:53] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:36:54] T370463: Update CheckUser to handle global account blocks - https://phabricator.wikimedia.org/T370463 [16:36:55] T178571: Add CentralAuth and GlobalBlock links to Special:CheckUser - https://phabricator.wikimedia.org/T178571 [16:36:55] T370941: PHP Notice: Undefined index: title - https://phabricator.wikimedia.org/T370941 [16:37:01] :) [16:38:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:38:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2003.wikimedia.org with OS bookworm [16:38:47] !log urbanecm@deploy1003 dreamyjazz, migr, urbanecm: Backport for [[gerrit:1057917|Display a GlobalBlock link to stewards in Special:CheckUser (T370463 T178571)]], [[gerrit:1057001|Ignore help-links with no title configured (T370941)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:38:47] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.wikimedia.org with OS bookworm completed: - gerrit2003 (*... [16:39:11] Dreamy_Jazz: what do i need to do? [16:39:41] Load Special:CheckUser 'Get users' on any wiki and test that the result lines have a "GlobalBlock" link next to them. [16:39:50] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023951 (10Papaul) [16:40:49] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023952 (10Papaul) 05Open→03Resolved @Dzahn all your's [16:41:02] (03PS3) 10Ssingh: Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) [16:41:03] that works [16:41:08] (03CR) 10CI reject: [V:04-1] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [16:41:10] i also see this https://usercontent.irccloud-cdn.com/file/Xavv47yw/image.png [16:41:17] which might be unrelated [16:41:22] It is unrelated [16:41:44] That's locally blocking (as opposed to globally) [16:41:49] gotcha [16:41:52] anyway, link was there [16:42:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023956 (10Papaul) @Marostegui serial was off on the server I set it up. We had an issue with the provision cookbook not setting the serial co we did all the servers man... [16:42:07] M​artin Urbanec globally blocked ~2024-2553 (expires: 2024-07-29 16:40:57) with the following comment: Testing block <=== and block works too :) [16:42:08] !log urbanecm@deploy1003 dreamyjazz, migr, urbanecm: Continuing with sync [16:42:10] proceeding [16:44:19] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023969 (10Papaul) @Marostegui since all those servers are on 10G when you put them in productions can you please let me know if you noticed any improvement. [16:44:24] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm [16:44:25] (03PS1) 10Ssingh: Release 9.2.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) [16:44:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors: -... [16:44:32] (03CR) 10CI reject: [V:04-1] Release 9.2.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [16:45:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [16:45:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm [16:47:44] !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1057917|Display a GlobalBlock link to stewards in Special:CheckUser (T370463 T178571)]], [[gerrit:1057001|Ignore help-links with no title configured (T370941)]] (duration: 10m 56s) [16:47:48] Dreamy_Jazz: done [16:47:56] T370463: Update CheckUser to handle global account blocks - https://phabricator.wikimedia.org/T370463 [16:47:57] :D [16:47:57] T178571: Add CentralAuth and GlobalBlock links to Special:CheckUser - https://phabricator.wikimedia.org/T178571 [16:47:57] T370941: PHP Notice: Undefined index: title - https://phabricator.wikimedia.org/T370941 [16:48:30] Dreamy_Jazz: and also i'm done with my own stuff, in case you have anything else :D [16:49:31] That was the only one I wanted to deploy [16:49:32] Thanks [16:50:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024032 (10Marostegui) @Papaul I am trying to reimage db2227 but it is not doing PXE boot [16:57:38] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024082 (10Marostegui) >>! In T369654#10023969, @Papaul wrote: > @Marostegui since all those servers are on 10G when you put them in productions can you please let me kn... [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700) [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700). [17:02:09] (03CR) 10RLazarus: [C:03+1] switchdc: mediawiki cache warmup now targets k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [17:02:40] (03CR) 10Dzahn: "you are right. Jelto also raised the same concern. not sure yet what the best fix it but at least this gets closer to what the core of the" [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn) [17:04:01] (03CR) 10Dzahn: [C:03+1] "while it should not be handled via email to individual people, I'd still say +1 to this one" [puppet] - 10https://gerrit.wikimedia.org/r/1057877 (owner: 10Slyngshede) [17:05:02] (03CR) 10Dzahn: [C:03+2] gerrit: drop gerrit-replica-new.wikimedia.org from list of replicas [puppet] - 10https://gerrit.wikimedia.org/r/1056996 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn) [17:08:23] (03PS1) 10Elukey: sre.hosts.provision: fix dell_config_changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1057927 (https://phabricator.wikimedia.org/T365372) [17:13:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024156 (10Papaul) @Marostegui please run the cookbook this way: ` sudo cookbook sre.hosts.reimage -t T369654 --os bookworm --force-dhcp-tftp db2227 --new ` add the ---... [17:13:54] (03CR) 10Elukey: [C:03+2] sre.host.provision: no-op refactor to highlight DELL-specific confs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [17:14:05] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.5-1wm2_amd64.changes T339134 [17:14:06] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10024160 (10jijiki) [17:14:07] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#10024161 (10jijiki) [17:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:11] T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134 [17:14:55] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-drmrs or A:cp-text_ [17:14:55] drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [17:17:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp- [17:17:44] drmrs or A:cp-text_drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru) [17:24:28] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm [17:24:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors: -... [17:25:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [17:25:44] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm [17:26:22] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [reason: testing ATS 9.2.5 upgrade] [17:30:42] (03Abandoned) 10Dzahn: wikistats: drop min_gb parameter from cinder volume mount [puppet] - 10https://gerrit.wikimedia.org/r/1056605 (owner: 10Dzahn) [17:33:52] (03CR) 10Dzahn: [C:03+2] site: simplify regex for doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056586 (owner: 10Dzahn) [17:36:18] (03CR) 10Dzahn: [C:04-1] firewall: if provider is nft and not pulling requestctl, remove confd [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn) [17:37:47] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10024274 (10lmata) >>! In T370783#10023283, @hnowlan wrote: > Are we classifying "incident issue closed" as resolved? Alternatively, we'd need some intermediate state like "Stalled" or a new one, m... [17:39:42] (03PS1) 10Dzahn: ci: replace ferm::service with firewall::service in data_rsync [puppet] - 10https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677) [17:43:31] (03PS1) 10Dzahn: zuul: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) [17:46:07] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1056603/3437/aphlict1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1056603 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:46:38] (03CR) 10Dzahn: [V:03+1 C:03+2] aphlict: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056603 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:50:58] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm [17:51:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors: -... [17:51:26] !log mwmaint1002: kill extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php for enwiki (T370802) [17:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:31] T370802: Add a link (Structured task): Release as "turned off" to English Wikipedia - https://phabricator.wikimedia.org/T370802 [17:52:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [17:52:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024447 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm [17:59:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:37] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10024486 (10NBaca-WMF) Created three related tickets to track work for this: * https://phabricator.wikimedia.org/T371295 for running synthetic performanc... [18:05:40] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024506 (10akosiaris) >>! In T370739#10019839, @Catrope wrote: > @akosiaris I'm trying to figure out how we should proceed based on your comment. Y... [18:08:14] 06SRE, 06serviceops, 10Shellbox, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024527 (10LGoto) [18:09:47] 06SRE, 06serviceops, 10Shellbox, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024525 (10LGoto) p:05Triage→03High [18:10:03] (03PS3) 10Dzahn: aphlict: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) [18:10:23] (03CR) 10Dzahn: [V:03+1] "works now after the previous fix: https://puppet-compiler.wmflabs.org/output/1055489/3438/aphlict1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:15:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:48:40] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10024743 (10Jhancock.wm) a:03Jhancock.wm [18:49:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:30] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2227.codfw.wmnet with OS bookworm [18:58:37] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db22... [18:59:15] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10024815 (10jhathaway) >>! In T370789#10015615, @BCornwall wrote: > That's right! Thanks for reminding. Anyone have any qualms with going that route? seems simple and easy to change later, so +1 [18:59:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [18:59:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024816 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm [19:00:17] (03PS1) 10Jdlrobson: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) [19:00:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson) [19:00:58] (03CR) 10CI reject: [V:04-1] Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson) [19:01:20] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10024842 (10Jhancock.wm) a:03Jhancock.wm [19:04:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:07:33] (03CR) 10JHathaway: [C:03+1] data.yaml: Extend andyrussg until the end of August. [puppet] - 10https://gerrit.wikimedia.org/r/1057877 (owner: 10Slyngshede) [19:07:52] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10024885 (10Jhancock.wm) a:03Jhancock.wm [19:09:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:11:34] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10024896 (10jhathaway) >>! In T370783#10023283, @hnowlan wrote: > Are we classifying "incident issue closed" as resolved? Resolved maps well to our docs on resolving an incident, https://wikitech.w... [19:19:09] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10024917 (10Jhancock.wm) a:03Jhancock.wm [19:28:39] 06SRE, 06serviceops, 10Shellbox, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024998 (10CDanis) >>! In T370739#10024506, @akosiaris wrote: > Rate limiting is broken in service-runner for a long time now. See T200374... [19:43:56] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10025093 (10thcipriani) Reason for access makes sense. Approved from my side. [19:44:28] 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10025096 (10thcipriani) [19:49:53] (03PS4) 10NMW03: Increase edit count requirement for autoconfirmed on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057379 (https://phabricator.wikimedia.org/T371186) [19:50:08] o/ [19:54:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:56:48] (03CR) 10DannyS712: [C:04-1] admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [19:57:50] jouncebot: next [19:57:50] In 0 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T2000) [19:59:00] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10025127 (10Dzahn) Check the VictorOps web UI -> rotations -> and see what time (and timezone!) is configured for the 2 rotations. (There are only 2 so not sure how... [19:59:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T2000). [20:00:04] Nemoralis, Superzerocool, ebernhardson, Gerges, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] o/ [20:00:32] i can deploy [20:01:13] \o [20:01:32] cjming: i have to restart some services between my two patches, so they could have a number of others between them [20:01:48] ebernhardson: sounds good [20:02:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057379 (https://phabricator.wikimedia.org/T371186) (owner: 10NMW03) [20:02:25] o/ im here if there is space for me :) [20:03:20] wop... I was late, but I'm here for a tiny deploy (IP cap lift) :) [20:03:34] no worries, it is started now [20:03:41] Superzerocool: i'll do yours next [20:03:55] yay!, thanks =) [20:04:01] cjming: do you know how can I test my patch? [20:04:20] Nemoralis: once it's ready - do you have the mwdebug extension installed? [20:04:30] no, no I know that [20:04:51] I am talking about testing wgAutoConfirmCount [20:05:02] oh - that idk [20:06:25] (03Merged) 10jenkins-bot: Increase edit count requirement for autoconfirmed on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057379 (https://phabricator.wikimedia.org/T371186) (owner: 10NMW03) [20:06:36] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]] [20:06:45] T371186: Change autoconfirmed requirements on English Wikivoyage - https://phabricator.wikimedia.org/T371186 [20:08:35] !log cjming@deploy1003 nmw03, cjming: Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:39] Nemoralis: not sure what to tell you about testing - your patch is up on test servers tho - shall i sync? [20:09:03] I think yes, it is not a big patch [20:09:19] cool - syncing [20:09:34] for any SRE around -- i saw this message: 2 masters had sync errors [20:09:56] https://www.irccloud.com/pastebin/Ixq1TyA8/ [20:10:02] !log cjming@deploy1003 nmw03, cjming: Continuing with sync [20:12:17] (03PS2) 10Superzerocool: enwiki, commonswiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026) [20:13:57] (03CR) 10Ebernhardson: [C:03+1] Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [20:14:22] (03PS8) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [20:14:31] (03PS2) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046) [20:15:29] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]] (duration: 08m 52s) [20:15:34] T371186: Change autoconfirmed requirements on English Wikivoyage - https://phabricator.wikimedia.org/T371186 [20:15:41] ty cjming [20:16:05] Nemoralis: i think it's live but i just saw an error [20:16:31] if any SREs are available: did the last scap backport actually work? [20:16:37] 20:15:29 backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=nmw03', 'Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]]']' returned non-zero exit status 1. [20:16:57] ^^ saw this msg after saying it finished [20:17:03] weird [20:17:30] maybe you should comment this on phab task too [20:17:44] i kinda want confirmation before proceeding with the next patch [20:18:05] Nemoralis: is there a way for you to check on prod? [20:18:20] I am not sure [20:18:28] let me check if I have autoconfirmed [20:18:52] oh I don't [20:18:55] and I have 2 edits [20:18:56] wait [20:19:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm [20:19:16] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10025297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db22... [20:19:27] brennan or thcipriani -- not sure who else to ping -- is it ok to go ahead with the backport window in spite of weird error messages i'm seeing? [20:19:54] sorry *brennen ^^ [20:20:09] i wonder if it's something to do with the switchover in deploy hosts [20:20:22] that's what i'm wondering - i'm on deploy1003 [20:20:48] after seeing a giant message on deploy1002 not to use it [20:20:58] ok I have received autoconfirmed now [20:21:03] oh good! [20:21:08] so maybe things are working [20:21:31] https://en.wikivoyage.org/wiki/Special:UserRights/Nemoralis [20:21:42] still - error messages are a bit disconcerting - not sure if it's ok to plow ahead in spite of them [20:21:46] cjming: there's a task but it's probably ok [20:21:50] Give me a minute [20:22:03] FIRING: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [20:22:41] RhinosF1: thanks - i'm going to err on the side of plowing ahead then [20:22:46] cjming: https://phabricator.wikimedia.org/T371261 [20:22:50] Is it that? [20:23:07] it looks like the same error [20:23:13] similar - i pasted above what i'm seeing [20:23:14] Go ahead then [20:23:17] cool [20:23:19] It's fine for today [20:23:31] great - thanks [20:24:22] Nemoralis: I see the API and it shows right the deploy... [20:24:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026) (owner: 10Superzerocool) [20:24:47] Superzerocool: thanks! What is the api url for that? I couldn't find that [20:24:49] Nemoralis: https://en.wikivoyage.org/wiki/Special:ApiSandbox#action=query&format=json&meta=siteinfo&formatversion=2&siprop=autopromote [20:24:57] Superzerocool: deploying yours now [20:25:06] (03Merged) 10jenkins-bot: enwiki, commonswiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026) (owner: 10Superzerocool) [20:25:17] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1057033|enwiki, commonswiki: lift IP cap for edit-a-thon (T371026)]] [20:25:22] T371026: Requesting temporary lift of IP cap for 31 July 2024 - https://phabricator.wikimedia.org/T371026 [20:25:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:45] (03CR) 10Dzahn: [V:03+1 C:03+2] aphlict: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:27:13] !log cjming@deploy1003 superzerocool, cjming: Backport for [[gerrit:1057033|enwiki, commonswiki: lift IP cap for edit-a-thon (T371026)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:27:16] eberhardson: i guess yours can't go out together - so i'll do part 1, you can restart what needs restarting, and maybe do 1-2 between and resume with your part 2 when you tell me you're ready? [20:27:23] cjming: sure [20:27:29] Superzerocool: ok to sync? [20:27:42] sure cjming :) [20:27:46] !log cjming@deploy1003 superzerocool, cjming: Continuing with sync [20:28:15] (03PS9) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [20:29:14] Gerges: are you around? [20:33:17] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1057033|enwiki, commonswiki: lift IP cap for edit-a-thon (T371026)]] (duration: 07m 59s) [20:33:22] (03PS1) 10Dzahn: miscweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057948 (https://phabricator.wikimedia.org/T370677) [20:33:25] T371026: Requesting temporary lift of IP cap for 31 July 2024 - https://phabricator.wikimedia.org/T371026 [20:33:27] Superzerocool: guessing it's live [20:33:38] yay!, thanks cjming :) [20:33:45] yw! [20:33:51] See you wiki-people :wave: [20:33:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [20:34:08] ebernhardson: starting with your part 1 [20:34:29] cjming: I am here if I can jump the queue? [20:34:37] (03Merged) 10jenkins-bot: Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [20:34:48] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]] [20:34:53] T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 [20:35:00] (03CR) 10Dzahn: [V:03+1 C:03+2] "still getting desktop notification after this" [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:35:03] Jdlrobson: i was just about to say - i'll yours in between Erik's since the person before you appears to be N/A [20:35:06] (03PS1) 10Dzahn: codesearch: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677) [20:36:36] (03PS1) 10Dzahn: releases: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677) [20:36:37] !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:36:53] ebernhardson: should i sync? [20:37:13] cjming: yea [20:37:17] !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync [20:37:17] (03PS2) 10Dzahn: releases: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677) [20:37:38] (03PS2) 10Jdlrobson: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) [20:38:18] (03CR) 10CI reject: [V:04-1] Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson) [20:38:52] (03PS1) 10Dzahn: durum: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057951 [20:38:57] Jdlrobson: your patch isn't passing CI - can you take a look? [20:39:21] (03PS3) 10Jdlrobson: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) [20:39:22] fixed [20:39:29] that was fast [20:39:45] 🫡 [20:41:32] (03CR) 10Dzahn: [V:03+1] "as you can see in compiler output all that happens is the ferm config file gets slightly renamed but the rules stay the same and this just" [puppet] - 10https://gerrit.wikimedia.org/r/1057951 (owner: 10Dzahn) [20:42:19] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]] (duration: 07m 30s) [20:42:24] T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 [20:42:24] ebernhardson: part 1 should be live - i'll do Jon's next, then resume with your part 2 [20:42:29] kk, thanks! [20:42:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson) [20:43:32] (03Merged) 10jenkins-bot: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson) [20:43:34] Gerges: if you're around, happy to do your patches here in a bit [20:43:43] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1057936|Clean up night mode exclude namespaces and allow font size on submit (T370092 T370505)]] [20:43:49] T370092: Switching editing mode from VisualEditor to source mode locks text size if it contains changed content - https://phabricator.wikimedia.org/T370092 [20:43:49] T370505: Enable dark-mode in mediawiki.org Manual namespace - https://phabricator.wikimedia.org/T370505 [20:44:08] (03PS9) 10TheDJ: Adjust CSP header for pdfs & videos & set enforce on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [20:44:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:31] (03CR) 10TheDJ: "Scheduled this again, now for july 30th." [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [20:45:04] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [20:45:09] (03PS1) 10Dzahn: prometheus::ops: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057952 [20:45:28] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [20:45:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:54] !log cjming@deploy1003 cjming, jdlrobson: Backport for [[gerrit:1057936|Clean up night mode exclude namespaces and allow font size on submit (T370092 T370505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:45:59] Jdlrobson: on mwdebug if you'd like to test [20:46:09] wahoo! [20:46:23] LGTM please sync [20:46:28] !log cjming@deploy1003 cjming, jdlrobson: Continuing with sync [20:48:04] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [20:48:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1057950/3441/releases1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:48:37] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [20:52:02] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1057936|Clean up night mode exclude namespaces and allow font size on submit (T370092 T370505)]] (duration: 08m 18s) [20:52:06] Jdlrobson: should be live! [20:52:08] T370092: Switching editing mode from VisualEditor to source mode locks text size if it contains changed content - https://phabricator.wikimedia.org/T370092 [20:52:08] T370505: Enable dark-mode in mediawiki.org Manual namespace - https://phabricator.wikimedia.org/T370505 [20:52:10] cjming: alright mine looks to be ready [20:52:18] perfect timing [20:52:38] (03CR) 10Dzahn: [V:03+1 C:03+2] "complete noop here" [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:53:12] ebernhardson: should i rebase your part 2 on parent or master? [20:53:28] cjming: master is fine [20:53:38] (03PS3) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046) [20:54:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [20:55:24] (03Merged) 10jenkins-bot: Produce a limited set of event streams on private wikis (pt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [20:55:34] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]] [20:55:38] T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 [21:00:05] Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T2100). [21:00:11] !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:00:27] ebernhardson: shall i sync part 2? [21:00:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:45] cjming: yes please [21:00:49] !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync [21:04:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:14] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]] (duration: 10m 40s) [21:06:19] T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046 [21:06:20] ebernhardson: part 2 should be live! (hopefully) [21:07:07] cjming: awesome! will poke and see if it generates the new kafka topics [21:07:23] \o/ [21:07:35] cjming: sorry i missed that earlier ping. had a flight delayed and have been in transit most of today. [21:08:07] brennen: no worries! sorry to trouble while you're traveling - all good here [21:08:13] Gerges: last call [21:08:19] i will be closing backport window here shortly [21:09:03] !log end of UTC late backport window [21:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10025480 (10Papaul) We reiamge 5 times db2227 for some reason the server is still sending the certificate request to puppetmaster1001 ` pt1979@puppetmaster1001:~$ sudo pu... [21:25:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:44] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [21:39:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:40] !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1002" [21:42:40] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1002" [21:42:41] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:45:31] (03CR) 10Dzahn: gerrit: use list of replicas from hiera again, don't do puppet DB lookup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [21:47:29] (03CR) 10Dzahn: "compiler fails because of the bug this is trying to fix - still" [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [21:48:25] (03CR) 10Dzahn: "we need one succesful puppet run to make the hosts appear in the new puppetdb ...I'll try it on the replica" [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [21:50:39] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:52] (03CR) 10Dzahn: [C:03+2] gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [21:51:28] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3443/console" [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [21:54:22] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:24] (03PS1) 10JHathaway: WIP: test pcc do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1057967 [22:01:32] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [22:15:07] (03PS2) 10JHathaway: WIP: test pcc do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1057967 [22:16:08] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway) [22:23:02] cjming: also sorry to miss the ping, that is https://phabricator.wikimedia.org/T371261 and it should be OK [22:24:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:24] (03PS21) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [22:53:44] (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [23:14:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:39] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:21] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:21] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057975 [23:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057975 (owner: 10TrainBranchBot) [23:39:47] (03PS1) 10C. Scott Ananian: Enable Parsoid Read Views on {en,he}wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) [23:54:20] (03CR) 10Arlolra: [C:03+1] Enable Parsoid Read Views on {en,he}wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian)