[00:02:13] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057416 (owner: 10TrainBranchBot)
[00:06:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:06:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:22:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T367856)', diff saved to https://phabricator.wikimedia.org/P66980 and previous config saved to /var/cache/conftool/dbconfig/20240729-022221-marostegui.json
[02:22:26] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[02:37:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P66981 and previous config saved to /var/cache/conftool/dbconfig/20240729-023728-marostegui.json
[02:39:21] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:52:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P66982 and previous config saved to /var/cache/conftool/dbconfig/20240729-025235-marostegui.json
[02:59:21] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:07:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T367856)', diff saved to https://phabricator.wikimedia.org/P66983 and previous config saved to /var/cache/conftool/dbconfig/20240729-030742-marostegui.json
[03:07:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2216.codfw.wmnet with reason: Maintenance
[03:07:53] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[03:07:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2216.codfw.wmnet with reason: Maintenance
[03:08:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T367856)', diff saved to https://phabricator.wikimedia.org/P66984 and previous config saved to /var/cache/conftool/dbconfig/20240729-030804-marostegui.json
[03:54:49] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2024-07-24-145137-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057421 (https://phabricator.wikimedia.org/T355304)
[04:05:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T371190#10021355 (10Marostegui)
[04:33:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:01:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T370556#10021373 (10Marostegui)
[05:03:19] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#10021377 (10Marostegui) Thank you Papaul!
[05:03:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T370556#10021374 (10Marostegui) 05Open→03Resolved a:03VRiley-WMF Thanks @VRiley-WMF - the host is now looking good
[05:09:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#10021378 (10Marostegui) Did this server get the data checksummed or cloned before repooling it back?
[05:27:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[05:32:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[05:39:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[05:44:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:18:02] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1057436 (https://phabricator.wikimedia.org/T371205)
[06:19:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T371205
[06:19:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2140 with weight 0 T371205', diff saved to https://phabricator.wikimedia.org/P66987 and previous config saved to /var/cache/conftool/dbconfig/20240729-061940-root.json
[06:19:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T371205
[06:21:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2140 from API/vslow/dump T371205', diff saved to https://phabricator.wikimedia.org/P66988 and previous config saved to /var/cache/conftool/dbconfig/20240729-062123-root.json
[06:22:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1057436 (https://phabricator.wikimedia.org/T371205) (owner: 10Gerrit maintenance bot)
[06:26:05] <wikibugs>	 (03PS1) 10Marostegui: db2179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1057603
[06:26:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1057603 (owner: 10Marostegui)
[06:39:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse)
[06:42:27] <marostegui>	 !log Starting s4 codfw failover from db2179 to db2140 - T371205
[06:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:31] <stashbot>	 T371205: Switchover s4 master (db2179 -> db2140) - https://phabricator.wikimedia.org/T371205
[06:42:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2140 to s4 primary T371205', diff saved to https://phabricator.wikimedia.org/P66989 and previous config saved to /var/cache/conftool/dbconfig/20240729-064250-marostegui.json
[06:44:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2179 T371205', diff saved to https://phabricator.wikimedia.org/P66990 and previous config saved to /var/cache/conftool/dbconfig/20240729-064405-marostegui.json
[06:46:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Long schema change
[06:47:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Long schema change
[06:48:16] <marostegui>	 !log Deploy schema change on s4 codfw db2179 dbmaint T367856
[06:48:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:21] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T0700)
[07:00:05] <jouncebot>	 kart_ and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:17] <dcausse>	 o/
[07:01:17] * kart_ is here
[07:02:07] <kart_>	 I'll start with my patch.
[07:02:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057432 (owner: 10KartikMistry)
[07:02:54] <wikibugs>	 (03Merged) 10jenkins-bot: Temporary disable MinT for Wikireaders for bn, fa, hi, and ko [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057432 (owner: 10KartikMistry)
[07:03:27] <logmsgbot>	 !log kartik@deploy1002 Started scap sync-world: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]]
[07:13:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] profile::haproxy: move tls_terminator.pp to profile module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056466 (owner: 10Giuseppe Lavagetto)
[07:17:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: add confd_file define [puppet] - 10https://gerrit.wikimedia.org/r/1056875 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto)
[07:18:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto)
[07:18:22] <wikibugs>	 (03PS4) 10CDanis: haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto)
[07:18:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: add ability to inject requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1056876 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto)
[07:19:09] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:19:09] <logmsgbot>	 !log kartik@deploy1002 Sync cancelled.
[07:19:31] <kart_>	 eh. Seems accidental key pressed. Retying.
[07:19:52] <logmsgbot>	 !log kartik@deploy1002 Started scap sync-world: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]]
[07:19:52] <kart_>	 Sorry dcausse :/
[07:20:11] <dcausse>	 kart_: no worries! :)
[07:25:13] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts karapace1002.eqiad.wmnet
[07:25:45] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:25:49] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:29:57] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.dns.netbox
[07:32:30] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002"
[07:34:00] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002"
[07:34:00] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:34:00] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts karapace1002.eqiad.wmnet
[07:34:21] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts karapace1001.eqiad.wmnet
[07:34:34] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1057432|Temporary disable MinT for Wikireaders for bn, fa, hi, and ko]] (duration: 14m 42s)
[07:34:49] <kart_>	 dcausse: done!
[07:35:03] <dcausse>	 kart_: thanks! will deploy mine
[07:37:17] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 24482
[07:38:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse)
[07:39:02] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 24482
[07:39:17] <wikibugs>	 (03Merged) 10jenkins-bot: GeoData: add pool counter settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) (owner: 10DCausse)
[07:39:33] <logmsgbot>	 !log dcausse@deploy1002 Started scap sync-world: Backport for [[gerrit:1055890|GeoData: add pool counter settings (T370621)]]
[07:39:39] <stashbot>	 T370621: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621
[07:39:55] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 24482
[07:40:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: smaller batches for mw_accesslog_metrics [puppet] - 10https://gerrit.wikimedia.org/r/1057798 (https://phabricator.wikimedia.org/T369256)
[07:41:34] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.dns.netbox
[07:42:25] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10021585 (10Fabfur) 05Open→03In progress a:03Fabfur
[07:42:46] <logmsgbot>	 !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1055890|GeoData: add pool counter settings (T370621)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:42:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10021587 (10Fabfur) 05Open→03In progress a:03Fabfur
[07:44:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Good to go once requisites are in place" [puppet] - 10https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi)
[07:45:30] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: ignore, test [alerts] - 10https://gerrit.wikimedia.org/r/1056897 (owner: 10Filippo Giunchedi)
[07:45:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:45:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] logstash: consume k8s logs topics [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[07:46:12] <logmsgbot>	 !log dcausse@deploy1002 dcausse: Continuing with sync
[07:46:21] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002"
[07:46:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3433/co" [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[07:47:31] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: karapace1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002"
[07:47:32] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:47:32] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts karapace1001.eqiad.wmnet
[07:48:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T371190#10021604 (10Marostegui) p:05Triage→03Medium This host is probably out of warranty, but can we check if there're disks we can use somewhere? Thanks
[07:49:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:50:48] <wikibugs>	 (03PS1) 10Stevemunene: idp-test: Register airflow-analytics-test IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209)
[07:51:10] <logmsgbot>	 !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1055890|GeoData: add pool counter settings (T370621)]] (duration: 11m 36s)
[07:51:14] <stashbot>	 T370621: Latency issues in search elastic clusters 2024-07-22 since 05:00 - https://phabricator.wikimedia.org/T370621
[07:51:27] <wikibugs>	 (03PS1) 10Brouberol: karapace: cleanup after karapace100[12] were decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1057800 (https://phabricator.wikimedia.org/T363461)
[07:53:06] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 24482
[07:54:09] <dcausse>	 !log closing the backport window
[07:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:46] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1057800 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol)
[08:01:20] <wikibugs>	 (03PS10) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[08:03:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] karapace: cleanup after karapace100[12] were decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1057800 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol)
[08:09:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LGTM! Will the Node have an associated Kubernetes label allowing Pods to target it specifically?" [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman)
[08:11:12] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LGTM except for a small typo" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene)
[08:11:23] <wikibugs>	 (03CR) 10Klausman: "That will be done by pods requiring the GPU resource (which is added by the AMDGPU role). If we find that we need stricter control, we can" [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman)
[08:12:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#10021674 (10ayounsi)
[08:13:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:27:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:27:53] <wikibugs>	 (03CR) 10Elukey: "Hey folks, I added Simon from I/F, please always involve somebody from I/F before merging changes to IDP :)" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene)
[08:31:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T367856)', diff saved to https://phabricator.wikimedia.org/P66991 and previous config saved to /var/cache/conftool/dbconfig/20240729-083115-marostegui.json
[08:31:21] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[08:32:15] <wikibugs>	 (03PS1) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209)
[08:35:33] <wikibugs>	 (03PS1) 10Fabfur: geo-maps: make esams default DC for France [dns] - 10https://gerrit.wikimedia.org/r/1057812 (https://phabricator.wikimedia.org/T371216)
[08:38:36] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] geo-maps: make esams default DC for France [dns] - 10https://gerrit.wikimedia.org/r/1057812 (https://phabricator.wikimedia.org/T371216) (owner: 10Fabfur)
[08:41:56] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] geo-maps: make esams default DC for France [dns] - 10https://gerrit.wikimedia.org/r/1057812 (https://phabricator.wikimedia.org/T371216) (owner: 10Fabfur)
[08:45:31] <wikibugs>	 (03CR) 10Brouberol: dns: provision airflow-analytics-test domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene)
[08:46:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P66992 and previous config saved to /var/cache/conftool/dbconfig/20240729-084622-marostegui.json
[08:48:30] <wikibugs>	 (03PS11) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[08:48:30] <wikibugs>	 (03PS1) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356)
[08:50:02] <wikibugs>	 (03PS2) 10Stevemunene: idp-test: Register airflow-analytics-test IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209)
[08:51:25] <wikibugs>	 (03CR) 10Stevemunene: "Ack, thanks Luca 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene)
[08:52:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[08:52:42] <wikibugs>	 07Puppet: Single member group breaks cross validation script - https://phabricator.wikimedia.org/T371221 (10SLyngshede-WMF) 03NEW
[08:54:53] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221)
[08:55:35] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10021879 (10Fabfur) Looping in @thcipriani just for a quick confirmation that this is both for a new shell account and for adding the user to the `restricted` group
[08:55:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede)
[08:58:04] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10021886 (10hnowlan) >>! In T370789#10015615, @BCornwall wrote: > That's right! Thanks for reminding. Anyone have any qualms with going that route?  Makes sense to me.
[09:01:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P66994 and previous config saved to /var/cache/conftool/dbconfig/20240729-090129-marostegui.json
[09:02:30] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "I actually increased them in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050367 , but looking back I'm not sure what (if any) ef" [puppet] - 10https://gerrit.wikimedia.org/r/1057798 (https://phabricator.wikimedia.org/T369256) (owner: 10Filippo Giunchedi)
[09:04:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog: send all k8s logs to dedicated kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1057819 (https://phabricator.wikimedia.org/T366710)
[09:05:18] <wikibugs>	 (03PS1) 10Stevemunene: Add airflow-analytics-test secret [labs/private] - 10https://gerrit.wikimedia.org/r/1057820 (https://phabricator.wikimedia.org/T371209)
[09:05:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I've verified with k8s staging that logging happens as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1057819 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[09:07:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032 investigate access denied errors', diff saved to https://phabricator.wikimedia.org/P66995 and previous config saved to /var/cache/conftool/dbconfig/20240729-090730-root.json
[09:07:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change
[09:07:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change
[09:08:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] benthos: smaller batches for mw_accesslog_metrics [puppet] - 10https://gerrit.wikimedia.org/r/1057798 (https://phabricator.wikimedia.org/T369256) (owner: 10Filippo Giunchedi)
[09:09:17] <wikibugs>	 (03PS2) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356)
[09:09:17] <wikibugs>	 (03PS12) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[09:09:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool 25% of es1032', diff saved to https://phabricator.wikimedia.org/P66996 and previous config saved to /var/cache/conftool/dbconfig/20240729-090953-marostegui.json
[09:11:02] <wikibugs>	 (03PS3) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356)
[09:11:02] <wikibugs>	 (03PS13) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[09:12:43] <wikibugs>	 (03PS2) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209)
[09:13:09] <wikibugs>	 (03PS3) 10Stevemunene: dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209)
[09:14:16] <wikibugs>	 (03PS2) 10Slyngshede: data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221)
[09:14:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede)
[09:14:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3434/co" [puppet] - 10https://gerrit.wikimedia.org/r/1057187 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[09:16:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T367856)', diff saved to https://phabricator.wikimedia.org/P66997 and previous config saved to /var/cache/conftool/dbconfig/20240729-091637-marostegui.json
[09:16:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[09:16:42] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[09:16:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance
[09:16:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T367856)', diff saved to https://phabricator.wikimedia.org/P66998 and previous config saved to /var/cache/conftool/dbconfig/20240729-091658-marostegui.json
[09:19:15] <wikibugs>	 (03PS3) 10Slyngshede: data.yaml Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221)
[09:22:23] <wikibugs>	 (03PS4) 10Slyngshede: P:openldap::management Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221)
[09:22:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032 investigate access denied errors', diff saved to https://phabricator.wikimedia.org/P66999 and previous config saved to /var/cache/conftool/dbconfig/20240729-092239-root.json
[09:24:50] <wikibugs>	 (03PS1) 10Fabfur: hiera:benthos: remove benthos from ulsfo cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741)
[09:25:35] <wikibugs>	 (03Abandoned) 10Fabfur: hiera: enable benthos on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[09:25:46] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur)
[09:27:23] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [airflow-dags/search@7da1ef0]: search: process_sparql_query workaround oom issues
[09:27:44] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [airflow-dags/search@7da1ef0]: search: process_sparql_query workaround oom issues (duration: 00m 20s)
[09:28:40] <wikibugs>	 (03PS2) 10Hnowlan: mesh.configuration: copypasta commit in advance of changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056891 (https://phabricator.wikimedia.org/T356241)
[09:28:40] <wikibugs>	 (03PS5) 10Hnowlan: mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241)
[09:28:40] <wikibugs>	 (03PS3) 10Hnowlan: shellbox: use latest mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056562 (https://phabricator.wikimedia.org/T356241)
[09:28:51] <wikibugs>	 (03CR) 10Elukey: "Looks good, I just have a question about how pyyaml renders empty lists :)" [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede)
[09:29:44] <wikibugs>	 (03CR) 10Slyngshede: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[09:31:14] <Dreamy_Jazz>	 !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration
[09:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-1] "LGTM but:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[09:35:19] <wikibugs>	 (03CR) 10Fabfur: "Don't know if there's any usefulness in keeping benthos references in haproxy/cache base profiles (that defaults to false anyway)..." [puppet] - 10https://gerrit.wikimedia.org/r/1057823 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur)
[09:36:12] <wikibugs>	 (03CR) 10Hashar: [C:03+1] gerrit: drop gerrit-replica-new.wikimedia.org from list of replicas [puppet] - 10https://gerrit.wikimedia.org/r/1056996 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn)
[09:38:45] <wikibugs>	 (03CR) 10Stevemunene: dns: provision airflow-analytics-test domain (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene)
[09:39:14] <wikibugs>	 (03PS4) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356)
[09:39:14] <wikibugs>	 (03PS14) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[09:39:25] <wikibugs>	 (03CR) 10Elukey: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[09:42:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[09:43:18] <wikibugs>	 (03PS1) 10Elukey: WIP provision_server.py: add mac address to network provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1057826
[09:43:40] <wikibugs>	 (03PS5) 10Slyngshede: P:openldap::management Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221)
[09:44:16] <wikibugs>	 (03CR) 10Slyngshede: P:openldap::management Unbreak cross-validate-accounts script. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede)
[09:44:52] <wikibugs>	 (03CR) 10Hnowlan: "Done - used the Envoy default of 5m, which is a little steep but means no surprises should we encounter it elsewhere." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan)
[09:44:57] <wikibugs>	 (03PS6) 10Hnowlan: mesh.configuration: add idle_upstream_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056560 (https://phabricator.wikimedia.org/T356241)
[09:45:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Thanks for the follow up!" [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede)
[09:46:07] <wikibugs>	 (03CR) 10Slyngshede: "This bug was noticed due to: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056452/3/modules/admin/data/data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede)
[09:47:25] <wikibugs>	 (03CR) 10Slyngshede: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[09:48:38] <wikibugs>	 (03PS5) 10Elukey: ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356)
[09:48:38] <wikibugs>	 (03PS15) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[09:48:49] <wikibugs>	 (03CR) 10Elukey: ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[09:49:58] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dns: provision airflow-analytics-test domain [dns] - 10https://gerrit.wikimedia.org/r/1057805 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene)
[09:51:39] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Switch to CAS 7.0 hosts. [dns] - 10https://gerrit.wikimedia.org/r/1057827 (https://phabricator.wikimedia.org/T367487)
[09:51:54] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:openldap::management Unbreak cross-validate-accounts script. [puppet] - 10https://gerrit.wikimedia.org/r/1057815 (https://phabricator.wikimedia.org/T371221) (owner: 10Slyngshede)
[09:55:34] <wikibugs>	 07Puppet, 13Patch-For-Review: Single member group breaks cross validation script - https://phabricator.wikimedia.org/T371221#10022153 (10SLyngshede-WMF) 05Open→03Resolved
[09:56:18] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[09:56:47] <wikibugs>	 (03PS1) 10Jelto: gitlab: reduce max_storage_concurrency for test instance [puppet] - 10https://gerrit.wikimedia.org/r/1057828 (https://phabricator.wikimedia.org/T371222)
[09:58:18] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] "Nicely spotted, thank you" [software/bitu] - 10https://gerrit.wikimedia.org/r/1055998 (owner: 10Bartosz Dziewoński)
[09:58:22] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: reimage 1 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1057829 (https://phabricator.wikimedia.org/T351074)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1000)
[10:07:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2441.mgmt.codfw.wmnet with reboot policy GRACEFUL
[10:11:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change
[10:11:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1032.eqiad.wmnet with reason: Long schema change
[10:12:09] <wikibugs>	 (03PS1) 10Stevemunene: trafficserver: add airflow-analytics-test discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210)
[10:12:25] <godog>	 !log bounce benthos@mw_accesslog_sampler on logstash collectors
[10:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67000 and previous config saved to /var/cache/conftool/dbconfig/20240729-101348-root.json
[10:14:23] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2441.mgmt.codfw.wmnet with reboot policy GRACEFUL
[10:18:57] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234 (10MatthewVernon) 03NEW
[10:19:18] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234#10022286 (10MatthewVernon) p:05Triage→03Medium
[10:20:13] <marostegui>	 !log Deploy schema change on s7 eqiad master with replication dbmaint T370394
[10:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:18] <stashbot>	 T370394: Drop gb_by from globalblocks table - https://phabricator.wikimedia.org/T370394
[10:26:42] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto)
[10:26:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto)
[10:27:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto)
[10:27:14] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192#10022321 (10MatthewVernon) p:05Triage→03High
[10:27:41] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: wikireplicas::backend: convert to using haproxy::confd_site [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto)
[10:27:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto)
[10:28:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67001 and previous config saved to /var/cache/conftool/dbconfig/20240729-102853-root.json
[10:30:01] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add eqsin, drmrs wrongly numbered hosts to typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057832
[10:30:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deployment: Switch master deployment host to deploy1003 [puppet] - 10https://gerrit.wikimedia.org/r/1056878 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris)
[10:31:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "PCC fails when this change is applied with:" [puppet] - 10https://gerrit.wikimedia.org/r/1056937 (owner: 10Giuseppe Lavagetto)
[10:33:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Switch deployment.eqiad.wmnet to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1057833 (https://phabricator.wikimedia.org/T364417)
[10:34:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Switch deployment.eqiad.wmnet to deploy1003 [dns] - 10https://gerrit.wikimedia.org/r/1057833 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris)
[10:35:37] <wikibugs>	 (03PS5) 10Clément Goubert: mwdebug: Add logstash and otelcol config [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949)
[10:36:56] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ldap: fix add-ldap-group script [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[10:37:52] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[10:43:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#10022431 (10Ladsgroup) No but it had ten days of replication replayed (with RBR) and if it had issues, it would have broken replication really quickly. Also logs also said aria recovery was...
[10:43:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67002 and previous config saved to /var/cache/conftool/dbconfig/20240729-104358-root.json
[10:44:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#10022434 (10Marostegui) Sure, that's fine (remember we don't use Aria, so in this case that can be misleading).
[10:46:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#10022450 (10elukey) ` elukey@ldap-maint1001:~$ sudo add-ldap-group --gid 724 ops-limited successfully created group ops-limited, with gidNumber 724 and 0 members `
[10:46:38] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] gitlab: reduce max_storage_concurrency for test instance [puppet] - 10https://gerrit.wikimedia.org/r/1057828 (https://phabricator.wikimedia.org/T371222) (owner: 10Jelto)
[10:47:28] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10022456 (10Volans)
[10:49:50] <wikibugs>	 (03PS16) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[10:49:50] <wikibugs>	 (03PS1) 10Elukey: ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835
[10:49:57] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Add eqsin, drmrs wrongly numbered hosts to typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057832 (owner: 10Alexandros Kosiaris)
[10:50:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add eqsin, drmrs wrongly numbered hosts to typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057832 (owner: 10Alexandros Kosiaris)
[10:51:51] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: reduce max_storage_concurrency for test instance [puppet] - 10https://gerrit.wikimedia.org/r/1057828 (https://phabricator.wikimedia.org/T371222) (owner: 10Jelto)
[10:53:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 (owner: 10Elukey)
[10:54:07] <logmsgbot>	 !log akosiaris@deploy1003 Started scap sync-world: check the deployment server after switchover
[10:56:00] <wikibugs>	 (03PS1) 10Abijeet Patro: TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455)
[10:56:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro)
[10:56:26] <wikibugs>	 (03PS2) 10Elukey: ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835
[10:56:26] <wikibugs>	 (03PS17) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[10:58:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Let’s try it and keep an eye on Grafana: https://grafana.wikimedia.org/d/000000316/memcache" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro)
[10:59:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67003 and previous config saved to /var/cache/conftool/dbconfig/20240729-105904-root.json
[10:59:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 (owner: 10Elukey)
[11:00:51] <wikibugs>	 (03PS3) 10Elukey: ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835
[11:00:51] <wikibugs>	 (03PS18) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356)
[11:03:44] <wikibugs>	 (03PS1) 10Clément Goubert: cumin: Remove mw-api aliases [puppet] - 10https://gerrit.wikimedia.org/r/1057841 (https://phabricator.wikimedia.org/T367949)
[11:04:41] <wikibugs>	 (03CR) 10Ladsgroup: "Does this work for you Manuel?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe)
[11:04:52] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10022490 (10Clement_Goubert)
[11:05:56] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1057841 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[11:06:27] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] cumin: Remove mw-api aliases [puppet] - 10https://gerrit.wikimedia.org/r/1057841 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[11:14:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67004 and previous config saved to /var/cache/conftool/dbconfig/20240729-111410-root.json
[11:19:53] <wikibugs>	 (03CR) 10Marostegui: "This works for me, we rarely touch any of this. We only interact now with db-production to set external store as RO sometimes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe)
[11:22:52] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Then let's go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe)
[11:26:36] <logmsgbot>	 !log akosiaris@deploy1003 Finished scap: check the deployment server after switchover (duration: 32m 28s)
[11:32:58] <wikibugs>	 (03CR) 10Klausman: [C:03+2] knative-serving: Switch activator to use Calico NP/k8s services (1/9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054538 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman)
[11:34:21] <jinxer-wm>	 FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:36:20] <wikibugs>	 (03Merged) 10jenkins-bot: knative-serving: Switch activator to use Calico NP/k8s services (1/9) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054538 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman)
[11:37:21] <wikibugs>	 (03PS1) 10Jelto: gitlab: add missing max_concurrency value in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1057851 (https://phabricator.wikimedia.org/T371222)
[11:39:21] <jinxer-wm>	 RESOLVED: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:40:36] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] kubernetes: reimage 1 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1057829 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:40:39] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:41:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes: reimage 1 appserver to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1057829 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:41:57] <jinxer-wm>	 FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:42:43] * volans got paged
[11:42:49] <marostegui>	 same
[11:42:51] <eoghan>	 Looking
[11:44:03] <hnowlan>	 pods are up, but they're failing their readiness probes, service is throwing 503s 
[11:44:09] <hnowlan>	 service logs are empty
[11:44:11] * kamila_ looking 
[11:44:21] <jinxer-wm>	 RESOLVED: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:45:39] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:45:49] * Emperor got a page, are more hands needed?
[11:46:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2441 to wikikube-worker2039
[11:46:10] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:46:16] <eoghan>	 Why did it page everyone first? I would have expected me and kamila_ would get paged first before everyone. 
[11:46:43] <Emperor>	 I'll go look at VO
[11:46:53] <kamila_>	 thanks Emperor <3 
[11:46:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:47:45] <Emperor>	 VO says "User escalator_sysuser routed incident #4929 from SRE:SRE Business Hours (Escalation) to SRE:SRE Batphone (Escalation)" at basically the same time as the alert fired
[11:48:25] <hnowlan>	 man the error rate on recommendation-api suuuucks, 25-30% is normal 
[11:48:57] <jinxer-wm>	 FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:49:16] <claime>	 !incidents
[11:49:17] <sirenbot>	 4930 (UNACKED)  ProbeDown sre (10.2.1.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 codfw)
[11:49:17] <sirenbot>	 4929 (RESOLVED)  ProbeDown sre (10.2.2.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 eqiad)
[11:49:24] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2441 to wikikube-worker2039 - cgoubert@cumin1002"
[11:49:28] <claime>	 !ack 4930
[11:49:28] <sirenbot>	 4930 (ACKED)  ProbeDown sre (10.2.1.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 codfw)
[11:49:42] <Emperor>	 I think VO may just have messed up - AFAICT the escalation policy is correctly configured (Business hours first, then batphone after 5m)
[11:50:06] <wikibugs>	 (03PS1) 10Klausman: charts: Version bump for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057852
[11:50:06] <wikibugs>	 (03PS1) 10KartikMistry: AX: Unregister "axArticleFooterEntrypointRegistrar" hook handler [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338)
[11:50:10] <claime>	 hnowlan: what's weird is I can curl the readiness probe and get a 200
[11:51:21] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2441 to wikikube-worker2039 - cgoubert@cumin1002"
[11:51:21] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:51:21] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2039
[11:51:22] <Emperor>	 kamila_: if it's OK with you, I'll open a ticket against sre-observability about the escalation failure for this incident?
[11:51:41] <kamila_>	 Emperor: thanks, sgtm
[11:51:53] <hnowlan>	 claime: I just get "fault filter abort" when curling it 
[11:52:00] <hnowlan>	 and a 503
[11:52:21] <hnowlan>	 should we roll restart? there is very little by way of docs or logging for this 
[11:52:36] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: add missing max_concurrency value in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1057851 (https://phabricator.wikimedia.org/T371222) (owner: 10Jelto)
[11:52:38] <hnowlan>	 only thing I can think of is the service having issues connecting to mysql 
[11:52:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338) (owner: 10KartikMistry)
[11:53:00] * kamila_ was going to suggest roll restarting, +1 hnowlan 
[11:53:22] <claime>	 hnowlan: ok i get that going through recommendation-api.discovery.wmnet:4632, but not http://10.67.148.182:9632/robots.txt
[11:53:24] <wikibugs>	 (03CR) 10Klausman: [C:03+2] charts: Version bump for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057852 (owner: 10Klausman)
[11:53:49] <kamila_>	 do we want to keep one of the bad pods around for debugging with the relabeling trick? 
[11:53:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:56:06] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10022582 (10JayCano) As Máté's manager, I approve this request.
[11:56:42] <wikibugs>	 (03Merged) 10jenkins-bot: charts: Version bump for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057852 (owner: 10Klausman)
[11:56:55] <wikibugs>	 (03PS2) 10Anzx: dtpwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076)
[11:57:00] <kamila_>	 hnowlan: are you roll restarting?
[11:57:00] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244 (10MatthewVernon) 03NEW
[11:57:16] <Emperor>	 ^-- ticket re the mis-directed page
[11:57:35] <wikibugs>	 (03PS2) 10Anzx: mywikisource: add portal, author and translation namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057824 (https://phabricator.wikimedia.org/T371060)
[11:57:40] <hnowlan>	 kamila_: haven't yet - what's the relabelling trick? 
[11:58:42] <kamila_>	 hnowlan: https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Administration#Isolate_a_pod_from_traffic_and_deployments
[11:58:46] <kamila_>	 cc eoghan ^ 
[11:58:55] <kamila_>	 (thanks a.lex <3) 
[11:59:21] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:59:55] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:00:39] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:00:54] <hnowlan>	 kamila_: ah, cool - will do now
[12:01:47] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:01:56] <kamila_>	 thanks hnowlan <3 
[12:01:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:02:30] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync
[12:02:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync
[12:04:07] <claime>	 !incidents
[12:04:08] <sirenbot>	 4931 (ACKED)  [2x] ProbeDown sre (ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4)
[12:04:08] <sirenbot>	 4930 (RESOLVED)  ProbeDown sre (10.2.1.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 codfw)
[12:04:08] <sirenbot>	 4929 (RESOLVED)  ProbeDown sre (10.2.2.37 ip4 recommendation-api:4632 probes/service http_recommendation-api_ip4 eqiad)
[12:04:14] <hnowlan>	 still seeing 503s 
[12:04:46] <hnowlan>	 sigh, time to look at the codebase 
[12:05:02] <hnowlan>	 can we raise someone from research? 
[12:05:21] * kamila_ doesn't see anything in SAL
[12:05:46] <eoghan>	 hnowlan: Did you do/are you doing a rolling restart? 
[12:05:48] <kamila_>	 hnowlan, eoghan: let's move to -sre, for noise reduction
[12:06:04] <kamila_>	 eoghan: the scap sync above was a roll restart I assume
[12:06:17] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:06:20] <eoghan>	 Oh yes, sorry. Missed that!
[12:06:33] <kamila_>	 np, it's not obvious from the message
[12:06:41] <kamila_>	 (maybe should be fixed someday)
[12:06:42] <hnowlan>	 ack
[12:06:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:07:48] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2039
[12:07:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2441 to wikikube-worker2039
[12:08:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2039.codfw.wmnet with OS bullseye
[12:08:57] <jinxer-wm>	 FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:09:46] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022648 (10Jhancock.wm)
[12:13:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:14:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2221.codfw.wmnet with OS bookworm
[12:14:16] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022664 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2221.codfw.wmnet with OS bookworm
[12:16:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[12:16:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2222.codfw.wmnet with OS bookworm
[12:16:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm
[12:17:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2223.codfw.wmnet with OS bookworm
[12:17:27] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022683 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2223.codfw.wmnet with OS bookworm
[12:17:30] <claime>	 spike of RU NEL
[12:17:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2224.codfw.wmnet with OS bookworm
[12:17:51] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2224.codfw.wmnet with OS bookworm
[12:17:57] <jinxer-wm>	 FIRING: ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:18:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2225.codfw.wmnet with OS bookworm
[12:18:21] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm
[12:18:28] <kamila_>	 oh come on, is my silence bad?
[12:18:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2226.codfw.wmnet with OS bookworm
[12:18:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm
[12:18:58] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2226.codfw.wmnet with OS bookworm
[12:19:00] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm
[12:19:12] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service recommendation-api:4632 has failed probes (http_recommendation-api_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#recommendation-api:4632 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:21:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[12:22:44] <wikibugs>	 (03PS1) 10Klausman: charts/knative-serving: fix selector for activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057859
[12:27:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2039.codfw.wmnet with reason: host reimage
[12:27:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:28:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:29:34] <wikibugs>	 (03PS1) 10Slyngshede: Initial 2FA support [software/bitu] - 10https://gerrit.wikimedia.org/r/1057862
[12:30:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2221.codfw.wmnet with reason: host reimage
[12:32:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2039.codfw.wmnet with reason: host reimage
[12:32:57] <kart_>	 I'll do early +2 for my wmf.15 backport patch (Also for probably abijeet's patch) as CI will take 20-25 minutes.
[12:33:28] <wikibugs>	 (03PS3) 10Slyngshede: Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923
[12:33:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2226.codfw.wmnet with reason: host reimage
[12:33:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2223.codfw.wmnet with reason: host reimage
[12:34:41] <wikibugs>	 (03CR) 10Klausman: [C:03+2] charts/knative-serving: fix selector for activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057859 (owner: 10Klausman)
[12:34:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2224.codfw.wmnet with reason: host reimage
[12:34:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2225.codfw.wmnet with reason: host reimage
[12:35:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2221.codfw.wmnet with reason: host reimage
[12:35:29] <Lucas_WMDE>	 kart_: but you’re at the end of the deployment order
[12:35:40] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Styling: Allow the use of normal Codex tables. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052923 (owner: 10Slyngshede)
[12:35:46] <godog>	 !log test benthos 4.27 on logstash1023
[12:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:54] <wikibugs>	 (03PS7) 10Slyngshede: Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924
[12:37:40] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 (owner: 10Slyngshede)
[12:38:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: host reimage
[12:39:02] <kart_>	 Lucas_WMDE: OK, in that case, I can +2 at the start of the window?
[12:39:16] <Lucas_WMDE>	 yeah, IMHO that should be enough time
[12:39:16] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1057866 (https://phabricator.wikimedia.org/T371251)
[12:39:21] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1057867 (https://phabricator.wikimedia.org/T371251)
[12:40:45] <Lucas_WMDE>	 kart_: maybe abijeet’s change can be +2ed a bit before the window starts, not sure
[12:41:00] <Lucas_WMDE>	 but that one will need a bit of time to verify that everything is okay after the full deployment (can’t be tested as well on mwdebug)
[12:41:07] <Lucas_WMDE>	 so I’d like to leave some time there before your backport
[12:41:08] <wikibugs>	 (03Merged) 10jenkins-bot: charts/knative-serving: fix selector for activator netpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057859 (owner: 10Klausman)
[12:41:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: host reimage
[12:41:37] <kart_>	 Sure
[12:43:25] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ldap: fix log for add-ldap-group.py [puppet] - 10https://gerrit.wikimedia.org/r/1057835 (owner: 10Elukey)
[12:43:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2223.codfw.wmnet with reason: host reimage
[12:45:42] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2222.codfw.wmnet with OS bookworm
[12:45:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022764 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm executed with errors: - db...
[12:46:54] <godog>	 !log upgrade and roll-restart benthos@mw_accesslog_sampler on logstash hosts
[12:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:07] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:47:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2226.codfw.wmnet with reason: host reimage
[12:48:45] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:51:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[12:53:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:53:44] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2039.codfw.wmnet with OS bullseye
[12:55:22] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[12:55:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2221.codfw.wmnet with OS bookworm
[12:55:30] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2221.codfw.wmnet with OS bookworm completed: - db2221 (**PAS...
[12:55:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[12:56:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: conftool and pyparsing requirements - https://phabricator.wikimedia.org/T371252 (10elukey) 03NEW
[12:56:13] <kart_>	 Seems abijeet is not around.
[12:56:48] <Lucas_WMDE>	 we can wait a bit, but I think I also feel relatively confident to deploy that backport myself
[12:57:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2222.codfw.wmnet with OS bookworm
[12:57:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[12:57:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2224.codfw.wmnet with OS bookworm
[12:57:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[12:57:31] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm
[12:57:32] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2224.codfw.wmnet with OS bookworm completed: - db2224 (**PAS...
[12:57:54] <Lucas_WMDE>	 kart_: but I guess we can +2 your backport first, then
[12:58:01] <Lucas_WMDE>	 (and let gate-and-submit run while deploying the config changes)
[12:58:09] <kart_>	 sure
[12:58:31] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Okay to deploy now (backport window is in a few minutes)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE))
[12:58:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[12:59:06] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db2225.codfw.wmnet with OS bookworm
[12:59:32] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm completed: - db2225 (**PAS...
[12:59:35] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm executed with errors: - db...
[12:59:59] <Lucas_WMDE>	 heh, deploy1003 actually seems to be a “weaker” machine than deploy1002? (at least it has fewer nproc and RAM; haven’t looked into the exact CPU specs or anything ^^)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1300). nyaa~
[13:00:05] <jouncebot>	 Lucas_WMDE, Gerges, abijeet, and kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:08] <Lucas_WMDE>	 o/
[13:00:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:00:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338) (owner: 10KartikMistry)
[13:00:29] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable mul language code on Wikidata (limited mode) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281)
[13:00:50] * Lucas_WMDE waits for diffConfig build
[13:01:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:01:24] <kart_>	 Lucas_WMDE: we can also +2 abijeet's change.
[13:01:26] <Lucas_WMDE>	 no diff in -labs- or in testwikidatawiki, as expected
[13:01:56] <Lucas_WMDE>	 kart_: I would wait a bit more with that
[13:01:59] <Lucas_WMDE>	 hmm, scap backport fails
[13:01:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:02:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2223.codfw.wmnet with OS bookworm
[13:02:10] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2223.codfw.wmnet with OS bookworm completed: - db2223 (**PAS...
[13:02:22] <Lucas_WMDE>	 akosiaris: I might be having issues on deploy1003… I’ll look a bit closer at it but I assume you’d be interested
[13:02:33] <akosiaris>	 Lucas_WMDE: what do you experience?
[13:02:33] <Lucas_WMDE>	 … git remote get-url origin --recursive' failed with exit code 128
[13:02:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2225.codfw.wmnet with OS bookworm
[13:02:38] <Lucas_WMDE>	 error: unknown option `recursive'
[13:02:42] <Lucas_WMDE>	 are we on an older git?
[13:02:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022823 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm
[13:02:50] <akosiaris>	 newer for sure, it's bullseye 
[13:02:59] <akosiaris>	 and the older hosts are buster
[13:03:06] <Lucas_WMDE>	 yup, 2.20.1 to 2.30.2
[13:03:10] <Lucas_WMDE>	 so did git remove the option? o_O
[13:03:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:03:58] <kart_>	 Oops.
[13:04:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:04:18] <jinxer-wm>	 FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[13:04:25] <Lucas_WMDE>	 hm, I see no evidence of it ever having been in Documentation/git-remote.txt
[13:04:29] <abijeet>	 hello, patch for review: 1057840: TranslatablePage: Split translatable page id cache into multiple shards | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1057840 -- I don't have rights to +2
[13:04:31] <Lucas_WMDE>	 (in git.git)
[13:04:36] <abijeet>	 patch for backport**
[13:04:44] <akosiaris>	 Lucas_WMDE: I was about to point out, I don't see it in git-remote man page either
[13:04:45] <Lucas_WMDE>	 abijeet: we’ll get to it, but currently it looks like we might not be able to deploy at all
[13:04:51] <akosiaris>	 what's that --recursive thing?
[13:04:57] <abijeet>	 Lucas_WMDE, too many patches already?
[13:04:58] <wikibugs>	 (03PS1) 10Physikerwelt: Make native MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057870 (https://phabricator.wikimedia.org/T371254)
[13:05:09] <Lucas_WMDE>	 akosiaris: yeah, even on deploy1002 it’s not in the docs
[13:05:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2225.codfw.wmnet with reason: host reimage
[13:05:18] <Lucas_WMDE>	 it’s also not really clear to me what it would do
[13:05:22] <Lucas_WMDE>	 but scap tries to run it…
[13:05:30] * Lucas_WMDE looks at scap code
[13:05:42] <akosiaris>	 I did a scap sync-world today and didn't notice such a thing
[13:05:46] <akosiaris>	 is it scap backport ?
[13:06:29] <Lucas_WMDE>	 yes
[13:06:31] <Lucas_WMDE>	 scap/plugins/backport.py has
[13:06:34] <Lucas_WMDE>	 paths_urls = git.list_submodules_paths_urls(location, "--recursive")
[13:06:41] <Lucas_WMDE>	 and that just pastes the --recursive to the end of the git command
[13:06:55] <Lucas_WMDE>	 I think git might just have silently ignored it before?
[13:07:07] <abijeet>	 Lucas_WMDE, no rush, we can deploy it during the UTC late backport window. I'll be around.
[13:07:09] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "sounds reasonable to not log changes on the staging host" [puppet] - 10https://gerrit.wikimedia.org/r/1056941 (owner: 10EoghanGaffney)
[13:07:14] <Lucas_WMDE>	 I guess I get to practice deploying without scap backport today
[13:07:23] <Lucas_WMDE>	 abijeet: I definitely have a change of my own I want to deploy though :D
[13:07:28] <Lucas_WMDE>	 we announced a date to the community and all
[13:07:29] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: host reimage
[13:08:10] <akosiaris>	 ah so this is probably mean for git submodule then
[13:08:20] <Lucas_WMDE>	 the --recursive appears in https://gitlab.wikimedia.org/repos/releng/scap/-/commit/f1477e7856
[13:08:21] <akosiaris>	 which does have multiple commands supporting --recursive
[13:08:26] <Lucas_WMDE>	 is jeena around by any chance?
[13:08:36] <Lucas_WMDE>	 also I guess I should definitely file a phab task
[13:08:40] <Lucas_WMDE>	 easier to paste the error output there
[13:09:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:09:18] <jinxer-wm>	 RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[13:09:18] <kart_>	 Lucas_WMDE: will the older scap way work?
[13:09:27] <Lucas_WMDE>	 kart_: I assume so
[13:09:30] <Lucas_WMDE>	 I’ll try once the phab task is filed
[13:10:11] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:10:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2226.codfw.wmnet with OS bookworm
[13:10:20] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2226.codfw.wmnet with OS bookworm completed: - db2226 (**PAS...
[13:10:36] <kart_>	 OK. We've to deploy CX patch. It is quite important one :/
[13:10:46] <Lucas_WMDE>	 akosiaris: T371255
[13:10:47] <stashbot>	 T371255: scap backport broken on deploy1003 (bullseye, Git 2.30) - https://phabricator.wikimedia.org/T371255
[13:11:03] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2222.codfw.wmnet with reason: host reimage
[13:11:04] <Lucas_WMDE>	 I’ll try to deploy with the old-style commands now
[13:11:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:11:37] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Deploying (manual +2 because `scap backport` is broken, T371255)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE))
[13:11:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "*actual* +2 vote lol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE))
[13:12:05] <Lucas_WMDE>	 akosiaris: would `scap backport` from deploy1002 work? or is that a terrible idea? ^^
[13:12:10] <akosiaris>	 Lucas_WMDE: yeah, it's passing --recursive to the wrong git subcommand, it should be passing --recursive to git submodule foreach
[13:12:27] <Lucas_WMDE>	 ah, and then it would just echo a bit more, okay
[13:12:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable mul language code on Wikidata (limited mode) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055434 (https://phabricator.wikimedia.org/T330281) (owner: 10Lucas Werkmeister (WMDE))
[13:13:07] <Lucas_WMDE>	 oh jeez how do you even sync to mwdebug hosts
[13:13:16] <Lucas_WMDE>	 I guess I’ll just scap pull on one bare-metal mwdebug
[13:13:19] <Lucas_WMDE>	 and it’ll only be testable there
[13:13:23] <Lucas_WMDE>	 no idea how to do k8s-mwdebug ^^
[13:13:46] <Lucas_WMDE>	 ok pulled on mwdebug1002
[13:13:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2222.codfw.wmnet with reason: host reimage
[13:13:54] <Lucas_WMDE>	 testing…
[13:14:17] <akosiaris>	 Lucas_WMDE: it will probably work but take quite a bit of time to deploy from deploy1002. 
[13:14:25] <Lucas_WMDE>	 alright, then let’s not do that probably
[13:14:28] <Lucas_WMDE>	 thanks!
[13:15:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:15:18] <jinxer-wm>	 FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[13:16:19] <Lucas_WMDE>	 okay I think my config change is working, so let’s do a sync-world
[13:16:28] <Lucas_WMDE>	 or sync-file (does sync-file still exist? ^^)
[13:17:33] <Lucas_WMDE>	 looks like it does
[13:17:47] <akosiaris>	 yes it does
[13:17:56] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2225.codfw.wmnet with OS bookworm
[13:18:04] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm completed: - db2225 (**PAS...
[13:18:05] <Lucas_WMDE>	 though I guess all it does is make the rsync on ~5 remaining hosts a tiny bit faster
[13:18:30] <Lucas_WMDE>	 as I assume the image building doesn’t take the path into account
[13:20:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:20:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:20:18] <jinxer-wm>	 RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from RU) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[13:24:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Synchronized wmf-config/: Backport for [[gerrit:1055434|Enable mul language code on Wikidata (limited mode) (T330281)]] (duration: 06m 47s)
[13:24:19] <stashbot>	 T330281: MUL - Phased rollout on Wikidata.org (Stage 2 of 3: Initial limited release) - https://phabricator.wikimedia.org/T330281
[13:25:30] <wikibugs>	 (03CR) 10Jelto: gerrit: use list of replicas from hiera again, don't do puppet DB lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn)
[13:26:07] <wikibugs>	 (03PS3) 10Jelto: gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn)
[13:26:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:26:34] <Lucas_WMDE>	 alright, I think my change was deployed successfully AFAICT
[13:26:41] <Lucas_WMDE>	 so kart_ is up next once CI finishes
[13:26:53] <wikibugs>	 (03PS1) 10Klausman: charts/knative-serving: Drop selector for activator networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057872
[13:27:05] <Lucas_WMDE>	 and I think we can already +2 abijeet’s backport
[13:27:10] <Lucas_WMDE>	 unless you want to wait for tonight?
[13:27:18] <Lucas_WMDE>	 but I wouldn’t mind deploying it now
[13:28:25] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3435/console" [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn)
[13:29:10] <abijeet>	 Lucas_WMDE, fine with me
[13:29:15] <abijeet>	 lets deploy it now
[13:29:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:29:32] <kart_>	 Lucas_WMDE: Nice!
[13:29:33] <Lucas_WMDE>	 ok, then let’s +2 it and it should be merged by the time we’re done with kart_’s backport
[13:29:41] <wikibugs>	 (03Merged) 10jenkins-bot: AX: Unregister "axArticleFooterEntrypointRegistrar" hook handler [extensions/ContentTranslation] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057853 (https://phabricator.wikimedia.org/T363338) (owner: 10KartikMistry)
[13:29:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "+2ing ahead of deployment" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro)
[13:30:28] <Lucas_WMDE>	 kart_: your backport should be on mwdebug1002, can you test?
[13:30:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:30:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2222.codfw.wmnet with OS bookworm
[13:31:40] <kart_>	 Tricky, but let me see.
[13:32:31] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022911 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm completed: - db2222 (**PAS...
[13:32:33] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022912 (10Jhancock.wm)
[13:33:04] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host kafkamon2003.codfw.wmnet
[13:33:05] <wikibugs>	 (03CR) 10Klausman: [C:03+2] charts/knative-serving: Drop selector for activator networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057872 (owner: 10Klausman)
[13:33:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2228.codfw.wmnet with OS bookworm
[13:33:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2229.codfw.wmnet with OS bookworm
[13:33:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2230.codfw.wmnet with OS bookworm
[13:33:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2231.codfw.wmnet with OS bookworm
[13:33:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2232.codfw.wmnet with OS bookworm
[13:33:42] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2228.codfw.wmnet with OS bookworm
[13:33:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2233.codfw.wmnet with OS bookworm
[13:33:44] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2229.codfw.wmnet with OS bookworm
[13:33:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2230.codfw.wmnet with OS bookworm
[13:33:54] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2231.codfw.wmnet with OS bookworm
[13:33:57] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2232.codfw.wmnet with OS bookworm
[13:34:02] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm
[13:35:25] <kart_>	 Lucas_WMDE: still testing with Nik in parallel. Give me one more minute.
[13:35:57] <Lucas_WMDE>	 sure
[13:36:30] <wikibugs>	 (03Merged) 10jenkins-bot: charts/knative-serving: Drop selector for activator networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057872 (owner: 10Klausman)
[13:36:47] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2003.codfw.wmnet
[13:37:14] <kart_>	 Lucas_WMDE: looks good. Please go ahead
[13:39:16] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874
[13:39:25] <Lucas_WMDE>	 kart_: syncing, thanks for testing!
[13:39:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (owner: 10Alexandros Kosiaris)
[13:40:38] <wikibugs>	 (03PS1) 10Fabfur: Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904)
[13:41:11] <XioNoX>	 !log push new pfw policies - T371137
[13:41:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:24] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874
[13:41:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur)
[13:42:25] <wikibugs>	 (03PS2) 10Fabfur: Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904)
[13:42:34] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "Copied votes on follow-up patch sets have been updated:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (owner: 10Alexandros Kosiaris)
[13:42:47] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:43:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: service: Remove probes from recommendation-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (owner: 10Alexandros Kosiaris)
[13:43:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur)
[13:43:39] <wikibugs>	 (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur)
[13:43:54] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Extend andyrussg until the end of August. [puppet] - 10https://gerrit.wikimedia.org/r/1057877
[13:43:54] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Added mszabo to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1057876 (https://phabricator.wikimedia.org/T370904) (owner: 10Fabfur)
[13:44:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox automation to move selected hosts from ASW to LSW - https://phabricator.wikimedia.org/T370846#10022944 (10cmooney)
[13:44:50] <wikibugs>	 (03PS1) 10DCausse: wdqs: configure internal federation between main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1057878
[13:44:56] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[13:45:27] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874
[13:45:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Synchronized php-1.43.0-wmf.15/extensions/ContentTranslation/extension.json: Backport for [[gerrit:1057853|AX: Unregister "axArticleFooterEntrypointRegistrar" hook handler (T363338)]] (duration: 06m 36s)
[13:45:58] <stashbot>	 T363338:  MinT for Wiki Readers MVP: Access from the footer of an article - https://phabricator.wikimedia.org/T363338
[13:46:02] <Lucas_WMDE>	 kart_: should be deployed everywhere now
[13:46:11] <kart_>	 cool. Thanks a lot Lucas_WMDE
[13:46:15] <Lucas_WMDE>	 np
[13:46:27] <Lucas_WMDE>	 up next, abijeet, once CI finishes
[13:46:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm
[13:46:39] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10022952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db...
[13:46:39] <Lucas_WMDE>	 and I haven’t seen Gerges yet (but I don’t mind if there’s less to deploy while scap backport is broken ^^)
[13:46:59] <Gerges>	 Here
[13:47:09] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 06Infrastructure-Foundations, 06Release-Engineering-Team: package_builder python-all conflicts with base::standard_packages python2.7 removal - https://phabricator.wikimedia.org/T370337#10022948 (10hashar) 05Open→03Resolved a:03hashar I have solved...
[13:47:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2228.codfw.wmnet with reason: host reimage
[13:47:19] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1240 - jclark@cumin1002"
[13:47:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2230.codfw.wmnet with reason: host reimage
[13:47:37] <Lucas_WMDE>	 Gerges: alright, we’ll see if we still have time at the end of the window
[13:47:55] <Gerges>	 Ok
[13:47:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2232.codfw.wmnet with reason: host reimage
[13:48:13] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1240 - jclark@cumin1002"
[13:48:13] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:48:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2229.codfw.wmnet with reason: host reimage
[13:48:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2231.codfw.wmnet with reason: host reimage
[13:48:32] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10022967 (10Fabfur) The user should be now part of the required group(s), please test it and let me know if anything doesn't work as expected!
[13:48:57] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: clean up legacy parameters [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[13:49:15] <wikibugs>	 (03CR) 10Herron: [C:03+1] "🧹🧼" [puppet] - 10https://gerrit.wikimedia.org/r/1057187 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[13:49:31] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED
[13:49:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2228.codfw.wmnet with reason: host reimage
[13:50:40] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10022968 (10Fabfur) p:05Triage→03Low
[13:52:00] <wikibugs>	 06SRE, 10conftool, 06Infrastructure-Foundations, 10Puppet-Infrastructure: conftool and pyparsing requirements - https://phabricator.wikimedia.org/T371252#10022989 (10Volans)
[13:52:06] <wikibugs>	 (03PS2) 10Klausman: charts/knative-serving: Re-add app selector in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057879
[13:52:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2232.codfw.wmnet with reason: host reimage
[13:52:11] <wikibugs>	 (03CR) 10Klausman: [C:03+2] charts/knative-serving: Re-add app selector in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057879 (owner: 10Klausman)
[13:53:00] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: service: Remove probes from recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (https://phabricator.wikimedia.org/T338471)
[13:53:02] * Lucas_WMDE has now installed P8845 on a laptop that previously never needed it thanks to scap backport ^^
[13:53:27] <Lucas_WMDE>	 (ok, no stashbot – that’s https://phabricator.wikimedia.org/P8845, `backport-summary` script to generate the message for the scap sync-file)
[13:53:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] service: Remove probes from recommendation-api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris)
[13:53:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] "Comments addressed, got a +1 already, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1057874 (https://phabricator.wikimedia.org/T338471) (owner: 10Alexandros Kosiaris)
[13:54:10] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks Daniel!  Overall LGTM thanks for taking a look... the only worry I would have is are we in danger of removing confd for some roles " [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn)
[13:54:16] <abijeet>	 Lucas_WMDE, 3-4 minutes remaining hopefully.
[13:54:44] * Lucas_WMDE nods
[13:54:46] <Lucas_WMDE>	 jouncebot: next
[13:54:46] <jouncebot>	 In 1 hour(s) and 35 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1530)
[13:54:57] <Lucas_WMDE>	 no other window we’re about to run into at least
[13:55:21] <wikibugs>	 (03Merged) 10jenkins-bot: charts/knative-serving: Re-add app selector in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057879 (owner: 10Klausman)
[13:55:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2230.codfw.wmnet with reason: host reimage
[13:55:46] <wikibugs>	 06SRE, 10Continuous-Integration-Infrastructure, 10observability, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089#10022995 (10hashar) I must have declined this as part of a task triage since I usually leave a comment when...
[13:56:33] <claime>	 !log homer 'cr*codfw*' commit 'T351074'
[13:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:37] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[13:57:02] <wikibugs>	 (03Merged) 10jenkins-bot: TranslatablePage: Split translatable page id cache into multiple shards [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057840 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro)
[13:57:38] <Lucas_WMDE>	 abijeet: I pulled the change to mwdebug1002, anything to test there?
[13:57:45] <Lucas_WMDE>	 or should we just sync it everywhere and be ready to revert?
[13:57:56] <effie>	 I am around as well
[13:58:06] <Lucas_WMDE>	 hi effie :)
[13:58:14] <effie>	 :)
[13:58:22] <jnuche>	 Lucas_WMDE: I've created a scap release with the backport fix, let me know when I can deploy it
[13:58:44] <Lucas_WMDE>	 jnuche: I’m not scap’ing right now, I think you could do it now
[13:58:50] <Lucas_WMDE>	 I’m assuming it doesn’t take ages ^^
[13:58:59] <jnuche>	 nope, should be fast
[13:59:04] <jnuche>	 gonna do it then
[13:59:08] <Lucas_WMDE>	 alright, thanks!
[13:59:11] <Lucas_WMDE>	 and then I can try it out right afterwards
[13:59:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1057187 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[13:59:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: host reimage
[13:59:22] <logmsgbot>	 !log jnuche@deploy1003 Installing scap version "4.94.0" for 211 hosts
[13:59:31] <wikibugs>	 (03PS1) 10Herron: grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1057882 (https://phabricator.wikimedia.org/T269333)
[14:00:00] <wikibugs>	 (03PS1) 10Klausman: charts/knative-serving: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057883
[14:00:12] <wikibugs>	 (03CR) 10Klausman: [C:03+2] charts/knative-serving: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057883 (owner: 10Klausman)
[14:00:15] <abijeet>	 Lucas_WMDE, checking
[14:00:38] <logmsgbot>	 !log jnuche@deploy1003 Installing scap version "4.94.0" for 210 hosts
[14:00:47] <Gerges>	 ping
[14:01:12] <logmsgbot>	 !log jnuche@deploy1003 Installation of scap version "4.94.0" completed for 210 hosts
[14:01:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED
[14:01:28] <Lucas_WMDE>	 Gerges: I don’t think we’ll have time for your config changes in this window, sorry
[14:01:31] <abijeet>	 Lucas_WMDE, we can monitor this: https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?orgId=1&var-kClass=pagetranslation&from=now-1h&to=now
[14:01:34] <Lucas_WMDE>	 we’ve had some problem with the deployment system
[14:01:43] <jnuche>	 Lucas_WMDE: done, hopefully the problem is fixed now!
[14:01:48] <Lucas_WMDE>	 \o/
[14:01:49] <Lucas_WMDE>	 let’s try it
[14:01:50] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1240.eqiad.wmnet with OS bullseye
[14:01:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10023061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bull...
[14:02:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1057840|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]]
[14:02:14] <Lucas_WMDE>	 looking good so far :)
[14:02:18] <Gerges>	 Well no problem
[14:02:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: burrow: restart on failure [puppet] - 10https://gerrit.wikimedia.org/r/1057886 (https://phabricator.wikimedia.org/T366573)
[14:02:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2231.codfw.wmnet with reason: host reimage
[14:03:19] <wikibugs>	 (03Merged) 10jenkins-bot: charts/knative-serving: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057883 (owner: 10Klausman)
[14:04:05] <abijeet>	 Lucas_WMDE, looks good.
[14:04:21] <Lucas_WMDE>	 alright
[14:04:32] <Lucas_WMDE>	 (scap backport is running now btw)
[14:06:01] <wikibugs>	 (03PS1) 10Klausman: charts/knative-serving: move selector to Egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057887
[14:06:06] <wikibugs>	 (03CR) 10Klausman: [C:03+2] charts/knative-serving: move selector to Egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057887 (owner: 10Klausman)
[14:06:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:07:45] <Gerges>	 Does scap backport work now, or I wait for the late backport window?
[14:07:48] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2039.codfw.wmnet),cluster=kubernetes,service=kubesvc [reason: Pooling and uncordoning - T351074]
[14:07:53] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[14:08:02] <Lucas_WMDE>	 it works now, but I’m not going to start another deployment after this one, as the window is already over
[14:08:17] <Lucas_WMDE>	 but there’s no known blocker for deploying this in the evening window later (assuming someone else is around to do it)
[14:08:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1057882 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron)
[14:09:10] <Gerges>	 ):
[14:09:16] <wikibugs>	 (03Merged) 10jenkins-bot: charts/knative-serving: move selector to Egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057887 (owner: 10Klausman)
[14:09:17] <SandraEbele_>	 !log rerunning airflow mediawiki_history_check_denormalize dag  as down stream task after rerunning mediawiki_history_denormalize dag
[14:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:30] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T371260 (10Clement_Goubert) 03NEW
[14:09:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: clean up legacy parameters [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[14:09:35] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T371260#10023126 (10Clement_Goubert) p:05Triage→03Low
[14:09:36] <Lucas_WMDE>	 feels like docker_pull_k8s is taking unusually long
[14:10:31] <Lucas_WMDE>	 (on the previous deployments they took 16/20 seconds)
[14:11:03] <claime>	 Hmm I hope that's not because I just pooled a node
[14:11:35] <Lucas_WMDE>	 > ImportError: cannot import name 'cli' from 'scap' (unknown location)
[14:11:36] <Lucas_WMDE>	 o_O
[14:11:40] <wikibugs>	 (03CR) 10Herron: [C:03+2] grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1057882 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron)
[14:11:46] <Lucas_WMDE>	 2 masters had sync errors
[14:11:52] <claime>	 Huh yeah that's not me
[14:11:53] <Lucas_WMDE>	 (deploy1002 and deploy2002, I think?)
[14:12:13] <Lucas_WMDE>	 claime: the build-and-push-container-images also took 4m13s, idk if that makes it more or less likely to be related to the pooled node?
[14:12:26] <Lucas_WMDE>	 feels like “bigger image diff” to me but idk why that would be
[14:12:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:12:32] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10023129 (10Clement_Goubert) p:05Triage→03Low
[14:12:43] <claime>	 Lucas_WMDE: nodes are not involved in build-and-push
[14:12:53] <Lucas_WMDE>	 ok
[14:13:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:13:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2228.codfw.wmnet with OS bookworm
[14:13:32] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2228.codfw.wmnet with OS bookworm completed: - db2228 (**PAS...
[14:13:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Backport for [[gerrit:1057840|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:13:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, abi: Continuing with sync
[14:14:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm
[14:14:32] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm
[14:15:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:15:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2232.codfw.wmnet with OS bookworm
[14:15:18] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023153 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2232.codfw.wmnet with OS bookworm completed: - db2232 (**PAS...
[14:15:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:19:03] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:19:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2230.codfw.wmnet with OS bookworm
[14:19:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:19:13] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023182 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2230.codfw.wmnet with OS bookworm completed: - db2230 (**PAS...
[14:20:30] <abijeet>	 noticing a simikar spike in traffic again...will monitor for some more time.
[14:20:42] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[14:20:51] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[14:20:52] <Lucas_WMDE>	 oh dear
[14:21:03] <Lucas_WMDE>	 ohhhh yeah TX bandwith is going up
[14:21:23] <Lucas_WMDE>	 (k8s deployment is done btw, scap is just finishing up)
[14:21:32] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[14:21:33] <Lucas_WMDE>	 that sync-masters error is now tracked at T371261 btw
[14:21:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1057840|TranslatablePage: Split translatable page id cache into multiple shards (T366455)]] (duration: 19m 24s)
[14:21:37] <stashbot>	 T371261: scap broken on deploy1002 / deploy2002 (buster) - https://phabricator.wikimedia.org/T371261
[14:21:58] <Lucas_WMDE>	 scap returned non-zero exit status… I assume that’s because of the sync-masters
[14:22:07] * Lucas_WMDE scrolls up
[14:22:18] <Lucas_WMDE>	 yeah I don’t see any other errors in the output
[14:22:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:22:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2231.codfw.wmnet with OS bookworm
[14:22:54] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023205 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2231.codfw.wmnet with OS bookworm completed: - db2231 (**PAS...
[14:23:07] <effie>	 Lucas_WMDE: lets give it ~10' and revert 
[14:23:11] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[14:23:14] <Lucas_WMDE>	 ok
[14:23:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:23:43] <Lucas_WMDE>	 or, maybe: let’s upload and +2 the revert now, and leave ourselves the option to abort the merge if we decide not to revert after all?
[14:23:55] <Lucas_WMDE>	 though that would make it more than 10 minutes before the revert merges normally
[14:24:04] <herron>	 !log the grafana default datasource has been changed from graphite to thanos T269333
[14:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:11] <stashbot>	 T269333: Switch default Grafana datasource to Thanos - https://phabricator.wikimedia.org/T269333
[14:24:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:24:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2229.codfw.wmnet with OS bookworm
[14:24:44] <wikibugs>	 (03PS1) 10Klausman: charts/knative-serving: remove selector again [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057892
[14:24:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2229.codfw.wmnet with OS bookworm completed: - db2229 (**PAS...
[14:25:09] <effie>	 Lucas_WMDE: based on last week, it is unlikely we will not revert :)
[14:25:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:25:24] <Lucas_WMDE>	 yeah
[14:25:59] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Revert "TranslatablePage: Split translatable page id cache into multiple shards" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455)
[14:26:07] <Lucas_WMDE>	 effie, abijeet: ^
[14:26:26] <abijeet>	 ack
[14:26:38] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023220 (10Jhancock.wm)
[14:26:50] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "Let’s start gate-and-submit while we continue to look at Grafana for a bit; if we decide to deploy the revert, we might force-merge this b" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE))
[14:26:54] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Revert "TranslatablePage: Split translatable page id cache into multiple shards" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE))
[14:27:24] <wikibugs>	 06SRE, 10conftool, 06Infrastructure-Foundations, 10Puppet-Infrastructure: conftool and pyparsing requirements - https://phabricator.wikimedia.org/T371252#10023233 (10elukey) p:05Triage→03Medium
[14:28:43] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023235 (10Jhancock.wm) 27 and 33 having some issues. will check again shortly.  27 is not connecting to the right puppet hosts.   > Generated Puppet certificate > [1/10...
[14:28:45] <abijeet>	 Lets give it another 2 minutes, and then we can revert it.
[14:28:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10023237 (10cmooney) p:05Triage→03Medium
[14:29:07] <Lucas_WMDE>	 ack
[14:30:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10023247 (10cmooney) p:05Triage→03Medium
[14:30:29] <wikibugs>	 (03CR) 10Klausman: [C:03+2] charts/knative-serving: remove selector again [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057892 (owner: 10Klausman)
[14:31:31] <Lucas_WMDE>	 the lines are going up and down a bit but to me they don’t look like they’re settling down to a reasonable level
[14:31:34] <Lucas_WMDE>	 let’s revert?
[14:33:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE))
[14:33:07] <abijeet>	 Lucas_WMDE, yea I was hoping they'd keep going down, but that doesn't appear to be happening
[14:33:15] <abijeet>	 Lets revert
[14:33:21] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "force-merging the revert" [extensions/Translate] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057893 (https://phabricator.wikimedia.org/T366455) (owner: 10Lucas Werkmeister (WMDE))
[14:33:24] <sukhe>	 !log A:wikidough: debdeploy upgrade anycast-hc to 0.9.8
[14:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1057893|Revert "TranslatablePage: Split translatable page id cache into multiple shards" (T366455)]]
[14:33:39] <Lucas_WMDE>	 alright, merged, now scap is running again
[14:33:43] <wikibugs>	 (03Merged) 10jenkins-bot: charts/knative-serving: remove selector again [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057892 (owner: 10Klausman)
[14:33:52] <sukhe>	 !log A:wikidough: debdeploy upgrade anycast-hc to 0.9.8: T370068
[14:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:57] <stashbot>	 T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068
[14:34:52] <sukhe>	 !log sudo cumin -b1 -s120 'O:wikidough' 'run-puppet-agent'
[14:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1057893|Revert "TranslatablePage: Split translatable page id cache into multiple shards" (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:35:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync
[14:36:27] <Lucas_WMDE>	 (building and pulling the image was much faster again this time, btw)
[14:37:32] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[14:37:40] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10023283 (10hnowlan) Are we classifying "incident issue closed" as resolved?
[14:38:02] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325#10023279 (10ayounsi) 05Open→03Stalled p:05High→03Low
[14:39:11] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[14:39:18] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023287 (10Marostegui) @Jhancock.wm I've fixed db2227's certificate issues. Puppet finished correctly. I am going to reimage it again and see if it works fine this time....
[14:39:21] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:31] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#10023289 (10Volans) a:05Volans→03None
[14:40:25] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#10023290 (10elukey)
[14:41:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1057893|Revert "TranslatablePage: Split translatable page id cache into multiple shards" (T366455)]] (duration: 07m 58s)
[14:41:49] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website - https://phabricator.wikimedia.org/T357756#10023295 (10joanna_borun) p:05High→03Medium
[14:42:41] <Lucas_WMDE>	 memcached looks fine again to me
[14:43:24] <effie>	 cheers thank you 
[14:45:06] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:45:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:22] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: review irc grammar ergonomics - https://phabricator.wikimedia.org/T370786#10023319 (10hnowlan) One of the big challenges I can see here is the use of compound words - currently we use lazy names like incident-create and incident-list because adding a verb and subverbs w...
[14:46:41] <Lucas_WMDE>	 and IMHO someone™ should look at T371261 – we probably either need to install the older scap version there(?) or remove them from some masters list so that scap@deploy1003 won’t try to deploy to them anymore
[14:46:42] <stashbot>	 T371261: scap broken on deploy1002 / deploy2002 (buster) - https://phabricator.wikimedia.org/T371261
[14:47:06] <Lucas_WMDE>	 CC akosiaris and jnuche for ^
[14:48:10] <akosiaris>	 Lucas_WMDE: ehm, what? that's new
[14:48:43] <wikibugs>	 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799#10023337 (10SLyngshede-WMF) a:03SLyngshede-WMF
[14:49:02] <akosiaris>	 I am not sure how to rollback scap tbh. And while I 'll remove deploy1002 within the week, and upgrade deploy2002 to bullseye
[14:49:15] <akosiaris>	 I am not sure what is going on there
[14:49:35] <akosiaris>	 ah dammit python versions
[14:49:36] <akosiaris>	 sigh
[14:49:37] <Lucas_WMDE>	 to me it looks like an issue with some other commit that was included in the new release
[14:49:38] <Lucas_WMDE>	 yeah
[14:49:58] <Lucas_WMDE>	 or is it just using a different python version of where the package was built, maybe
[14:50:03] <Lucas_WMDE>	 I’m not seeing anything obvious in the git log at least
[14:52:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:52:16] <wikibugs>	 (03PS3) 10Ottomata: mediawiki.org - Apache rewrite /beacon/event -> /w/beacon/event.php [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817)
[14:52:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:53:13] <wikibugs>	 (03CR) 10Ottomata: "Okay, ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[14:54:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2233.codfw.wmnet with OS bookworm
[14:54:08] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm executed with errors: - db...
[14:56:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10023429 (10cmooney) 05Open→03Resolved Gonna close this one, I see hosts have been assigned to the new range and are reachable ` cmo...
[14:56:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2233.codfw.wmnet with OS bookworm
[14:56:21] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023437 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm
[14:56:37] <wikibugs>	 (03PS4) 10Ottomata: mediawiki.org - Rewrite /beacon/event -> EventLogging rest handler [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817)
[14:57:08] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "LGTM to align with the configuration of mwdebug1001." [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[14:57:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234#10023444 (10Jhancock.wm) a:03Jhancock.wm
[14:57:42] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[14:58:14] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm
[14:58:25] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db...
[14:58:35] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10023447 (10mszabo) 05In progress→03Resolved
[14:58:42] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Máté Szabó - https://phabricator.wikimedia.org/T370904#10023448 (10mszabo) Thanks, looks good!
[14:58:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2003.codfw.wmnet with OS bookworm
[14:58:52] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm
[14:59:21] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[14:59:21] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:53] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mwdebug: Add logstash and otelcol config [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[15:02:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm
[15:02:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:02:14] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm
[15:03:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] mediawiki.org - Rewrite /beacon/event -> EventLogging rest handler [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[15:04:56] <jnuche>	 Lucas_WMDE: I think that scap issue comes from the scap installer/self-updater
[15:05:09] <jnuche>	 it's not critically urgent but I'll take a look soon
[15:05:14] <jnuche>	 (ish)
[15:06:07] <Lucas_WMDE>	 ok, thanks!
[15:08:18] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023467 (10Papaul) @Marostegui 2233 was a switch port issue so it should be fix now. @Jhancock.wm started the re -image already on it  Cookbook cookbooks.sre.hosts.reima...
[15:09:39] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns2006.wikimedia.org [reason: upgrading anycast-hc: T370068]
[15:09:44] <stashbot>	 T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068
[15:10:50] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2233.codfw.wmnet with reason: host reimage
[15:10:56] <sukhe>	 !log [dns2006] upgrade anycast-healthchecker to 0.9.8-1+wmf12u2: T370068
[15:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:51] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#10023498 (10SLyngshede-WMF) 05In progress→03Resolved Closing this task, I've created https://phabricator.wikimedia.org/T371270 for the issues r...
[15:12:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:13:05] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org [reason: finished upgrading anycast-hc: T370068]
[15:13:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2233.codfw.wmnet with reason: host reimage
[15:14:05] <sukhe>	 !log running authdns-update after dns2006 depool
[15:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:12] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023508 (10Marostegui) @Papaul can you also check db2227? It is not rebooting after I issued the reimage cookbook. The idrac screen is also blank so I cannot see where i...
[15:16:23] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:16:49] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:17:58] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:18:05] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:18:29] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1240.eqiad.wmnet with OS bullseye
[15:18:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10023511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1240.eqiad.wmnet with OS bullseye...
[15:22:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:23:38] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:23:46] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:27:08] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#10023539 (10Clement_Goubert) 05In progress→03Resolved
[15:29:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T371100#10023591 (10phaultfinder)
[15:29:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:30:04] <jouncebot>	 jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1530). Please do the needful.
[15:32:34] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[15:33:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:33:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2233.codfw.wmnet with OS bookworm
[15:33:36] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: [DNM] Showcase atomic: false for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057907
[15:33:40] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm completed: - db2233 (**PAS...
[15:34:01] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023610 (10Jhancock.wm)
[15:40:26] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gerrit2003.codfw.wmnet with OS bookworm
[15:40:32] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm executed with errors: - gerr...
[15:41:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2003.codfw.wmnet with OS bookworm
[15:41:26] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023643 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm
[15:42:06] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10023645 (10Clement_Goubert)
[15:47:33] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gerrit2003.codfw.wmnet with OS bookworm
[15:47:37] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023654 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.codfw.wmnet with OS bookworm executed with errors: - gerr...
[15:48:40] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:49:09] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:51:30] <wikibugs>	 (03PS2) 10Clément Goubert: Cleanup old config [puppet] - 10https://gerrit.wikimedia.org/r/1056895 (https://phabricator.wikimedia.org/T367949)
[15:51:35] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Cleanup old config [puppet] - 10https://gerrit.wikimedia.org/r/1056895 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[15:53:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:54:46] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:55:14] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:55:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add public vlan for gerrit2003 - pt1979@cumin2002"
[15:56:04] <sukhe>	 !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.5-1wm1_amd64.changes T339134
[15:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:09] <stashbot>	 T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134
[15:56:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add public vlan for gerrit2003 - pt1979@cumin2002"
[15:56:35] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:57:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2003.wikimedia.org with OS bookworm
[15:57:14] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2003.wikimedia.org with OS bookworm
[16:01:39] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-drmrs or A:cp-text_
[16:01:39] <logmsgbot>	 drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru)
[16:02:36] <urbanecm>	 jouncebot: nowandnext
[16:02:36] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[16:02:36] <jouncebot>	 In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700)
[16:02:36] <jouncebot>	 In 0 hour(s) and 57 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700)
[16:02:46] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Ignore help-links with no title configured [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) (owner: 10Michael Große)
[16:04:33] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-
[16:04:33] <logmsgbot>	 drmrs or A:cp-text_drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru)
[16:04:39] <sukhe>	 scary
[16:05:05] <Dreamy_Jazz>	 urbanecm: Once you are done deploying could you ping me and then I can deploy?
[16:05:18] <urbanecm>	 Dreamy_Jazz: sure, will do
[16:05:32] <urbanecm>	 Dreamy_Jazz: depending on what you have, i can also sequeeze it into my scap if you want to. up2you.
[16:05:47] <Dreamy_Jazz>	 Yet to write my patch, so don't want to hold you up.
[16:05:51] <akosiaris>	 urbanecm: keep https://phabricator.wikimedia.org/T371261 in case it's shows up
[16:05:58] <akosiaris>	 in mind, in case*
[16:06:09] <claime>	 Be advised of of a.kosiaris was faster
[16:06:19] <urbanecm>	 akosiaris: good to know, thanks.
[16:06:40] <urbanecm>	 scap help works at least now
[16:06:53] <akosiaris>	 on deploy1003, yes it does
[16:07:01] <akosiaris>	 it's the other 2 hosts that are borked
[16:07:12] <urbanecm>	 yep. i first sshed to 1002, and it yelled at me "do not use, use 1003 instead", so i switched.
[16:07:22] <akosiaris>	 chances are you will be fine btw. But keep it in mind
[16:07:28] <urbanecm>	 yup, thanks for the headsup
[16:08:11] <urbanecm>	 Dreamy_Jazz: i'm literally waiting on CI, so...no problem if it'll come until CI is finished (it says 20 mins eta)
[16:08:33] <Dreamy_Jazz>	 I'm just writing it now, so I should have it ready in time :)
[16:12:23] <zabe>	 hey, is it okay if I quickly merge a labs-only change in between? ;)
[16:12:55] <urbanecm>	 zabe: go for it
[16:14:23] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Make native MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057870 (https://phabricator.wikimedia.org/T371254) (owner: 10Physikerwelt)
[16:14:31] <zabe>	 thx
[16:15:13] <wikibugs>	 (03Merged) 10jenkins-bot: Make native MathML rendering default in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057870 (https://phabricator.wikimedia.org/T371254) (owner: 10Physikerwelt)
[16:15:26] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2003.wikimedia.org with reason: host reimage
[16:16:01] <zabe>	 done
[16:16:30] <urbanecm>	 zabe: did you pull to deploy too?
[16:16:33] <urbanecm>	 or should i?
[16:16:39] <zabe>	 pulled
[16:16:43] <urbanecm>	 ack, thanks
[16:17:36] <wikibugs>	 (03PS1) 10Dreamy Jazz: Display a GlobalBlock link to stewards in Special:CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057917 (https://phabricator.wikimedia.org/T370463)
[16:17:57] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet [reason: testing ATS 9.2.5 upgrade]
[16:18:46] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Display a GlobalBlock link to stewards in Special:CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057917 (https://phabricator.wikimedia.org/T370463) (owner: 10Dreamy Jazz)
[16:18:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2003.wikimedia.org with reason: host reimage
[16:19:32] <wikibugs>	 (03Merged) 10jenkins-bot: Display a GlobalBlock link to stewards in Special:CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057917 (https://phabricator.wikimedia.org/T370463) (owner: 10Dreamy Jazz)
[16:19:33] <Dreamy_Jazz>	 urbanecm: I've created the config patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1057917 and given it a +2
[16:20:06] <Dreamy_Jazz>	 I don't have steward rights, so won't be able to test beyond ensuring that Special:CheckUser didn't break.
[16:20:43] <Dreamy_Jazz>	 The relevant code is tested so I feel confident that it'll work.
[16:21:48] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[16:23:23] <urbanecm>	 Dreamy_Jazz: ack, sounds good
[16:23:26] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057918
[16:23:27] <urbanecm>	 i can help with testing
[16:23:32] <Dreamy_Jazz>	 :D
[16:23:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) (owner: 10Michael Große)
[16:29:29] <Lucas_WMDE>	 :D
[16:30:11] <Emperor>	 !log restart swift-proxy on ms-fe2011 T360913
[16:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:16] <stashbot>	 T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913
[16:31:02] <wikibugs>	 (03CR) 10Volans: [C:03+1] "very late post merge issue found after T371132" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[16:31:18] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023867 (10Papaul) @Marostegui checking
[16:33:32] <wikibugs>	 (03PS2) 10Ssingh: Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134)
[16:33:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[16:36:35] <wikibugs>	 (03Merged) 10jenkins-bot: Ignore help-links with no title configured [extensions/GrowthExperiments] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057001 (https://phabricator.wikimedia.org/T370941) (owner: 10Michael Große)
[16:36:47] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1057917|Display a GlobalBlock link to stewards in Special:CheckUser (T370463 T178571)]], [[gerrit:1057001|Ignore help-links with no title configured (T370941)]]
[16:36:52] <urbanecm>	 progress!
[16:36:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:36:54] <stashbot>	 T370463: Update CheckUser to handle global account blocks - https://phabricator.wikimedia.org/T370463
[16:36:55] <stashbot>	 T178571: Add CentralAuth and GlobalBlock links to Special:CheckUser - https://phabricator.wikimedia.org/T178571
[16:36:55] <stashbot>	 T370941: PHP Notice: Undefined index: title - https://phabricator.wikimedia.org/T370941
[16:37:01] <Dreamy_Jazz>	 :)
[16:38:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:38:30] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2003.wikimedia.org with OS bookworm
[16:38:47] <logmsgbot>	 !log urbanecm@deploy1003 dreamyjazz, migr, urbanecm: Backport for [[gerrit:1057917|Display a GlobalBlock link to stewards in Special:CheckUser (T370463 T178571)]], [[gerrit:1057001|Ignore help-links with no title configured (T370941)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:38:47] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2003.wikimedia.org with OS bookworm completed: - gerrit2003 (*...
[16:39:11] <urbanecm>	 Dreamy_Jazz: what do i need to do?
[16:39:41] <Dreamy_Jazz>	 Load Special:CheckUser 'Get users' on any wiki and test that the result lines have a "GlobalBlock" link next to them.
[16:39:50] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023951 (10Papaul)
[16:40:49] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10023952 (10Papaul) 05Open→03Resolved @Dzahn all your's
[16:41:02] <wikibugs>	 (03PS3) 10Ssingh: Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134)
[16:41:03] <urbanecm>	 that works
[16:41:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[16:41:10] <urbanecm>	 i also see this https://usercontent.irccloud-cdn.com/file/Xavv47yw/image.png
[16:41:17] <urbanecm>	 which might be unrelated
[16:41:22] <Dreamy_Jazz>	 It is unrelated
[16:41:44] <Dreamy_Jazz>	 That's locally blocking (as opposed to globally)
[16:41:49] <urbanecm>	 gotcha
[16:41:52] <urbanecm>	 anyway, link was there
[16:42:02] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023956 (10Papaul) @Marostegui serial was off on the server I set it up. We had an issue with the provision cookbook not setting the serial co we did all the servers man...
[16:42:07] <urbanecm>	  M​artin Urbanec globally blocked ~2024-2553 (expires: 2024-07-29 16:40:57) with the following comment: Testing block <=== and block works too :)
[16:42:08] <logmsgbot>	 !log urbanecm@deploy1003 dreamyjazz, migr, urbanecm: Continuing with sync
[16:42:10] <urbanecm>	 proceeding
[16:44:19] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023969 (10Papaul) @Marostegui since all those servers are on 10G when you put them in productions can you please let me know if you noticed any improvement.
[16:44:24] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm
[16:44:25] <wikibugs>	 (03PS1) 10Ssingh: Release 9.2.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134)
[16:44:30] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors: -...
[16:44:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Release 9.2.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh)
[16:45:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm
[16:45:29] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10023979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm
[16:47:44] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap: Backport for [[gerrit:1057917|Display a GlobalBlock link to stewards in Special:CheckUser (T370463 T178571)]], [[gerrit:1057001|Ignore help-links with no title configured (T370941)]] (duration: 10m 56s)
[16:47:48] <urbanecm>	 Dreamy_Jazz: done
[16:47:56] <stashbot>	 T370463: Update CheckUser to handle global account blocks - https://phabricator.wikimedia.org/T370463
[16:47:57] <Dreamy_Jazz>	 :D
[16:47:57] <stashbot>	 T178571: Add CentralAuth and GlobalBlock links to Special:CheckUser - https://phabricator.wikimedia.org/T178571
[16:47:57] <stashbot>	 T370941: PHP Notice: Undefined index: title - https://phabricator.wikimedia.org/T370941
[16:48:30] <urbanecm>	 Dreamy_Jazz: and also i'm done with my own stuff, in case you have anything else :D
[16:49:31] <Dreamy_Jazz>	 That was the only one I wanted to deploy
[16:49:32] <Dreamy_Jazz>	 Thanks
[16:50:17] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024032 (10Marostegui) @Papaul I am trying to reimage db2227 but it is not doing PXE boot
[16:57:38] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024082 (10Marostegui) >>! In T369654#10023969, @Papaul wrote: > @Marostegui since all those servers are on 10G when you put them in productions can you please let me kn...
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700)
[17:00:05] <jouncebot>	 ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T1700).
[17:02:09] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] switchdc: mediawiki cache warmup now targets k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French)
[17:02:40] <wikibugs>	 (03CR) 10Dzahn: "you are right. Jelto also raised the same concern. not sure yet what the best fix it but at least this gets closer to what the core of the" [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn)
[17:04:01] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "while it should not be handled via email to individual people, I'd still say +1 to this one" [puppet] - 10https://gerrit.wikimedia.org/r/1057877 (owner: 10Slyngshede)
[17:05:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: drop gerrit-replica-new.wikimedia.org from list of replicas [puppet] - 10https://gerrit.wikimedia.org/r/1056996 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn)
[17:08:23] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: fix dell_config_changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1057927 (https://phabricator.wikimedia.org/T365372)
[17:13:11] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024156 (10Papaul) @Marostegui please run the cookbook this way: ` sudo cookbook sre.hosts.reimage  -t T369654 --os bookworm --force-dhcp-tftp db2227 --new ` add the ---...
[17:13:54] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.host.provision: no-op refactor to highlight DELL-specific confs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[17:14:05] <sukhe>	 !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.5-1wm2_amd64.changes T339134
[17:14:06] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10024160 (10jijiki)
[17:14:07] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#10024161 (10jijiki)
[17:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:11] <stashbot>	 T339134: Package and deploy ATS 9.2.5 - https://phabricator.wikimedia.org/T339134
[17:14:55] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-drmrs or A:cp-text_
[17:14:55] <logmsgbot>	 drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru)
[17:17:44] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4052*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-
[17:17:44] <logmsgbot>	 drmrs or A:cp-text_drmrs or A:cp-upload_drmrs or A:cp-magru or A:cp-text_magru or A:cp-upload_magru)
[17:24:28] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm
[17:24:39] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors: -...
[17:25:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm
[17:25:44] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm
[17:26:22] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [reason: testing ATS 9.2.5 upgrade]
[17:30:42] <wikibugs>	 (03Abandoned) 10Dzahn: wikistats: drop min_gb parameter from cinder volume mount [puppet] - 10https://gerrit.wikimedia.org/r/1056605 (owner: 10Dzahn)
[17:33:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] site: simplify regex for doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/1056586 (owner: 10Dzahn)
[17:36:18] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] firewall: if provider is nft and not pulling requestctl, remove confd [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) (owner: 10Dzahn)
[17:37:47] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10024274 (10lmata) >>! In T370783#10023283, @hnowlan wrote: > Are we classifying "incident issue closed" as resolved?   Alternatively, we'd need some intermediate state like "Stalled" or a new one, m...
[17:39:42] <wikibugs>	 (03PS1) 10Dzahn: ci: replace ferm::service with firewall::service in data_rsync [puppet] - 10https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677)
[17:43:31] <wikibugs>	 (03PS1) 10Dzahn: zuul: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677)
[17:46:07] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1056603/3437/aphlict1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1056603 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[17:46:38] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] aphlict: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056603 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[17:50:58] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm
[17:51:10] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors: -...
[17:51:26] <urbanecm>	 !log mwmaint1002: kill extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php for enwiki (T370802)
[17:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:31] <stashbot>	 T370802: Add a link (Structured task): Release as "turned off" to English Wikipedia - https://phabricator.wikimedia.org/T370802
[17:52:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm
[17:52:18] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024447 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm
[17:59:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:00:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:05:37] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10024486 (10NBaca-WMF) Created three related tickets to track work for this:  * https://phabricator.wikimedia.org/T371295 for running synthetic performanc...
[18:05:40] <wikibugs>	 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024506 (10akosiaris) >>! In T370739#10019839, @Catrope wrote: > @akosiaris I'm trying to figure out how we should proceed based on your comment.   Y...
[18:08:14] <wikibugs>	 06SRE, 06serviceops, 10Shellbox, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024527 (10LGoto)
[18:09:47] <wikibugs>	 06SRE, 06serviceops, 10Shellbox, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024525 (10LGoto) p:05Triage→03High
[18:10:03] <wikibugs>	 (03PS3) 10Dzahn: aphlict: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677)
[18:10:23] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "works now after the previous fix: https://puppet-compiler.wmflabs.org/output/1055489/3438/aphlict1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[18:15:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:19:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:48:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10024743 (10Jhancock.wm) a:03Jhancock.wm
[18:49:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:50:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:58:30] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2227.codfw.wmnet with OS bookworm
[18:58:37] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db22...
[18:59:15] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10024815 (10jhathaway) >>! In T370789#10015615, @BCornwall wrote: > That's right! Thanks for reminding. Anyone have any qualms with going that route?  seems simple and easy to change later, so +1
[18:59:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm
[18:59:45] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10024816 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm
[19:00:17] <wikibugs>	 (03PS1) 10Jdlrobson: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092)
[19:00:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson)
[19:00:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson)
[19:01:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10024842 (10Jhancock.wm) a:03Jhancock.wm
[19:04:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:07:33] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] data.yaml: Extend andyrussg until the end of August. [puppet] - 10https://gerrit.wikimedia.org/r/1057877 (owner: 10Slyngshede)
[19:07:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10024885 (10Jhancock.wm) a:03Jhancock.wm
[19:09:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:11:34] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10024896 (10jhathaway) >>! In T370783#10023283, @hnowlan wrote: > Are we classifying "incident issue closed" as resolved?   Resolved maps well to our docs on resolving an incident, https://wikitech.w...
[19:19:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10024917 (10Jhancock.wm) a:03Jhancock.wm
[19:28:39] <wikibugs>	 06SRE, 06serviceops, 10Shellbox, 10Charts (Sprint 3): Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10024998 (10CDanis) >>! In T370739#10024506, @akosiaris wrote: > Rate limiting is broken in service-runner for a long time now. See T200374...
[19:43:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10025093 (10thcipriani) Reason for access makes sense. Approved from my side.
[19:44:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to `restricted` group for Michael Große/migr - https://phabricator.wikimedia.org/T371010#10025096 (10thcipriani)
[19:49:53] <wikibugs>	 (03PS4) 10NMW03: Increase edit count requirement for autoconfirmed on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057379 (https://phabricator.wikimedia.org/T371186)
[19:50:08] <Nemoralis>	 o/
[19:54:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:56:48] <wikibugs>	 (03CR) 10DannyS712: [C:04-1] admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey)
[19:57:50] <Nemoralis>	 jouncebot: next
[19:57:50] <jouncebot>	 In 0 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T2000)
[19:59:00] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10025127 (10Dzahn) Check the VictorOps web UI -> rotations -> and see what time (and timezone!) is configured for the 2 rotations. (There are only 2 so not sure how...
[19:59:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T2000).
[20:00:04] <jouncebot>	 Nemoralis, Superzerocool, ebernhardson, Gerges, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <Nemoralis>	 o/
[20:00:32] <cjming>	 i can deploy
[20:01:13] <ebernhardson>	 \o
[20:01:32] <ebernhardson>	 cjming: i have to restart some services between my two patches, so they could have a number of others between them
[20:01:48] <cjming>	 ebernhardson: sounds good
[20:02:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057379 (https://phabricator.wikimedia.org/T371186) (owner: 10NMW03)
[20:02:25] <Jdlrobson>	 o/ im here if there is space for me :)
[20:03:20] <Superzerocool>	 wop... I was late, but I'm here for a tiny deploy (IP cap lift) :)
[20:03:34] <Nemoralis>	 no worries, it is started now
[20:03:41] <cjming>	 Superzerocool: i'll do yours next
[20:03:55] <Superzerocool>	 yay!, thanks =)
[20:04:01] <Nemoralis>	 cjming: do you know how can I test my patch?
[20:04:20] <cjming>	 Nemoralis: once it's ready - do you have the mwdebug extension installed?
[20:04:30] <Nemoralis>	 no, no I know that
[20:04:51] <Nemoralis>	 I am talking about testing wgAutoConfirmCount
[20:05:02] <cjming>	 oh - that idk
[20:06:25] <wikibugs>	 (03Merged) 10jenkins-bot: Increase edit count requirement for autoconfirmed on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057379 (https://phabricator.wikimedia.org/T371186) (owner: 10NMW03)
[20:06:36] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]]
[20:06:45] <stashbot>	 T371186: Change autoconfirmed requirements on English Wikivoyage - https://phabricator.wikimedia.org/T371186
[20:08:35] <logmsgbot>	 !log cjming@deploy1003 nmw03, cjming: Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:08:39] <cjming>	 Nemoralis: not sure what to tell you about testing - your patch is up on test servers tho - shall i sync?
[20:09:03] <Nemoralis>	 I think yes, it is not a big patch
[20:09:19] <cjming>	 cool - syncing
[20:09:34] <cjming>	 for any SRE around -- i saw this message: 2 masters had sync errors
[20:09:56] <cjming>	 https://www.irccloud.com/pastebin/Ixq1TyA8/
[20:10:02] <logmsgbot>	 !log cjming@deploy1003 nmw03, cjming: Continuing with sync
[20:12:17] <wikibugs>	 (03PS2) 10Superzerocool: enwiki, commonswiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026)
[20:13:57] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson)
[20:14:22] <wikibugs>	 (03PS8) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046)
[20:14:31] <wikibugs>	 (03PS2) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046)
[20:15:29] <logmsgbot>	 !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]] (duration: 08m 52s)
[20:15:34] <stashbot>	 T371186: Change autoconfirmed requirements on English Wikivoyage - https://phabricator.wikimedia.org/T371186
[20:15:41] <Nemoralis>	 ty cjming
[20:16:05] <cjming>	 Nemoralis: i think it's live but i just saw an error
[20:16:31] <cjming>	 if any SREs are available: did the last scap backport actually work?
[20:16:37] <cjming>	 20:15:29 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=nmw03', 'Backport for [[gerrit:1057379|Increase edit count requirement for autoconfirmed on English Wikivoyage (T371186)]]']' returned non-zero exit status 1.
[20:16:57] <cjming>	 ^^ saw this msg after saying it finished
[20:17:03] <Nemoralis>	 weird
[20:17:30] <Nemoralis>	 maybe you should comment this on phab task too
[20:17:44] <cjming>	 i kinda want confirmation before proceeding with the next patch
[20:18:05] <cjming>	 Nemoralis: is there a way for you to check on prod?
[20:18:20] <Nemoralis>	 I am not sure
[20:18:28] <Nemoralis>	 let me check if I have autoconfirmed
[20:18:52] <Nemoralis>	 oh I don't
[20:18:55] <Nemoralis>	 and I have 2 edits
[20:18:56] <Nemoralis>	 wait
[20:19:09] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2227.codfw.wmnet with OS bookworm
[20:19:16] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10025297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors: - db22...
[20:19:27] <cjming>	 brennan or thcipriani -- not sure who else to ping -- is it ok to go ahead with the backport window in spite of weird error messages i'm seeing?
[20:19:54] <cjming>	 sorry *brennen ^^
[20:20:09] <ebernhardson>	 i wonder if it's something to do with the switchover in deploy hosts
[20:20:22] <cjming>	 that's what i'm wondering - i'm on deploy1003
[20:20:48] <cjming>	 after seeing a giant message on deploy1002 not to use it
[20:20:58] <Nemoralis>	 ok I have received autoconfirmed now
[20:21:03] <cjming>	 oh good!
[20:21:08] <cjming>	 so maybe things are working
[20:21:31] <Nemoralis>	 https://en.wikivoyage.org/wiki/Special:UserRights/Nemoralis
[20:21:42] <cjming>	 still - error messages are a bit disconcerting - not sure if it's ok to plow ahead in spite of them
[20:21:46] <RhinosF1>	 cjming: there's a task but it's probably ok
[20:21:50] <RhinosF1>	 Give me a minute
[20:22:03] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on kafka-main2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kafka_main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[20:22:41] <cjming>	 RhinosF1: thanks - i'm going to err on the side of plowing ahead then
[20:22:46] <RhinosF1>	 cjming: https://phabricator.wikimedia.org/T371261
[20:22:50] <RhinosF1>	 Is it that?
[20:23:07] <Nemoralis>	 it looks like the same error
[20:23:13] <cjming>	 similar - i pasted above what i'm seeing
[20:23:14] <RhinosF1>	 Go ahead then
[20:23:17] <cjming>	 cool
[20:23:19] <RhinosF1>	 It's fine for today
[20:23:31] <cjming>	 great - thanks
[20:24:22] <Superzerocool>	 Nemoralis: I see the API and it shows right the deploy... 
[20:24:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026) (owner: 10Superzerocool)
[20:24:47] <Nemoralis>	 Superzerocool: thanks! What is the api url for that? I couldn't find that
[20:24:49] <Superzerocool>	 Nemoralis: https://en.wikivoyage.org/wiki/Special:ApiSandbox#action=query&format=json&meta=siteinfo&formatversion=2&siprop=autopromote
[20:24:57] <cjming>	 Superzerocool: deploying yours now
[20:25:06] <wikibugs>	 (03Merged) 10jenkins-bot: enwiki, commonswiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026) (owner: 10Superzerocool)
[20:25:17] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1057033|enwiki, commonswiki: lift IP cap for edit-a-thon (T371026)]]
[20:25:22] <stashbot>	 T371026: Requesting temporary lift of IP cap for 31 July 2024 - https://phabricator.wikimedia.org/T371026
[20:25:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:25:45] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] aphlict: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[20:27:13] <logmsgbot>	 !log cjming@deploy1003 superzerocool, cjming: Backport for [[gerrit:1057033|enwiki, commonswiki: lift IP cap for edit-a-thon (T371026)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:27:16] <cjming>	 eberhardson: i guess yours can't go out together - so i'll do part 1, you can restart what needs restarting, and maybe do 1-2 between and resume with your part 2 when you tell me you're ready?
[20:27:23] <ebernhardson>	 cjming: sure
[20:27:29] <cjming>	 Superzerocool: ok to sync?
[20:27:42] <Superzerocool>	 sure cjming :)
[20:27:46] <logmsgbot>	 !log cjming@deploy1003 superzerocool, cjming: Continuing with sync
[20:28:15] <wikibugs>	 (03PS9) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046)
[20:29:14] <cjming>	 Gerges: are you around?
[20:33:17] <logmsgbot>	 !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1057033|enwiki, commonswiki: lift IP cap for edit-a-thon (T371026)]] (duration: 07m 59s)
[20:33:22] <wikibugs>	 (03PS1) 10Dzahn: miscweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057948 (https://phabricator.wikimedia.org/T370677)
[20:33:25] <stashbot>	 T371026: Requesting temporary lift of IP cap for 31 July 2024 - https://phabricator.wikimedia.org/T371026
[20:33:27] <cjming>	 Superzerocool: guessing it's live
[20:33:38] <Superzerocool>	 yay!, thanks cjming :)
[20:33:45] <cjming>	 yw!
[20:33:51] <Superzerocool>	 See you wiki-people :wave:
[20:33:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson)
[20:34:08] <cjming>	 ebernhardson: starting with your part 1
[20:34:29] <Jdlrobson>	 cjming: I am here if I can jump the queue?
[20:34:37] <wikibugs>	 (03Merged) 10jenkins-bot: Produce a limited set of event streams on private wikis (pt 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson)
[20:34:48] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]]
[20:34:53] <stashbot>	 T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046
[20:35:00] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "still getting desktop notification after this" [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[20:35:03] <cjming>	 Jdlrobson: i was just about to say - i'll yours in between Erik's since the person before you appears to be N/A
[20:35:06] <wikibugs>	 (03PS1) 10Dzahn: codesearch: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057949 (https://phabricator.wikimedia.org/T370677)
[20:36:36] <wikibugs>	 (03PS1) 10Dzahn: releases: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677)
[20:36:37] <logmsgbot>	 !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:36:53] <cjming>	 ebernhardson: should i sync?
[20:37:13] <ebernhardson>	 cjming: yea
[20:37:17] <logmsgbot>	 !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync
[20:37:17] <wikibugs>	 (03PS2) 10Dzahn: releases: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677)
[20:37:38] <wikibugs>	 (03PS2) 10Jdlrobson: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092)
[20:38:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson)
[20:38:52] <wikibugs>	 (03PS1) 10Dzahn: durum: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057951
[20:38:57] <cjming>	 Jdlrobson: your patch isn't passing CI - can you take a look?
[20:39:21] <wikibugs>	 (03PS3) 10Jdlrobson: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092)
[20:39:22] <Jdlrobson>	 fixed
[20:39:29] <cjming>	 that was fast
[20:39:45] <Jdlrobson>	 🫡 
[20:41:32] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "as you can see in compiler output all that happens is the ferm config file gets slightly renamed but the rules stay the same and this just" [puppet] - 10https://gerrit.wikimedia.org/r/1057951 (owner: 10Dzahn)
[20:42:19] <logmsgbot>	 !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]] (duration: 07m 30s)
[20:42:24] <stashbot>	 T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046
[20:42:24] <cjming>	 ebernhardson: part 1 should be live - i'll do Jon's next, then resume with your part 2
[20:42:29] <ebernhardson>	 kk, thanks!
[20:42:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson)
[20:43:32] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up night mode exclude namespaces and allow font size on submit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057936 (https://phabricator.wikimedia.org/T370092) (owner: 10Jdlrobson)
[20:43:34] <cjming>	 Gerges: if you're around, happy to do your patches here in a bit
[20:43:43] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1057936|Clean up night mode exclude namespaces and allow font size on submit (T370092 T370505)]]
[20:43:49] <stashbot>	 T370092: Switching editing mode from VisualEditor to source mode locks text size if it contains changed content - https://phabricator.wikimedia.org/T370092
[20:43:49] <stashbot>	 T370505: Enable dark-mode in mediawiki.org Manual namespace - https://phabricator.wikimedia.org/T370505
[20:44:08] <wikibugs>	 (03PS9) 10TheDJ: Adjust CSP header for pdfs & videos & set enforce on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff)
[20:44:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:44:31] <wikibugs>	 (03CR) 10TheDJ: "Scheduled this again, now for july 30th." [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff)
[20:45:04] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
[20:45:09] <wikibugs>	 (03PS1) 10Dzahn: prometheus::ops: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057952
[20:45:28] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
[20:45:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:45:54] <logmsgbot>	 !log cjming@deploy1003 cjming, jdlrobson: Backport for [[gerrit:1057936|Clean up night mode exclude namespaces and allow font size on submit (T370092 T370505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:45:59] <cjming>	 Jdlrobson: on mwdebug if you'd like to test
[20:46:09] <Jdlrobson>	 wahoo!
[20:46:23] <Jdlrobson>	 LGTM please sync
[20:46:28] <logmsgbot>	 !log cjming@deploy1003 cjming, jdlrobson: Continuing with sync
[20:48:04] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[20:48:32] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1057950/3441/releases1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[20:48:37] <logmsgbot>	 !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[20:52:02] <logmsgbot>	 !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1057936|Clean up night mode exclude namespaces and allow font size on submit (T370092 T370505)]] (duration: 08m 18s)
[20:52:06] <cjming>	 Jdlrobson: should be live!
[20:52:08] <stashbot>	 T370092: Switching editing mode from VisualEditor to source mode locks text size if it contains changed content - https://phabricator.wikimedia.org/T370092
[20:52:08] <stashbot>	 T370505: Enable dark-mode in mediawiki.org Manual namespace - https://phabricator.wikimedia.org/T370505
[20:52:10] <ebernhardson>	 cjming: alright mine looks to be ready
[20:52:18] <cjming>	 perfect timing
[20:52:38] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "complete noop here" [puppet] - 10https://gerrit.wikimedia.org/r/1057950 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[20:53:12] <cjming>	 ebernhardson: should i rebase your part 2 on parent or master?
[20:53:28] <ebernhardson>	 cjming: master is fine
[20:53:38] <wikibugs>	 (03PS3) 10Ebernhardson: Produce a limited set of event streams on private wikis (pt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046)
[20:54:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson)
[20:55:24] <wikibugs>	 (03Merged) 10jenkins-bot: Produce a limited set of event streams on private wikis (pt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056965 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson)
[20:55:34] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]]
[20:55:38] <stashbot>	 T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240729T2100).
[21:00:11] <logmsgbot>	 !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:00:27] <cjming>	 ebernhardson: shall i sync part 2?
[21:00:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:00:45] <ebernhardson>	 cjming: yes please
[21:00:49] <logmsgbot>	 !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync
[21:04:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:06:14] <logmsgbot>	 !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]] (duration: 10m 40s)
[21:06:19] <stashbot>	 T346046: [Search Update Pipeline] Source streams for private wikis - https://phabricator.wikimedia.org/T346046
[21:06:20] <cjming>	 ebernhardson: part 2 should be live! (hopefully)
[21:07:07] <ebernhardson>	 cjming: awesome! will poke and see if it generates the new kafka topics
[21:07:23] <cjming>	 \o/
[21:07:35] <brennen>	 cjming: sorry i missed that earlier ping.  had a flight delayed and have been in transit most of today.
[21:08:07] <cjming>	 brennen: no worries! sorry to trouble while you're traveling - all good here
[21:08:13] <cjming>	 Gerges: last call
[21:08:19] <cjming>	 i will be closing backport window here shortly
[21:09:03] <cjming>	 !log end of UTC late backport window
[21:09:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:56] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10025480 (10Papaul) We reiamge 5 times db2227 for some reason the server is still sending the certificate request to puppetmaster1001 ` pt1979@puppetmaster1001:~$ sudo pu...
[21:25:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:34:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:38:44] <logmsgbot>	 !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox
[21:39:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:41:40] <logmsgbot>	 !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1002"
[21:42:40] <logmsgbot>	 !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1002"
[21:42:41] <logmsgbot>	 !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:45:31] <wikibugs>	 (03CR) 10Dzahn: gerrit: use list of replicas from hiera again, don't do puppet DB lookup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn)
[21:47:29] <wikibugs>	 (03CR) 10Dzahn: "compiler fails because of the bug this is trying to fix - still" [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn)
[21:48:25] <wikibugs>	 (03CR) 10Dzahn: "we need one succesful puppet run to make the hosts appear in the new puppetdb ...I'll try it on the replica" [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn)
[21:50:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:50:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn)
[21:51:28] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3443/console" [puppet] - 10https://gerrit.wikimedia.org/r/1042898 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[21:54:22] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:59:24] <wikibugs>	 (03PS1) 10JHathaway: WIP: test pcc do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1057967
[22:01:32] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway)
[22:15:07] <wikibugs>	 (03PS2) 10JHathaway: WIP: test pcc do not merge [puppet] - 10https://gerrit.wikimedia.org/r/1057967
[22:16:08] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057967 (owner: 10JHathaway)
[22:23:02] <thcipriani>	 cjming: also sorry to miss the ping, that is https://phabricator.wikimedia.org/T371261 and it should be OK
[22:24:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:53:24] <wikibugs>	 (03PS21) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234)
[22:53:44] <wikibugs>	 (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[23:14:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:15:39] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:29:21] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:34:21] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:38:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057975
[23:38:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057975 (owner: 10TrainBranchBot)
[23:39:47] <wikibugs>	 (03PS1) 10C. Scott Ananian: Enable Parsoid Read Views on {en,he}wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367)
[23:54:20] <wikibugs>	 (03CR) 10Arlolra: [C:03+1] Enable Parsoid Read Views on {en,he}wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057976 (https://phabricator.wikimedia.org/T365367) (owner: 10C. Scott Ananian)