[00:03:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T376905)', diff saved to https://phabricator.wikimedia.org/P70337 and previous config saved to /var/cache/conftool/dbconfig/20241018-000356-ladsgroup.json [00:04:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2225.codfw.wmnet with reason: Maintenance [00:04:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2225.codfw.wmnet with reason: Maintenance [00:04:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T376905)', diff saved to https://phabricator.wikimedia.org/P70338 and previous config saved to /var/cache/conftool/dbconfig/20241018-000422-ladsgroup.json [00:06:08] (03CR) 10Cwhite: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1081250 (https://phabricator.wikimedia.org/T377502) (owner: 10Herron) [00:09:40] 06SRE, 06Content-Transform-Team-WIP, 10MW-on-K8s, 06serviceops, and 4 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#10240560 (10ABreault-WMF) 05Open→03Resolved [00:12:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1081289 (owner: 10TrainBranchBot) [00:12:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T376905)', diff saved to https://phabricator.wikimedia.org/P70339 and previous config saved to /var/cache/conftool/dbconfig/20241018-001231-ladsgroup.json [00:27:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P70340 and previous config saved to /var/cache/conftool/dbconfig/20241018-002738-ladsgroup.json [00:31:26] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10240602 (10Papaul) All the servers are now running on the new switches We did 2 fail over tests today each did last about 5 minutes and had no issues The fir... [00:38:53] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:42:14] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove mgmt DNS entries for old frack switches - pt1979@cumin2002" [00:42:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P70341 and previous config saved to /var/cache/conftool/dbconfig/20241018-004245-ladsgroup.json [00:43:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove mgmt DNS entries for old frack switches - pt1979@cumin2002" [00:43:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:47:54] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack pfw3 and old fasw decommission - https://phabricator.wikimedia.org/T377254#10240609 (10Papaul) [00:55:43] (03PS1) 10Papaul: Remove fasw-c-codfw from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1081298 (https://phabricator.wikimedia.org/T377254) [00:57:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T376905)', diff saved to https://phabricator.wikimedia.org/P70342 and previous config saved to /var/cache/conftool/dbconfig/20241018-005752-ladsgroup.json [00:57:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: Maintenance [00:58:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2238.codfw.wmnet with reason: Maintenance [00:58:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T376905)', diff saved to https://phabricator.wikimedia.org/P70343 and previous config saved to /var/cache/conftool/dbconfig/20241018-005819-ladsgroup.json [01:06:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T376905)', diff saved to https://phabricator.wikimedia.org/P70344 and previous config saved to /var/cache/conftool/dbconfig/20241018-010631-ladsgroup.json [01:21:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P70345 and previous config saved to /var/cache/conftool/dbconfig/20241018-012138-ladsgroup.json [01:36:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P70346 and previous config saved to /var/cache/conftool/dbconfig/20241018-013645-ladsgroup.json [01:51:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T376905)', diff saved to https://phabricator.wikimedia.org/P70347 and previous config saved to /var/cache/conftool/dbconfig/20241018-015152-ladsgroup.json [02:15:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:20:20] (03PS11) 10Ebomani: Updating Patch Demo plugin to return legacy/new URL as needed [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) [02:21:25] (03PS12) 10Ebomani: Updating Patch Demo plugin to return legacy/new URL as needed and modifying tests to reflect current process. [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) [02:37:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:15:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:23:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [05:52:24] (03CR) 10Andrea Denisse: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1081250 (https://phabricator.wikimedia.org/T377502) (owner: 10Herron) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241018T0600) [06:01:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080857 (owner: 10KartikMistry) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:40:32] (03PS1) 10Ebrahim: Fix duplicated key in wgVectorNightMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081310 [06:41:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10240798 (10ayounsi) Nicely written plan !! > Fmsw connects directly to firewalls We need to do the same in codfw before or af... [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241018T0700) [07:04:02] (03CR) 10Ayounsi: [C:03+1] Remove fasw-c-codfw from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1081298 (https://phabricator.wikimedia.org/T377254) (owner: 10Papaul) [07:06:14] (03CR) 10Volans: "Yes, but not reserved as they are not reserved for something specific at this time. I've left an inline comment." [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [07:23:21] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [07:50:51] (03PS4) 10Volans: sre.switchdc.databases.prepare: add binlog check [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) [07:50:51] (03PS6) 10Volans: sre.switchdc.databases: allow to select a section [cookbooks] - 10https://gerrit.wikimedia.org/r/1079537 (https://phabricator.wikimedia.org/T375144) [07:51:17] (03CR) 10Volans: "Updated without the fix, ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [07:51:30] (03CR) 10Volans: [C:03+2] sre.switchdc.databases.prepare: add check [cookbooks] - 10https://gerrit.wikimedia.org/r/1074127 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [07:51:45] (03CR) 10Volans: [C:03+2] sre.switchdc.databases: update Phabricator more [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [07:54:59] o/ [07:56:54] (03Merged) 10jenkins-bot: sre.switchdc.databases.prepare: add check [cookbooks] - 10https://gerrit.wikimedia.org/r/1074127 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [07:58:33] (03Merged) 10jenkins-bot: sre.switchdc.databases: update Phabricator more [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [07:59:46] (03PS4) 10Vgutierrez: profile: Fix puppetserver spec test [puppet] - 10https://gerrit.wikimedia.org/r/1081195 [08:02:27] (03CR) 10Vgutierrez: profile: Fix puppetserver spec test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081195 (owner: 10Vgutierrez) [08:03:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [08:03:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [08:03:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T367856)', diff saved to https://phabricator.wikimedia.org/P70348 and previous config saved to /var/cache/conftool/dbconfig/20241018-080343-ladsgroup.json [08:03:48] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:04:37] (03CR) 10Vgutierrez: [C:03+2] profile: Fix puppetserver spec test [puppet] - 10https://gerrit.wikimedia.org/r/1081195 (owner: 10Vgutierrez) [08:17:04] (03PS2) 10Alexandros Kosiaris: Remove obsolete api records [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [08:17:12] (03CR) 10Alexandros Kosiaris: Remove obsolete api records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [08:17:15] (03PS3) 10Alexandros Kosiaris: Remove obsolete api records [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [08:18:26] (03CR) 10CI reject: [V:04-1] Remove obsolete api records [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [08:27:47] (03CR) 10Brouberol: analytics_test_cluster: add secret (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [08:33:21] (03PS1) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) [08:33:26] (03PS1) 10Alexandros Kosiaris: cleanup old mx1001, mx2001 references [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) [08:33:47] (03PS4) 10Alexandros Kosiaris: Remove obsolete api records [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [08:34:13] (03CR) 10Máté Szabó: [C:04-2] "DNM, this is blocked on Legal discussions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [08:42:28] (03PS5) 10Vgutierrez: liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) [08:42:28] (03PS1) 10Vgutierrez: profile: Provide a liberica profile [puppet] - 10https://gerrit.wikimedia.org/r/1081372 (https://phabricator.wikimedia.org/T377127) [09:10:27] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [09:11:00] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [09:14:01] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [09:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:48] (03PS2) 10JMeybohm: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) [09:17:48] (03PS3) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [09:17:48] (03PS1) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [09:26:36] (03CR) 10Giuseppe Lavagetto: [C:03+1] "the logic is a bit convoluted but should work." [puppet] - 10https://gerrit.wikimedia.org/r/1081224 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:33:27] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/proton: sync [09:33:31] (03CR) 10JMeybohm: [C:03+2] etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4 [puppet] - 10https://gerrit.wikimedia.org/r/1081224 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:33:36] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: sync [09:33:46] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: sync [09:35:07] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: sync [09:35:17] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/proton: sync [09:36:20] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: sync [09:37:57] (03CR) 10Jelto: [C:03+2] push miscweb/static-codereview to image 2024-10-17-175203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081241 (https://phabricator.wikimedia.org/T363771) (owner: 10Dzahn) [09:39:09] (03Merged) 10jenkins-bot: push miscweb/static-codereview to image 2024-10-17-175203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081241 (https://phabricator.wikimedia.org/T363771) (owner: 10Dzahn) [09:41:04] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:42:15] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:43:37] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:45:41] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:45:50] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:47:45] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:48:49] (03PS2) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [09:55:02] (03PS3) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [09:56:09] (03PS4) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [09:56:34] (03CR) 10Jcrespo: [C:03+1] "Check worked as expected, validated otherwise:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [09:58:49] (03CR) 10Jcrespo: [C:03+2] mariadb: Remove test-pc1 as a valid section [puppet] - 10https://gerrit.wikimedia.org/r/1080717 (https://phabricator.wikimedia.org/T374933) (owner: 10Jcrespo) [10:01:59] (03CR) 10CI reject: [V:04-1] Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:04:10] (03PS5) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [10:04:48] (03CR) 10Jcrespo: [C:03+1] "While Amir's suggestion are valid concerns, that is out of scope of this patch- the list hasn't been added here, was beforehand, this is o" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [10:07:22] (03PS6) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [10:09:25] (03CR) 10Jelto: [C:03+2] "I deployed this to all wikikube clusters" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081241 (https://phabricator.wikimedia.org/T363771) (owner: 10Dzahn) [10:10:33] (03CR) 10Volans: [C:03+2] sre.switchdc.databases.prepare: add binlog check [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [10:10:59] (03CR) 10Volans: [C:03+2] sre.switchdc.databases: allow to select a section [cookbooks] - 10https://gerrit.wikimedia.org/r/1079537 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [10:11:03] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10241288 (10jcrespo) Let's merge carefully https://gerrit.wikimedia.org/r/1081103... [10:12:53] (03CR) 10CI reject: [V:04-1] Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:16:42] (03Merged) 10jenkins-bot: sre.switchdc.databases.prepare: add binlog check [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [10:17:12] (03Merged) 10jenkins-bot: sre.switchdc.databases: allow to select a section [cookbooks] - 10https://gerrit.wikimedia.org/r/1079537 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [10:20:47] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:21:01] (03CR) 10Jelto: [C:03+2] miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:22:25] (03Merged) 10jenkins-bot: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:22:27] (03Merged) 10jenkins-bot: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:25:56] (03PS7) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [10:26:31] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [10:37:10] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [10:37:23] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubestagemaster2005.codfw.wmnet [10:37:42] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=kubestagemaster2005.codfw.wmnet [10:38:39] !log jayme@cumin1002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster staging-codfw: containerd migration [10:39:01] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=99) Reimaging k8s control planes of cluster staging-codfw: containerd migration [10:40:10] (03PS1) 10Jelto: wikidata-query-gui: fix volumeMount with subPath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081382 (https://phabricator.wikimedia.org/T350793) [10:52:15] (03CR) 10JMeybohm: [C:03+1] wikidata-query-gui: fix volumeMount with subPath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081382 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:52:52] (03CR) 10Ladsgroup: [C:03+1] "Yeah, let's go with this for now." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [10:55:32] (03PS2) 10Jelto: wikidata-query-gui: fix volumeMount with subPath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081382 (https://phabricator.wikimedia.org/T350793) [10:56:54] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: fix volumeMount with subPath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081382 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:57:47] (03PS8) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [10:57:56] (03Merged) 10jenkins-bot: wikidata-query-gui: fix volumeMount with subPath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081382 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:58:02] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=kubestagemaster2005.codfw.wmnet [10:58:15] (03CR) 10Volans: [C:03+2] "Thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [10:59:08] (03PS9) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [10:59:12] !log jayme@cumin1002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster staging-codfw: containerd migration [10:59:41] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241018T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241018T1100). [11:00:17] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [11:00:20] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2005.codfw.wmnet with OS bookworm [11:08:41] (03PS3) 10JMeybohm: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) [11:08:41] (03PS4) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [11:08:42] (03PS10) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [11:08:42] (03PS1) 10JMeybohm: reimage: Add a --force parameter to avoid being asked for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 [11:09:35] (03Merged) 10jenkins-bot: mysql_legacy: reorder CORE_SECTIONS constant [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [11:11:16] (03PS2) 10JMeybohm: reimage: Add a --force parameter to avoid being asked for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 [11:11:16] (03PS4) 10JMeybohm: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) [11:11:17] (03PS5) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [11:11:17] (03PS11) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [11:14:22] (03CR) 10CI reject: [V:04-1] reimage: Add a --force parameter to avoid being asked for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 (owner: 10JMeybohm) [11:16:39] (03PS3) 10JMeybohm: reimage: Add a --force parameter to avoid being asked for confirmation [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 [11:16:39] (03PS5) 10JMeybohm: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) [11:16:39] (03PS6) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [11:16:40] (03PS12) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [11:17:50] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [11:21:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [11:31:34] !log btullis@cumin1002 START - Cookbook sre.hosts.remove-downtime for dbstore1009.eqiad.wmnet [11:31:34] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbstore1009.eqiad.wmnet [11:32:08] 07sre-alert-triage, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): Alert in need of triage: PrometheusMysqldExporterFailed (instance dbstore1009:13350) - https://phabricator.wikimedia.org/T376977#10241563 (10BTullis) 05Open→03Resolved a:03BTullis [11:43:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2005.codfw.wmnet with OS bookworm [11:43:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster staging-codfw: containerd migration [11:47:47] (03PS1) 10Ilias Sarantopoulos: ml-services: update article-descriptions kserve to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081392 (https://phabricator.wikimedia.org/T367048) [12:02:54] (03Abandoned) 10Btullis: Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [12:04:07] (03Abandoned) 10Ladsgroup: dbtools: Add prep-dc-switchover.py [software] - 10https://gerrit.wikimedia.org/r/1072168 (owner: 10Ladsgroup) [12:21:48] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [12:22:14] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [12:22:30] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [12:22:50] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [12:35:23] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update article-descriptions kserve to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081392 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [12:35:38] (03CR) 10Urbanecm: Implement redirects to meta's Special:GlobalContributions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) (owner: 10STran) [12:44:03] (03CR) 10Alexandros Kosiaris: [C:03+1] bitu: Add some stewards to the list of account managers [puppet] - 10https://gerrit.wikimedia.org/r/1081220 (https://phabricator.wikimedia.org/T359820) (owner: 10BryanDavis) [12:48:43] (03PS13) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [12:54:13] (03CR) 10CI reject: [V:04-1] Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:55:03] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376764#10241793 (10VRiley-WMF) 05Open→03Resolved reseated cables, closing [12:55:05] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376094#10241791 (10VRiley-WMF) 05Duplicate→03Resolved Reseated cables. They seem to be back on. Closing ticket. [12:56:08] (03PS14) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [12:56:33] pylint: too-many-locals ... that's what I think when entering a bar around here [12:59:02] (03PS1) 10DCausse: Do not pass null to DataSender::sendWeightedTagsUpdate $tagWeights [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081396 (https://phabricator.wikimedia.org/T376715) [13:00:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081396 (https://phabricator.wikimedia.org/T376715) (owner: 10DCausse) [13:01:57] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update article-descriptions kserve to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081392 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [13:02:58] (03Merged) 10jenkins-bot: ml-services: update article-descriptions kserve to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081392 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [13:03:01] (03CR) 10Brouberol: [C:03+1] ceph-rbd: Bump the ceph-csi plugin image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081112 (https://phabricator.wikimedia.org/T376401) (owner: 10Btullis) [13:03:23] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:07:51] (03PS6) 10JMeybohm: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) [13:07:52] (03PS7) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [13:07:52] (03PS15) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [13:07:52] (03PS1) 10JMeybohm: reimage: Don't fail dry run if host os does not match target os [cookbooks] - 10https://gerrit.wikimedia.org/r/1081398 [13:10:32] (03PS16) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [13:15:58] (03PS17) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [13:16:19] (03CR) 10Papaul: [C:03+2] Remove fasw-c-codfw from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1081298 (https://phabricator.wikimedia.org/T377254) (owner: 10Papaul) [13:19:17] (03PS1) 10Tiziano Fogli: logstash/containerd: fix regexp to match also non-json entries [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) [13:20:11] (03CR) 10CI reject: [V:04-1] Do not pass null to DataSender::sendWeightedTagsUpdate $tagWeights [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081396 (https://phabricator.wikimedia.org/T376715) (owner: 10DCausse) [13:22:17] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@f020959]: Deploying updated dumps reconciliation [13:22:21] (03PS1) 10DCausse: Fix phan issue with getCounter returning NullMetric|CounterMetric [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081402 [13:22:48] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@f020959]: Deploying updated dumps reconciliation (duration: 00m 31s) [13:24:07] (03PS2) 10DCausse: Do not pass null to DataSender::sendWeightedTagsUpdate $tagWeights [extensions/CirrusSearch] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1081396 (https://phabricator.wikimedia.org/T376715) [13:27:14] (03CR) 10Tiziano Fogli: logstash/containerd: fix regexp to match also non-json entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [13:31:22] (03CR) 10Volans: [C:03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/1081398 (owner: 10JMeybohm) [13:31:58] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Hardware replacement [13:32:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Hardware replacement [13:32:24] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10241921 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a5251fb2-fa43-4b25-ad41-97765f693742) set by eevans@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with... [13:32:54] (03PS3) 10Bking: analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) [13:33:21] (03CR) 10Bking: analytics_test_cluster: add secret (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:34:29] (03CR) 10Btullis: [C:03+1] analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:34:32] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack pfw3 and old fasw decommission - https://phabricator.wikimedia.org/T377254#10241928 (10Papaul) [13:34:52] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack pfw3 and old fasw decommission - https://phabricator.wikimedia.org/T377254#10241929 (10Papaul) 05Open→03Resolved This is complete [13:36:28] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:44] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10241940 (10VRiley-WMF) Removed S4KVNA0MB03305 and put in S4KVNA0MB03300 into slot 4 of the device (where S4KVNA0MB03305 was located) [13:46:28] RESOLVED: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:42] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:47:03] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10242009 (10phaultfinder) [13:54:43] (03PS7) 10Bking: airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) [13:55:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10242011 (10cmooney) >>! In T377381#10240797, @ayounsi wrote: >> Fmsw connects directly to firewalls > We need to do the same i... [13:55:29] (03CR) 10Bking: [V:03+2 C:03+2] analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:59:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10242019 (10cmooney) [14:00:15] (03CR) 10Cwhite: [C:04-1] "Please add a test case to ensure the effect of this change works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [14:00:16] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:06:31] (03CR) 10Sergio Gimeno: [Growth] beta: configure the A/B test experiment variants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [14:09:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10242050 (10Papaul) >>! In T377381#10242011, @cmooney wrote: >>>! In T377381#10240797, @ayounsi wrote: >>> Fmsw connects direct... [14:09:18] !log Running `foreachwiki userOptions.php --delete-defaults growthexperiments-homepage-variant` (T374544, T375753) [14:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:23] T374544: Use "control" as the static default for growthexperiments-homepage-variant - https://phabricator.wikimedia.org/T374544 [14:09:24] T375753: Drop unnecessary growthexperiments-homepage-variant entries from user_properties at Wikimedia wikis - https://phabricator.wikimedia.org/T375753 [14:12:40] (03CR) 10Hashar: Updating Patch Demo plugin to return legacy/new URL as needed and modifying tests to reflect current process. (037 comments) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [14:13:06] (03PS13) 10Hashar: Updating Patch Demo plugin to return legacy/new URL as needed and modifying tests to reflect current process. [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [14:16:41] (03CR) 10Brouberol: airflow: make 'secret_key' configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:19:26] (03CR) 10JHathaway: [C:03+1] "Thanks for cleaning this up, looks good, I had the same question about the lbs, but fine with leaving them in for now." [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [14:20:31] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:22:06] (03CR) 10Alexandros Kosiaris: [C:03+2] cleanup old mx1001, mx2001 references [dns] - 10https://gerrit.wikimedia.org/r/1081371 (https://phabricator.wikimedia.org/T325409) (owner: 10Alexandros Kosiaris) [14:22:08] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove obsolete api records [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [14:24:38] (03PS4) 10JMeybohm: reimage: Add a --force parameter to skip some confirmations [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 [14:24:39] (03PS2) 10JMeybohm: reimage: Don't fail dry run if host os does not match target os [cookbooks] - 10https://gerrit.wikimedia.org/r/1081398 [14:24:39] (03PS7) 10JMeybohm: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) [14:24:39] (03PS8) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [14:24:40] (03PS18) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [14:25:51] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [14:25:59] (03CR) 10Mxmxchere: "Thanks for merging, glad you found that change after such a long time 😊" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T362408) (owner: 10Mxmxchere) [14:29:50] (03CR) 10JMeybohm: logstash/containerd: fix regexp to match also non-json entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [14:30:17] (03CR) 10Volans: [C:03+1] "LGTM, non blocking nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 (owner: 10JMeybohm) [14:30:24] (03PS1) 10STran: Disable IP reveal rights for local metawiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081415 (https://phabricator.wikimedia.org/T377584) [14:30:40] (03CR) 10Hashar: Updating Patch Demo plugin to return legacy/new URL as needed and modifying tests to reflect current process. (031 comment) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079624 (https://phabricator.wikimedia.org/T374954) (owner: 10Ebomani) [14:32:13] (03CR) 10Herron: [V:03+1 C:03+2] "thx for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1081250 (https://phabricator.wikimedia.org/T377502) (owner: 10Herron) [14:35:38] (03CR) 10Herron: [V:03+1 C:03+2] grafana-loki: add systemd override and bump max open files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081250 (https://phabricator.wikimedia.org/T377502) (owner: 10Herron) [14:37:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:47] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for aqs1013.eqiad.wmnet [14:37:47] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1013.eqiad.wmnet [14:37:54] (03PS2) 10STran: Disable IP reveal rights for local metawiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081415 (https://phabricator.wikimedia.org/T377584) [14:38:41] !log jayme@cumin1002 START - Cookbook sre.k8s.reimage-stacked-control-plane Reimaging k8s control planes of cluster staging-codfw: containerd migration [14:39:17] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bookworm [14:39:20] 06SRE-OnFire, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10242220 (10akosiaris) [14:39:55] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#10242223 (10Eevans) >>! In T362033#10241940, @VRiley-WMF wrote: > Removed S4KVNA0MB03305 and put in S4KVNA0MB03300 into slot 4 of the device (where S4KVNA0MB03305 was located) The array is rebui... [14:42:45] (03CR) 10Ahmon Dancy: [C:03+1] deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [14:42:47] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:57] FIRING: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:46:09] (03CR) 10Tchanders: [C:03+1] Disable IP reveal rights for local metawiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081415 (https://phabricator.wikimedia.org/T377584) (owner: 10STran) [14:47:36] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@e44bacc]: Deploying updated dumps reconciliation [14:48:07] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@e44bacc]: Deploying updated dumps reconciliation (duration: 00m 31s) [14:51:28] RESOLVED: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:01] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removal of old mx records and api.svc records - akosiaris@cumin1002" [14:53:07] (03CR) 10Scott French: deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [14:53:28] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [14:53:52] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Removal of old mx records and api.svc records - akosiaris@cumin1002" [14:53:52] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:57:04] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [14:58:17] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [14:59:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:52] (03PS2) 10Tiziano Fogli: logstash/containerd: fix regexp to match also non-json entries [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) [15:01:27] (03PS3) 10Tiziano Fogli: logstash/containerd: fix regexp to match also non-json entries [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) [15:02:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:46] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [15:07:10] (03PS3) 10STran: Disable IP reveal rights for local metawiki groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081415 (https://phabricator.wikimedia.org/T377584) [15:11:08] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1081285 (https://phabricator.wikimedia.org/T377490) (owner: 10Eevans) [15:15:49] (03CR) 10Brouberol: [C:03+2] airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [15:17:13] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.10.19 - 2024.11.08), 13Patch-For-Review: Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10242406 (10BTullis) [15:18:22] (03CR) 10Eevans: [C:03+2] Add jebe to airflow-analytics-product-admins per access request [puppet] - 10https://gerrit.wikimedia.org/r/1081285 (https://phabricator.wikimedia.org/T377490) (owner: 10Eevans) [15:24:46] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.10.19 - 2024.11.08), 13Patch-For-Review: Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10242467 (10Eevans) 05Open→03Resolved a:03Eevans Done! [15:26:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2003.codfw.wmnet with OS bookworm [15:26:38] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2004.codfw.wmnet with OS bookworm [15:31:28] FIRING: [2x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:38:27] FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:41:38] (03CR) 10Scott French: "Ah, missed the C:scap in Hosts before ... yeah, that's going to try to run for a lot of hosts, many of which are probably already broken." [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [15:43:26] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2004.codfw.wmnet with reason: host reimage [15:44:46] (03CR) 10Dzahn: "Now I expected to see the changes on https://static-codereview.wikimedia.org/ though" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081241 (https://phabricator.wikimedia.org/T363771) (owner: 10Dzahn) [15:45:42] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10242563 (10Dzahn) This change has been deployed. Just somehow I don't see the changes yet. [15:45:53] (03PS2) 10Saint Johann: Add Russian Wikipedia to Wikidata link move [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081421 (https://phabricator.wikimedia.org/T66315) [15:46:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2004.codfw.wmnet with reason: host reimage [15:46:51] (03CR) 10Ahmon Dancy: [C:03+1] deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [16:01:34] (03PS19) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [16:02:00] (03CR) 10JMeybohm: [C:03+2] reimage: Don't fail dry run if host os does not match target os [cookbooks] - 10https://gerrit.wikimedia.org/r/1081398 (owner: 10JMeybohm) [16:02:03] (03CR) 10JMeybohm: [C:03+2] reimage: Add a --force parameter to skip some confirmations [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 (owner: 10JMeybohm) [16:04:13] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [16:07:22] (03CR) 10Dduvall: deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [16:08:38] (03Merged) 10jenkins-bot: reimage: Add a --force parameter to skip some confirmations [cookbooks] - 10https://gerrit.wikimedia.org/r/1081389 (owner: 10JMeybohm) [16:08:38] (03Merged) 10jenkins-bot: reimage: Don't fail dry run if host os does not match target os [cookbooks] - 10https://gerrit.wikimedia.org/r/1081398 (owner: 10JMeybohm) [16:09:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2004.codfw.wmnet with OS bookworm [16:10:14] (03PS4) 10Cwhite: logstash/containerd: fix regexp to match also non-json entries [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [16:10:29] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2005.codfw.wmnet with OS bookworm [16:10:55] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10242681 (10Reedy) It looks... caching-ish ` last-modified: Fri, 30 Jun 2023 15:19:42 GMT ` [16:12:10] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10242698 (10Reedy) Interestingly, that's before the date of the previous image version too `2023-09-13-183746` [16:15:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10242702 (10cmooney) >>! In T377381#10242050, @Papaul wrote: > TIP: first i count the number of servers with only 1G NIC's thos... [16:16:05] (03PS5) 10Cwhite: logstash/containerd: fix regexp to match also non-json entries [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [16:16:28] FIRING: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:27] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:21:28] RESOLVED: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:31] (03CR) 10Cwhite: [C:03+2] logstash/containerd: fix regexp to match also non-json entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081401 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [16:28:19] (03CR) 10Dduvall: deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [16:28:26] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [16:32:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [16:37:07] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10242782 (10Dzahn) Still seeing the old version with: ` [deploy1003:~] $ curl --resolve static-codereview.wikimedia.or... [16:41:19] (03CR) 10Scott French: [C:03+1] deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [16:45:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10242797 (10cmooney) I spoke to @Jgreen earlier about the setup and I think the above plan should be workable, provided we carr... [16:47:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10242802 (10cmooney) [16:49:11] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607 (10phaultfinder) 03NEW [16:54:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2005.codfw.wmnet with OS bookworm [16:54:12] !log dzahn@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:54:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.reimage-stacked-control-plane (exit_code=0) Reimaging k8s control planes of cluster staging-codfw: containerd migration [16:54:24] !log dzahn@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:57:07] (03CR) 10Jdlrobson: [C:03+1] "Nice catch! Luckily not impacting production, as logged_out is false in Vector's skin.json on master." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081310 (owner: 10Ebrahim) [16:57:16] (03PS1) 10JHathaway: replace exim alert with postfix alert [alerts] - 10https://gerrit.wikimedia.org/r/1081427 (https://phabricator.wikimedia.org/T325394) [17:00:01] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10242863 (10Dzahn) I saw a diff ` - chart: miscweb-0.2.18 + chart: miscweb-0.3.0 ` then ran ` helmfile -e... [17:11:08] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#10242886 (10Dzahn) I guess now the issue is we have to edit inside the _compressed_ HTML file: https://gitlab.wikimedi... [18:20:58] (03CR) 10JHathaway: [C:03+2] replace exim alert with postfix alert [alerts] - 10https://gerrit.wikimedia.org/r/1081427 (https://phabricator.wikimedia.org/T325394) (owner: 10JHathaway) [18:24:29] 06SRE-OnFire, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#10243079 (10Eevans) >>! In T370785#10227246, @lmata wrote: >>>! In T370785#10211164, @Eevans wrote: >> Q: Should this be a part of the MVP (i.e. Day 1), or saved for a subs... [18:24:30] 06SRE-OnFire, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#10243081 (10Eevans) [18:24:33] 06SRE-OnFire, 10Incident Tooling: Corto internal incident response workflow automation (MVP) - https://phabricator.wikimedia.org/T356790#10243082 (10Eevans) [18:24:34] 06SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467#10243083 (10Eevans) [18:31:28] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:43:11] FIRING: [2x] MXQueueNoMetrics: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [18:53:11] RESOLVED: [2x] MXQueueNoMetrics: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [18:56:45] !log removing 1 file for legal compliance [18:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:46] !log removing 3 files for legal compliance [19:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:57] 06SRE-OnFire, 10Incident Tooling: Corto: remove unused context.Context arguments - https://phabricator.wikimedia.org/T376501#10243314 (10Eevans) Done (see: [[ https://gitlab.wikimedia.org/repos/sre/corto/-/commit/c170285a66a6262ce66a40b5ec5af0b198dc53b0 | c170285 ]]). [19:30:23] 06SRE-OnFire, 10Incident Tooling: Corto: remove unused context.Context arguments - https://phabricator.wikimedia.org/T376501#10243315 (10Eevans) 05Open→03Resolved p:05Triage→03Medium a:03Eevans [19:45:12] (03PS20) 10JMeybohm: Add a cookbook to roll-reimage stacked k8s control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1081377 (https://phabricator.wikimedia.org/T362408) [20:20:46] !log dduvall@deploy2002 Installing scap version "latest" for 2 hosts [20:21:33] !log dduvall@deploy2002 install-world aborted: (no justification provided) (duration: 00m 52s) [20:22:25] !log dduvall@deploy2002 Installing scap version "4.113.0" for 2 hosts [20:23:47] !log deployed scap release 4.113.0 to releases{1003,2003} hosts [20:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:00] (03PS1) 10Bartosz Dziewoński: Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) [20:28:09] (03CR) 10CI reject: [V:04-1] Re-apply "Set special footer licence message for MediaWiki.org re. Help: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) (owner: 10Bartosz Dziewoński) [20:28:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081445 (https://phabricator.wikimedia.org/T301483) (owner: 10Bartosz Dziewoński) [20:34:08] (03PS1) 10Daimona Eaytoy: Enable CampaignEvents collaboration list in testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081446 (https://phabricator.wikimedia.org/T376055) [20:34:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081446 (https://phabricator.wikimedia.org/T376055) (owner: 10Daimona Eaytoy) [21:44:40] !log dduvall@deploy2002 Started deploy [releng/jenkins-deploy@8c1070f] (releasing): deploying changes to publishMWSingleVersion job [21:45:22] !log dduvall@deploy2002 Finished deploy [releng/jenkins-deploy@8c1070f] (releasing): deploying changes to publishMWSingleVersion job (duration: 01m 06s) [21:50:19] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:52:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:13:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:16:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:24:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10243686 (10Jhancock.wm) got ms-be2083 back up. reseated all the ram. @papaul you can test the reimage script on this @elukey ran into another issue wi... [22:26:01] (03PS1) 10Cwhite: Profiler: introduce metrics batching and centralize socket management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 [22:26:01] (03PS1) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [22:31:28] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:38:25] (03PS1) 10Zabe: s4: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081463 (https://phabricator.wikimedia.org/T183490) [22:41:19] (03PS1) 10Zabe: group0: Increase revision-slots cache expiry back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081464 (https://phabricator.wikimedia.org/T183490) [22:49:17] (03CR) 10Cwhite: "This felt like a good first step towards sending arclamp metrics towards statsd-exporter. Let me know what you think?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [23:16:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 811.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:21:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 811.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:31:45] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#10243825 (10Dwisehaupt) 05Open→03Resolved Hosts are in the process of building. Tracking install steps in T377641 [23:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1081466 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1081466 (owner: 10TrainBranchBot) [23:40:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 801.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:45:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 989.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:49:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 941.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:54:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 822.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded