[00:04:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1058717 (owner: 10TrainBranchBot) [00:04:36] (03PS1) 10Andrew Bogott: Fake passwords for trove rabbitmq user [labs/private] - 10https://gerrit.wikimedia.org/r/1058720 (https://phabricator.wikimedia.org/T320256) [00:08:17] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fake passwords for trove rabbitmq user [labs/private] - 10https://gerrit.wikimedia.org/r/1058720 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [00:09:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [00:15:59] (03PS4) 10Andrew Bogott: Switch trove to the new trove rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) [00:16:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [00:16:58] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [00:26:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:11] !log run authdns-update [00:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:52] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [01:01:23] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:16:48] (03CR) 10Andrew Bogott: [C:03+2] Switch trove to the new trove rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058718 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [01:51:38] (03PS2) 10Dzahn: cinderutils: allow floating point numbers for min_gb and max_gb [puppet] - 10https://gerrit.wikimedia.org/r/1057000 (https://phabricator.wikimedia.org/T371573) [01:52:01] (03PS3) 10Dzahn: cinderutils: add --allow-unattended-format when preparing volumes [puppet] - 10https://gerrit.wikimedia.org/r/1056606 (https://phabricator.wikimedia.org/T371573) [02:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.688s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:49:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.987s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:52:47] (03PS1) 10Andrew Bogott: cinder-volume: use cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058728 (https://phabricator.wikimedia.org/T320256) [02:56:51] (03PS2) 10Andrew Bogott: cinder-volume: use cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058728 (https://phabricator.wikimedia.org/T320256) [02:59:22] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:00] (03CR) 10Andrew Bogott: [C:03+2] cinder-volume: use cinder rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1058728 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [04:03:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T371190#10034371 (10Marostegui) Thanks! Everything looks good! [04:16:58] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [04:53:54] (03PS1) 10Marostegui: installserver: Do not format db2227 [puppet] - 10https://gerrit.wikimedia.org/r/1058731 [05:02:49] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2227 [puppet] - 10https://gerrit.wikimedia.org/r/1058731 (owner: 10Marostegui) [05:26:52] 10ops-magru, 06Traffic: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10034399 (10Volans) [05:27:46] 10ops-magru, 06SRE: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371559#10034397 (10Volans) →14Duplicate dup:03T371554 [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T0600) [06:00:04] marostegui, Amir1, and arnaudb: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:31:11] (03PS7) 10Ayounsi: Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) [06:35:16] (03CR) 10Ayounsi: [C:03+2] Cumin Alias, add temp netbox4 and restore global netbox ones [puppet] - 10https://gerrit.wikimedia.org/r/1056505 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [06:39:50] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [06:42:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:45:36] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [06:46:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:48:12] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [06:48:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.885s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:51:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:53:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.28s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:58:02] (03PS1) 10Slyngshede: data.yaml: Offboarding sbailey [puppet] - 10https://gerrit.wikimedia.org/r/1058952 [06:59:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T367856)', diff saved to https://phabricator.wikimedia.org/P67175 and previous config saved to /var/cache/conftool/dbconfig/20240801-065924-marostegui.json [06:59:27] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:02:27] (03PS1) 10Ayounsi: Netbox: add netbox4 frontends to the frontends list [puppet] - 10https://gerrit.wikimedia.org/r/1058953 [07:03:05] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1058953 (owner: 10Ayounsi) [07:03:59] !log uncordon parse2001, parse1001 T359387 [07:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:04] T359387: Cleanup parsoid-php service - https://phabricator.wikimedia.org/T359387 [07:14:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67176 and previous config saved to /var/cache/conftool/dbconfig/20240801-071431-marostegui.json [07:21:14] !log akosiaris@cumin1002 START - Cookbook sre.hosts.decommission for hosts deploy1002.eqiad.wmnet [07:23:26] (03PS1) 10Alexandros Kosiaris: deploy1002: decommission [puppet] - 10https://gerrit.wikimedia.org/r/1059006 (https://phabricator.wikimedia.org/T371283) [07:25:57] (03PS1) 10Jelto: phabricator: delay pages my 30 minutes to reduce alerting noise [puppet] - 10https://gerrit.wikimedia.org/r/1059007 (https://phabricator.wikimedia.org/T371418) [07:27:26] (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1002: decommission [puppet] - 10https://gerrit.wikimedia.org/r/1059006 (https://phabricator.wikimedia.org/T371283) (owner: 10Alexandros Kosiaris) [07:28:18] (03CR) 10Ayounsi: [C:03+2] "Self merging to unblock dns, PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1058953 (owner: 10Ayounsi) [07:29:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P67177 and previous config saved to /var/cache/conftool/dbconfig/20240801-072938-marostegui.json [07:32:09] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:36:19] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:36:25] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:37:04] (03PS1) 10Alexandros Kosiaris: site.pp: Remove deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/1059008 (https://phabricator.wikimedia.org/T357392) [07:39:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:39:29] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox 4 sync - ayounsi@cumin1002" [07:39:34] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10034509 (10hashar) >>! In T177826#10034118, @Dzahn wrote: > ACK! I see the key in `modules/profile/manifests/ci/... [07:40:22] (03CR) 10Alexandros Kosiaris: [C:03+2] site.pp: Remove deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/1059008 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [07:41:26] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [07:41:56] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [07:42:27] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission deploy1002 - https://phabricator.wikimedia.org/T371283#10034513 (10akosiaris) [07:43:50] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:43:51] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts deploy1002.eqiad.wmnet [07:44:00] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission deploy1002 - https://phabricator.wikimedia.org/T371283#10034520 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1002 for hosts: `deploy1002.eqiad.wmnet` - deploy1002.eqiad.wmnet (**PASS**) - Down... [07:44:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:44:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T367856)', diff saved to https://phabricator.wikimedia.org/P67178 and previous config saved to /var/cache/conftool/dbconfig/20240801-074445-marostegui.json [07:44:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [07:44:49] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:45:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [07:45:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T367856)', diff saved to https://phabricator.wikimedia.org/P67179 and previous config saved to /var/cache/conftool/dbconfig/20240801-074507-marostegui.json [07:46:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [07:47:11] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox 4 sync - ayounsi@cumin1002" [07:47:11] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [07:47:33] 06SRE, 06MediaWiki-Platform-Team, 06Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#10034529 (10matej_suchanek) This morning (CEST), I visited [[ https://cs.wikipedia.org/wiki/Hlavn%C3%AD_strana | my favorite wik... [07:49:47] (03PS7) 10Ayounsi: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) [07:49:47] (03PS8) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) [07:49:47] (03PS1) 10Ayounsi: sync-netbox-hiera: set mgmt host status to ignore to lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1059010 [07:55:45] 06SRE, 06Infrastructure-Foundations: Netbox dns record generation not working - https://phabricator.wikimedia.org/T371565#10034563 (10ayounsi) 05Open→03Resolved a:03ayounsi Fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056505 and https://gerrit.wikimedia.org/r/c/operations/puppet/+... [07:56:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:56:29] 06SRE, 06collaboration-services, 06serviceops, 10Release-Engineering-Team (Radar): replace production buster deployment servers - https://phabricator.wikimedia.org/T364656#10034567 (10akosiaris) 05Open→03Resolved a:03akosiaris deploy1003 has been tracked in T364417, deploy2002 reimaging as bullse... [07:56:31] (03CR) 10Ayounsi: [C:03+2] sync-netbox-hiera: set mgmt host status to ignore to lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1059010 (owner: 10Ayounsi) [07:58:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T367856)', diff saved to https://phabricator.wikimedia.org/P67180 and previous config saved to /var/cache/conftool/dbconfig/20240801-075826-marostegui.json [07:58:29] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:00:49] (03Merged) 10jenkins-bot: sync-netbox-hiera: set mgmt host status to ignore to lowercase [cookbooks] - 10https://gerrit.wikimedia.org/r/1059010 (owner: 10Ayounsi) [08:01:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:04:03] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "netbox4 sync - ayounsi@cumin1002" [08:04:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "netbox4 sync - ayounsi@cumin1002" [08:06:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:08:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [08:08:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2148.codfw.wmnet with reason: Maintenance [08:08:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1246.eqiad.wmnet with reason: Maintenance [08:09:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1246.eqiad.wmnet with reason: Maintenance [08:13:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P67181 and previous config saved to /var/cache/conftool/dbconfig/20240801-081333-marostegui.json [08:14:32] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10034583 (10hashar) [08:16:58] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [08:28:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P67182 and previous config saved to /var/cache/conftool/dbconfig/20240801-082840-marostegui.json [08:29:54] 06SRE: How do handle old/unneeded Gerrit groups - https://phabricator.wikimedia.org/T371581 (10AndrewTavis_WMDE) 03NEW [08:35:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2229.mgmt.codfw.wmnet with reboot policy GRACEFUL [08:36:18] 06SRE: How do handle old/unneeded Gerrit groups - https://phabricator.wikimedia.org/T371581#10034607 (10AndrewTavis_WMDE) [08:43:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T367856)', diff saved to https://phabricator.wikimedia.org/P67183 and previous config saved to /var/cache/conftool/dbconfig/20240801-084347-marostegui.json [08:43:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [08:43:50] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:44:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [08:44:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T367856)', diff saved to https://phabricator.wikimedia.org/P67184 and previous config saved to /var/cache/conftool/dbconfig/20240801-084409-marostegui.json [08:45:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2229.mgmt.codfw.wmnet with reboot policy GRACEFUL [08:48:17] !log ayounsi@cumin1002 START - Cookbook sre.postgresql.postgres-init [08:49:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [08:53:05] 06SRE, 06MediaWiki-Platform-Team, 06Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#10034635 (10Tgr) Looking at that page now, what I see is NewPP limit report: ` Parsed by mw‐web.eqiad.main‐5ffbbd4f55‐hlpc7 Redu... [08:55:22] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2230.mgmt.codfw.wmnet with reboot policy GRACEFUL [08:55:35] jouncebot: now and next [08:55:35] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [08:55:39] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: staging config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058574 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [08:57:25] (03CR) 10Filippo Giunchedi: [C:03+2] rsyslog: send all k8s logs to dedicated kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1057819 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [08:57:43] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2230.mgmt.codfw.wmnet with reboot policy GRACEFUL [09:00:38] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephmon1004.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:00:47] 06SRE, 06MediaWiki-Platform-Team, 06Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#10034642 (10matej_suchanek) > the In the news section still starts with July 31. Not sure what's up with that. That's fine. For... [09:06:38] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: create charts only containing the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049084 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:08:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon1004.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:13:38] (03PS1) 10Ayounsi: Don't import docker interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059018 [09:14:09] (03PS2) 10Ayounsi: Don't import docker interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059018 [09:16:22] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephmon1005.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:16:43] (03CR) 10Hashar: cumin: clone homer public repo with default parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:22:40] (03PS2) 10Hashar: cumin: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) [09:22:56] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:24:02] (03CR) 10Hashar: [C:04-1] "Well actually I can squash that in child change Id55dec816ddcda3f3a53434d0eef95d34f4ee7cc" [puppet] - 10https://gerrit.wikimedia.org/r/1056981 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:24:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon1005.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:24:46] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059018 (owner: 10Ayounsi) [09:26:39] (03PS2) 10Hashar: cumin: set git::clone umask to match requested file mode [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) [09:27:17] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:27:46] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephmon1006.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:31:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1233', diff saved to https://phabricator.wikimedia.org/P67185 and previous config saved to /var/cache/conftool/dbconfig/20240801-093123-marostegui.json [09:31:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1233.eqiad.wmnet with reason: Maintenance [09:31:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1233.eqiad.wmnet with reason: Maintenance [09:32:04] (03CR) 10Hashar: "After talking with Luca about this, the easy path is to fix the `umask` of the two git clones and do not touch their mode/groups." [puppet] - 10https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:32:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584 (10JoelyRooke-WMDE) 03NEW [09:33:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10034711 (10JoelyRooke-WMDE) [09:36:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon1006.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:40:29] (03Abandoned) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for arwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053835 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [09:44:24] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1035.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:50:13] 06SRE, 06Infrastructure-Foundations, 10netops: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10034735 (10ayounsi) [09:51:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10034739 (10WMDECyn) Approving this request as approving manager [09:54:11] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: create charts only containing the CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049084 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:54:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1035.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:55:09] (03PS1) 10Filippo Giunchedi: rsyslog: fix kafka-k8s double logging [puppet] - 10https://gerrit.wikimedia.org/r/1059025 (https://phabricator.wikimedia.org/T366710) [09:56:52] (03CR) 10Ayounsi: [C:03+2] Don't import docker interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059018 (owner: 10Ayounsi) [09:57:18] (03CR) 10Clément Goubert: [C:03+1] rsyslog: fix kafka-k8s double logging [puppet] - 10https://gerrit.wikimedia.org/r/1059025 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [09:57:49] (03Merged) 10jenkins-bot: Don't import docker interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059018 (owner: 10Ayounsi) [09:58:07] (03CR) 10Filippo Giunchedi: [C:03+2] rsyslog: fix kafka-k8s double logging [puppet] - 10https://gerrit.wikimedia.org/r/1059025 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [09:59:04] (03CR) 10DCausse: wdqs graph split: routing for wdqs backends (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1053765 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1000) [10:00:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67186 and previous config saved to /var/cache/conftool/dbconfig/20240801-100035-root.json [10:05:32] (03CR) 10Alexandros Kosiaris: [C:04-1] "Various small comments here and there, and one larger concern, all inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:10:06] (03CR) 10Brouberol: "Thanks for the thorough review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:11:34] (03CR) 10DCausse: [C:03+1] "seems to be taken care of at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053765" [puppet] - 10https://gerrit.wikimedia.org/r/1054342 (https://phabricator.wikimedia.org/T364366) (owner: 10Stevemunene) [10:11:58] (03PS2) 10DCausse: wdqs: drop deprecated hosts hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/1058649 [10:13:35] (03PS1) 10Clément Goubert: Remove obsolete dummy certs for api and parsoid [labs/private] - 10https://gerrit.wikimedia.org/r/1059029 (https://phabricator.wikimedia.org/T360636) [10:14:10] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#10034762 (10Clement_Goubert) [10:15:25] (03CR) 10Alexandros Kosiaris: [C:04-1] "/me facepalms. I didn't realize that, sorry about that!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:15:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67187 and previous config saved to /var/cache/conftool/dbconfig/20240801-101541-root.json [10:16:25] (03CR) 10Alexandros Kosiaris: [C:03+1] Add upstream version annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037733 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:16:26] (03PS2) 10Clément Goubert: Remove no longer used parsoid and api certs [puppet] - 10https://gerrit.wikimedia.org/r/1042936 (https://phabricator.wikimedia.org/T360636) (owner: 10Alexandros Kosiaris) [10:16:35] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: add CI fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049114 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:17:37] (03CR) 10DCausse: "some parts of this patch seems to be handled in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054342, should they be merged togeth" [puppet] - 10https://gerrit.wikimedia.org/r/1054520 (https://phabricator.wikimedia.org/T364368) (owner: 10Stevemunene) [10:17:41] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:18:23] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:18:44] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:19:57] (03CR) 10Alexandros Kosiaris: [C:03+1] Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:21:26] (03PS1) 10Ayounsi: Increase the number of Redis DB for standalone Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1059030 [10:21:44] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059030 (owner: 10Ayounsi) [10:24:56] (03CR) 10Ayounsi: "The alternative is to remove that configuration option so it falls back to the default of 16 DBs" [puppet] - 10https://gerrit.wikimedia.org/r/1059030 (owner: 10Ayounsi) [10:25:45] (03PS2) 10Ayounsi: Increase the number of Redis DB for standalone Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1059030 [10:29:28] 06SRE, 06Growth-Team, 10StructuredDiscussions: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586 (10Michael) 03NEW [10:29:28] (03CR) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:30:25] (03CR) 10Brouberol: cloudnative-pg: Import the upstream chart for inspection (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:30:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67188 and previous config saved to /var/cache/conftool/dbconfig/20240801-103046-root.json [10:35:58] 06SRE, 10Gerrit: How do handle old/unneeded Gerrit groups - https://phabricator.wikimedia.org/T371581#10034794 (10Aklapper) +#Gerrit (not sure why #SRE was added?) [10:38:41] 06SRE, 06Growth-Team, 10observability, 10StructuredDiscussions, 10Wikimedia-Logstash: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10034812 (10Urbanecm_WMF) [10:45:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67189 and previous config saved to /var/cache/conftool/dbconfig/20240801-104551-root.json [10:51:49] (03CR) 10Vgutierrez: [C:03+1] Release 9.2.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [11:00:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67190 and previous config saved to /var/cache/conftool/dbconfig/20240801-110057-root.json [11:02:55] (03CR) 10Lucas Werkmeister: "We really need this page, I’m seeing more and more people be confused by the buster bastion: T371556#10034413" [puppet] - 10https://gerrit.wikimedia.org/r/1058654 (owner: 10Lucas Werkmeister) [11:15:55] (03PS1) 10Aklapper: Make Etherpad frontpage say it's not for personal use [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1059036 (https://phabricator.wikimedia.org/T371591) [11:16:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67191 and previous config saved to /var/cache/conftool/dbconfig/20240801-111602-root.json [11:16:46] (03CR) 10Aklapper: "Warning: Neither do I know if this is the correct approach nor has this been tested" [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1059036 (https://phabricator.wikimedia.org/T371591) (owner: 10Aklapper) [11:22:37] (03CR) 10Stevemunene: "Indeed, lemme stack them. Thanks David" [puppet] - 10https://gerrit.wikimedia.org/r/1054520 (https://phabricator.wikimedia.org/T364368) (owner: 10Stevemunene) [11:25:18] (03PS5) 10Brouberol: cloudnative-pg: Import the upstream chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) [11:25:18] (03PS5) 10Brouberol: Add upstream version annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037733 (https://phabricator.wikimedia.org/T364797) [11:25:18] (03PS4) 10Brouberol: cloudnative-pg: add CI fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049114 (https://phabricator.wikimedia.org/T364797) [11:25:18] (03PS7) 10Brouberol: cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) [11:25:19] (03PS9) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [11:25:20] (03PS9) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [11:25:24] (03PS9) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [11:25:28] (03PS14) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) [11:25:32] (03PS1) 10Brouberol: cloudnative-pg: remove unused podmonitor templates/values/dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059038 (https://phabricator.wikimedia.org/T364797) [11:25:36] (03PS1) 10Brouberol: cloudnative-pg: remove the crds values block [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059039 (https://phabricator.wikimedia.org/T364797) [11:25:40] (03PS1) 10Brouberol: cloudnative-pg: cleanup chart version and maintainers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059040 (https://phabricator.wikimedia.org/T364797) [11:25:44] (03PS1) 10Brouberol: cloudnative-pg: drop the event.patch permission [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059041 (https://phabricator.wikimedia.org/T364797) [11:25:49] (03PS3) 10Ayounsi: Increase the number of Redis DB for standalone Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1059030 [11:25:52] (03PS1) 10Ayounsi: check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 [11:25:57] (03PS1) 10Ilias Sarantopoulos: ml-services: update lang agnostic articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059043 (https://phabricator.wikimedia.org/T360455) [11:27:55] (03CR) 10Brouberol: cloudnative-pg: Import the upstream chart for inspection (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:28:12] (03CR) 10Brouberol: cloudnative-pg: Import the upstream chart for inspection (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:29:06] (03PS2) 10Brouberol: cloudnative-pg: remove unused podmonitor templates/values/dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059038 (https://phabricator.wikimedia.org/T364797) [11:29:06] (03PS2) 10Brouberol: cloudnative-pg: remove the crds values block [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059039 (https://phabricator.wikimedia.org/T364797) [11:29:06] (03PS2) 10Brouberol: cloudnative-pg: cleanup chart version and maintainers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059040 (https://phabricator.wikimedia.org/T364797) [11:29:06] (03PS2) 10Brouberol: cloudnative-pg: drop the event.patch permission [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059041 (https://phabricator.wikimedia.org/T364797) [11:29:07] (03PS15) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) [11:31:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67192 and previous config saved to /var/cache/conftool/dbconfig/20240801-113108-root.json [11:35:13] (03CR) 10Alexandros Kosiaris: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059041 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:35:30] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: cleanup chart version and maintainers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059040 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:41:31] (03CR) 10Kevin Bazira: [C:03+2] ml-services: staging config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058574 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [11:42:29] (03Merged) 10jenkins-bot: ml-services: staging config for modernized rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1058574 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [11:43:59] (03CR) 10Alexandros Kosiaris: [C:04-1] "2 inline comments, rest LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:44:37] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: remove the crds values block [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059039 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:45:01] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: remove unused podmonitor templates/values/dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059038 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:45:43] (03CR) 10Alexandros Kosiaris: "Removing -1, comments are being addressed to followup CRs. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [11:46:56] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10034978 (10Jhancock.wm) [11:47:50] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10034979 (10Jhancock.wm) [11:48:30] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:48:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host alert2002.mgmt.codfw.wmnet with reboot policy FORCED [11:48:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [11:49:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host alert2002.mgmt.codfw.wmnet with reboot policy FORCED [11:49:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host vrts2002.mgmt.codfw.wmnet with reboot policy FORCED [11:53:49] (03PS1) 10Jgiannelos: changeprop: Add header to avoid unnecessary summary pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059048 [11:57:01] (03PS2) 10Jgiannelos: changeprop: Add header to avoid unnecessary summary pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059048 (https://phabricator.wikimedia.org/T367418) [11:57:37] (03PS3) 10Jgiannelos: changeprop: Add header to avoid unnecessary summary pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059048 (https://phabricator.wikimedia.org/T367418) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1200) [12:06:00] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node for host kubestage1003.eqiad.wmnet [12:09:10] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) for host kubestage1003.eqiad.wmnet [12:09:31] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node for host kubestage1003.eqiad.wmnet [12:09:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) for host kubestage1003.eqiad.wmnet [12:10:16] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['vrts2002'] [12:10:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['alert2002'] [12:10:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['vrts2002'] [12:10:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['alert2002'] [12:12:05] (03PS1) 10Volans: sre.switchdc.databases.preparation: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [12:12:25] * volans CI failure is expeted, depends on a spicerack release [12:13:15] (03PS1) 10Volans: cookbooks: add config for sre.switchdc.databases [puppet] - 10https://gerrit.wikimedia.org/r/1059053 (https://phabricator.wikimedia.org/T371351) [12:14:09] (03CR) 10Volans: "CI failure is currently expected, depends on a new spicerack release not yet out." [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:16:02] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases.preparation: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:16:58] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [12:17:36] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059053 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:18:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host alert2002.wikimedia.org with OS bookworm [12:19:06] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10035060 (10Jhancock.wm) [12:19:08] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm [12:19:17] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki=dewiki --olderThan=1721045915 --verbose # T371597 [12:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:19] T371597: Add Link: Release as "turned off" to German Wikipedia - https://phabricator.wikimedia.org/T371597 [12:20:00] (03CR) 10Volans: "This is to add the replication credentials to the cumin host so that it can be read by the cookbook. It can be merged anytime so that the " [puppet] - 10https://gerrit.wikimedia.org/r/1059053 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:22:57] (03CR) 10AOkoth: [C:03+1] phabricator: delay pages my 30 minutes to reduce alerting noise [puppet] - 10https://gerrit.wikimedia.org/r/1059007 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [12:23:57] (03CR) 10Vgutierrez: [C:03+1] "basic bash to validate that rules are exactly the same:" [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [12:23:58] (03CR) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:24:13] (03CR) 10Jelto: [C:03+2] phabricator: delay pages my 30 minutes to reduce alerting noise [puppet] - 10https://gerrit.wikimedia.org/r/1059007 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [12:25:18] (03PS10) 10Brouberol: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) [12:25:18] (03PS10) 10Brouberol: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) [12:25:18] (03PS10) 10Brouberol: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) [12:25:18] (03PS3) 10Brouberol: cloudnative-pg: remove unused podmonitor templates/values/dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059038 (https://phabricator.wikimedia.org/T364797) [12:25:19] (03PS3) 10Brouberol: cloudnative-pg: remove the crds values block [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059039 (https://phabricator.wikimedia.org/T364797) [12:25:21] (03PS3) 10Brouberol: cloudnative-pg: cleanup chart version and maintainers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059040 (https://phabricator.wikimedia.org/T364797) [12:25:25] (03PS3) 10Brouberol: cloudnative-pg: drop the event.patch permission [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059041 (https://phabricator.wikimedia.org/T364797) [12:25:29] (03PS16) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) [12:26:04] !log isaranto@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:26:36] (03CR) 10Brouberol: Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:28:17] (03Abandoned) 10Andrew Bogott: rabbitmq: create cinder-specific rabbit user [puppet] - 10https://gerrit.wikimedia.org/r/1058708 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [12:37:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [12:39:38] !log Decommission Add Link models for akwiki, nawiki (T371598) [12:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:41] T371598: Unpublish Add Link models for wikis where it did not work - https://phabricator.wikimedia.org/T371598 [12:40:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [12:40:43] (03PS1) 10Andrew Bogott: Switch designate to the new designate rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1059064 (https://phabricator.wikimedia.org/T320256) [12:42:46] (03CR) 10Volans: [C:04-1] "Temporary -1 because it depends on the release in production of conftool 3.2.0. Feel free to remove my -1 and merge once that's completed," [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto) [12:43:11] (03PS1) 10Andrew Bogott: Fake rabbitmq passwords for designate [labs/private] - 10https://gerrit.wikimedia.org/r/1059065 (https://phabricator.wikimedia.org/T320256) [12:44:11] (03PS2) 10Andrew Bogott: Fake rabbitmq passwords for designate [labs/private] - 10https://gerrit.wikimedia.org/r/1059065 (https://phabricator.wikimedia.org/T320256) [12:48:13] 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, and 2 others: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#10035208 (10Volans) Status update: The conftool improvements ([[ https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/9 | here ]] and [[ https://gi... [12:51:04] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Fake rabbitmq passwords for designate [labs/private] - 10https://gerrit.wikimedia.org/r/1059065 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [12:51:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059064 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [12:52:10] 10ops-magru, 06Traffic: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371554#10035242 (10Fabfur) Do we have any evidence that the disk has not been manually removed/tampered? [12:52:41] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host gerrit2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [12:55:01] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [12:55:03] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [12:55:36] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: sync [12:55:45] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync [12:56:31] (03CR) 10Ssingh: "Thanks for the review!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [12:56:34] (03CR) 10Ssingh: [C:03+2] Release 9.2.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057920 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [12:56:36] (03PS2) 10Andrew Bogott: Switch designate to the new designate rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1059064 (https://phabricator.wikimedia.org/T320256) [12:56:38] (03CR) 10Ssingh: [C:03+2] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [12:57:08] (03PS1) 10Ilias Sarantopoulos: ml-services: override staging rec-api entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059071 (https://phabricator.wikimedia.org/T371465) [12:58:18] !log urbanecm@deploy1003 helmfile [codfw] START helmfile.d/services/linkrecommendation: sync [12:58:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059064 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [12:59:03] (03CR) 10Ssingh: [C:03+2] sre.dns.roll-upgrade-ats: update cookbook (changes below) [cookbooks] - 10https://gerrit.wikimedia.org/r/1058652 (owner: 10Ssingh) [12:59:29] !log urbanecm@deploy1003 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync [12:59:35] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: sync [12:59:52] (03CR) 10Kevin Bazira: [C:03+1] "ok. let's give this a shot. my understanding is that the container has this entrypoint by default." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059071 (https://phabricator.wikimedia.org/T371465) (owner: 10Ilias Sarantopoulos) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1300). [13:00:05] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:24] that’s good, I probably couldn’t deploy today anyway :) [13:00:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gerrit2003.mgmt.codfw.wmnet with reboot policy GRACEFUL [13:02:48] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: override staging rec-api entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059071 (https://phabricator.wikimedia.org/T371465) (owner: 10Ilias Sarantopoulos) [13:03:58] !log isaranto@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:04:30] (03CR) 10Andrew Bogott: [C:03+2] Switch designate to the new designate rabbitmq user [puppet] - 10https://gerrit.wikimedia.org/r/1059064 (https://phabricator.wikimedia.org/T320256) (owner: 10Andrew Bogott) [13:04:37] (03PS3) 10Elukey: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) [13:05:28] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: Import the upstream chart for inspection (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:06:16] (03PS4) 10Elukey: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) [13:06:42] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org,service=recdns [reason: pdns-rec upgrade] [13:07:11] (03CR) 10Elukey: redfish: add the add_account function (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:08:16] 06SRE, 06Infrastructure-Foundations, 10netops: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10035314 (10cmooney) I've been playing with this a little on Netbox-Next, you can see the data here covering our existing GRE tunnels: https://netbox-next.wikimedia.org/vpn/tunnels/ Initia... [13:09:14] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org,service=recdns [reason: [done] pdns-rec upgrade] [13:11:18] (03PS1) 10NMW03: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059075 (https://phabricator.wikimedia.org/T371439) [13:12:53] (03CR) 10CI reject: [V:04-1] redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:13:13] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: drop the event.patch permission [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059041 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:14:26] (03CR) 10Volans: [C:03+1] "LGTM, docstrings nits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:17:12] (03CR) 10Alexandros Kosiaris: [C:03+1] cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:18:53] !log depool cp4037 to test remove benthos package / conffiles (T370741) [13:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:56] hello [13:19:01] T370741: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 [13:19:02] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [13:19:09] who is deployer [13:19:31] I have urgent patch (IP cap limit) [13:20:47] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: cleanup chart version and maintainers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059040 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:00] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:26] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: remove the crds values block [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059039 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:27] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: remove unused podmonitor templates/values/dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059038 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:29] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:30] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:32] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:33] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: add CI fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049114 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:35] (03CR) 10Brouberol: [C:03+2] Add upstream version annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037733 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:37] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: Import the upstream chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:21:55] (03PS5) 10Elukey: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) [13:22:16] (03CR) 10Elukey: redfish: add the add_account function (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:22:17] (03Merged) 10jenkins-bot: cloudnative-pg: Import the upstream chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037731 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:22:19] (03Merged) 10jenkins-bot: Add upstream version annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037733 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:22:25] (03Merged) 10jenkins-bot: cloudnative-pg: add CI fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049114 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:22:26] (03Merged) 10jenkins-bot: cloudnative-pg: allow the specification of watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049109 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:22:56] (03Merged) 10jenkins-bot: cloudnative-pg: adjust RBAC management by scoping it to PG cluster namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049085 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:23:01] (03Merged) 10jenkins-bot: cloudnative-pg: move queries to configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049086 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:23:02] Nemoralis: I can’t deploy, sorry (about to go afk for a bit) [13:23:02] (03Merged) 10jenkins-bot: cloudnative-pg: set image values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049087 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:23:04] (03Merged) 10jenkins-bot: cloudnative-pg: remove unused podmonitor templates/values/dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059038 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:23:05] (03Merged) 10jenkins-bot: cloudnative-pg: remove the crds values block [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059039 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:23:08] (03Merged) 10jenkins-bot: cloudnative-pg: cleanup chart version and maintainers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059040 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:23:10] (03Merged) 10jenkins-bot: cloudnative-pg: drop the event.patch permission [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059041 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:23:40] Lucas_WMDE: :( [13:24:28] Nemoralis: this is for the next deploy window? [13:24:48] current [13:24:59] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1059075 [13:25:25] We are hosting and edit-a-thon and everybody is getting spam blocked [13:25:27] jouncebot: current [13:25:37] I forgot to deploy it yesterday [13:25:43] jouncebot: !current [13:25:46] was it added after this window started? [13:25:51] jouncebot: next [13:25:52] In 1 hour(s) and 34 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1500) [13:25:59] jouncebot: now [13:26:00] For the next 0 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1300) [13:26:08] bblack: added it now [13:26:15] so yes [13:26:41] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime [13:26:54] any of you able to take this on? [13:28:31] (03CR) 10CI reject: [V:04-1] redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:29:34] (03PS6) 10Elukey: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) [13:30:16] i'm in a meeting unfortunately :( [13:30:32] :( [13:31:03] (03CR) 10CDanis: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059075 (https://phabricator.wikimedia.org/T371439) (owner: 10NMW03) [13:31:06] I can deploy if no one else is available [13:31:23] yes please [13:31:34] jouncebot: nowandnext [13:31:35] For the next 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1300) [13:31:35] In 1 hour(s) and 28 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1500) [13:31:58] cdanis: it needs https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold as well, and has no review or input on the patch or ticket yet [13:32:04] bblack: indeed [13:32:11] I know that file and the patch itself looks ok to me [13:32:22] ok [13:32:47] (03CR) 10CDanis: Increase IP cap limit for azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059075 (https://phabricator.wikimedia.org/T371439) (owner: 10NMW03) [13:33:01] (03CR) 10Elukey: [C:03+1] "I would add a comment about why "5", so we'll know in the future." [puppet] - 10https://gerrit.wikimedia.org/r/1059030 (owner: 10Ayounsi) [13:33:32] Nemoralis: maybe also add commons? [13:33:40] you have to run maintenance script too [13:33:45] cdanis: what for [13:33:55] (03CR) 10Brouberol: [C:03+2] Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037734 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:33:58] because editors typically also add images [13:34:19] do you mean images of editathon [13:34:28] I can upload one now [13:34:31] bblack: but this is an account creation throttle I thought [13:34:39] as long as they make their account on azwiki it should be ok [13:34:43] I will upload all of them once [13:35:06] (03PS2) 10CDanis: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059075 (https://phabricator.wikimedia.org/T371439) (owner: 10NMW03) [13:35:27] (03CR) 10CDanis: [C:03+2] Increase IP cap limit for azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059075 (https://phabricator.wikimedia.org/T371439) (owner: 10NMW03) [13:36:01] "This will also raise edit rate limits to the value used for autoconfirmed users (by default 90 edits/minute) instead of the limit for non-autoconfirmed users (8 edits/minute for all newbies coming from the same IP address)." [13:36:10] this is why everybody is getting blocked by mediawiki [13:36:11] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM, don't forget to bump the chart version number too!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059048 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [13:36:17] (03Merged) 10jenkins-bot: Increase IP cap limit for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059075 (https://phabricator.wikimedia.org/T371439) (owner: 10NMW03) [13:36:23] (03PS1) 10Brouberol: cloudnative-pg: Bump chart versions to integrate all customizations in a new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059082 (https://phabricator.wikimedia.org/T364797) [13:37:21] !log cdanis@deploy1003 Started scap sync-world: Backport for [[gerrit:1059075|Increase IP cap limit for azwiki (T371439)]] [13:37:24] T371439: Requesting temporary lift of IP cap for azwiki - https://phabricator.wikimedia.org/T371439 [13:38:01] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy GRACEFUL [13:40:06] !log cdanis@deploy1003 cdanis, nmw03: Backport for [[gerrit:1059075|Increase IP cap limit for azwiki (T371439)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:08] !log cdanis@deploy1003 cdanis, nmw03: Continuing with sync [13:40:28] (03CR) 10Stevemunene: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059082 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:41:17] (03CR) 10Elukey: [C:03+2] redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:41:33] Nemoralis: almost merged, and I have the command to reset the counter ready. In the future please do not forget, we cannot and will not do an emergency deploy every time :) [13:41:38] (03CR) 10Stevemunene: cloudnative-pg: Bump chart versions to integrate all customizations in a new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059082 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:42:17] cdanis: thanks! Sure, I will [13:42:27] I think you have to run resetAuthenticationThrottle script [13:42:40] yes, I have the command ready [13:44:27] Nemoralis: please try now [13:44:45] (03PS1) 10Ayounsi: Cable validator: prevent cables with multiple terminations [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059085 [13:44:49] !log cdanis@deploy1003 Finished scap: Backport for [[gerrit:1059075|Increase IP cap limit for azwiki (T371439)]] (duration: 07m 28s) [13:44:51] T371439: Requesting temporary lift of IP cap for azwiki - https://phabricator.wikimedia.org/T371439 [13:45:01] did you run it already? [13:45:39] yes [13:45:47] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059082 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:46:13] (03CR) 10CI reject: [V:04-1] Cable validator: prevent cables with multiple terminations [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059085 (owner: 10Ayounsi) [13:46:55] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1036.mgmt.eqiad.wmnet with reboot policy GRACEFUL [13:47:32] (03Merged) 10jenkins-bot: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:48:29] cdanis: thanks! [13:49:02] (03PS4) 10Ayounsi: Increase the number of Redis DB for standalone Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1059030 [13:49:02] (03PS2) 10Ayounsi: check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 [13:49:51] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node for host kubestage1003.eqiad.wmnet [13:49:51] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) for host kubestage1003.eqiad.wmnet [13:49:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10035491 (10Jclark-ctr) wikikube-worker1260 3183 #. 1 wikikube-worker1261 3182 #. 0 wikikube-worker1262 3184 #. 2 wikikube-worker1263... [13:50:07] (03PS2) 10Ayounsi: Cable validator: prevent cables with multiple terminations [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059085 [13:51:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [13:51:54] (03PS5) 10Ayounsi: Increase the number of Redis DB for standalone Netbox [puppet] - 10https://gerrit.wikimedia.org/r/1059030 [13:51:54] (03PS3) 10Ayounsi: check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 [13:52:03] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1037.mgmt.eqiad.wmnet with reboot policy GRACEFUL [13:53:01] (03CR) 10Ayounsi: [C:03+2] "Added. Thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1059030 (owner: 10Ayounsi) [13:54:58] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059085 (owner: 10Ayounsi) [13:59:53] (03CR) 10Ayounsi: [C:03+2] Cable validator: prevent cables with multiple terminations [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059085 (owner: 10Ayounsi) [14:00:48] (03Merged) 10jenkins-bot: Cable validator: prevent cables with multiple terminations [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1059085 (owner: 10Ayounsi) [14:01:25] (03PS4) 10Jgiannelos: changeprop: Add header to avoid unnecessary summary pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059048 (https://phabricator.wikimedia.org/T367418) [14:01:28] (03CR) 10Elukey: "For this one do we need a new pynetbox release?" [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [14:01:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1037.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:02:55] (03CR) 10Jgiannelos: [C:03+2] changeprop: Add header to avoid unnecessary summary pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059048 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [14:03:56] (03Merged) 10jenkins-bot: changeprop: Add header to avoid unnecessary summary pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059048 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [14:05:21] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:05:53] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v8.10.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1059089 [14:07:09] (03CR) 10Volans: [C:03+1] "\o/" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1059089 (owner: 10Elukey) [14:13:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2002.wikimedia.org with OS bookworm [14:13:29] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm executed with errors: - alert2002 (... [14:14:05] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v8.10.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1059089 (owner: 10Elukey) [14:14:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1038.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:15:04] (03CR) 10Andrew Bogott: [C:03+1] "Yep, let's give this a try before I pull everything apart :)" [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (https://phabricator.wikimedia.org/T364492) (owner: 10JHathaway) [14:15:17] (03PS1) 10Elukey: Upstream release v8.10.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1059091 [14:16:29] (03PS2) 10Brouberol: cloudnative-pg: Bump chart versions to integrate all customizations in a new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059082 (https://phabricator.wikimedia.org/T364797) [14:16:37] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:17:04] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v8.10.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1059091 (owner: 10Elukey) [14:18:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host alert2002.wikimedia.org with OS bookworm [14:18:24] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm [14:19:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2002.wikimedia.org with OS bookworm [14:19:14] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm executed with errors: - alert2002 (... [14:20:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host alert2002.wikimedia.org with OS bookworm [14:20:14] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035612 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm [14:20:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2002.wikimedia.org with OS bookworm [14:20:36] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035613 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm executed with errors: - alert2002 (... [14:28:08] 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10035633 (10herron) Comparing with an example after hours incident like [[ https://portal.victorops.com/ui/wikimedia/incident/4926/details | incident 4926 ]] VO logs... [14:28:27] !log upgrade debmonitor-server on debmonitor2003 to 0.5.0 [14:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:58] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035654 (10Jhancock.wm) [14:30:52] (03PS3) 10Ayounsi: Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) [14:31:18] !log repool cp4037 (T370741) [14:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:21] T370741: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 [14:31:23] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [14:32:20] (03CR) 10Ayounsi: Validators: enforce Trident3 port block consistency (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [14:33:18] (03CR) 10Andrew Bogott: [C:03+2] p:toolforge::bastion: add deprecated banner [puppet] - 10https://gerrit.wikimedia.org/r/1058654 (owner: 10Lucas Werkmeister) [14:34:26] !log uploaded spicerack_8.10.0 to apt.wikimedia.org bullseye-wikimedia [14:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:37] jouncebot: nowandnext [14:35:37] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [14:35:37] In 0 hour(s) and 24 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1500) [14:35:44] (03CR) 10Zabe: [C:03+2] Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [14:36:41] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10035685 (10Jhancock.wm) ran into a puppet issue on this one. ran the first install just fine up to a point. failed here: ` [237/240, retrying in 10.00s] Attempt to run 'spicerack.... [14:36:49] (03PS8) 10Zabe: Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [14:36:59] (03CR) 10Zabe: [C:03+2] Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [14:37:39] (03Merged) 10jenkins-bot: Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [14:38:04] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1057034|Move section mapping to separate file]] [14:38:59] (03PS4) 10Ayounsi: Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) [14:39:22] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:17] !log zabe@deploy1003 zabe: Backport for [[gerrit:1057034|Move section mapping to separate file]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:41:06] (03PS1) 10Brouberol: cloudnative-pg: create a test namespace and make the operator watch it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059093 (https://phabricator.wikimedia.org/T364797) [14:41:42] !log zabe@deploy1003 zabe: Continuing with sync [14:43:20] (03CR) 10Ayounsi: "Example output: https://usercontent.irccloud-cdn.com/file/ASpBVsoV/Screenshot%202024-08-01%20at%2016-36-37%20Editing%20interface%20ge-0_0_" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [14:45:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1039.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:46:11] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1057034|Move section mapping to separate file]] (duration: 08m 06s) [14:48:30] (03PS3) 10Ayounsi: Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [14:49:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:49:24] (03CR) 10CI reject: [V:04-1] Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [14:49:32] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:49:59] (03PS4) 10Ayounsi: Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [14:50:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [14:51:13] (03PS14) 10Clément Goubert: sre.k8s: Add pool-depool-node cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1059045 [14:53:06] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node for host kubestage1003.eqiad.wmnet [14:53:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) for host kubestage1003.eqiad.wmnet [14:53:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.975s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:54:05] (03PS15) 10Clément Goubert: sre.k8s: Add pool-depool-node cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1059045 [14:54:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:15] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host kubestage1003.eqiad.wmnet [14:54:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host kubestage1003.eqiad.wmnet [14:57:18] (03PS2) 10Brennen Bearnes: logspam-watch: Add version column, group errors [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) [14:58:00] (03PS3) 10Brennen Bearnes: logspam-watch: Add version column, group errors [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) [14:58:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 2.855s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:59:21] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1040.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:00:05] brennen and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1500). [15:00:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding prometheus2007 to codfw - jhancock@cumin2002" [15:00:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding prometheus2007 to codfw - jhancock@cumin2002" [15:00:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:00:40] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host prometheus2007.mgmt.codfw.wmnet with reboot policy FORCED [15:01:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host prometheus2008.mgmt.codfw.wmnet with reboot policy FORCED [15:02:58] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10035741 (10Jhancock.wm) [15:04:02] !log rollback debmonitor-server to 0.4.0-3 on debmonitor2003 [15:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:06] (03PS4) 10Ayounsi: check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 [15:05:06] (03PS1) 10Ayounsi: Netbox add libpq-dev package [puppet] - 10https://gerrit.wikimedia.org/r/1059099 [15:11:12] (03CR) 10JHathaway: "sounds good, do you want to roll it out and test?" [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (https://phabricator.wikimedia.org/T364492) (owner: 10JHathaway) [15:11:52] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10035757 (10Jhancock.wm) [15:13:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host prometheus2007.mgmt.codfw.wmnet with reboot policy FORCED [15:13:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['prometheus2007'] [15:14:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['prometheus2007'] [15:16:03] (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [15:17:30] !log jgiannelos@deploy1003 Started deploy [restbase/deploy@f696b76]: (no justification provided) [15:17:34] (03PS5) 10Ssingh: P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) [15:18:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10035781 (10Papaul) @cmooney links removed. You can resolve the task if nothing else needs to be done. [15:18:57] (03CR) 10JHathaway: [C:03+1] data.yaml: Offboarding sbailey [puppet] - 10https://gerrit.wikimedia.org/r/1058952 (owner: 10Slyngshede) [15:21:34] 10ops-magru, 06SRE: Degraded RAID on cp7015 - https://phabricator.wikimedia.org/T371618 (10ops-monitoring-bot) 03NEW [15:22:04] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10035803 (10elukey) Tried to test the new debmonitor-server on debmonitor2003: * changed an-worker1080 (random host) /etc/hosts to point debmonitor.discovery.wm... [15:23:13] !log installing spicerack v8.10.0 to cumin2002 [15:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:13] (03CR) 10Elukey: [C:03+1] Netbox add libpq-dev package [puppet] - 10https://gerrit.wikimedia.org/r/1059099 (owner: 10Ayounsi) [15:24:17] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1055231 (owner: 10Ncmonitor) [15:26:26] (03PS2) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1055230 [15:26:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host prometheus2008.mgmt.codfw.wmnet with reboot policy FORCED [15:27:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['prometheus2008'] [15:27:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['prometheus2008'] [15:27:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [15:27:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2008.codfw.wmnet with OS bookworm [15:28:08] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10035809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host prometheus2007.codfw.wmnet with OS bookworm [15:28:09] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10035810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host prometheus2008.codfw.wmnet with OS bookworm [15:29:45] (03PS1) 10Brouberol: cloudnative-pg: create namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059101 (https://phabricator.wikimedia.org/T364797) [15:30:17] (03PS2) 10Brouberol: cloudnative-pg: create operator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059101 (https://phabricator.wikimedia.org/T364797) [15:31:49] (03PS3) 10Brouberol: cloudnative-pg: create operator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059101 (https://phabricator.wikimedia.org/T364797) [15:31:49] (03PS2) 10Brouberol: cloudnative-pg: create a test namespace and make the operator watch it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059093 (https://phabricator.wikimedia.org/T364797) [15:34:14] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1041.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:34:38] !log jgiannelos@deploy1003 Finished deploy [restbase/deploy@f696b76]: (no justification provided) (duration: 17m 07s) [15:39:35] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10035867 (10Papaul) @Dwisehaupt we have 3 payment nodes ready to rack but we have no room in C8, We do also have 3 payment nodes already racked payment200[1-3] . since t... [15:41:08] (03PS4) 10Brouberol: cloudnative-pg: create operator namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059101 (https://phabricator.wikimedia.org/T364797) [15:41:08] (03PS3) 10Brouberol: cloudnative-pg: create a test namespace and make the operator watch it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059093 (https://phabricator.wikimedia.org/T364797) [15:43:08] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:43:29] (03CR) 10Scott French: [C:03+2] mediawiki-cache-warmup: support 'clone' for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1054968 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [15:44:02] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:45:55] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:46:12] (03CR) 10Scott French: [C:03+2] deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [15:46:14] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:46:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [15:46:57] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:47:01] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1041.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:47:51] !log installing spicerack v8.10.0 to cumin1002 [15:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:30] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:48:52] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:49:12] (03PS1) 10Effie Mouzeli: mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) [15:49:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [15:49:34] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10035928 (10Jhancock.wm) [15:49:35] (03CR) 10CI reject: [V:04-1] mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) (owner: 10Effie Mouzeli) [15:50:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [15:50:16] (03PS2) 10Effie Mouzeli: mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) [15:52:09] (03CR) 10Elukey: [C:03+1] cookbooks: add config for sre.switchdc.databases [puppet] - 10https://gerrit.wikimedia.org/r/1059053 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [15:52:55] (03PS1) 10Dzahn: ci: add new ECDSA ssh key for jenkins to connect to itself [puppet] - 10https://gerrit.wikimedia.org/r/1059106 (https://phabricator.wikimedia.org/T177826) [15:53:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [15:53:16] (03CR) 10Volans: [C:03+2] cookbooks: add config for sre.switchdc.databases [puppet] - 10https://gerrit.wikimedia.org/r/1059053 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [15:53:25] (03PS3) 10Effie Mouzeli: mediawiki: add wikitech to virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) [15:53:29] (03CR) 10CI reject: [V:04-1] ci: add new ECDSA ssh key for jenkins to connect to itself [puppet] - 10https://gerrit.wikimedia.org/r/1059106 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [15:53:30] (03PS4) 10Brennen Bearnes: logspam-watch: Add version column, group errors [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) [15:53:34] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) (owner: 10Effie Mouzeli) [15:54:21] (03CR) 10Dzahn: [C:03+2] "https://phabricator.wikimedia.org/T371575" [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [15:56:22] (03CR) 10Scott French: [C:03+2] switchdc: mediawiki cache warmup now targets k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [15:59:42] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg: Bump chart versions to integrate all customizations in a new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059082 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [16:00:04] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:10] (03Merged) 10jenkins-bot: switchdc: mediawiki cache warmup now targets k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [16:01:44] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10036001 (10Papaul) @Dwisehaupt we can do the same process I mentioned in https://phabricator.wikimedia.org/T369942 here [16:03:57] 06SRE, 06Infrastructure-Foundations, 10netops: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351#10036015 (10cmooney) After discussing with @ayounsi on irc I've adjusted the approach: https://netbox-next.wikimedia.org/vpn/tunnels/ Principal decisions were: # We will use a group calle... [16:09:23] (03PS5) 10Brennen Bearnes: logspam-watch: Add version column, group errors [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) [16:09:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus2008.codfw.wmnet with OS bookworm [16:11:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus2007.codfw.wmnet with OS bookworm [16:17:23] (03PS2) 10Ilias Sarantopoulos: ml-services: update lang agnostic articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059043 (https://phabricator.wikimedia.org/T360455) [16:23:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [16:23:57] (03PS3) 10Ilias Sarantopoulos: ml-services: update lang agnostic articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059043 (https://phabricator.wikimedia.org/T360455) [16:24:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2008.codfw.wmnet with OS bookworm [16:25:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [16:27:14] (03PS2) 10Volans: sre.switchdc.databases.preparation: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [16:27:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [16:28:03] (03PS1) 10Effie Mouzeli: (DNM) trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1059118 (https://phabricator.wikimedia.org/T371358) [16:28:26] (03CR) 10CI reject: [V:04-1] (DNM) trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1059118 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [16:28:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [16:30:19] (03PS2) 10Effie Mouzeli: (DNM) trafficserver: remove wikitech routing [puppet] - 10https://gerrit.wikimedia.org/r/1059118 (https://phabricator.wikimedia.org/T371358) [16:30:29] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059118 (https://phabricator.wikimedia.org/T371358) (owner: 10Effie Mouzeli) [16:31:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [16:39:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus2007.codfw.wmnet with OS bookworm [16:41:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus2008.codfw.wmnet with OS bookworm [16:53:53] (03PS1) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [16:55:30] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: Bump chart versions to integrate all customizations in a new release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059082 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [16:55:48] (03PS2) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) [16:58:55] (03PS1) 10BPirkle: Add content.v1 REST module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059124 (https://phabricator.wikimedia.org/T370430) [16:59:19] (03PS1) 10Andrew Bogott: cloud-vps dynamic proxy: prometheus stats from nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) [16:59:42] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1059126 [16:59:43] (03PS1) 10AOkoth: site: add setup roles for vrts hardware [puppet] - 10https://gerrit.wikimedia.org/r/1059127 (https://phabricator.wikimedia.org/T369672) [16:59:45] (03CR) 10CI reject: [V:04-1] cloud-vps dynamic proxy: prometheus stats from nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) (owner: 10Andrew Bogott) [17:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1700) [17:00:13] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1059126 [17:00:19] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (owner: 10CDanis) [17:00:50] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1059127 (https://phabricator.wikimedia.org/T369672) (owner: 10AOkoth) [17:01:22] (03CR) 10Dzahn: [C:03+1] "we can link https://phabricator.wikimedia.org/T369674 as well" [puppet] - 10https://gerrit.wikimedia.org/r/1059127 (https://phabricator.wikimedia.org/T369672) (owner: 10AOkoth) [17:01:31] (03PS2) 10Dzahn: site: add setup roles for vrts hardware [puppet] - 10https://gerrit.wikimedia.org/r/1059127 (https://phabricator.wikimedia.org/T369672) (owner: 10AOkoth) [17:01:37] (03CR) 10Ahmon Dancy: logspam-watch: Add version column, group errors (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) (owner: 10Brennen Bearnes) [17:01:57] o/ nothing for me to do in the WMCS window today. [17:02:28] (03CR) 10Dzahn: "once this is merged, you can check the "add to puppet" check box on both tickets, one eqiad, one codfw. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1059127 (https://phabricator.wikimedia.org/T369672) (owner: 10AOkoth) [17:03:02] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1059126 [17:03:06] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (owner: 10CDanis) [17:03:56] (03CR) 10AOkoth: [C:03+2] site: add setup roles for vrts hardware [puppet] - 10https://gerrit.wikimedia.org/r/1059127 (https://phabricator.wikimedia.org/T369672) (owner: 10AOkoth) [17:04:23] (03PS2) 10Andrew Bogott: cloud-vps dynamic proxy: prometheus stats from nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) [17:05:54] (03PS2) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [17:06:54] (03CR) 10CI reject: [V:04-1] cloud-vps dynamic proxy: prometheus stats from nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) (owner: 10Andrew Bogott) [17:07:36] (03CR) 10Andrew Bogott: "nginx license is waiting on a response to my question at https://gist.github.com/mattpr/de96f3a9c7b895ce5a9fbbe8812d0890" [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) (owner: 10Andrew Bogott) [17:09:16] (03CR) 10Brennen Bearnes: logspam-watch: Add version column, group errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) (owner: 10Brennen Bearnes) [17:14:51] (03CR) 10Ahmon Dancy: [C:03+1] "You'll probably want to rebase this so it's not stacked on my controversial changes." [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) (owner: 10Brennen Bearnes) [17:15:31] (03PS6) 10Brennen Bearnes: logspam-watch: Add version column, group errors [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) [17:19:53] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update lang agnostic articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059043 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [17:19:54] !log cdanis@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:21:04] (03CR) 10Brennen Bearnes: "Thanks Ahmon!" [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) (owner: 10Brennen Bearnes) [17:21:31] !log cdanis@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:24:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [17:25:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:27:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [17:31:42] (03PS85) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [17:31:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [17:32:50] (03PS3) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [17:42:14] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus2007.codfw.wmnet with OS bookworm [17:42:15] (03PS1) 10Jsn.sherman: revisionCheck: skip null wikiPages [extensions/AutoModerator] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1059130 (https://phabricator.wikimedia.org/T371348) [17:43:32] (03PS1) 10Ssingh: nrpe::monitor_service: clarify interval is in minutes [puppet] - 10https://gerrit.wikimedia.org/r/1059131 [17:43:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/AutoModerator] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1059130 (https://phabricator.wikimedia.org/T371348) (owner: 10Jsn.sherman) [17:44:07] (03CR) 10Brennen Bearnes: [C:03+1] gitlab: enable throttling for all GitLab instances [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:46:57] (03PS4) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [17:48:27] (03PS1) 10Ssingh: gdnsd::monitor_conf: set alert level CRITICAL [puppet] - 10https://gerrit.wikimedia.org/r/1059132 [17:48:31] (03CR) 10BBlack: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [17:49:28] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3475/console" [puppet] - 10https://gerrit.wikimedia.org/r/1059132 (owner: 10Ssingh) [17:50:25] (03CR) 10Ssingh: [V:03+1 C:03+2] gdnsd::monitor_conf: set alert level CRITICAL [puppet] - 10https://gerrit.wikimedia.org/r/1059132 (owner: 10Ssingh) [17:52:56] (03PS1) 10CDanis: otelcol: bump RAM reservation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059133 (https://phabricator.wikimedia.org/T370043) [17:58:44] (03PS5) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [17:58:46] !log cdanis@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:58:52] !log cdanis@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:00:00] (03CR) 10Chris_steinchen: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [18:00:03] (03CR) 10CDanis: [C:03+2] otelcol: bump RAM reservation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059133 (https://phabricator.wikimedia.org/T370043) (owner: 10CDanis) [18:00:05] brennen and dduvall: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T1800). Please do the needful. [18:00:19] o/ [18:00:53] !log 1.43.0-wmf.16 train (T366961): no current blockers, logs cluttered but not too scary, rolling to all wikis. [18:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:57] T366961: 1.43.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T366961 [18:02:29] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059134 (https://phabricator.wikimedia.org/T366961) [18:02:31] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059134 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [18:03:14] (03Merged) 10jenkins-bot: otelcol: bump RAM reservation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059133 (https://phabricator.wikimedia.org/T370043) (owner: 10CDanis) [18:03:25] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059134 (https://phabricator.wikimedia.org/T366961) (owner: 10TrainBranchBot) [18:06:39] (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1059126 [18:08:07] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (owner: 10CDanis) [18:10:12] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.16 refs T366961 [18:10:16] T366961: 1.43.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T366961 [18:12:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host alert2002.wikimedia.org with OS bookworm [18:16:01] (03PS1) 10Ssingh: gdnsd::monitor_conf: update notes URL [puppet] - 10https://gerrit.wikimedia.org/r/1059139 [18:16:27] (03PS1) 10BBlack: Remove admin_state handling from ops/dns [dns] - 10https://gerrit.wikimedia.org/r/1059140 [18:16:43] (03PS1) 10Chris_steinchen: Revert "group2 to 1.43.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059141 [18:16:55] (03Abandoned) 10Chris_steinchen: Revert "group2 to 1.43.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059141 (owner: 10Chris_steinchen) [18:17:53] (03CR) 10Ssingh: [C:03+2] gdnsd::monitor_conf: update notes URL [puppet] - 10https://gerrit.wikimedia.org/r/1059139 (owner: 10Ssingh) [18:24:58] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10036474 (10Dwisehaupt) [18:25:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192#10036478 (10VRiley-WMF) Device out of warranty. Looked to see if there were any replacement drives from previously decommissioned servers. I thought I had located a spare hard drive... [18:25:47] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10036480 (10Dwisehaupt) This host is ready. It has downtime set in icinga and will soon have a diff for removal from the config. [18:26:17] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10036481 (10Dwisehaupt) [18:26:21] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10036485 (10Dwisehaupt) This host is ready. It has downtime set in icinga and will soon have a diff for removal from the config. [18:27:30] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371629#10036489 (10Dwisehaupt) [18:29:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [18:29:21] (03CR) 10Ahmon Dancy: gitlab: enable throttling for all GitLab instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [18:29:43] (03PS1) 10Scott French: mediawiki:: move cache_warmup from maintenance to tools [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) [18:30:46] (03PS6) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [18:31:51] (03CR) 10Dzahn: [C:03+1] gitlab: enable throttling for all GitLab instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [18:32:17] (03CR) 10CI reject: [V:04-1] mediawiki:: move cache_warmup from maintenance to tools [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [18:32:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on alert2002.wikimedia.org with reason: host reimage [18:33:21] (03PS8) 10Dzahn: ci: add new ECDSA ssh key for jenkins to connect to itself [puppet] - 10https://gerrit.wikimedia.org/r/1059106 (https://phabricator.wikimedia.org/T177826) [18:33:28] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1059106/3477/contint2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1059106 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [18:38:20] (03CR) 10Scardenasmolinar: [C:03+1] "Ready for backport!" [extensions/AutoModerator] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1059130 (https://phabricator.wikimedia.org/T371348) (owner: 10Jsn.sherman) [18:39:49] (03PS2) 10Scott French: mediawiki:: move cache_warmup from maintenance to tools [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) [18:40:28] (03PS7) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [18:40:49] (03PS3) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [18:42:59] (03PS3) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) [18:43:10] (03CR) 10Ahmon Dancy: gitlab: enable throttling for all GitLab instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [18:44:00] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [18:44:20] (03PS4) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) [18:44:21] (03PS3) 10Scott French: mediawiki:: move cache_warmup from maintenance to tools [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) [18:44:42] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [18:45:19] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [18:45:41] (03CR) 10Kgraessle: [C:03+1] "Thanks for backporting this!" [extensions/AutoModerator] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1059130 (https://phabricator.wikimedia.org/T371348) (owner: 10Jsn.sherman) [18:47:11] (03CR) 10Volans: "See https://phabricator.wikimedia.org/T371351#10036520 for the latest context and test runs" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [18:49:05] (03PS4) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [18:49:24] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2002.wikimedia.org with OS bookworm [18:49:33] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10036532 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host alert2002.wikimedia.org with OS bookworm executed with errors: - alert2002 (**... [18:49:35] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, and 2 others: Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10036531 (10Dzahn) >>! In T177826#10034509, @hashar wrote: > I would append the ECDSA key there. Looks like `ssh::userkey` `$con... [18:50:52] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, and 2 others: Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10036534 (10Dzahn) 05Open→03In progress [18:51:08] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2031.codfw.wmnet,service=(cdn|ats-be) [18:51:23] (03PS8) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [18:52:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.373s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:53:14] (03PS4) 10Scott French: mediawiki:: move cache_warmup from maintenance to tools [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) [18:55:14] (03CR) 10Scott French: "PCC fails when generically targeting O:deployment_server::kubernetes, as it picks up deploy1002 as a matching example host (no longer in s" [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [18:55:25] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [18:57:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.815s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:57:49] (03CR) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (owner: 10Ahmon Dancy) [18:57:53] (03CR) 10Scott French: "Thanks in advance for the review, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:04:55] (03PS9) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [19:13:50] (03CR) 10RLazarus: [C:03+1] mediawiki:: move cache_warmup from maintenance to tools [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:15:27] (03PS5) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1059126 [19:15:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host alert2002.wikimedia.org with OS bookworm [19:15:40] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (owner: 10CDanis) [19:15:44] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10036613 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm [19:16:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2002.wikimedia.org with OS bookworm [19:16:38] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert2002 - https://phabricator.wikimedia.org/T370112#10036616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host alert2002.wikimedia.org with OS bookworm executed with errors: - alert2002 (... [19:17:14] (03PS10) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [19:19:18] (03PS6) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1059126 [19:19:28] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (owner: 10CDanis) [19:23:55] (03PS11) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [19:27:26] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:28:14] (03CR) 10Scott French: [C:03+2] mediawiki:: move cache_warmup from maintenance to tools [puppet] - 10https://gerrit.wikimedia.org/r/1059147 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:32:54] (03PS12) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [19:35:23] (03PS1) 10Dzahn: durum: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059152 [19:37:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission deploy1002 - https://phabricator.wikimedia.org/T371283#10036634 (10VRiley-WMF) 05Open→03Resolved [19:37:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission deploy1002 - https://phabricator.wikimedia.org/T371283#10036637 (10VRiley-WMF) This unit has been decommissioned [19:39:05] (03PS13) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [19:40:29] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1059152/3487/durum2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1059152 (owner: 10Dzahn) [19:41:13] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - ps1-b4-eqiad - https://phabricator.wikimedia.org/T371100#10036640 (10VRiley-WMF) 05Open→03Resolved [19:41:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - ps1-b4-eqiad - https://phabricator.wikimedia.org/T371100#10036639 (10VRiley-WMF) Rebalanced power. [19:47:07] (03PS14) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [19:51:55] (03PS1) 10JHathaway: postfix: bump postfix module version [puppet] - 10https://gerrit.wikimedia.org/r/1059154 (https://phabricator.wikimedia.org/T370011) [19:51:57] (03PS1) 10JHathaway: postfix: enable smtpd_forbid_bare_newline [puppet] - 10https://gerrit.wikimedia.org/r/1059155 (https://phabricator.wikimedia.org/T370011) [19:52:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059154 (https://phabricator.wikimedia.org/T370011) (owner: 10JHathaway) [19:52:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059155 (https://phabricator.wikimedia.org/T370011) (owner: 10JHathaway) [19:54:02] (03PS7) 10CDanis: Exclude some requests from concurrency tracking [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (https://phabricator.wikimedia.org/T368389) [19:55:01] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (https://phabricator.wikimedia.org/T368389) (owner: 10CDanis) [19:55:12] (03PS1) 10Dzahn: durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 [19:55:13] (03CR) 10CDanis: "As discussed" [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (https://phabricator.wikimedia.org/T368389) (owner: 10CDanis) [19:56:19] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [19:56:59] (03PS15) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [19:59:29] (03CR) 10JHathaway: [C:03+2] postfix: bump postfix module version [puppet] - 10https://gerrit.wikimedia.org/r/1059154 (https://phabricator.wikimedia.org/T370011) (owner: 10JHathaway) [20:00:05] thcipriani, RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T2000). [20:00:05] JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:34] o/ I can deploy [20:01:46] !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decomission of frdb2002, payments2001, and payments2002 - dwisehaupt@cumin1002" [20:01:51] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: decomission of frdb2002, payments2001, and payments2002 - dwisehaupt@cumin1002" [20:01:51] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:02:42] here [20:03:14] (03PS16) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [20:03:52] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3491/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059155 (https://phabricator.wikimedia.org/T370011) (owner: 10JHathaway) [20:04:36] toyofuku: howdy! toyofuku is helping with backport today! [20:13:54] (03CR) 10JHathaway: [V:03+1 C:03+2] postfix: enable smtpd_forbid_bare_newline [puppet] - 10https://gerrit.wikimedia.org/r/1059155 (https://phabricator.wikimedia.org/T370011) (owner: 10JHathaway) [20:15:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1003 using scap backport" [extensions/AutoModerator] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1059130 (https://phabricator.wikimedia.org/T371348) (owner: 10Jsn.sherman) [20:16:31] (03PS1) 10CDanis: WIP fixme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 [20:17:25] (03CR) 10CI reject: [V:04-1] WIP fixme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 (owner: 10CDanis) [20:18:32] (03Merged) 10jenkins-bot: revisionCheck: skip null wikiPages [extensions/AutoModerator] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1059130 (https://phabricator.wikimedia.org/T371348) (owner: 10Jsn.sherman) [20:18:43] !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1059130|revisionCheck: skip null wikiPages (T371348)]] [20:18:50] T371348: "Call to a member function getNamespace() on null" when importing - https://phabricator.wikimedia.org/T371348 [20:20:46] !log thcipriani@deploy1003 thcipriani, jsn: Backport for [[gerrit:1059130|revisionCheck: skip null wikiPages (T371348)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:42] (03PS17) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [20:23:30] !log thcipriani@deploy1003 thcipriani, jsn: Continuing with sync [20:28:03] !log thcipriani@deploy1003 Finished scap: Backport for [[gerrit:1059130|revisionCheck: skip null wikiPages (T371348)]] (duration: 09m 19s) [20:28:06] T371348: "Call to a member function getNamespace() on null" when importing - https://phabricator.wikimedia.org/T371348 [20:29:14] (03PS3) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [20:33:40] jouncebot: nowandnext [20:33:40] For the next 0 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240801T2000) [20:33:40] In 9 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240802T0600) [20:37:15] (03PS18) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [20:40:04] !log utc late window complete [20:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:30] (03PS19) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [20:45:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T371045#10036734 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This unit has been relabeled as requested. Thanks! [20:48:07] (03PS20) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [20:52:52] (03PS21) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [20:56:04] (03PS22) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:01:17] (03PS23) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:03:28] (03PS24) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:04:57] FIRING: Device rebooted: Alert for device ps1-e6-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [21:05:02] (03PS1) 10Dwisehaupt: decommission: frdb2002, payments2001, payments2002 [puppet] - 10https://gerrit.wikimedia.org/r/1059162 (https://phabricator.wikimedia.org/T371629) [21:07:42] (03PS25) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:09:20] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10036797 (10Papaul) @Jhancock.wm this is ready for decom [21:10:17] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10036802 (10Papaul) @Jhancock.wm this is ready for decom [21:11:37] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371629#10036808 (10Papaul) @Jhancock.wm this is ready for decom [21:14:57] FIRING: [2x] Device rebooted: Device ps1-e6-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [21:15:02] (03PS1) 10Ahmon Dancy: Bump buildkitd to 0.15.1 [puppet] - 10https://gerrit.wikimedia.org/r/1059164 (https://phabricator.wikimedia.org/T371641) [21:19:57] RESOLVED: Device rebooted: Device ps1-e7-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [21:21:23] (03PS26) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:22:44] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3503/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [21:24:28] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3504/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [21:30:05] (03PS27) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:42:46] (03PS28) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:43:23] (03PS2) 10CDanis: WIP fixme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 [21:46:57] (03PS29) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:49:18] (03PS3) 10CDanis: jaeger: very basic archive traces support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 [21:50:02] (03PS4) 10CDanis: jaeger: very basic archive traces support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059158 (https://phabricator.wikimedia.org/T371390) [21:51:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T367856)', diff saved to https://phabricator.wikimedia.org/P67195 and previous config saved to /var/cache/conftool/dbconfig/20240801-215150-marostegui.json [21:51:53] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [21:52:01] (03PS30) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:55:17] (03PS4) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [21:55:26] (03PS31) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [21:55:54] (03CR) 10CI reject: [V:04-1] Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 (owner: 10Zabe) [21:56:47] (03PS5) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [21:56:58] (03CR) 10Dzahn: [C:03+2] Bump buildkitd to 0.15.1 [puppet] - 10https://gerrit.wikimedia.org/r/1059164 (https://phabricator.wikimedia.org/T371641) (owner: 10Ahmon Dancy) [21:57:24] (03CR) 10CI reject: [V:04-1] Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 (owner: 10Zabe) [21:59:07] (03PS32) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [22:01:39] (03PS6) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [22:02:17] (03CR) 10CI reject: [V:04-1] Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 (owner: 10Zabe) [22:02:34] (03PS7) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [22:02:57] (03PS33) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [22:03:14] (03CR) 10CI reject: [V:04-1] Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 (owner: 10Zabe) [22:05:24] (03PS8) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [22:06:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P67196 and previous config saved to /var/cache/conftool/dbconfig/20240801-220657-marostegui.json [22:09:45] FIRING: Device rebooted: Alert for device ps1-f5-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [22:14:17] (03PS9) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [22:15:46] (03Abandoned) 10Zabe: db-production: Generate sectionsByDB on the fly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe) [22:19:45] RESOLVED: Device rebooted: Device ps1-f5-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [22:22:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P67197 and previous config saved to /var/cache/conftool/dbconfig/20240801-222204-marostegui.json [22:22:27] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1057948/3493/miscweb2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1057948 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [22:26:03] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop, file renamed but no functional change" [puppet] - 10https://gerrit.wikimedia.org/r/1057948 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [22:27:11] (03CR) 10Dzahn: [C:03+2] icinga: Add frqueue2003 pay-lb2001 and pay-lb2002 [puppet] - 10https://gerrit.wikimedia.org/r/1058261 (https://phabricator.wikimedia.org/T369566) (owner: 10Dwisehaupt) [22:27:48] (03CR) 10Dzahn: [C:03+2] decommission: frdb2002, payments2001, payments2002 [puppet] - 10https://gerrit.wikimedia.org/r/1059162 (https://phabricator.wikimedia.org/T371629) (owner: 10Dwisehaupt) [22:35:49] (03CR) 10Dzahn: [C:03+2] "gone! checked icinga config for errors. all good" [puppet] - 10https://gerrit.wikimedia.org/r/1059162 (https://phabricator.wikimedia.org/T371629) (owner: 10Dwisehaupt) [22:36:56] (03CR) 10Dzahn: [C:03+2] "new hosts added! checked icinga config for errors. all good" [puppet] - 10https://gerrit.wikimedia.org/r/1058261 (https://phabricator.wikimedia.org/T369566) (owner: 10Dwisehaupt) [22:37:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T367856)', diff saved to https://phabricator.wikimedia.org/P67198 and previous config saved to /var/cache/conftool/dbconfig/20240801-223711-marostegui.json [22:37:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:37:15] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:37:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:40:12] (03PS1) 10Cwhite: logstash: split curator jobs into individual actions [puppet] - 10https://gerrit.wikimedia.org/r/1059171 (https://phabricator.wikimedia.org/T364190) [22:44:06] (03CR) 10Dwisehaupt: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1059162 (https://phabricator.wikimedia.org/T371629) (owner: 10Dwisehaupt) [22:44:14] (03CR) 10Dwisehaupt: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1058261 (https://phabricator.wikimedia.org/T369566) (owner: 10Dwisehaupt) [22:46:04] (03CR) 10Dzahn: [C:03+2] logspam-watch: Add version column, group errors [puppet] - 10https://gerrit.wikimedia.org/r/1058707 (https://phabricator.wikimedia.org/T371566) (owner: 10Brennen Bearnes) [22:50:50] (03PS3) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) [22:53:33] (03PS4) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) [23:02:54] (03PS5) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) [23:27:59] (03CR) 10Zabe: [C:03+2] Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 (owner: 10Zabe) [23:28:42] (03Merged) 10jenkins-bot: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 (owner: 10Zabe) [23:28:57] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1057037|Automatically set db section to s5 for new wiki]] [23:31:02] !log zabe@deploy1003 zabe: Backport for [[gerrit:1057037|Automatically set db section to s5 for new wiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:31:47] !log zabe@deploy1003 zabe: Continuing with sync [23:34:38] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [23:36:18] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1057037|Automatically set db section to s5 for new wiki]] (duration: 07m 20s) [23:37:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059174 [23:38:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059174 (owner: 10TrainBranchBot) [23:51:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1250.eqiad.wmnet with OS bullseye [23:52:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bull... [23:52:11] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1251.eqiad.wmnet with OS bullseye [23:52:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1251.eqiad.wmnet with OS bull... [23:52:21] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1252.eqiad.wmnet with OS bullseye [23:52:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037215 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1252.eqiad.wmnet with OS bull... [23:52:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1253.eqiad.wmnet with OS bullseye [23:52:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1253.eqiad.wmnet with OS bull... [23:52:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1254.eqiad.wmnet with OS bullseye [23:52:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037217 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1254.eqiad.wmnet with OS bull... [23:53:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1257.eqiad.wmnet with OS bullseye [23:53:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037220 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1257.eqiad.wmnet with OS bull... [23:53:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1258.eqiad.wmnet with OS bullseye [23:53:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1259.eqiad.wmnet with OS bullseye [23:54:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1256.eqiad.wmnet with OS bullseye [23:55:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1255.eqiad.wmnet with OS bullseye [23:55:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1258.eqiad.wmnet with OS bull... [23:55:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037226 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1259.eqiad.wmnet with OS bull... [23:56:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bull... [23:56:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10037231 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1255.eqiad.wmnet with OS bull... [23:58:08] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650 (10SToyofuku-WMF) 03NEW