[00:03:36] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10236283 (10Papaul) [00:04:25] RESOLVED: [5x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:03] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10236284 (10Papaul) 05Open→03Resolved This is complete [00:10:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1080835 (owner: 10TrainBranchBot) [00:14:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P70203 and previous config saved to /var/cache/conftool/dbconfig/20241017-001457-ladsgroup.json [00:25:49] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [00:26:30] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [00:30:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P70204 and previous config saved to /var/cache/conftool/dbconfig/20241017-003004-ladsgroup.json [00:45:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T376905)', diff saved to https://phabricator.wikimedia.org/P70206 and previous config saved to /var/cache/conftool/dbconfig/20241017-004511-ladsgroup.json [00:45:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [00:45:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [00:45:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T376905)', diff saved to https://phabricator.wikimedia.org/P70207 and previous config saved to /var/cache/conftool/dbconfig/20241017-004537-ladsgroup.json [00:54:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T376905)', diff saved to https://phabricator.wikimedia.org/P70208 and previous config saved to /var/cache/conftool/dbconfig/20241017-005405-ladsgroup.json [01:09:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P70209 and previous config saved to /var/cache/conftool/dbconfig/20241017-010912-ladsgroup.json [01:24:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P70210 and previous config saved to /var/cache/conftool/dbconfig/20241017-012419-ladsgroup.json [01:39:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T376905)', diff saved to https://phabricator.wikimedia.org/P70211 and previous config saved to /var/cache/conftool/dbconfig/20241017-013926-ladsgroup.json [01:39:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [01:39:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [01:44:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [01:44:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [01:45:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T376905)', diff saved to https://phabricator.wikimedia.org/P70212 and previous config saved to /var/cache/conftool/dbconfig/20241017-014500-ladsgroup.json [01:53:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T376905)', diff saved to https://phabricator.wikimedia.org/P70213 and previous config saved to /var/cache/conftool/dbconfig/20241017-015310-ladsgroup.json [02:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10236353 (10phaultfinder) [02:06:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 818.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:08:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P70214 and previous config saved to /var/cache/conftool/dbconfig/20241017-020817-ladsgroup.json [02:10:23] (03CR) 10Tim Starling: [C:03+2] Enable {{USERLANGUAGE}} on Commons and Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079680 (https://phabricator.wikimedia.org/T4085) (owner: 10Tim Starling) [02:11:06] (03Merged) 10jenkins-bot: Enable {{USERLANGUAGE}} on Commons and Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079680 (https://phabricator.wikimedia.org/T4085) (owner: 10Tim Starling) [02:11:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 808.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:18:42] !log tstarling@deploy2002 Synchronized wmf-config/InitialiseSettings.php: T4085 Enable {{USERLANGUAGE}} on Commons and Meta (duration: 06m 34s) [02:18:46] T4085: Add a {{USERLANGUAGE}} magic word - https://phabricator.wikimedia.org/T4085 [02:23:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P70215 and previous config saved to /var/cache/conftool/dbconfig/20241017-022324-ladsgroup.json [02:37:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T376905)', diff saved to https://phabricator.wikimedia.org/P70216 and previous config saved to /var/cache/conftool/dbconfig/20241017-023831-ladsgroup.json [02:38:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [02:38:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [02:38:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T376905)', diff saved to https://phabricator.wikimedia.org/P70217 and previous config saved to /var/cache/conftool/dbconfig/20241017-023857-ladsgroup.json [02:45:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T376905)', diff saved to https://phabricator.wikimedia.org/P70218 and previous config saved to /var/cache/conftool/dbconfig/20241017-024557-ladsgroup.json [03:01:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P70219 and previous config saved to /var/cache/conftool/dbconfig/20241017-030104-ladsgroup.json [03:02:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10236405 (10Papaul) @elukey we are having some issues re-image the first 3 servers on this task. on the first 2, the re-image was not the first time but p... [03:16:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P70220 and previous config saved to /var/cache/conftool/dbconfig/20241017-031611-ladsgroup.json [03:31:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T376905)', diff saved to https://phabricator.wikimedia.org/P70221 and previous config saved to /var/cache/conftool/dbconfig/20241017-033118-ladsgroup.json [03:31:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2221.codfw.wmnet with reason: Maintenance [03:31:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2221.codfw.wmnet with reason: Maintenance [03:31:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T376905)', diff saved to https://phabricator.wikimedia.org/P70222 and previous config saved to /var/cache/conftool/dbconfig/20241017-033144-ladsgroup.json [03:38:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T376905)', diff saved to https://phabricator.wikimedia.org/P70223 and previous config saved to /var/cache/conftool/dbconfig/20241017-033852-ladsgroup.json [03:47:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10236448 (10Papaul) @VRiley-WMF thank you for following up on this. It looks like the router is back running on re0 and disks are all there. We can close. @ayounsi any... [03:54:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P70224 and previous config saved to /var/cache/conftool/dbconfig/20241017-035359-ladsgroup.json [04:09:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P70225 and previous config saved to /var/cache/conftool/dbconfig/20241017-040906-ladsgroup.json [04:24:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T376905)', diff saved to https://phabricator.wikimedia.org/P70226 and previous config saved to /var/cache/conftool/dbconfig/20241017-042413-ladsgroup.json [04:24:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2222.codfw.wmnet with reason: Maintenance [04:24:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2222.codfw.wmnet with reason: Maintenance [04:24:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T376905)', diff saved to https://phabricator.wikimedia.org/P70227 and previous config saved to /var/cache/conftool/dbconfig/20241017-042440-ladsgroup.json [04:31:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T376905)', diff saved to https://phabricator.wikimedia.org/P70228 and previous config saved to /var/cache/conftool/dbconfig/20241017-043139-ladsgroup.json [04:46:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P70229 and previous config saved to /var/cache/conftool/dbconfig/20241017-044646-ladsgroup.json [05:01:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P70230 and previous config saved to /var/cache/conftool/dbconfig/20241017-050153-ladsgroup.json [05:04:12] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079572 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [05:17:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T376905)', diff saved to https://phabricator.wikimedia.org/P70231 and previous config saved to /var/cache/conftool/dbconfig/20241017-051700-ladsgroup.json [05:19:56] (03PS1) 10KartikMistry: Enable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080857 [05:47:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: T367781', diff saved to https://phabricator.wikimedia.org/P70233 and previous config saved to /var/cache/conftool/dbconfig/20241017-054722-arnaudb.json [05:47:26] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T0600) [06:00:05] marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T0600). [06:02:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: T367781', diff saved to https://phabricator.wikimedia.org/P70234 and previous config saved to /var/cache/conftool/dbconfig/20241017-060227-arnaudb.json [06:02:31] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2205.codfw.wmnet with OS bookworm [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:15:42] (03PS1) 10Santhosh: cxserver: Add logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080862 [06:17:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: T367781', diff saved to https://phabricator.wikimedia.org/P70235 and previous config saved to /var/cache/conftool/dbconfig/20241017-061732-arnaudb.json [06:17:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:26:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2205.codfw.wmnet with reason: host reimage [06:31:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: host reimage [06:32:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: T367781', diff saved to https://phabricator.wikimedia.org/P70236 and previous config saved to /var/cache/conftool/dbconfig/20241017-063238-arnaudb.json [06:32:42] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:40:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10236574 (10ayounsi) ` re1.cr1-eqiad> show system alarms 1 alarms currently active Alarm time Class Description 2024-07-18 16:11:37 UTC Minor Backup... [06:42:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10236576 (10ayounsi) [06:42:50] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10236577 (10ayounsi) [06:53:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2205.codfw.wmnet with OS bookworm [07:00:06] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T0700). [07:00:06] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2149 to reclone on db2205 - T377276', diff saved to https://phabricator.wikimedia.org/P70237 and previous config saved to /var/cache/conftool/dbconfig/20241017-070015-arnaudb.json [07:00:20] T377276: db2205 is stuck at "Shutdown in progress" - https://phabricator.wikimedia.org/T377276 [07:00:21] o/ [07:00:56] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2149.codfw.wmnet onto db2205.codfw.wmnet [07:01:25] dcausse: hi, are you self deploying ? [07:01:42] hashar: hello, yes [07:01:46] +1 :) [07:02:04] * hashar heads back to Java exceptions tracking [07:04:06] :) [07:05:42] (03CR) 10KartikMistry: [C:03+2] cxserver: Add logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080862 (owner: 10Santhosh) [07:07:06] (03Merged) 10jenkins-bot: cxserver: Add logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080862 (owner: 10Santhosh) [07:07:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080332 (https://phabricator.wikimedia.org/T377226) (owner: 10DCausse) [07:07:44] (03Merged) 10jenkins-bot: cirrus: cleanup removed label_count field on next re-index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080332 (https://phabricator.wikimedia.org/T377226) (owner: 10DCausse) [07:08:47] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1080332|cirrus: cleanup removed label_count field on next re-index (T377226)]] [07:08:51] T377226: Remove LabelCountField from WikibaseCirrusSearch - https://phabricator.wikimedia.org/T377226 [07:12:56] (03PS1) 10KartikMistry: cxserver: Bump chart to 0.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080980 [07:13:23] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1080332|cirrus: cleanup removed label_count field on next re-index (T377226)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:13:35] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on kubestagemaster2005.codfw.wmnet with reason: reimage [07:13:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubestagemaster2005.codfw.wmnet with reason: reimage [07:14:44] !log dcausse@deploy2002 dcausse: Continuing with sync [07:14:53] (03CR) 10KartikMistry: [C:03+2] cxserver: Bump chart to 0.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080980 (owner: 10KartikMistry) [07:16:00] dcausse: let me know when deployment is done. [07:16:06] (03Merged) 10jenkins-bot: cxserver: Bump chart to 0.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080980 (owner: 10KartikMistry) [07:16:19] kart_: sure [07:18:39] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubestagemaster2005.codfw.wmnet [07:19:28] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080332|cirrus: cleanup removed label_count field on next re-index (T377226)]] (duration: 10m 40s) [07:19:31] T377226: Remove LabelCountField from WikibaseCirrusSearch - https://phabricator.wikimedia.org/T377226 [07:20:26] kart_: all done [07:20:31] (03PS3) 10Elukey: tox: align behavior with what we use in Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/1080716 [07:20:59] (03CR) 10Elukey: tox: align behavior with what we use in Spicerack (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080716 (owner: 10Elukey) [07:25:42] (03CR) 10Brouberol: [C:03+1] ATS: add mapping for airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1079361 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [07:27:09] (03PS1) 10JMeybohm: reimage: Fix vlan check for VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1081060 [07:28:01] (03PS1) 10Ayounsi: vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 [07:28:35] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2005.codfw.wmnet with OS bookworm [07:31:00] (03CR) 10Ayounsi: "Tested on Netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [07:31:16] (03PS20) 10Arnaudb: mariadb: cookbook to safely upgrade and reboot santarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) [07:33:18] (03CR) 10Ayounsi: [C:03+1] "Good catch, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1081060 (owner: 10JMeybohm) [07:33:42] (03CR) 10JMeybohm: [C:03+2] reimage: Fix vlan check for VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1081060 (owner: 10JMeybohm) [07:36:24] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:37:09] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:37:11] dcausse: thanks [07:37:15] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [07:37:29] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:39:28] (03Merged) 10jenkins-bot: reimage: Fix vlan check for VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1081060 (owner: 10JMeybohm) [07:40:27] (03PS4) 10Elukey: tox: align behavior with what we use in Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/1080716 [07:41:46] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080716 (owner: 10Elukey) [07:44:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10236696 (10elukey) >>! In T371400#10234173, @Jhancock.wm wrote: > when I tried to login to the BMC this morning, 2081 and 2082 were unreachable. connecte... [07:47:05] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869#10236702 (10ayounsi) [07:47:13] (03CR) 10Elukey: [C:03+2] tox: align behavior with what we use in Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/1080716 (owner: 10Elukey) [07:48:08] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [07:51:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [07:54:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10236705 (10elukey) For reimage the RemoteExecutionError is related to this: ` elukey@puppetserver1001:~$ puppet lookup --render-as s --compile --node ms... [07:55:37] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:55:59] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:00:05] jeena and andre: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T0800) [08:01:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1065.eqiad.wmnet [08:01:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10236709 (10elukey) To summarize: * 208[1,2] have surely a very old firmware for the BMC that can't work with the provision cookbook, so we'll have to wa... [08:01:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdv) failed on ms-be1065 - https://phabricator.wikimedia.org/T376775#10236710 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: disks badly ordered [08:09:16] (03CR) 10Brouberol: [C:03+2] Import ceph-csi-cephfs chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077872 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:09:19] (03CR) 10Brouberol: [C:03+2] Make it possible to deploy provisioner without the snahshotter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077873 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:09:23] (03CR) 10Brouberol: [C:03+2] Run the driver-registrar as root [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077874 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:09:27] (03CR) 10Brouberol: [C:03+2] Disable the priviledged security context of the liveness-prometheus container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077875 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:09:29] (03CR) 10Brouberol: [C:03+2] Make it possible to create several storage classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078387 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:09:34] (03CR) 10Brouberol: [C:03+2] ceph-csi-cephfs: replace the ClusterRole by a list of ns-scoped Roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080032 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:10:06] (03Merged) 10jenkins-bot: Import ceph-csi-cephfs chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077872 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:10:07] (03Merged) 10jenkins-bot: Make it possible to deploy provisioner without the snahshotter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077873 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:10:14] (03Merged) 10jenkins-bot: Run the driver-registrar as root [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077874 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:10:18] (03Merged) 10jenkins-bot: Disable the priviledged security context of the liveness-prometheus container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077875 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:10:23] (03Merged) 10jenkins-bot: Make it possible to create several storage classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078387 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:10:24] (03Merged) 10jenkins-bot: ceph-csi-cephfs: replace the ClusterRole by a list of ns-scoped Roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080032 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:11:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1065.eqiad.wmnet [08:13:43] (03PS1) 10Arthur taylor: Restore support for Dark Mode on Wikibase pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081067 (https://phabricator.wikimedia.org/T369385) [08:15:21] (03PS2) 10Ayounsi: vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 [08:15:25] (03PS4) 10Vgutierrez: liberica: provide a liberica module [puppet] - 10https://gerrit.wikimedia.org/r/1080708 (https://phabricator.wikimedia.org/T377127) [08:16:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2149.codfw.wmnet onto db2205.codfw.wmnet [08:17:48] (03PS6) 10Tiziano Fogli: logstash: parse new containerd log format [puppet] - 10https://gerrit.wikimedia.org/r/1080603 (https://phabricator.wikimedia.org/T377132) [08:17:48] (03CR) 10Tiziano Fogli: "@cwhite@wikimedia.org Thank you for all the suggestions!" [puppet] - 10https://gerrit.wikimedia.org/r/1080603 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [08:18:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: post clone', diff saved to https://phabricator.wikimedia.org/P70238 and previous config saved to /var/cache/conftool/dbconfig/20241017-081802-arnaudb.json [08:18:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2149 to reclone on db2205 - T377276', diff saved to https://phabricator.wikimedia.org/P70239 and previous config saved to /var/cache/conftool/dbconfig/20241017-081822-arnaudb.json [08:18:26] T377276: db2205 is stuck at "Shutdown in progress" - https://phabricator.wikimedia.org/T377276 [08:22:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: post clone', diff saved to https://phabricator.wikimedia.org/P70240 and previous config saved to /var/cache/conftool/dbconfig/20241017-082215-arnaudb.json [08:36:04] (03CR) 10Btullis: Define the ceph-csi-cephfs admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:37:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: post clone', diff saved to https://phabricator.wikimedia.org/P70241 and previous config saved to /var/cache/conftool/dbconfig/20241017-083721-arnaudb.json [08:37:31] (03PS11) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [08:44:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10236796 (10MatthewVernon) @elukey thanks for looking at this, and for bringing my attention to T371416 (of which I was previously blissfully unaware). Th... [08:44:26] (03PS1) 10Ayounsi: Netbox: run the vlan_migration report every 2 hours [puppet] - 10https://gerrit.wikimedia.org/r/1081071 (https://phabricator.wikimedia.org/T350152) [08:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10236800 (10phaultfinder) [08:45:20] (03CR) 10Elukey: [C:04-1] "Need to work a little bit more on it, I tried to add more tests and sorted() is not as smart as I thought :(" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [08:45:53] (03CR) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:47:47] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:48:00] (03CR) 10Brouberol: [C:03+2] Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [08:50:05] (03CR) 10Elukey: [C:04-1] "For example, see this:" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [08:52:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: post clone', diff saved to https://phabricator.wikimedia.org/P70242 and previous config saved to /var/cache/conftool/dbconfig/20241017-085226-arnaudb.json [09:02:49] (03PS1) 10Giuseppe Lavagetto: Release MR 9: allow defining read-only users [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1081075 [09:03:33] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#10236844 (10ayounsi) Now this applies to rows C and D as well as the switches got upgraded there as well. It makes sens to ignore all the 2019 hos... [09:05:56] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Release MR 9: allow defining read-only users [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1081075 (owner: 10Giuseppe Lavagetto) [09:07:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: post clone', diff saved to https://phabricator.wikimedia.org/P70243 and previous config saved to /var/cache/conftool/dbconfig/20241017-090731-arnaudb.json [09:08:47] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add support for read-only users - oblivian@cumin1002" [09:08:49] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add support for read-only users - oblivian@cumin1002 [09:09:08] (03PS1) 10Brouberol: deployment_server: update the puppet role used by mailout servers [puppet] - 10https://gerrit.wikimedia.org/r/1081077 [09:09:24] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add support for read-only users - oblivian@cumin1002 [09:09:25] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add support for read-only users - oblivian@cumin1002" [09:09:34] (03PS2) 10Brouberol: deployment_server: update the puppet role used by mailout servers [puppet] - 10https://gerrit.wikimedia.org/r/1081077 [09:11:53] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4314/co" [puppet] - 10https://gerrit.wikimedia.org/r/1081077 (owner: 10Brouberol) [09:13:05] (03PS1) 10Brouberol: admin_ng: add the ceph-csi-cephfs helmfile to the list of imported helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081079 (https://phabricator.wikimedia.org/T376406) [09:13:05] (03CR) 10Btullis: [C:03+2] deployment_server: update the puppet role used by mailout servers [puppet] - 10https://gerrit.wikimedia.org/r/1081077 (owner: 10Brouberol) [09:15:45] (03PS1) 10Brouberol: airflow: reflect recent changes in MX server hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081082 (https://phabricator.wikimedia.org/T362788) [09:16:58] (03CR) 10Btullis: [C:03+1] airflow: reflect recent changes in MX server hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081082 (https://phabricator.wikimedia.org/T362788) (owner: 10Brouberol) [09:18:09] (03CR) 10Brouberol: [C:03+2] airflow: reflect recent changes in MX server hostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081082 (https://phabricator.wikimedia.org/T362788) (owner: 10Brouberol) [09:19:25] (03CR) 10Btullis: [C:03+1] admin_ng: add the ceph-csi-cephfs helmfile to the list of imported helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081079 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [09:19:28] (03CR) 10Brouberol: [C:03+2] admin_ng: add the ceph-csi-cephfs helmfile to the list of imported helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081079 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [09:21:13] (03PS1) 10Giuseppe Lavagetto: requestctl: allow access to the web UI to wmf group members [puppet] - 10https://gerrit.wikimedia.org/r/1081084 [09:21:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:22:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:28:24] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: mariadb: systemctl status accessor in mysql_legacy - https://phabricator.wikimedia.org/T377129#10236907 (10ABran-WMF) [09:28:32] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack: mysql_legacy data_directory getter - https://phabricator.wikimedia.org/T376701#10236909 (10ABran-WMF) [09:28:45] (03PS1) 10Brouberol: ceph-csi-cephfs: avoid name collison with ceph-csi-rbd configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081088 (https://phabricator.wikimedia.org/T376406) [09:31:02] (03PS1) 10JMeybohm: etcd::v3: Don't set trusted-ca-file if client-cert-auth is false [puppet] - 10https://gerrit.wikimedia.org/r/1081089 (https://phabricator.wikimedia.org/T377132) [09:31:40] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081089 (https://phabricator.wikimedia.org/T377132) (owner: 10JMeybohm) [09:34:32] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab2002.codfw.wmnet with OS bullseye [09:34:39] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869#10236937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host phab2002.codfw.wmnet with OS bullseye executed... [09:38:00] (03CR) 10Btullis: [C:03+1] ceph-csi-cephfs: avoid name collison with ceph-csi-rbd configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081088 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [09:38:19] (03CR) 10Brouberol: [C:03+2] ceph-csi-cephfs: avoid name collison with ceph-csi-rbd configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081088 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [09:39:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:45:40] (03PS35) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [09:45:40] (03CR) 10Arnaudb: "This will depend on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1078658 and https://gerrit.wikimedia.org/r/c/operatio" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [09:47:05] (03CR) 10Volans: "you can specify that in the commit message and Gerrit will enforce it. For the spicerack patches use Depends-On: $CHANGE_ID, for the cookb" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [09:47:15] (03CR) 10Arnaudb: "(to be able to be tested, I realized my previous phrase was lacking its second part)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [09:47:47] (03CR) 10Arnaudb: "Ah, will do thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [09:52:49] FIRING: HelmReleaseBadStatus: Helm release kube-system/ceph-csi-cephfs on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:53:48] (03PS18) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [09:54:39] (03CR) 10CI reject: [V:04-1] miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:54:54] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869#10236988 (10cmooney) [09:59:26] (03PS5) 10STran: Apply wmf-specific protected vars rights access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1000) [10:01:42] (03PS19) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [10:02:15] (03CR) 10STran: "Thanks for the explainers everyone 🙇" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [10:02:28] (03CR) 10CI reject: [V:04-1] miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:03:30] (03CR) 10Vgutierrez: [C:03+1] P:trafficserver: extend x-wikimedia-debug-routing for mwdebug-next (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072638 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [10:05:53] (03CR) 10Dreamy Jazz: [C:03+1] Apply wmf-specific protected vars rights access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [10:08:09] (03PS2) 10JMeybohm: etcd::v3: Add an etcd_version fact [puppet] - 10https://gerrit.wikimedia.org/r/1081089 (https://phabricator.wikimedia.org/T377132) [10:08:09] (03PS3) 10JMeybohm: etcd::v3: Don't set trusted-ca-file if client-cert-auth is false [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T377132) (owner: 10Mxmxchere) [10:09:05] (03CR) 10CI reject: [V:04-1] etcd::v3: Add an etcd_version fact [puppet] - 10https://gerrit.wikimedia.org/r/1081089 (https://phabricator.wikimedia.org/T377132) (owner: 10JMeybohm) [10:11:53] (03PS3) 10JMeybohm: etcd::v3: Add an etcd_version fact [puppet] - 10https://gerrit.wikimedia.org/r/1081089 (https://phabricator.wikimedia.org/T377132) [10:11:53] (03PS4) 10JMeybohm: etcd::v3: Don't set trusted-ca-file if client-cert-auth is false [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T377132) (owner: 10Mxmxchere) [10:14:54] FIRING: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:16:28] FIRING: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:39] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubestagemaster2005.codfw.wmnet with reason: reimage [10:17:48] FIRING: PuppetFailure: Puppet has failed on kubestagemaster2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:17:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubestagemaster2005.codfw.wmnet with reason: reimage [10:26:53] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T374664) [10:31:48] (03CR) 10Giuseppe Lavagetto: [C:03+1] etcd::v3: Add an etcd_version fact [puppet] - 10https://gerrit.wikimedia.org/r/1081089 (https://phabricator.wikimedia.org/T377132) (owner: 10JMeybohm) [10:36:59] (03CR) 10Michael Große: [C:03+1] tests: ensure maintenance base class has always been requierd [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080828 (https://phabricator.wikimedia.org/T377391) (owner: 10C. Scott Ananian) [10:41:21] (03CR) 10CDanis: [C:03+1] requestctl: allow access to the web UI to wmf group members [puppet] - 10https://gerrit.wikimedia.org/r/1081084 (owner: 10Giuseppe Lavagetto) [10:51:21] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl: allow access to the web UI to wmf group members [puppet] - 10https://gerrit.wikimedia.org/r/1081084 (owner: 10Giuseppe Lavagetto) [10:55:13] (03PS1) 10Jcrespo: mariadb: Default pt-heartbeat STATEMENT-based replication [puppet] - 10https://gerrit.wikimedia.org/r/1081103 (https://phabricator.wikimedia.org/T375144) [10:57:38] (03PS2) 10Jcrespo: mariadb: Default pt-heartbeat STATEMENT-based replication [puppet] - 10https://gerrit.wikimedia.org/r/1081103 (https://phabricator.wikimedia.org/T375144) [10:57:47] FIRING: ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:58:02] (03PS3) 10Jcrespo: mariadb: Default pt-heartbeat to STATEMENT-based replication [puppet] - 10https://gerrit.wikimedia.org/r/1081103 (https://phabricator.wikimedia.org/T375144) [10:58:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:58:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [10:58:56] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) [10:59:11] (03PS2) 10Sergio Gimeno: [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) [10:59:31] (03CR) 10Sergio Gimeno: [C:04-1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) (owner: 10Sergio Gimeno) [11:00:03] (03CR) 10Giuseppe Lavagetto: [C:03+1] etcd::v3: Don't set trusted-ca-file if client-cert-auth is false [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T377132) (owner: 10Mxmxchere) [11:01:28] RESOLVED: ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:05:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [11:05:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [11:05:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T376905)', diff saved to https://phabricator.wikimedia.org/P70245 and previous config saved to /var/cache/conftool/dbconfig/20241017-110527-ladsgroup.json [11:15:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T376905)', diff saved to https://phabricator.wikimedia.org/P70246 and previous config saved to /var/cache/conftool/dbconfig/20241017-111507-ladsgroup.json [11:23:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [11:29:41] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1177.eqiad.wmnet [11:30:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P70247 and previous config saved to /var/cache/conftool/dbconfig/20241017-113014-ladsgroup.json [11:30:42] (03PS1) 10Btullis: cephfs: Run the csi-cephfsplugin as uid 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081110 (https://phabricator.wikimedia.org/T376401) [11:31:32] (03CR) 10Kosta Harlan: [C:04-1] "Per task discussion, maybe not needed after all." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080621 (https://phabricator.wikimedia.org/T326940) (owner: 10Kosta Harlan) [11:34:34] (03PS1) 10Btullis: cephfs: bump the image of the ceph csi plugin image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081111 (https://phabricator.wikimedia.org/T376408) [11:36:36] (03CR) 10JMeybohm: [C:03+2] etcd::v3: Add an etcd_version fact [puppet] - 10https://gerrit.wikimedia.org/r/1081089 (https://phabricator.wikimedia.org/T377132) (owner: 10JMeybohm) [11:36:41] (03PS1) 10Btullis: ceph-rbd: Bump the ceph-csi plugin image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081112 (https://phabricator.wikimedia.org/T376401) [11:39:17] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1177.eqiad.wmnet [11:41:08] (03CR) 10Btullis: [C:03+2] cephfs: Run the csi-cephfsplugin as uid 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081110 (https://phabricator.wikimedia.org/T376401) (owner: 10Btullis) [11:41:27] (03CR) 10Btullis: [C:03+2] cephfs: bump the image of the ceph csi plugin image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081111 (https://phabricator.wikimedia.org/T376408) (owner: 10Btullis) [11:44:10] (03Merged) 10jenkins-bot: cephfs: Run the csi-cephfsplugin as uid 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081110 (https://phabricator.wikimedia.org/T376401) (owner: 10Btullis) [11:44:53] (03Merged) 10jenkins-bot: cephfs: bump the image of the ceph csi plugin image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081111 (https://phabricator.wikimedia.org/T376408) (owner: 10Btullis) [11:45:19] (03CR) 10Alexandros Kosiaris: [C:04-1] "Thanks for this! Various inline comments and proposals" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [11:45:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P70248 and previous config saved to /var/cache/conftool/dbconfig/20241017-114522-ladsgroup.json [11:50:36] (03PS1) 10Btullis: Bump the cephfs chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081115 (https://phabricator.wikimedia.org/T376408) [11:51:01] (03CR) 10Btullis: [C:03+2] Bump the cephfs chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081115 (https://phabricator.wikimedia.org/T376408) (owner: 10Btullis) [11:55:23] (03Merged) 10jenkins-bot: Bump the cephfs chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081115 (https://phabricator.wikimedia.org/T376408) (owner: 10Btullis) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1200) [12:00:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T376905)', diff saved to https://phabricator.wikimedia.org/P70249 and previous config saved to /var/cache/conftool/dbconfig/20241017-120029-ladsgroup.json [12:00:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [12:00:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [12:00:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:00:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:00:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T376905)', diff saved to https://phabricator.wikimedia.org/P70250 and previous config saved to /var/cache/conftool/dbconfig/20241017-120049-ladsgroup.json [12:01:39] (03PS5) 10JMeybohm: etcd::v3: Don't set trusted-ca-file if client-cert-auth is false [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T362408) (owner: 10Mxmxchere) [12:07:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:07:47] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:10:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:10:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:11:28] (03PS27) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [12:11:28] (03CR) 10Jelto: "I added the suggestions in the latest patch sets except the additional name for the config maps. I think it's fine to continue with the au" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:12:13] (03PS20) 10Jelto: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) [12:12:33] (03CR) 10Ayounsi: "Thanks! Much simpler. Just one comment." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [12:12:52] (03PS3) 10Ayounsi: vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 [12:15:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:15:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:15:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T376905)', diff saved to https://phabricator.wikimedia.org/P70251 and previous config saved to /var/cache/conftool/dbconfig/20241017-121525-ladsgroup.json [12:16:14] (03PS1) 10Kosta Harlan: QuickSurveys: Update safety survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081124 (https://phabricator.wikimedia.org/T376517) [12:16:21] (03PS21) 10Jelto: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) [12:19:12] (03PS4) 10Ayounsi: vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 [12:20:07] (03CR) 10Ayounsi: vlan migration report: add one example host per group (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [12:21:02] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081128 [12:21:03] (03CR) 10CI reject: [V:04-1] vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 (owner: 10Ayounsi) [12:22:34] (03PS5) 10Ayounsi: vlan migration report: add one example host per group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1081061 [12:24:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T376905)', diff saved to https://phabricator.wikimedia.org/P70252 and previous config saved to /var/cache/conftool/dbconfig/20241017-122425-ladsgroup.json [12:26:06] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T362408) (owner: 10Mxmxchere) [12:30:59] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080250 (https://phabricator.wikimedia.org/T369610) (owner: 10STran) [12:31:30] FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:36:29] (03PS6) 10JMeybohm: etcd::v3: Don't set trusted-ca-file if client-cert-auth is false [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T362408) (owner: 10Mxmxchere) [12:36:30] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:36:36] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T362408) (owner: 10Mxmxchere) [12:38:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081124 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [12:38:59] (03PS1) 10Urbanecm: cswikivoyage: Set category collation to uca-cs-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081134 (https://phabricator.wikimedia.org/T377446) [12:39:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P70253 and previous config saved to /var/cache/conftool/dbconfig/20241017-123932-ladsgroup.json [12:39:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081134 (https://phabricator.wikimedia.org/T377446) (owner: 10Urbanecm) [12:45:34] (03CR) 10JMeybohm: [C:03+2] "Last PS only uptated ordering and whitespace chomping to not produce diffs on hosts without change." [puppet] - 10https://gerrit.wikimedia.org/r/992629 (https://phabricator.wikimedia.org/T362408) (owner: 10Mxmxchere) [12:52:29] 14SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 06DBA, 10MediaWiki-libs-Rdbms, 07Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255#10237601 (10Reedy) [12:52:30] (03PS29) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [12:52:31] (03CR) 10Jelto: "patchset 29 uses a separate key for the config map name now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:52:38] (03PS22) 10Jelto: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) [12:54:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P70254 and previous config saved to /var/cache/conftool/dbconfig/20241017-125440-ladsgroup.json [12:54:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:54:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:57:33] (03PS30) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [12:58:17] (03CR) 10CI reject: [V:04-1] miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:58:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [12:58:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [12:59:05] (03PS31) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [12:59:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:59:48] (03CR) 10CI reject: [V:04-1] miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1300) [13:00:05] MatmaRex, cscott, mszabo, and urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] hello! [13:00:20] i can deploy today [13:00:29] (03PS32) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [13:00:30] (03CR) 10Urbanecm: [C:03+2] tests: ensure maintenance base class has always been requierd [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080828 (https://phabricator.wikimedia.org/T377391) (owner: 10C. Scott Ananian) [13:00:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2005.codfw.wmnet with OS bookworm [13:01:00] mszabo: cscott: around? [13:01:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T376905)', diff saved to https://phabricator.wikimedia.org/P70255 and previous config saved to /var/cache/conftool/dbconfig/20241017-130115-ladsgroup.json [13:01:20] (03CR) 10Urbanecm: [C:03+2] cswikivoyage: Set category collation to uca-cs-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081134 (https://phabricator.wikimedia.org/T377446) (owner: 10Urbanecm) [13:02:10] (03Merged) 10jenkins-bot: cswikivoyage: Set category collation to uca-cs-u-kn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081134 (https://phabricator.wikimedia.org/T377446) (owner: 10Urbanecm) [13:02:25] I'm here [13:02:28] (03PS23) 10Jelto: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) [13:02:29] welcome! [13:02:39] (03CR) 10Urbanecm: [C:03+2] Bump wikimedia/parsoid to 0.20.0-a26 [vendor] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080770 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:02:41] (03CR) 10Urbanecm: [C:03+2] Bump wikimedia/parsoid to 0.20.0-a26 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080773 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:02:49] RESOLVED: HelmReleaseBadStatus: Helm release kube-system/ceph-csi-cephfs on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:02:54] urbanecm: yup, I'm around -- the change should be ready to go, it's only increasing the survey percentage so no verification is needed [13:02:54] My three patches can be synced at once (in fact probably have to be) [13:03:30] (03CR) 10JMeybohm: [C:03+1] miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:03:41] The community config patch is needed to make ci for the vendor patch pass which is needed before the package dep in mediawiki-core can be bumped [13:04:06] cscott: do you mind clarifying? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1080773 looks like a no-op as far as our prod is concern, but i might be missing something [13:04:48] The first and last patches are more about making ci happy for future patches yes [13:04:50] (03PS24) 10Jelto: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) [13:05:01] Only the vendor patch actually affects runtime [13:05:17] gotcha [13:06:02] (03PS2) 10Kosta Harlan: QuickSurveys: Update safety survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081124 (https://phabricator.wikimedia.org/T376517) [13:06:04] (03CR) 10Urbanecm: [C:03+2] QuickSurveys: Update safety survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081124 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [13:06:21] So what I was trying to say at the start was it doesn't make sense to scap the patches separately [13:06:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081124 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [13:06:46] You can do them all at once [13:06:52] (03Merged) 10jenkins-bot: QuickSurveys: Update safety survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081124 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [13:07:18] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1081134|cswikivoyage: Set category collation to uca-cs-u-kn (T377446)]], [[gerrit:1081124|QuickSurveys: Update safety survey coverage (T376517)]] [13:07:23] T377446: Czech Wikivoyage: Update category collation setting - https://phabricator.wikimedia.org/T377446 [13:07:23] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [13:07:51] cscott: i understand that, i was confused at the "probably have to be" [scaped at once] part. as long as only vendor affects runtime, then it shouldn't matter whether they hit the pods at once or not. in either case: planning to sync them at once anyway. [13:07:59] * Lucas_WMDE also around btw [13:08:03] (03CR) 10JMeybohm: [C:03+1] "🚢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:08:12] hi Lucas_WMDE [13:08:25] urbanecm: yeah sorry I said that wrong [13:08:52] hi, sorry i'm late [13:08:57] ack cscott, thanks for the clarification. [13:08:58] hi MatmaRex! [13:09:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:09:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T376905)', diff saved to https://phabricator.wikimedia.org/P70256 and previous config saved to /var/cache/conftool/dbconfig/20241017-130947-ladsgroup.json [13:09:48] !log urbanecm@deploy2002 kharlan, urbanecm: Backport for [[gerrit:1081134|cswikivoyage: Set category collation to uca-cs-u-kn (T377446)]], [[gerrit:1081124|QuickSurveys: Update safety survey coverage (T376517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:10:01] !log urbanecm@deploy2002 kharlan, urbanecm: Continuing with sync [13:10:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:10:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T376905)', diff saved to https://phabricator.wikimedia.org/P70257 and previous config saved to /var/cache/conftool/dbconfig/20241017-131012-ladsgroup.json [13:10:34] (03PS1) 10STran: Implement redirects to meta's Special:GlobalContributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) [13:14:20] (03PS1) 10Btullis: cephfs: Add the /etc/ceph volummount to the liveness container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081142 (https://phabricator.wikimedia.org/T376408) [13:14:41] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081134|cswikivoyage: Set category collation to uca-cs-u-kn (T377446)]], [[gerrit:1081124|QuickSurveys: Update safety survey coverage (T376517)]] (duration: 07m 23s) [13:14:46] T377446: Czech Wikivoyage: Update category collation setting - https://phabricator.wikimedia.org/T377446 [13:14:47] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [13:15:31] (03PS2) 10Bartosz Dziewoński: Set $wgAllowRawHtmlCopyrightMessages = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080805 (https://phabricator.wikimedia.org/T375789) [13:15:34] (03CR) 10Urbanecm: [C:03+2] Set $wgAllowRawHtmlCopyrightMessages = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080805 (https://phabricator.wikimedia.org/T375789) (owner: 10Bartosz Dziewoński) [13:16:17] (03CR) 10Btullis: [C:03+2] cephfs: Add the /etc/ceph volummount to the liveness container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081142 (https://phabricator.wikimedia.org/T376408) (owner: 10Btullis) [13:16:21] (03Merged) 10jenkins-bot: Set $wgAllowRawHtmlCopyrightMessages = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080805 (https://phabricator.wikimedia.org/T375789) (owner: 10Bartosz Dziewoński) [13:16:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70258 and previous config saved to /var/cache/conftool/dbconfig/20241017-131622-ladsgroup.json [13:18:41] !log bking@wdqs1015 depooling to catch up on lag [13:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:27] (03CR) 10Arnaudb: [C:03+1] "the cautionary comment is a very good idea" [puppet] - 10https://gerrit.wikimedia.org/r/1081103 (https://phabricator.wikimedia.org/T375144) (owner: 10Jcrespo) [13:19:28] (03Merged) 10jenkins-bot: cephfs: Add the /etc/ceph volummount to the liveness container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081142 (https://phabricator.wikimedia.org/T376408) (owner: 10Btullis) [13:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10237737 (10phaultfinder) [13:21:35] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:22:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:22:49] (03Merged) 10jenkins-bot: tests: ensure maintenance base class has always been requierd [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080828 (https://phabricator.wikimedia.org/T377391) (owner: 10C. Scott Ananian) [13:22:51] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.20.0-a26 [vendor] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080770 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:22:53] (03CR) 10CI reject: [V:04-1] Bump wikimedia/parsoid to 0.20.0-a26 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080773 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:23:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:23:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:24:01] those verification failures look to be transient selenium timeout nonsense? [13:24:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2204.codfw.wmnet with reason: Maintenance [13:24:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2204.codfw.wmnet with reason: Maintenance [13:24:18] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10237756 (10jcrespo) The (potential) change that caused it was: https://gerrit.wik... [13:26:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [13:26:06] cscott: yep [13:26:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [13:26:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [13:26:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [13:26:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70260 and previous config saved to /var/cache/conftool/dbconfig/20241017-132617-arnaudb.json [13:26:25] (03CR) 10Urbanecm: Bump wikimedia/parsoid to 0.20.0-a26 [vendor] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080770 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:26:28] (03CR) 10Urbanecm: [C:03+2] Bump wikimedia/parsoid to 0.20.0-a26 [vendor] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080770 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:26:30] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:26:34] (03CR) 10Urbanecm: Bump wikimedia/parsoid to 0.20.0-a26 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080773 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:26:37] (03CR) 10Urbanecm: [C:03+2] Bump wikimedia/parsoid to 0.20.0-a26 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080773 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:27:14] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1080805|Set $wgAllowRawHtmlCopyrightMessages = false (T375789)]], [[gerrit:1080828|tests: ensure maintenance base class has always been requierd (T377391 T357535)]] [13:27:21] T375789: Replace on-wiki raw HTML overrides for "MediaWiki:Copyright" etc. and set $wgAllowRawHtmlCopyrightMessages = false - https://phabricator.wikimedia.org/T375789 [13:27:21] T377391: CommunityConfiguration extension breaking CI for all patches to mediawiki-vendor train branch - https://phabricator.wikimedia.org/T377391 [13:27:22] T357535: Community configuration: Create a maintenance script for configuration changes - https://phabricator.wikimedia.org/T357535 [13:29:00] !log [urbanecm@mwmaint2002 ~]$ mwscript updateCollation.php --wiki=cswikivoyage --previous-collation=uppercase # T377446 [13:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:05] T377446: Czech Wikivoyage: Update category collation setting - https://phabricator.wikimedia.org/T377446 [13:29:28] !log urbanecm@deploy2002 cscott, urbanecm, matmarex: Backport for [[gerrit:1080805|Set $wgAllowRawHtmlCopyrightMessages = false (T375789)]], [[gerrit:1080828|tests: ensure maintenance base class has always been requierd (T377391 T357535)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:31] my change looks good [13:30:39] thanks! was just about to ask :) [13:30:43] !log urbanecm@deploy2002 cscott, urbanecm, matmarex: Continuing with sync [13:31:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70261 and previous config saved to /var/cache/conftool/dbconfig/20241017-133129-ladsgroup.json [13:34:53] (03PS1) 10Bking: airflow-analytics-test: correct oidc mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081152 (https://phabricator.wikimedia.org/T374948) [13:35:22] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080805|Set $wgAllowRawHtmlCopyrightMessages = false (T375789)]], [[gerrit:1080828|tests: ensure maintenance base class has always been requierd (T377391 T357535)]] (duration: 08m 07s) [13:35:28] T375789: Replace on-wiki raw HTML overrides for "MediaWiki:Copyright" etc. and set $wgAllowRawHtmlCopyrightMessages = false - https://phabricator.wikimedia.org/T375789 [13:35:28] T377391: CommunityConfiguration extension breaking CI for all patches to mediawiki-vendor train branch - https://phabricator.wikimedia.org/T377391 [13:35:28] T357535: Community configuration: Create a maintenance script for configuration changes - https://phabricator.wikimedia.org/T357535 [13:35:35] synced [13:36:11] (03CR) 10Stevemunene: [C:03+1] airflow-analytics-test: correct oidc mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081152 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:36:22] (03CR) 10Bking: [C:03+2] airflow-analytics-test: correct oidc mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081152 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:37:20] (03Merged) 10jenkins-bot: airflow-analytics-test: correct oidc mapping [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081152 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:40:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [vendor] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080770 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:40:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080773 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:45:08] (03PS1) 10Cwhite: logstash: force log field to string [puppet] - 10https://gerrit.wikimedia.org/r/1081155 (https://phabricator.wikimedia.org/T377422) [13:45:09] (03PS1) 10Cwhite: logstash: restore partition name back to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1081156 (https://phabricator.wikimedia.org/T377422) [13:45:55] 06SRE, 10Maps, 06Traffic-Icebox: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#10237879 (10akosiaris) 05Open→03Invalid Tilerator exists no more in the WMF environment. I 'll close this av `invalid`, feel free to reopen. [13:45:59] (03PS1) 10Btullis: Remove unused airflow kubernetes user credentials [puppet] - 10https://gerrit.wikimedia.org/r/1081157 (https://phabricator.wikimedia.org/T374948) [13:46:02] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.20.0-a26 [vendor] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080770 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:46:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T376905)', diff saved to https://phabricator.wikimedia.org/P70263 and previous config saved to /var/cache/conftool/dbconfig/20241017-134636-ladsgroup.json [13:46:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [13:46:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [13:46:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70264 and previous config saved to /var/cache/conftool/dbconfig/20241017-134651-ladsgroup.json [13:47:34] (03CR) 10CI reject: [V:04-1] logstash: force log field to string [puppet] - 10https://gerrit.wikimedia.org/r/1081155 (https://phabricator.wikimedia.org/T377422) (owner: 10Cwhite) [13:48:18] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4315/co" [puppet] - 10https://gerrit.wikimedia.org/r/1081157 (https://phabricator.wikimedia.org/T374948) (owner: 10Btullis) [13:49:05] (03PS1) 10Bking: admin_ng (dse-k8s): remove unused namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081161 (https://phabricator.wikimedia.org/T374948) [13:49:26] (03CR) 10Bking: [C:03+2] Remove unused airflow kubernetes user credentials [puppet] - 10https://gerrit.wikimedia.org/r/1081157 (https://phabricator.wikimedia.org/T374948) (owner: 10Btullis) [13:49:34] (03CR) 10Stevemunene: [C:03+1] Remove unused airflow kubernetes user credentials [puppet] - 10https://gerrit.wikimedia.org/r/1081157 (https://phabricator.wikimedia.org/T374948) (owner: 10Btullis) [13:50:07] (03CR) 10Btullis: [C:03+2] admin_ng (dse-k8s): remove unused namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081161 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:50:11] (03CR) 10Stevemunene: [C:03+1] admin_ng (dse-k8s): remove unused namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081161 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:53:41] cscott: still waiting on ci [13:53:51] (03Merged) 10jenkins-bot: admin_ng (dse-k8s): remove unused namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081161 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:54:01] urbanecm: yeah i'm watching the last job slowly scroll [13:54:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:56:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:56:44] (03PS2) 10Cwhite: logstash: force log field to string [puppet] - 10https://gerrit.wikimedia.org/r/1081155 (https://phabricator.wikimedia.org/T377422) [13:56:45] (03PS2) 10Cwhite: logstash: restore partition name back to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1081156 (https://phabricator.wikimedia.org/T377422) [13:57:48] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-(api-ext|web): create "next" releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079572 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [13:57:56] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.20.0-a26 [core] (wmf/1.43.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1080773 (https://phabricator.wikimedia.org/T377287) (owner: 10C. Scott Ananian) [13:58:26] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1080770|Bump wikimedia/parsoid to 0.20.0-a26 (T377287)]], [[gerrit:1080773|Bump wikimedia/parsoid to 0.20.0-a26 (T377287)]] [13:58:29] finally [13:58:30] urbanecm: finally done [13:58:43] T377287: PHP Notice: Undefined property: Wikimedia\Parsoid\NodeData\DataMw::$parts - https://phabricator.wikimedia.org/T377287 [13:58:59] so the 'testing' on this is just confirming no more log messages appear, and evidence of absence is hard when it's only deployed to testservers [13:59:14] but when the time comes I can test parsoid and confirm that i didn't blow up the world, at least [13:59:16] so you're saying just sync? [13:59:23] okay, i'll wait [13:59:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:59:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:59:51] i'll give it a 30 second test at least for due diligence :) [14:00:03] sounds good [14:00:43] (03PS36) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [14:00:45] !log urbanecm@deploy2002 cscott, urbanecm: Backport for [[gerrit:1080770|Bump wikimedia/parsoid to 0.20.0-a26 (T377287)]], [[gerrit:1080773|Bump wikimedia/parsoid to 0.20.0-a26 (T377287)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:00:50] cscott: go ahead then! [14:01:01] (03PS4) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) [14:01:02] that was fast! [14:03:19] urbanecm: looks good [14:03:31] !log urbanecm@deploy2002 cscott, urbanecm: Continuing with sync [14:03:33] proceeding [14:05:21] (03PS1) 10DCausse: cirrus-streaming-updater: bump to v20241017132903-67693a7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081172 (https://phabricator.wikimedia.org/T373459) [14:08:07] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080770|Bump wikimedia/parsoid to 0.20.0-a26 (T377287)]], [[gerrit:1080773|Bump wikimedia/parsoid to 0.20.0-a26 (T377287)]] (duration: 09m 41s) [14:08:16] cscott: done! [14:08:28] T377287: PHP Notice: Undefined property: Wikimedia\Parsoid\NodeData\DataMw::$parts - https://phabricator.wikimedia.org/T377287 [14:08:28] (03CR) 10Cwhite: [C:03+2] logstash: force log field to string [puppet] - 10https://gerrit.wikimedia.org/r/1081155 (https://phabricator.wikimedia.org/T377422) (owner: 10Cwhite) [14:08:52] (03PS37) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [14:09:00] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2005.codfw.wmnet with OS bookworm [14:10:21] thanks for deploying urbanecm [14:10:23] urbanecm: thanks! [14:16:54] (03CR) 10Tiziano Fogli: [C:03+1] logstash: force log field to string (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081155 (https://phabricator.wikimedia.org/T377422) (owner: 10Cwhite) [14:17:53] (03PS1) 10DCausse: rdf-streaming-updater: bump image to flink-1.17.1-rdf-0.3.149 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081175 (https://phabricator.wikimedia.org/T371874) [14:18:52] jouncebot: nowandnext [14:18:52] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [14:18:52] In 0 hour(s) and 41 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1500) [14:19:36] (03PS5) 10Volans: sre.switchdc.databases: allow to select a section [cookbooks] - 10https://gerrit.wikimedia.org/r/1079537 (https://phabricator.wikimedia.org/T375144) [14:25:39] (03PS38) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) [14:26:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70265 and previous config saved to /var/cache/conftool/dbconfig/20241017-142643-arnaudb.json [14:27:01] (03PS5) 10Arnaudb: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) [14:27:06] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:27:07] (03CR) 10CI reject: [V:04-1] mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [14:28:02] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [14:31:04] (03CR) 10Cwhite: [C:03+2] logstash: force log field to string (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1081155 (https://phabricator.wikimedia.org/T377422) (owner: 10Cwhite) [14:31:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2005.codfw.wmnet with reason: host reimage [14:36:30] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump to v20241017132903-67693a7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081172 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse) [14:37:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:30] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump to v20241017132903-67693a7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081172 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse) [14:38:19] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:38:24] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:38:44] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:38:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238146 (10cmooney) [14:39:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:39:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:39:40] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:41:02] (03PS1) 10Scott French: hieradata: add "next" releases of mw-(web|api-ext) [puppet] - 10https://gerrit.wikimedia.org/r/1080793 (https://phabricator.wikimedia.org/T377040) [14:41:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70266 and previous config saved to /var/cache/conftool/dbconfig/20241017-144150-arnaudb.json [14:42:26] (03PS1) 10Bking: airflow: disable network egress by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081178 (https://phabricator.wikimedia.org/T374948) [14:43:12] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:43:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589#10238164 (10aborrero) 05In progress→03Resolved thanks! [14:43:27] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:44:28] (03PS6) 10Tiziano Fogli: kafka: port mirror maker alerts from icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [14:44:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238176 (10cmooney) [14:47:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70267 and previous config saved to /var/cache/conftool/dbconfig/20241017-144717-ladsgroup.json [14:47:25] 10ops-eqiad, 06DBA, 06DC-Ops: db1166 is not coming back online - https://phabricator.wikimedia.org/T377464#10238192 (10Ladsgroup) Force a reboot via mgmt console brought it back online. Worth checking what's going on. [14:47:48] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:48:03] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:48:05] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:48:07] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:48:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:48:18] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:48:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:48:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:49:25] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:49:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:50:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:50:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P70268 and previous config saved to /var/cache/conftool/dbconfig/20241017-145030-ladsgroup.json [14:50:31] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:50:49] 10ops-eqiad, 06DBA, 06DC-Ops: db1166 is not coming back online - https://phabricator.wikimedia.org/T377464#10238201 (10Ladsgroup) I'm repooling the host for now. If you need it shut down, let me know. [14:51:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238216 (10cmooney) [14:51:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:51:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:53:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:53:43] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:54:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:54:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:56:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:56:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:56:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70269 and previous config saved to /var/cache/conftool/dbconfig/20241017-145657-arnaudb.json [14:57:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:57:05] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:57:54] (03PS1) 10Herron: jaeger: bump to 1.62-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081181 (https://phabricator.wikimedia.org/T376904) [14:59:28] (03CR) 10Alexandros Kosiaris: [C:03+1] hieradata: add "next" releases of mw-(web|api-ext) [puppet] - 10https://gerrit.wikimedia.org/r/1080793 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [14:59:45] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:59:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:00:01] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:00:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:00:05] jeena and andre: gettimeofday() says it's time for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1500) [15:01:12] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: bump image to flink-1.17.1-rdf-0.3.149 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081175 (https://phabricator.wikimedia.org/T371874) (owner: 10DCausse) [15:02:13] (03Merged) 10jenkins-bot: rdf-streaming-updater: bump image to flink-1.17.1-rdf-0.3.149 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081175 (https://phabricator.wikimedia.org/T371874) (owner: 10DCausse) [15:02:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P70270 and previous config saved to /var/cache/conftool/dbconfig/20241017-150224-ladsgroup.json [15:03:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:03:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:05:10] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:05:25] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:05:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P70271 and previous config saved to /var/cache/conftool/dbconfig/20241017-150535-ladsgroup.json [15:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10238283 (10phaultfinder) [15:12:02] (03PS7) 10Tiziano Fogli: kafka: port mirror maker alerts from icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [15:12:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70272 and previous config saved to /var/cache/conftool/dbconfig/20241017-151204-arnaudb.json [15:12:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:12:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:12:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P70273 and previous config saved to /var/cache/conftool/dbconfig/20241017-151216-arnaudb.json [15:12:32] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:15:36] (03PS8) 10Tiziano Fogli: kafka: port mirror maker alerts from icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [15:17:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P70274 and previous config saved to /var/cache/conftool/dbconfig/20241017-151731-ladsgroup.json [15:18:54] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P70275 and previous config saved to /var/cache/conftool/dbconfig/20241017-152040-ladsgroup.json [15:22:15] (03CR) 10Jcrespo: "Sadly because of https://phabricator.wikimedia.org/T371351#10238327 This, which I proposed (all my fault) doesn't work. It would be nice t" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [15:23:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [15:23:26] (03CR) 10Jcrespo: "Sorry to make you work, Riccardo, more than necessary- I designed it but at least I also was the one who tested it and detected it not wor" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [15:24:48] (03CR) 10CDanis: [C:03+1] jaeger: bump to 1.62-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081181 (https://phabricator.wikimedia.org/T376904) (owner: 10Herron) [15:25:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238331 (10cmooney) [15:26:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238348 (10cmooney) [15:28:53] (03CR) 10Volans: "No worries at all. I'll drop the the fix part." [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [15:29:29] (03PS9) 10Tiziano Fogli: kafka: port mirror maker alerts from icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [15:32:08] (03PS1) 10Hnowlan: Reduce limits and requests for various services on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081191 [15:32:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T376905)', diff saved to https://phabricator.wikimedia.org/P70276 and previous config saved to /var/cache/conftool/dbconfig/20241017-153238-ladsgroup.json [15:32:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238391 (10cmooney) [15:32:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [15:32:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance [15:32:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T376905)', diff saved to https://phabricator.wikimedia.org/P70277 and previous config saved to /var/cache/conftool/dbconfig/20241017-153257-ladsgroup.json [15:33:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238395 (10cmooney) [15:35:00] (03CR) 10Tiziano Fogli: "I've brought the comments from Puppet/Icinga here as well." [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [15:35:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P70278 and previous config saved to /var/cache/conftool/dbconfig/20241017-153546-ladsgroup.json [15:39:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2005.codfw.wmnet with OS bookworm [15:39:43] (03CR) 10Scott French: [C:03+1] Reduce limits and requests for various services on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081191 (owner: 10Hnowlan) [15:40:27] (03CR) 10Hnowlan: [C:03+2] Reduce limits and requests for various services on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081191 (owner: 10Hnowlan) [15:40:56] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:41:30] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:42:47] (03Merged) 10jenkins-bot: Reduce limits and requests for various services on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081191 (owner: 10Hnowlan) [15:43:39] (03PS10) 10Tiziano Fogli: kafka: port mirror maker alerts from icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [15:44:27] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:44:34] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:44:45] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:44:56] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:45:46] 06SRE, 06Infrastructure-Foundations, 06Traffic: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624#10238465 (10Krinkle) This change introduces the following error, repeated in the console for me when logged-in. Note that, unlike the original message in the task... [15:46:01] (03CR) 10Jcrespo: [C:03+1] "This part worked flawlessly" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079537 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [15:47:16] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [15:47:19] (03CR) 10Jcrespo: [C:03+1] "all good based on tests at T374972" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [15:47:39] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:47:53] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [15:48:41] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [15:48:44] (03CR) 10Jcrespo: [C:03+1] "Tested both working and with an (unintended) failure due to my bad logic, caught it at the end with: **MASTER_TO db2230.codfw.wmnet wrong " [cookbooks] - 10https://gerrit.wikimedia.org/r/1074127 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [15:48:51] (03PS1) 10Vgutierrez: profile: PoC spec hiera host lookup not working as expected [puppet] - 10https://gerrit.wikimedia.org/r/1081195 [15:50:22] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:50:58] (03CR) 10CI reject: [V:04-1] profile: PoC spec hiera host lookup not working as expected [puppet] - 10https://gerrit.wikimedia.org/r/1081195 (owner: 10Vgutierrez) [15:51:00] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:51:45] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [15:52:17] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [15:55:31] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by (output started at /srv/mediawiki/php-1.43.0-wmf.11/includes/libs/http/MultiHttpClient.p... - https://phabricator.wikimedia.org/T369186#10238512 [15:56:05] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:56:08] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:56:15] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [15:56:33] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [15:57:47] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:57:50] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:57:56] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply [15:57:58] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:58:07] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [15:58:10] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:58:24] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [15:59:16] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [15:59:22] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [15:59:35] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [15:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10238532 (10phaultfinder) [16:00:04] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:04:24] (03PS1) 10Scott French: shellbox: reduce cpu resource requests in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081196 [16:10:43] (03PS2) 10Vgutierrez: profile: Fix puppetserver spec test [puppet] - 10https://gerrit.wikimedia.org/r/1081195 [16:10:58] (03PS1) 10Brouberol: airflow: remove custom network policy when the scheduler is running outside Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081200 (https://phabricator.wikimedia.org/T374948) [16:11:41] (03PS3) 10Vgutierrez: profile: Fix puppetserver spec test [puppet] - 10https://gerrit.wikimedia.org/r/1081195 [16:12:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P70279 and previous config saved to /var/cache/conftool/dbconfig/20241017-161242-arnaudb.json [16:13:12] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:13:54] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:44] (03CR) 10Btullis: [C:03+1] airflow: remove custom network policy when the scheduler is running outside Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081200 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [16:17:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10238641 (10RobH) [16:18:15] (03CR) 10BCornwall: "Not sure how relevant this is but "hostname" is a legacy fact while "networking.fqdn" is not:" [puppet] - 10https://gerrit.wikimedia.org/r/1081195 (owner: 10Vgutierrez) [16:19:07] (03CR) 10Hnowlan: [C:03+1] shellbox: reduce cpu resource requests in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081196 (owner: 10Scott French) [16:19:51] (03PS2) 10Brouberol: airflow: remove custom network policy when the scheduler is running outside Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081200 (https://phabricator.wikimedia.org/T374948) [16:21:20] (03CR) 10Vgutierrez: "I'd keep it till we don't switch to `networking.hostname` on the production environment:" [puppet] - 10https://gerrit.wikimedia.org/r/1081195 (owner: 10Vgutierrez) [16:22:33] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by (output started at /srv/mediawiki/php-1.43.0-wmf.11/includes/libs/http/MultiHttpClient.p... - https://phabricator.wikimedia.org/T369186#10238667 [16:23:14] (03CR) 10Btullis: [C:03+1] airflow: remove custom network policy when the scheduler is running outside Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081200 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [16:24:23] (03CR) 10Brouberol: [C:03+2] airflow: remove custom network policy when the scheduler is running outside Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081200 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [16:27:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P70280 and previous config saved to /var/cache/conftool/dbconfig/20241017-162749-arnaudb.json [16:28:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:29:59] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T377059#10238692 (10VRiley-WMF) 05Open→03Resolved Power supply has been replaced. Closing this ticket. [16:33:17] (03Abandoned) 10BCornwall: apache: Redirect sco Wiktionary to sco Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [16:33:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T376905)', diff saved to https://phabricator.wikimedia.org/P70281 and previous config saved to /var/cache/conftool/dbconfig/20241017-163324-ladsgroup.json [16:34:14] (03CR) 10Tchanders: [C:03+1] Implement redirects to meta's Special:GlobalContributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081138 (https://phabricator.wikimedia.org/T376612) (owner: 10STran) [16:35:22] (03CR) 10BCornwall: [C:03+1] Remove obsolete api records [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [16:36:29] (03CR) 10BCornwall: [C:03+1] "Do these need to have any "reserved for" comments like other unused numbers?" [dns] - 10https://gerrit.wikimedia.org/r/1080295 (owner: 10Clément Goubert) [16:37:13] (03PS1) 10Brouberol: airflow: fix missing configmap when the scheduler isn't deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081210 (https://phabricator.wikimedia.org/T374948) [16:37:43] (03CR) 10Btullis: [C:03+1] airflow: fix missing configmap when the scheduler isn't deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081210 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [16:38:50] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:39:28] (03CR) 10Brouberol: [C:03+2] airflow: fix missing configmap when the scheduler isn't deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081210 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [16:40:42] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:41:14] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:42:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P70282 and previous config saved to /var/cache/conftool/dbconfig/20241017-164256-arnaudb.json [16:43:19] (03CR) 10Brouberol: [C:03+2] ATS: add mapping for airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1079361 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [16:43:43] (03CR) 10Scott French: [C:03+2] shellbox: reduce cpu resource requests in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081196 (owner: 10Scott French) [16:44:47] (03Merged) 10jenkins-bot: shellbox: reduce cpu resource requests in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081196 (owner: 10Scott French) [16:48:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70283 and previous config saved to /var/cache/conftool/dbconfig/20241017-164830-ladsgroup.json [16:49:01] !log phab2002 T377396 - fix UIDs/GIDs for phab-related system users: vcs: uid 496 -> 497 | aphlict: uid 497 -> uid 496, gid 497 -> gid 496 | chown aphlict:aphlict /var/log/aphlict | chown aphlict:aphlict /run/aphlict [16:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:24] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:49:43] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:49:52] T377396: PuppetFailure - phab2002 - https://phabricator.wikimedia.org/T377396 [16:50:13] FIRING: [2x] ProbeDown: Service phab2002:25 has failed probes (tcp_phabricator_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:31] oof, this is a reboot, already done [16:50:35] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [16:50:53] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:50:54] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [16:51:21] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:52:09] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:52:34] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:55:13] RESOLVED: [2x] ProbeDown: Service phab2002:25 has failed probes (tcp_phabricator_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:18] !log phab2002 T377396 - reboot | in addition to /etc/passwd also fix aphlict GID in /etc/group | fixed puppet run which can now create group vcs. now equivalent to prod server phab1004. [16:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:48] T377396: PuppetFailure - phab2002 - https://phabricator.wikimedia.org/T377396 [16:58:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P70284 and previous config saved to /var/cache/conftool/dbconfig/20241017-165803-arnaudb.json [16:58:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [16:58:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [16:58:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70285 and previous config saved to /var/cache/conftool/dbconfig/20241017-165814-arnaudb.json [16:58:23] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1700). [17:00:05] swfrench-wmf: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1700). [17:00:44] here, and starting work momentarily [17:01:09] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079572 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:01:11] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): create "next" releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079572 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:02:38] (03Merged) 10jenkins-bot: mw-(api-ext|web): create "next" releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079572 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:03:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70286 and previous config saved to /var/cache/conftool/dbconfig/20241017-170337-ladsgroup.json [17:05:30] (03PS1) 10BryanDavis: bitu: Add some stewards to the list of account managers [puppet] - 10https://gerrit.wikimedia.org/r/1081220 (https://phabricator.wikimedia.org/T359820) [17:06:39] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:07:22] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:08:22] 06SRE, 10SRE-Access-Requests: Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490 (10JEbe-WMF) 03NEW [17:08:36] (03PS7) 10Cwhite: logstash: parse new containerd log format [puppet] - 10https://gerrit.wikimedia.org/r/1080603 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [17:08:47] (03CR) 10CI reject: [V:04-1] logstash: parse new containerd log format [puppet] - 10https://gerrit.wikimedia.org/r/1080603 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [17:11:07] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-10-17-122158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081222 [17:12:39] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-10-17-122158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081222 (owner: 10BryanDavis) [17:12:44] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:13:09] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:13:39] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-10-17-122158-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081222 (owner: 10BryanDavis) [17:13:52] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10238917 (10Stevemunene) [17:14:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye [17:14:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10238933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye [17:14:25] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:14:48] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:15:07] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:15:57] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:16:28] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:16:38] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:16:45] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:17:08] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:18:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T376905)', diff saved to https://phabricator.wikimedia.org/P70287 and previous config saved to /var/cache/conftool/dbconfig/20241017-171844-ladsgroup.json [17:19:44] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:19:48] (03PS1) 10JMeybohm: etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4 [puppet] - 10https://gerrit.wikimedia.org/r/1081224 (https://phabricator.wikimedia.org/T362408) [17:19:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:20:03] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081224 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [17:20:29] (03PS2) 10JMeybohm: etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4 [puppet] - 10https://gerrit.wikimedia.org/r/1081224 (https://phabricator.wikimedia.org/T362408) [17:20:51] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1081224 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [17:22:41] (03PS2) 10Scott French: hieradata: add "next" releases of mw-(web|api-ext) [puppet] - 10https://gerrit.wikimedia.org/r/1080793 (https://phabricator.wikimedia.org/T377040) [17:23:35] (03CR) 10Scott French: [C:03+2] hieradata: add "next" releases of mw-(web|api-ext) [puppet] - 10https://gerrit.wikimedia.org/r/1080793 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [17:24:00] (03CR) 10Giuseppe Lavagetto: "LGTM as fix, see my comment on the added tests." [puppet] - 10https://gerrit.wikimedia.org/r/1081195 (owner: 10Vgutierrez) [17:29:36] (03PS1) 10Dzahn: cloud/devtools: do NOT bind service IP on gerrit test instances [puppet] - 10https://gerrit.wikimedia.org/r/1081225 (https://phabricator.wikimedia.org/T363196) [17:30:15] (03CR) 10Dzahn: [C:03+2] cloud/devtools: do NOT bind service IP on gerrit test instances [puppet] - 10https://gerrit.wikimedia.org/r/1081225 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [17:31:58] !log swfrench@deploy2002 Started scap sync-world: Testing scap after mw-api-ext / mw-web next release bring up - T377040 [17:32:50] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [17:34:17] (03CR) 10CDanis: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/1080276 (https://phabricator.wikimedia.org/T376291) (owner: 10Cathal Mooney) [17:34:52] !log swfrench@deploy2002 Finished scap sync-world: Testing scap after mw-api-ext / mw-web next release bring up - T377040 (duration: 02m 54s) [17:37:14] (03PS2) 10Scott French: mw-(api-ext|web): remove "next" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080794 (https://phabricator.wikimedia.org/T377040) [17:43:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:43:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:45:07] (03PS1) 10Brouberol: airflow-analytic-test: fix OIDC client id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081228 (https://phabricator.wikimedia.org/T374948) [17:45:47] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10239057 (10Dzahn) 05In progress→03Stalled Basically done. All that is missing is we haven't assigned a service IP to this machine. In puppet it is just set to not bind... [17:50:23] (03Abandoned) 10Bking: airflow: disable network egress by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081178 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [17:50:40] (03CR) 10Bking: [C:03+2] airflow-analytic-test: fix OIDC client id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081228 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [17:50:56] (03CR) 10Bking: [V:03+2 C:03+2] airflow-analytic-test: fix OIDC client id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081228 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [17:51:19] (03CR) 10Btullis: [C:03+1] airflow-analytic-test: fix OIDC client id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081228 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [17:51:48] (03Merged) 10jenkins-bot: airflow-analytic-test: fix OIDC client id [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081228 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [17:54:55] forgot to say, I am done with the infra window [17:55:33] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th): Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10239141 (10Ottomata) a:03Ottomata [17:55:35] 06SRE, 06Data-Platform-SRE, 10Data-Engineering (Q2 2024 October 1st - December 31th): Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10239142 (10Ahoelzl) [17:55:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:56:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [17:58:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70288 and previous config saved to /var/cache/conftool/dbconfig/20241017-175841-arnaudb.json [17:59:02] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:00:05] jeena and andre: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T1800). [18:01:12] (03PS1) 10Brouberol: airflow-analytic-test: disable remote logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081230 (https://phabricator.wikimedia.org/T374948) [18:11:23] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081233 (https://phabricator.wikimedia.org/T375658) [18:11:24] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081233 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [18:12:09] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081233 (https://phabricator.wikimedia.org/T375658) (owner: 10TrainBranchBot) [18:12:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) (owner: 10Pppery) [18:13:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70289 and previous config saved to /var/cache/conftool/dbconfig/20241017-181348-arnaudb.json [18:16:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2081.codfw.wmnet with OS bullseye [18:16:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10239225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye execute... [18:19:03] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.27 refs T375658 [18:19:25] T375658: 1.43.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T375658 [18:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377317#10239232 (10phaultfinder) [18:22:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:23:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [18:23:39] (03CR) 10Bking: [C:03+2] airflow-analytic-test: disable remote logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081230 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [18:25:02] (03Merged) 10jenkins-bot: airflow-analytic-test: disable remote logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081230 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [18:28:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70290 and previous config saved to /var/cache/conftool/dbconfig/20241017-182855-arnaudb.json [18:29:01] (03CR) 10Pppery: "Did a quick check for instances that need NamespaceDupes.php to be run:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) (owner: 10Pppery) [18:37:41] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-(api-ext|web): remove "next" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080794 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:41:06] jeena: OK for me to deploy a new release of scap? [18:41:26] yeah all good [18:41:31] thx [18:41:43] !log dancy@deploy2002 Installing scap version "4.111.0" for 210 hosts [18:44:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70291 and previous config saved to /var/cache/conftool/dbconfig/20241017-184402-arnaudb.json [18:44:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:45:57] !log dancy@deploy2002 Installation of scap version "4.111.0" completed for 210 hosts [18:47:01] !log dancy@deploy2002 Started scap sync-world: testing scap 4.111.0 [18:48:00] !log mwscript-k8s --comment=T377360 -f -- extensions/Flow/maintenance/FlowFixInconsistentBoards.php --wiki=wikidatawiki # T377360 [18:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:24] T377360: Run fixInconsistentBoards on Wikidata - https://phabricator.wikimedia.org/T377360 [18:48:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:48:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:48:58] (03PS1) 10Scott French: wmnet: A and PTR records for mw-(web|api-ext)-next in svc [dns] - 10https://gerrit.wikimedia.org/r/1080778 (https://phabricator.wikimedia.org/T377040) [18:49:17] (03PS1) 10Scott French: service: add base configuration for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1080788 (https://phabricator.wikimedia.org/T377040) [18:49:46] !log dancy@deploy2002 Finished scap sync-world: testing scap 4.111.0 (duration: 02m 44s) [18:51:34] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080794 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:51:36] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): remove "next" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080794 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:52:44] (03Merged) 10jenkins-bot: mw-(api-ext|web): remove "next" release values overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080794 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:53:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:53:10] !log ladsgroup@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [18:53:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:53:47] !log ladsgroup@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [18:55:44] (03CR) 10RLazarus: [C:03+1] wmnet: A and PTR records for mw-(web|api-ext)-next in svc [dns] - 10https://gerrit.wikimedia.org/r/1080778 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:56:41] (03PS1) 10Dzahn: push miscweb/static-codereview to image 2024-10-17-175203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081241 (https://phabricator.wikimedia.org/T363771) [18:58:23] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10239372 (10mpopov) I approve membership in `airflow-analytics-product-admins` (Olja will need to approve as @JEbe-WMF's man... [18:59:53] (03PS1) 10Dzahn: cloud/devtools: disable lfs data syncing on gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1081244 (https://phabricator.wikimedia.org/T363196) [19:00:43] (03CR) 10Dzahn: [C:03+2] cloud/devtools: disable lfs data syncing on gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1081244 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [19:01:33] (03CR) 10RLazarus: [C:03+1] service: add base configuration for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1080788 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:02:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:02:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:05:45] (03CR) 10Urbanecm: [C:04-1] "-1 for visibility" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [19:06:39] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1080788 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:06:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:06:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:06:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T376905)', diff saved to https://phabricator.wikimedia.org/P70292 and previous config saved to /var/cache/conftool/dbconfig/20241017-190655-ladsgroup.json [19:07:56] !log dancy@deploy2002 Installing scap version "4.112.0" for 210 hosts [19:12:14] (03CR) 10Scott French: [C:03+2] service: add base configuration for mw-(web|api-ext)-next [puppet] - 10https://gerrit.wikimedia.org/r/1080788 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:13:14] !log dancy@deploy2002 Installing scap version "4.112.0" for 1 hosts [19:15:28] !log dancy@deploy2002 Started scap sync-world: testing https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/484 [19:16:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T376905)', diff saved to https://phabricator.wikimedia.org/P70293 and previous config saved to /var/cache/conftool/dbconfig/20241017-191601-ladsgroup.json [19:18:14] !log dancy@deploy2002 Finished scap sync-world: testing https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/484 (duration: 02m 46s) [19:23:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [19:24:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:24:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:24:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:24:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:24:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T376905)', diff saved to https://phabricator.wikimedia.org/P70294 and previous config saved to /var/cache/conftool/dbconfig/20241017-192424-ladsgroup.json [19:25:40] (03CR) 10Scott French: [C:03+2] wmnet: A and PTR records for mw-(web|api-ext)-next in svc [dns] - 10https://gerrit.wikimedia.org/r/1080778 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [19:26:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [19:31:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [19:31:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P70295 and previous config saved to /var/cache/conftool/dbconfig/20241017-193108-ladsgroup.json [19:33:01] !log ran authdns-update to pick up records for mw-(web|api-ext)-next in svc - T377040 [19:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:24] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [19:33:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T376905)', diff saved to https://phabricator.wikimedia.org/P70296 and previous config saved to /var/cache/conftool/dbconfig/20241017-193358-ladsgroup.json [19:39:35] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1081250 [19:46:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P70297 and previous config saved to /var/cache/conftool/dbconfig/20241017-194615-ladsgroup.json [19:49:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P70298 and previous config saved to /var/cache/conftool/dbconfig/20241017-194905-ladsgroup.json [19:52:17] (03PS8) 10Cwhite: logstash: parse new containerd log format [puppet] - 10https://gerrit.wikimedia.org/r/1080603 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [19:55:53] (03PS3) 10Herron: grafana-loki: add systemd override and bump max open files [puppet] - 10https://gerrit.wikimedia.org/r/1081250 (https://phabricator.wikimedia.org/T377502) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241017T2000). [20:00:05] Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] here [20:01:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T376905)', diff saved to https://phabricator.wikimedia.org/P70299 and previous config saved to /var/cache/conftool/dbconfig/20241017-200122-ladsgroup.json [20:01:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [20:01:36] I might be able to deploy. [20:01:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [20:01:45] * kindrobot looks at the patches [20:01:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T376905)', diff saved to https://phabricator.wikimedia.org/P70300 and previous config saved to /var/cache/conftool/dbconfig/20241017-200147-ladsgroup.json [20:02:23] OK, I can deploy! [20:02:42] It's one patch that does a lot of stuff because I got fed up with stuff languishing forever and did it all at once [20:02:49] * cjming bows to kindrobot [20:03:45] Did you see my comments on the patch about needing to run NamespaceDupes on a bunch of wikis? [20:04:03] I did not. [20:04:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P70301 and previous config saved to /var/cache/conftool/dbconfig/20241017-200412-ladsgroup.json [20:06:28] * kindrobot reads [20:07:08] Pppery: could you give me the exact commands that need to be run? [20:10:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T376905)', diff saved to https://phabricator.wikimedia.org/P70302 and previous config saved to /var/cache/conftool/dbconfig/20241017-201051-ladsgroup.json [20:11:15] * kindrobot is just not familiar with NamespaceDupes [20:11:22] I know. I'm thinking [20:11:59] I think just `namespaceDupes --fix` should work on the wiki [20:12:18] And it will give you a dry run of what it would do if you run it without `--fix` [20:12:43] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes should help [20:12:54] Thanks tyler [20:12:54] Thanks thcipriani ! [20:13:22] In this case with one exception the patch is adding namespace aliases not new namespaces, but the same concept should work [20:14:43] Oh, thcipriani, we need to update that page to use the new mwscript-k8s script [20:14:55] ah, I was just wondering about that [20:14:55] OK, kicking off the backport [20:16:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) (owner: 10Pppery) [20:17:10] (03Merged) 10jenkins-bot: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) (owner: 10Pppery) [20:17:26] !log kindrobot@deploy2002 Started scap sync-world: Backport for [[gerrit:1080078|Configure namespaces, sitenames, and timezones for new wikis (T377160 T375102 T375017 T375424 T376572 T377088 T374644 T375024 T374815 T375095 T375433 T360303 T363256 T360310)]] [20:17:27] This is going to take a while to test once it's on mwdebug, just because it configures things for so many wikis [20:18:00] no problem, you're the only one, so we've got time :) [20:18:23] T377160: Post-creation work for annwiki - https://phabricator.wikimedia.org/T377160 [20:18:24] T375102: Post-creation work for nrwiki - https://phabricator.wikimedia.org/T375102 [20:18:24] T375017: Post-creation work for rskwiki - https://phabricator.wikimedia.org/T375017 [20:18:25] T375424: Post-creation work for tddwiki - https://phabricator.wikimedia.org/T375424 [20:18:25] T376572: Post-creation work for ibawiki - https://phabricator.wikimedia.org/T376572 [20:18:25] T377088: Post-creation work for bclwikisource - https://phabricator.wikimedia.org/T377088 [20:18:26] T374644: Post-creation work for moswiki - https://phabricator.wikimedia.org/T374644 [20:18:26] T375024: Post-creation work for madwiktionary - https://phabricator.wikimedia.org/T375024 [20:18:26] T374815: Post-creation work for kgewiki - https://phabricator.wikimedia.org/T374815 [20:18:27] T375095: Post-creation work for gorwikiquote - https://phabricator.wikimedia.org/T375095 [20:18:27] T375433: Post-creation work for shnwikinews - https://phabricator.wikimedia.org/T375433 [20:18:28] T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303 [20:18:29] T363256: Post-creation work for kaawiktionary - https://phabricator.wikimedia.org/T363256 [20:18:29] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [20:18:40] I love you stashbot [20:18:46] heh [20:19:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T376905)', diff saved to https://phabricator.wikimedia.org/P70303 and previous config saved to /var/cache/conftool/dbconfig/20241017-201919-ladsgroup.json [20:19:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [20:19:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [20:19:40] !log kindrobot@deploy2002 pppery, kindrobot: Backport for [[gerrit:1080078|Configure namespaces, sitenames, and timezones for new wikis (T377160 T375102 T375017 T375424 T376572 T377088 T374644 T375024 T374815 T375095 T375433 T360303 T363256 T360310)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:19:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T376905)', diff saved to https://phabricator.wikimedia.org/P70304 and previous config saved to /var/cache/conftool/dbconfig/20241017-201944-ladsgroup.json [20:19:49] Looking now [20:20:05] Pppery: ready to test. Ping me when you're done. (I'll be doing some code reviews.) [20:21:26] bclwikisource namespace: seems to work as expected [20:21:35] (still more to do, doing each thing one by one) [20:21:54] (03CR) 10Cwhite: [C:03+2] logstash: parse new containerd log format (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1080603 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [20:22:03] no rush, we have ~30 more minutes in the window [20:22:33] (budgeting time for maintenance scripts) [20:23:02] bewwiki namespaces: seem to work as expected [20:25:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P70305 and previous config saved to /var/cache/conftool/dbconfig/20241017-202558-ladsgroup.json [20:27:26] kus.wikipedia.org has something off but it's not caused by my patch, it's a preexisting issue so I'm continuing [20:29:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T376905)', diff saved to https://phabricator.wikimedia.org/P70306 and previous config saved to /var/cache/conftool/dbconfig/20241017-202911-ladsgroup.json [20:31:09] Finished checking namespaces, all of them are fine to continue. Still have to do site names and timezones [20:31:27] ack [20:33:17] (03PS1) 10Dzahn: gerrit: make parameter lfs_sync_dest optional [puppet] - 10https://gerrit.wikimedia.org/r/1081257 (https://phabricator.wikimedia.org/T363196) [20:34:29] (03CR) 10Cwhite: [C:03+2] ci: capture job completion timer metrics [puppet] - 10https://gerrit.wikimedia.org/r/1080400 (https://phabricator.wikimedia.org/T233089) (owner: 10Cwhite) [20:36:09] (03CR) 10Dzahn: [C:03+2] "even when LFS data syncing is disabled, puppet still failed because it wanted to know a destination for the sync, so -> https://gerrit.wik" [puppet] - 10https://gerrit.wikimedia.org/r/1081244 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [20:36:30] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1081257/4318/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1081257 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [20:37:28] (03PS1) 10Bking: analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) [20:37:53] Finished checking timezones, now have to do site names [20:38:31] (03PS2) 10Bking: analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) [20:38:39] (03CR) 10Bking: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [20:38:51] Pppery: status check? [20:39:01] See above [20:39:04] still checking site names [20:39:08] ack ty [20:41:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P70307 and previous config saved to /var/cache/conftool/dbconfig/20241017-204105-ladsgroup.json [20:42:34] Kindrobot: Looks good, proceed [20:43:06] And sorry it took so long, when I prepared this patch I didn't realize testing it would take almost 20 minutes [20:43:49] absolutely no problem [20:43:52] syncing now [20:43:56] !log kindrobot@deploy2002 pppery, kindrobot: Continuing with sync [20:44:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P70308 and previous config saved to /var/cache/conftool/dbconfig/20241017-204418-ladsgroup.json [20:48:41] !log kindrobot@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080078|Configure namespaces, sitenames, and timezones for new wikis (T377160 T375102 T375017 T375424 T376572 T377088 T374644 T375024 T374815 T375095 T375433 T360303 T363256 T360310)]] (duration: 31m 15s) [20:48:59] T377160: Post-creation work for annwiki - https://phabricator.wikimedia.org/T377160 [20:48:59] T375102: Post-creation work for nrwiki - https://phabricator.wikimedia.org/T375102 [20:49:00] T375017: Post-creation work for rskwiki - https://phabricator.wikimedia.org/T375017 [20:49:00] T375424: Post-creation work for tddwiki - https://phabricator.wikimedia.org/T375424 [20:49:01] T376572: Post-creation work for ibawiki - https://phabricator.wikimedia.org/T376572 [20:49:01] T377088: Post-creation work for bclwikisource - https://phabricator.wikimedia.org/T377088 [20:49:01] T374644: Post-creation work for moswiki - https://phabricator.wikimedia.org/T374644 [20:49:02] T375024: Post-creation work for madwiktionary - https://phabricator.wikimedia.org/T375024 [20:49:02] T374815: Post-creation work for kgewiki - https://phabricator.wikimedia.org/T374815 [20:49:03] T375095: Post-creation work for gorwikiquote - https://phabricator.wikimedia.org/T375095 [20:49:03] T375433: Post-creation work for shnwikinews - https://phabricator.wikimedia.org/T375433 [20:49:03] T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303 [20:49:04] T363256: Post-creation work for kaawiktionary - https://phabricator.wikimedia.org/T363256 [20:49:04] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [20:50:04] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:50:18] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:50:18] Beginning NamespaceDupes [20:51:56] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:52:16] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:56:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T376905)', diff saved to https://phabricator.wikimedia.org/P70309 and previous config saved to /var/cache/conftool/dbconfig/20241017-205612-ladsgroup.json [20:56:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [20:56:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [20:56:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:56:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:56:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T376905)', diff saved to https://phabricator.wikimedia.org/P70310 and previous config saved to /var/cache/conftool/dbconfig/20241017-205655-ladsgroup.json [20:58:35] How is the namespaceDupes run going? [20:59:13] finished [20:59:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P70311 and previous config saved to /var/cache/conftool/dbconfig/20241017-205925-ladsgroup.json [20:59:32] I'm going to upload the logs as a paste, just incase there's anything interesting [20:59:55] Thanks [21:01:19] !log ran mwscript-k8s -f --comment="https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1080078/comments/02a9334e_cd3e7a0e" -- namespaceDupes.php on: bclwikisource, bewwiki, gorwikiquote, iglwiki, kaawiktionary, kgewiki, kuswiki, madwiktionary, moswiki, nrwiki, rskwiki, shnwikinews, and tddwiki [21:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:13] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10239955 (10Eevans) [21:04:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T376905)', diff saved to https://phabricator.wikimedia.org/P70312 and previous config saved to /var/cache/conftool/dbconfig/20241017-210428-ladsgroup.json [21:04:51] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10239959 (10Eevans) >>! In T377490#10239372, @mpopov wrote: > I approve membership in `airflow-analytics-product-admins` > >... [21:07:34] kindrobot: oh sweet, I was just realizing I hadn't updated the docs to point deployers at mwscript-k8s :) [21:07:52] (03PS1) 10Fabfur: haproxykafka: start working on haproxykafka puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) [21:08:27] !log results of de-duping: https://phabricator.wikimedia.org/P70313 [21:08:27] (03CR) 10CI reject: [V:04-1] haproxykafka: start working on haproxykafka puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [21:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:41] I can't see that paste [21:08:54] There's shouldn't be anything private there [21:08:57] You may need to sign into phabricator [21:09:15] No, I'm signed in [21:09:20] Access Denied: Restricted Paste [21:09:21] You do not have permission to view this object. [21:09:21] Users with the "Can View" capability: [21:09:22]     This object has a custom policy controlling who can take this action. [21:09:22]     The author of a paste can always view and edit it. [21:10:24] try now? [21:10:30] yep [21:10:31] works [21:10:33] great [21:10:41] (03PS2) 10Fabfur: haproxykafka: start working on haproxykafka puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) [21:11:00] I know it's probably safe, but just given that I don't read every line of the output, I want to use an abundance of caution [21:11:14] (03CR) 10CI reject: [V:04-1] haproxykafka: start working on haproxykafka puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [21:11:31] !log UTC late backport window finished <3 [21:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:02] Thanks Pppery, thcipriani, and everyone else for their service o7 [21:12:11] One-last thing - there were a few conflicts for the bewwiki namespaceDupes - you should probably re-run the script for bewwiki with `--add-prefix T360310` to clean those up [21:12:11] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [21:12:12] (03PS3) 10Fabfur: haproxykafka: start working on haproxykafka puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1081264 (https://phabricator.wikimedia.org/T374128) [21:12:23] ack [21:12:45] kindrobot: thank you for volunteering to run the backport window <3 [21:14:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T376905)', diff saved to https://phabricator.wikimedia.org/P70314 and previous config saved to /var/cache/conftool/dbconfig/20241017-211432-ladsgroup.json [21:14:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:14:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:14:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T376905)', diff saved to https://phabricator.wikimedia.org/P70315 and previous config saved to /var/cache/conftool/dbconfig/20241017-211458-ladsgroup.json [21:16:08] Pppery: done, paste has been updated at the bottom [21:16:12] Thanks, that worked [21:16:31] my pleasure [21:16:36] I now have to tell the Global Sysops to clean up by moving the pages with that prefix where they belong, but that's not your fault [21:16:53] good luck o7 [21:19:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P70316 and previous config saved to /var/cache/conftool/dbconfig/20241017-211935-ladsgroup.json [21:22:24] (03PS1) 10RLazarus: deployment_server: mwscript-k8s logging cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1081265 (https://phabricator.wikimedia.org/T377292) [21:24:11] I'll be back next week with another similar patch covering something I missed in that one, but it's just one wiki rather than over a dozen so it will be much quicker [21:25:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T376905)', diff saved to https://phabricator.wikimedia.org/P70317 and previous config saved to /var/cache/conftool/dbconfig/20241017-212536-ladsgroup.json [21:34:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P70318 and previous config saved to /var/cache/conftool/dbconfig/20241017-213442-ladsgroup.json [21:40:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P70319 and previous config saved to /var/cache/conftool/dbconfig/20241017-214043-ladsgroup.json [21:46:54] 06SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137#10240103 (10Ladsgroup) The old reports are too old and the logs have been purged but for the Zimbawe $5 picture, I found two logs that might be useful: https:/... [21:48:52] (03PS1) 10Pppery: Configure settings for ann, nrwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081267 [21:49:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T376905)', diff saved to https://phabricator.wikimedia.org/P70320 and previous config saved to /var/cache/conftool/dbconfig/20241017-214949-ladsgroup.json [21:49:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [21:50:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [21:50:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T376905)', diff saved to https://phabricator.wikimedia.org/P70321 and previous config saved to /var/cache/conftool/dbconfig/20241017-215014-ladsgroup.json [21:50:29] (03PS2) 10Pppery: Configure settings for ann, nrwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081267 [21:52:55] (03PS3) 10Pppery: Configure settings for ann, nrwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081267 (https://phabricator.wikimedia.org/T375102) [21:55:47] (03PS1) 10Bking: airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) [21:55:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P70322 and previous config saved to /var/cache/conftool/dbconfig/20241017-215550-ladsgroup.json [21:56:21] (03CR) 10CI reject: [V:04-1] airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [21:56:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T376905)', diff saved to https://phabricator.wikimedia.org/P70323 and previous config saved to /var/cache/conftool/dbconfig/20241017-215648-ladsgroup.json [21:59:36] (03CR) 10Scott French: [C:03+1] deployment_server: mwscript-k8s logging cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1081265 (https://phabricator.wikimedia.org/T377292) (owner: 10RLazarus) [22:01:02] (03PS2) 10Bking: airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) [22:01:37] (03CR) 10CI reject: [V:04-1] airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [22:06:56] (03PS3) 10Bking: airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) [22:07:20] (03CR) 10CI reject: [V:04-1] airflow: make 'secret_key' configurable [puppet] - 10https://gerrit.wikimedia.org/r/1081268 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [22:10:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T376905)', diff saved to https://phabricator.wikimedia.org/P70324 and previous config saved to /var/cache/conftool/dbconfig/20241017-221057-ladsgroup.json [22:11:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [22:11:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [22:11:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T376905)', diff saved to https://phabricator.wikimedia.org/P70325 and previous config saved to /var/cache/conftool/dbconfig/20241017-221123-ladsgroup.json [22:11:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P70326 and previous config saved to /var/cache/conftool/dbconfig/20241017-221155-ladsgroup.json [22:31:44] (03PS1) 10Dzahn: cloud/devtools: set a non-existing lfs data sync target [puppet] - 10https://gerrit.wikimedia.org/r/1081270 (https://phabricator.wikimedia.org/T363196) [22:32:16] (03CR) 10Dzahn: [C:03+2] "https://en.wikipedia.org/wiki/Example.com" [puppet] - 10https://gerrit.wikimedia.org/r/1081270 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [22:34:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P70329 and previous config saved to /var/cache/conftool/dbconfig/20241017-223443-ladsgroup.json [22:41:42] (03PS1) 10Dzahn: cloud/devtools: turn gerrit lfs_sync_dest into an array [puppet] - 10https://gerrit.wikimedia.org/r/1081273 (https://phabricator.wikimedia.org/T363196) [22:41:55] (03PS2) 10Dzahn: cloud/devtools: turn gerrit lfs_sync_dest into an array [puppet] - 10https://gerrit.wikimedia.org/r/1081273 (https://phabricator.wikimedia.org/T363196) [22:42:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T376905)', diff saved to https://phabricator.wikimedia.org/P70330 and previous config saved to /var/cache/conftool/dbconfig/20241017-224209-ladsgroup.json [22:42:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [22:42:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [22:43:01] (03CR) 10Dzahn: [C:03+2] cloud/devtools: turn gerrit lfs_sync_dest into an array [puppet] - 10https://gerrit.wikimedia.org/r/1081273 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [22:49:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P70331 and previous config saved to /var/cache/conftool/dbconfig/20241017-224950-ladsgroup.json [22:52:57] (03PS1) 10Dzahn: cloud/devtools/gerrit-bullseye: mask service, no monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1081277 (https://phabricator.wikimedia.org/T363196) [22:53:11] (03CR) 10CI reject: [V:04-1] cloud/devtools/gerrit-bullseye: mask service, no monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1081277 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [22:53:29] (03PS2) 10Dzahn: cloud/devtools/gerrit-bullseye: mask service, no monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1081277 (https://phabricator.wikimedia.org/T363196) [22:55:01] (03CR) 10Dzahn: [C:03+2] "Note how there is also already "profile::gerrit::bind_service_ip: false" already, so no effect on listening on anything." [puppet] - 10https://gerrit.wikimedia.org/r/1081277 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [22:59:30] (03PS1) 10Dduvall: deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) [23:00:46] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): Requesting access to airflow-analytics-product-admins for jebe - https://phabricator.wikimedia.org/T377490#10240359 (10Eevans) [23:00:48] (03PS4) 10Scott French: P:trafficserver: extend x-wikimedia-debug-routing for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1072638 (https://phabricator.wikimedia.org/T372605) [23:01:31] (03CR) 10CI reject: [V:04-1] deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [23:03:24] (03PS2) 10Dduvall: deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) [23:03:58] (03CR) 10CI reject: [V:04-1] deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [23:04:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T376905)', diff saved to https://phabricator.wikimedia.org/P70332 and previous config saved to /var/cache/conftool/dbconfig/20241017-230457-ladsgroup.json [23:05:02] (03PS1) 10Eevans: Add jebe to airflow-analytics-product-admins per access request [puppet] - 10https://gerrit.wikimedia.org/r/1081285 (https://phabricator.wikimedia.org/T377490) [23:05:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [23:05:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [23:05:58] (03CR) 10Scott French: "Thanks, Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1072638 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [23:06:28] (03PS3) 10Dduvall: deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) [23:07:02] (03CR) 10CI reject: [V:04-1] deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) (owner: 10Dduvall) [23:08:34] (03PS4) 10Dduvall: deployment_server::mediawiki: Execute scap mwscript/mwshell as mwbuilder [puppet] - 10https://gerrit.wikimedia.org/r/1081281 (https://phabricator.wikimedia.org/T369115) [23:10:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [23:10:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [23:10:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T376905)', diff saved to https://phabricator.wikimedia.org/P70333 and previous config saved to /var/cache/conftool/dbconfig/20241017-231037-ladsgroup.json [23:16:12] (03PS1) 10Dzahn: cloud/devtools: set service IP to existing gerrit.devtools.wmcloud.org. [puppet] - 10https://gerrit.wikimedia.org/r/1081286 (https://phabricator.wikimedia.org/T363196) [23:16:26] (03PS2) 10Dzahn: cloud/devtools: set service IP to existing gerrit.devtools.wmcloud.org. [puppet] - 10https://gerrit.wikimedia.org/r/1081286 (https://phabricator.wikimedia.org/T363196) [23:16:29] (03CR) 10CI reject: [V:04-1] cloud/devtools: set service IP to existing gerrit.devtools.wmcloud.org. [puppet] - 10https://gerrit.wikimedia.org/r/1081286 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn) [23:18:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T376905)', diff saved to https://phabricator.wikimedia.org/P70334 and previous config saved to /var/cache/conftool/dbconfig/20241017-231835-ladsgroup.json [23:23:21] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [23:33:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P70335 and previous config saved to /var/cache/conftool/dbconfig/20241017-233342-ladsgroup.json [23:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1081289 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1081289 (owner: 10TrainBranchBot) [23:48:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P70336 and previous config saved to /var/cache/conftool/dbconfig/20241017-234849-ladsgroup.json [23:55:15] (03PS3) 10Cwhite: logstash: restore partition name back to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1081156 (https://phabricator.wikimedia.org/T377422) [23:57:31] (03CR) 10Cwhite: [C:03+2] logstash: restore partition name back to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1081156 (https://phabricator.wikimedia.org/T377422) (owner: 10Cwhite)