[00:09:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078495 (owner: 10TrainBranchBot) [00:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209115 (10phaultfinder) [00:44:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:08:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.26 [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1078500 (https://phabricator.wikimedia.org/T375657) [01:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.26 [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1078500 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [01:39:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.26 [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1078500 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [01:55:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [01:58:31] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:59:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T0200) [02:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209205 (10phaultfinder) [02:26:56] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:13] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T0300) [03:01:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:59] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078508 (https://phabricator.wikimedia.org/T375657) [03:02:01] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078508 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [03:02:46] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078508 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [03:03:11] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.43.0-wmf.26 refs T375657 [03:03:14] T375657: 1.43.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T375657 [03:12:59] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:39:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209234 (10phaultfinder) [03:50:55] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.43.0-wmf.26 refs T375657 (duration: 47m 44s) [03:50:58] T375657: 1.43.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T375657 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T0400) [04:01:00] !log mwpresync@deploy2002 Pruned MediaWiki: 1.43.0-wmf.23 (duration: 00m 58s) [04:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209287 (10phaultfinder) [04:44:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209372 (10phaultfinder) [04:44:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:34:53] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209411 (10phaultfinder) [05:55:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:59:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T0600) [06:00:05] marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209430 (10phaultfinder) [06:14:16] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10209446 (10santhosh) @isarantopoulos Agreed, let us recheck after two weeks. From our team perspective,... [06:22:32] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [06:26:56] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:29] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [06:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10209455 (10phaultfinder) [06:45:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: T374215', diff saved to https://phabricator.wikimedia.org/P69492 and previous config saved to /var/cache/conftool/dbconfig/20241008-064548-arnaudb.json [06:45:55] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [06:48:57] (03PS1) 10Slyngshede: P:idm Add Taavi to list of "Account Manager". [puppet] - 10https://gerrit.wikimedia.org/r/1078535 (https://phabricator.wikimedia.org/T359820) [06:50:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078535 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [06:50:45] (03CR) 10Slyngshede: [C:03+2] P:idm Add Taavi to list of "Account Manager". [puppet] - 10https://gerrit.wikimedia.org/r/1078535 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [06:50:46] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Use ?? instead of default value in getRawVal() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077417 (https://phabricator.wikimedia.org/T376245) (owner: 10Fomafix) [06:52:05] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10209471 (10MoritzMuehlenhoff) @isarantopoulos Can you please ping this task once team-based permissions... [06:53:15] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10209472 (10MoritzMuehlenhoff) 05Open→03Stalled p:05Triage→03Medium [06:55:55] (03CR) 10Jelto: "thanks for uploading a workaround so quickly :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078460 (owner: 10Volans) [06:59:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to merge, all approvals are in and an NDA is on record." [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) (owner: 10Kamila Součková) [07:00:04] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:14] (03CR) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [07:00:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: T374215', diff saved to https://phabricator.wikimedia.org/P69493 and previous config saved to /var/cache/conftool/dbconfig/20241008-070053-arnaudb.json [07:01:03] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [07:01:13] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on pediapress.com - https://phabricator.wikimedia.org/T375761#10209484 (10MoritzMuehlenhoff) [07:10:21] !log depooling wdqs1013 (lag) [07:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:59] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:13:44] (03CR) 10Alexandros Kosiaris: [C:03+1] "TIL, I wasn't aware of this. Nice find. I think the change itself won't hurt (even if it doesn't work and we need the fallback plan), but " [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [07:15:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:15:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 15%: T374215', diff saved to https://phabricator.wikimedia.org/P69494 and previous config saved to /var/cache/conftool/dbconfig/20241008-071559-arnaudb.json [07:16:02] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [07:19:33] (03PS5) 10Ayounsi: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) [07:19:36] (03PS1) 10Jelto: profile::requesttracker: delay blackbox checks for 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1078540 (https://phabricator.wikimedia.org/T376580) [07:20:34] (03PS2) 10Ayounsi: sre.hosts.provision: make UEFI opt-out [cookbooks] - 10https://gerrit.wikimedia.org/r/1078539 [07:21:01] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209501 (10Surfcityrecovery) https://surfcityrecovery.com https://surfcityrecovery.com/interventions https://surfcityrecovery.com/areas-we-serve https://surfcityrecovery.com/photo-gallery http... [07:22:10] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [07:22:37] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4244/console" [puppet] - 10https://gerrit.wikimedia.org/r/1078540 (https://phabricator.wikimedia.org/T376580) (owner: 10Jelto) [07:25:12] (03CR) 10Ebrahim: "I think this is worth to have so when LiquidThread is disabled this will be noticed better." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 (owner: 10Jdlrobson) [07:26:04] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4245/" [puppet] - 10https://gerrit.wikimedia.org/r/1078540 (https://phabricator.wikimedia.org/T376580) (owner: 10Jelto) [07:26:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209507 (10Surfcityrecovery) https://surfcityrecovery.com/inpatient-treatment https://surfcityrecovery.com/outpatient-treatment https://surfcityrecovery.com/partial-hospitalization-program htt... [07:27:08] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209508 (10Surfcityrecovery) https://surfcityrecovery.com/areas-we-serve/san-clemente https://surfcityrecovery.com/areas-we-serve/garden-grove https://surfcityrecovery.com/areas-we-serve/midwa... [07:27:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209510 (10Surfcityrecovery) https://surfcityrecovery.com/areas-we-serve/anaheim https://surfcityrecovery.com/areas-we-serve/irvine https://surfcityrecovery.com/areas-we-serve/san-juan-capistr... [07:27:35] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209511 (10Surfcityrecovery) https://surfcityrecovery.com/areas-we-serve/la-palma https://surfcityrecovery.com/areas-we-serve/villa-park https://surfcityrecovery.com/areas-we-serve/coto-de-caz... [07:27:57] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209512 (10Surfcityrecovery) https://surfcityrecovery.com/cigna-health https://surfcityrecovery.com/methamphetamine-rehab https://surfcityrecovery.com/tricare-west https://surfcityrecovery.com... [07:28:10] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209513 (10Surfcityrecovery) https://surfcityrecovery.com/outpatient-alcohol-rehab-program/ https://surfcityrecovery.com/the-most-addictive-prescription-drugs/ https://surfcityrecovery.com/the... [07:28:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209514 (10Surfcityrecovery) https://surfcityrecovery.com/how-does-intensive-outpatient-program-works/ https://surfcityrecovery.com/how-to-cope-up-with-drug-detox/ https://surfcityrecovery.com... [07:28:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209515 (10Surfcityrecovery) https://surfcityrecovery.com/alcoholism-the-various-treatment-options-for-a-sober-living/ https://surfcityrecovery.com/5-benefits-of-long-term-rehab/ https://surfc... [07:29:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10209516 (10Surfcityrecovery) https://surfcityrecovery.com/reclaiming-your-life-through-substance-abuse-treatment/ https://surfcityrecovery.com/role-of-private-drug-rehab-facilities-in-overcomi... [07:31:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: T374215', diff saved to https://phabricator.wikimedia.org/P69495 and previous config saved to /var/cache/conftool/dbconfig/20241008-073104-arnaudb.json [07:31:08] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [07:39:17] (03CR) 10Ayounsi: [C:03+1] "Approach overall LGTM. If it gets too complex to proceed through DHCP it might become more viable to set the HTTP boot URL via Redfish and" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [07:42:28] (03CR) 10Ayounsi: "Should that be in the Spicerack module instead, to benefit other use-cases as we expand our dependency on Redfish ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [07:43:34] (03CR) 10Ayounsi: Add efi support to partman (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [07:44:01] !log uploaded golang-github-jvgutierrez-go-etcd-harness 1.0.0 to apt.wm.o (bookworm-wikimedia) - T376600 [07:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:05] T376600: Provide debian packages for liberica - https://phabricator.wikimedia.org/T376600 [07:46:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: T374215', diff saved to https://phabricator.wikimedia.org/P69496 and previous config saved to /var/cache/conftool/dbconfig/20241008-074609-arnaudb.json [07:46:12] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [07:57:44] (03CR) 10Alexandros Kosiaris: [C:03+1] "+1 as a bandaid, but this smells of envoy and XFF weirdness. We should probably solve it there and then revert this." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T339863) (owner: 10Hnowlan) [07:58:14] (03CR) 10Alexandros Kosiaris: [C:03+1] "Same comment as for the dependent commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063228 (https://phabricator.wikimedia.org/T372470) (owner: 10Hnowlan) [08:00:05] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T0800) [08:01:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: T374215', diff saved to https://phabricator.wikimedia.org/P69497 and previous config saved to /var/cache/conftool/dbconfig/20241008-080115-arnaudb.json [08:01:18] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [08:01:25] (03CR) 10Muehlenhoff: "Looks good, just a few typos inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) (owner: 10Slyngshede) [08:01:59] 👋 I'll be deploying the train this morning, rolling in 5 minutes [08:03:45] (03CR) 10Alexandros Kosiaris: [C:03+1] deployment_server: Set internal docker registry name by default [puppet] - 10https://gerrit.wikimedia.org/r/1078381 (https://phabricator.wikimedia.org/T376608) (owner: 10JMeybohm) [08:04:18] (03PS3) 10Slyngshede: Password change form. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) [08:04:26] (03CR) 10Slyngshede: Password change form. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) (owner: 10Slyngshede) [08:04:53] (03PS1) 10Brouberol: Make SparkHistoryTestServiceUnavailable less sensitive to small metric flaps [alerts] - 10https://gerrit.wikimedia.org/r/1078602 [08:06:09] (03PS2) 10Brouberol: Make SparkHistoryServiceUnavailable less sensitive to small metric flaps [alerts] - 10https://gerrit.wikimedia.org/r/1078602 [08:07:07] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078604 (https://phabricator.wikimedia.org/T375657) [08:07:08] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078604 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [08:07:48] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078604 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [08:15:27] (03CR) 10Muehlenhoff: "Looks good, one typo inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) (owner: 10Slyngshede) [08:15:31] (03PS1) 10WMDE-Fisch: [config] Rename moved gadget name setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078463 (https://phabricator.wikimedia.org/T362771) [08:16:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: T374215', diff saved to https://phabricator.wikimedia.org/P69498 and previous config saved to /var/cache/conftool/dbconfig/20241008-081620-arnaudb.json [08:16:24] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [08:18:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) (owner: 10Slyngshede) [08:18:37] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078607 (https://phabricator.wikimedia.org/T375881) [08:19:17] 06SRE, 06DBA, 10Sustainability (Incident Followup), 07Wikimedia-production-error: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10209632 (10ABran-WMF) →14Duplicate dup:03T376387 [08:19:49] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.26 refs T375657 [08:19:51] T375657: 1.43.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T375657 [08:19:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:20:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [08:20:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: maintenance [08:20:48] !log repooling wdqs1013 [08:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:04] (03CR) 10Slyngshede: [C:03+2] Password change form. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) (owner: 10Slyngshede) [08:21:06] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077309 (https://phabricator.wikimedia.org/T372405) (owner: 10AikoChou) [08:24:15] (03Merged) 10jenkins-bot: Password change form. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076197 (https://phabricator.wikimedia.org/T365370) (owner: 10Slyngshede) [08:24:42] (03CR) 10Giuseppe Lavagetto: [C:03+2] git::replicated_local_repo: set mode of post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1078438 (owner: 10Giuseppe Lavagetto) [08:29:26] (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078607 (https://phabricator.wikimedia.org/T375881) (owner: 10STran) [08:30:39] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078607 (https://phabricator.wikimedia.org/T375881) (owner: 10STran) [08:31:30] (03CR) 10Ayounsi: [C:03+1] "LGTM, the doc needs to be updated as well to match the new alerts/monitoring https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas" [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [08:36:42] (03CR) 10JMeybohm: [C:03+2] deployment_server: Set internal docker registry name by default [puppet] - 10https://gerrit.wikimedia.org/r/1078381 (https://phabricator.wikimedia.org/T376608) (owner: 10JMeybohm) [08:44:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:45:15] (03PS1) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [08:51:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [08:51:52] (03CR) 10Giuseppe Lavagetto: [C:03+1] Remove kubelet systemd unit dependency to docker.service [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1078447 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:53:04] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [08:53:27] (03CR) 10JMeybohm: [C:03+2] Remove kubelet systemd unit dependency to docker.service [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1078447 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:53:41] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [08:54:13] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [08:55:07] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [08:55:35] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [08:56:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [08:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:59:36] (03CR) 10Elukey: "For the moment I'd prefer to avoid Spicerack since everything is in flux, and we haven't reached a final config yet. I am planning to also" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:00:41] (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078427 (owner: 10Volans) [09:01:51] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [09:06:31] !log Ran `mwscript-k8s --comment="T376340" -- extensions/GlobalBlocking/maintenance/UpdateAutoBlockParentIdColumn.php --wiki=aawikibooks` [09:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:34] T376340: Update gb_autoblock_parent_id to use '0' instead of 'null' as the default - https://phabricator.wikimedia.org/T376340 [09:08:50] 10SRE-tools, 06Data-Persistence-SRE, 10Spicerack: mysql_legacy data_directory getter - https://phabricator.wikimedia.org/T376701 (10ABran-WMF) 03NEW [09:10:44] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:11:06] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:11:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2036.codfw.wmnet to cluster codfw and group C [09:12:36] !log Maintenance script for T376340 finished [09:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:39] T376340: Update gb_autoblock_parent_id to use '0' instead of 'null' as the default - https://phabricator.wikimedia.org/T376340 [09:12:46] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: mysql_legacy data_directory getter - https://phabricator.wikimedia.org/T376701#10209765 (10ABran-WMF) 05Open→03In progress [09:14:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2036.codfw.wmnet to cluster codfw and group C [09:17:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1005.eqiad.wmnet [09:19:37] (03PS3) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) [09:19:37] (03PS2) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [09:19:44] (03PS1) 10Muehlenhoff: Switch cloudcephosd1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078617 (https://phabricator.wikimedia.org/T349619) [09:19:47] (03PS2) 10Stevemunene: Change an-worker117[67] to use reuse partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) [09:20:15] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:20:42] (03PS1) 10Arnaudb: mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) [09:21:00] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078617 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:23:18] (03CR) 10Elukey: [C:03+1] Fix issues reported by newer pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/1078427 (owner: 10Volans) [09:23:22] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [09:23:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:25:48] !log imported kubernetes 1.23.14-4 to component/kubernetes123 (buster, bullseye, bookworm) - T362408 [09:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:53] T362408: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408 [09:26:22] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:27:35] (03CR) 10Brouberol: [C:03+2] Make SparkHistoryServiceUnavailable less sensitive to small metric flaps [alerts] - 10https://gerrit.wikimedia.org/r/1078602 (owner: 10Brouberol) [09:29:04] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: mysql_legacy data_directory getter - https://phabricator.wikimedia.org/T376701#10209803 (10ABran-WMF) p:05Triage→03Medium [09:29:43] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:30:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1005.eqiad.wmnet [09:32:16] (03CR) 10Elukey: [C:03+1] "Left a nit, you choose if you want to add it or not :)" [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:32:30] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [09:32:55] (03PS4) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) [09:32:55] (03PS3) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [09:33:14] (03CR) 10Alexandros Kosiaris: [C:03+1] Remove kubelet systemd unit dependency to docker.service [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1078447 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:33:29] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:36:50] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:37:29] (03CR) 10Volans: [C:03+1] "LGTM, once well tested we could also make it the default and remove the additional logic." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [09:37:51] (03PS6) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) [09:37:51] (03PS6) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [09:38:42] (03CR) 10Clément Goubert: [C:03+1] mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [09:38:45] (03CR) 10JMeybohm: "I do, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:39:17] (03CR) 10Alexandros Kosiaris: [C:04-1] "Minor readability comment, approach LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:39:19] (03CR) 10Volans: "optional approach inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078539 (owner: 10Ayounsi) [09:42:29] (03PS7) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) [09:42:29] (03PS7) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [09:42:38] (03CR) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:43:21] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:46:29] (03PS5) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) [09:46:29] (03PS4) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [09:47:45] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:48:06] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:48:34] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [09:48:58] (03CR) 10Elukey: [C:03+2] Fix issues reported by newer pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/1078427 (owner: 10Volans) [09:49:28] (03PS6) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) [09:49:37] (03PS5) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [09:50:19] (03CR) 10Elukey: "Manually tested on sretest2001, it seems to work as expected (BMC reset it, then applied etc..)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:50:58] (03CR) 10Elukey: "This is not particularly urgent since it is a problem only for ml-* hosts." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:52:55] !log installing freetype bugfix updates from Bookworm point update [09:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:45] (03CR) 10Lucas Werkmeister (WMDE): hawiki: Add temporary tagline for Vector-2022 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078396 (https://phabricator.wikimedia.org/T376049) (owner: 10Ammarpad) [09:59:34] (03CR) 10Volans: [C:03+1] "Makes sense, lgtm." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:59:50] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1000) [10:00:41] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706 (10klausman) 03NEW [10:02:23] (03CR) 10Volans: sre.hosts.provision: vary BIOS settings for Supermicro (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:04:19] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [10:04:57] (03CR) 10Elukey: "Me and Janis reviewed https://github.com/openstack/swift/blob/master/swift/common/middleware/ratelimit.py (the blame suggests the code did" [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [10:05:42] (03CR) 10AikoChou: [C:03+2] ml-services: update ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077309 (https://phabricator.wikimedia.org/T372405) (owner: 10AikoChou) [10:05:44] (03CR) 10Volans: "missing tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [10:06:40] 06SRE-OnFire, 06Data-Engineering, 06serviceops, 10Event-Platform: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994#10209907 (10Clement_Goubert) 05Open→03Resolved No more action needed on this incident. [10:06:44] (03Merged) 10jenkins-bot: ml-services: update ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077309 (https://phabricator.wikimedia.org/T372405) (owner: 10AikoChou) [10:07:07] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [10:08:01] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Engineering, 06Data-Platform-SRE, and 3 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068#10209933 (10Clement_Goubert) [10:08:57] (03CR) 10JMeybohm: [C:03+2] k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:09:15] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Race condition on puppetdb in sre.hosts.rename cookbook - https://phabricator.wikimedia.org/T374351#10209938 (10Clement_Goubert) 05Open→03Resolved I don't think this has reoccurred during the rest of the rename campaign, resolving [10:09:23] !log disabled puppet on all P:kubernetes::node [10:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:52] (03CR) 10Volans: [C:03+1] "In general LGTM, one question on the tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:15:53] (03PS2) 10Arnaudb: mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) [10:15:58] (03PS8) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [10:15:58] (03PS1) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078626 (https://phabricator.wikimedia.org/T362408) [10:16:00] (03PS7) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) [10:16:00] (03PS6) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [10:16:15] (03CR) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:16:26] (03CR) 10Arnaudb: mariadb: add data directory accessor (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [10:16:51] (03PS2) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078626 (https://phabricator.wikimedia.org/T362408) [10:16:51] (03PS9) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [10:17:16] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078626 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:18:43] (03CR) 10Volans: sre.hosts.provision: vary BIOS settings for Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:19:06] (03CR) 10JMeybohm: [C:03+2] k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078626 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:20:25] (03CR) 10Volans: "Still valid. I know it's pointless, but helps to keep the test stats clean without adding comments in the code to ignore the lines for cov" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [10:26:01] !log elukey@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [10:26:11] !log re-enable puppet on all P:kubernetes::node [10:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:56] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:50] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:33:01] (03CR) 10Clément Goubert: [C:03+1] [DNM] service: move mwdebug-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1072796 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [10:36:33] !log updated kubernetes 1.23.14-3 -> 1.23.14-4 on P:kubernetes::node - T362408 [10:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:36] T362408: Migration to containerd and away from docker - https://phabricator.wikimedia.org/T362408 [10:36:56] RESOLVED: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:49] (03PS7) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [10:41:07] (03CR) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:41:57] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:42:08] (03PS8) 10Elukey: sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) [10:43:26] (03CR) 10Elukey: [C:03+1] "Thanks for the change!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [10:43:32] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10210179 (10phaultfinder) [10:45:27] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bookworm [10:47:53] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:48:26] (03CR) 10Elukey: [C:03+1] "LGTM! Riccardo's point about tests is good, worth to follow up (will leave the" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:49:17] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: vary BIOS settings for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:49:21] (03CR) 10Elukey: [V:03+2 C:03+2] sre.hosts.provision: vary BIOS settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078613 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:49:57] !log elukey@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [10:49:58] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [10:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:53:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet [10:53:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2009.codfw.wmnet [10:53:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet [10:53:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10210219 (10ops-monitoring-bot) Draining ganeti2009.codfw.wmnet of running VMs [10:54:17] 06SRE, 10MW-on-K8s, 06serviceops: Update Parsoid wikitech documentation following mw-on-k8s migration - https://phabricator.wikimedia.org/T370646#10210213 (10Clement_Goubert) 05Open→03Resolved p:05Triage→03Low a:03Clement_Goubert [10:55:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2009.codfw.wmnet [10:55:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet [10:55:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10210221 (10ops-monitoring-bot) Draining ganeti2009.codfw.wmnet of running VMs [10:55:40] (03CR) 10Clément Goubert: [C:04-1] "Maybe we should hold off on this, at least for mw-api-ext, until WME has finished their baseline sync." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [10:55:47] (03CR) 10JMeybohm: [C:03+2] k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:57:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 936.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:58:56] (03PS2) 10Ammarpad: hawiki: Add temporary tagline for Vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078396 (https://phabricator.wikimedia.org/T376049) [10:59:13] (03PS1) 10Elukey: sre.hosts.provision: fix self.device_model_slug [cookbooks] - 10https://gerrit.wikimedia.org/r/1078636 (https://phabricator.wikimedia.org/T365372) [10:59:38] (03CR) 10Elukey: "/me plays "Shame! Shame! Shame!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078636 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:59:53] (03CR) 10Ammarpad: hawiki: Add temporary tagline for Vector-2022 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078396 (https://phabricator.wikimedia.org/T376049) (owner: 10Ammarpad) [11:02:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 817.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:02:32] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10210228 (10MoritzMuehlenhoff) [11:06:32] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [11:09:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [11:11:42] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: fix self.device_model_slug [cookbooks] - 10https://gerrit.wikimedia.org/r/1078636 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [11:12:59] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:13:41] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:16:54] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:17:47] (03CR) 10Elukey: [C:03+1] "I tried the following local test:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [11:27:33] (03CR) 10Stevemunene: Change an-worker117[67] to use reuse partman recipe. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [11:28:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bookworm [11:29:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1006.eqiad.wmnet [11:30:13] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2002.codfw.wmnet [11:30:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2002.codfw.wmnet [11:30:24] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2001.codfw.wmnet [11:33:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2001.codfw.wmnet [11:34:27] (03CR) 10Kamila Součková: [C:03+2] analytics_privatedata_users: add seanleong-wmde [puppet] - 10https://gerrit.wikimedia.org/r/1077441 (https://phabricator.wikimedia.org/T376034) (owner: 10Kamila Součková) [11:34:29] (03PS1) 10Muehlenhoff: Switch cloudcephosd1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078641 (https://phabricator.wikimedia.org/T349619) [11:35:16] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bookworm [11:36:29] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078641 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:40:30] 10ops-codfw, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057#10210271 (10MoritzMuehlenhoff) @Jhancock.wm If we have the DIMMs for puppetserver2003 already available, then let's proceed with that and handle 2001/2002 later when it's shipped over fro... [11:41:17] (03CR) 10Muehlenhoff: [C:03+2] Update point of contact for contracts formerly managed by Jean-Rene Branaa [puppet] - 10https://gerrit.wikimedia.org/r/1078434 (owner: 10Muehlenhoff) [11:43:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1006.eqiad.wmnet [11:47:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10210273 (10kamila) 05In progress→03Resolved [11:49:41] (03CR) 10Dzahn: [C:03+1] profile::requesttracker: delay blackbox checks for 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1078540 (https://phabricator.wikimedia.org/T376580) (owner: 10Jelto) [11:50:45] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10210281 (10Milimetric) This isn't alerting right now as far as I can tell, but we have new information that's probably related to the orig... [11:52:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1007.eqiad.wmnet [11:53:03] (03PS1) 10Muehlenhoff: Switch cloudcephosd1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078644 (https://phabricator.wikimedia.org/T349619) [11:54:40] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078644 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:56:41] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1200) [12:01:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [12:05:02] (03PS8) 10Tiziano Fogli: ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) [12:11:29] (03CR) 10Tiziano Fogli: [C:03+2] ripeatlas: move measurements checks to prom/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1069117 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [12:13:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1007.eqiad.wmnet [12:14:53] (03PS1) 10Ladsgroup: Remove flow from techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078645 (https://phabricator.wikimedia.org/T332022) [12:15:32] (03CR) 10CI reject: [V:04-1] Remove flow from techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078645 (https://phabricator.wikimedia.org/T332022) (owner: 10Ladsgroup) [12:15:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1008.eqiad.wmnet [12:16:48] (03PS2) 10Ladsgroup: Remove flow from techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078645 (https://phabricator.wikimedia.org/T332022) [12:17:50] (03PS1) 10Muehlenhoff: Switch cloudcephosd1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078649 (https://phabricator.wikimedia.org/T349619) [12:19:43] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bookworm [12:20:43] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078649 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:23:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2009.codfw.wmnet [12:25:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1008.eqiad.wmnet [12:26:33] !log remove ganeti2009 from active nodes T376594 [12:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:45] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [12:28:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [12:28:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10210370 (10ops-monitoring-bot) Draining ganeti2036.codfw.wmnet of running VMs [12:29:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [12:29:37] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host an-conf1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:31:15] FIRING: ProbeDown: Service ganeti2009:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [12:32:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:33:15] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:36:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-conf1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:38:10] jouncebot: nowandnext [12:38:10] For the next 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1200) [12:38:11] In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1300) [12:38:20] (03CR) 10Ladsgroup: [C:03+2] Remove flow from techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078645 (https://phabricator.wikimedia.org/T332022) (owner: 10Ladsgroup) [12:39:02] (03Merged) 10jenkins-bot: Remove flow from techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078645 (https://phabricator.wikimedia.org/T332022) (owner: 10Ladsgroup) [12:39:04] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078645 (https://phabricator.wikimedia.org/T332022) (owner: 10Ladsgroup) [12:39:41] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:39:50] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1078645|Remove flow from techconductwiki (T332022)]] [12:39:58] T332022: [Epic] Undeploying StructuredDiscussions (Flow) - https://phabricator.wikimedia.org/T332022 [12:42:14] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1078645|Remove flow from techconductwiki (T332022)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:42:55] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2005.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:43:54] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:43:55] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10210397 (10elukey) The new version of the cookbook is deployed, I am running it on insetup hosts listed in T376121 so we can apply the sa... [12:44:36] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:45:06] !log installing lua5.4 bugfix updates [12:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:30] (03CR) 10FNegri: [C:03+2] "LGTM, I'll merge this and update the repo, following https://wikitech.wikimedia.org/wiki/Reprepro#Updating_external_repositories" [puppet] - 10https://gerrit.wikimedia.org/r/1078420 (https://phabricator.wikimedia.org/T362867) (owner: 10Raymond Ndibe) [12:47:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:47:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10210420 (10MoritzMuehlenhoff) [12:49:18] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078645|Remove flow from techconductwiki (T332022)]] (duration: 09m 27s) [12:49:20] T332022: [Epic] Undeploying StructuredDiscussions (Flow) - https://phabricator.wikimedia.org/T332022 [12:50:15] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:50:22] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] hawiki: Add temporary tagline for Vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078396 (https://phabricator.wikimedia.org/T376049) (owner: 10Ammarpad) [12:51:15] RESOLVED: ProbeDown: Service ganeti2009:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:46] jouncebot: nowandnext [12:51:46] For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1200) [12:51:46] In 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1300) [12:53:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:53:49] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:55:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069293 (https://phabricator.wikimedia.org/T170001) (owner: 10MusikAnimal) [12:55:48] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ganeti2036.codfw.wmnet [12:55:58] 10SRE-tools, 06Data-Persistence-SRE, 10Spicerack: mysql_legacy: SQL query quote escape - https://phabricator.wikimedia.org/T376712 (10ABran-WMF) 03NEW [12:57:02] !log dropping povwatch_log on all.dblist (T54924 and T376627) [12:57:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [12:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:06] T54924: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924 [12:57:06] T376627: Drop ad-hoc tables in production - https://phabricator.wikimedia.org/T376627 [12:57:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10210467 (10MoritzMuehlenhoff) [12:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:58:03] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host krb1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [12:58:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host krb1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:00:03] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: mysql_legacy: SQL query quote escape - https://phabricator.wikimedia.org/T376712#10210469 (10ABran-WMF) p:05Triage→03Medium [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1300). [13:00:05] Ammar, Ammar, and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] o/ [13:00:42] feel free to deploy if you want Lucas_WMDE :) [13:00:59] ok, sure ^^ [13:01:10] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10210484 (10elukey) [13:01:30] (03PS1) 10Muehlenhoff: Remove ganeti2009/ganeti2010 from Ganeti role [puppet] - 10https://gerrit.wikimedia.org/r/1078660 (https://phabricator.wikimedia.org/T376594) [13:01:52] Lucas_WMDE OK [13:02:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078396 (https://phabricator.wikimedia.org/T376049) (owner: 10Ammarpad) [13:03:14] (03Merged) 10jenkins-bot: hawiki: Add temporary tagline for Vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078396 (https://phabricator.wikimedia.org/T376049) (owner: 10Ammarpad) [13:03:38] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1078396|hawiki: Add temporary tagline for Vector-2022 (T376049)]] [13:03:41] T376049: Requesting temporary logo change for ha.wiki - https://phabricator.wikimedia.org/T376049 [13:03:52] jouncebot: next [13:03:52] In 1 hour(s) and 56 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1500) [13:05:23] (03PS1) 10Volans: Fix issues reported by pylint >3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078663 [13:06:02] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host parsoidtest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:06:04] !log lucaswerkmeister-wmde@deploy2002 ammarpad, lucaswerkmeister-wmde: Backport for [[gerrit:1078396|hawiki: Add temporary tagline for Vector-2022 (T376049)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:06:19] Ammar: please test the logo change :) [13:07:13] I see the tagline on https://ha.wikipedia.org/wiki/Babban_shafi (after a force-reload) [13:07:16] @Lucas_WMDE It works corrrectly [13:07:19] !log lucaswerkmeister-wmde@deploy2002 ammarpad, lucaswerkmeister-wmde: Continuing with sync [13:07:21] \o/ [13:07:30] Thank you [13:08:49] (03PS1) 10Jelto: wikidata-query-gui: remove experimental endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078664 (https://phabricator.wikimedia.org/T350793) [13:09:10] (03CR) 10Arnaudb: [V:03+1 C:03+1] swift: avoid rate-limit for the Docker account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [13:09:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parsoidtest1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:10:10] (03CR) 10Snwachukwu: Change New Eventschemas Git URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [13:10:17] (03CR) 10Jelto: [V:03+1 C:03+2] profile::requesttracker: delay blackbox checks for 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1078540 (https://phabricator.wikimedia.org/T376580) (owner: 10Jelto) [13:11:04] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host deploy1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:11:32] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10210529 (10elukey) [13:11:55] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078396|hawiki: Add temporary tagline for Vector-2022 (T376049)]] (duration: 08m 17s) [13:11:58] T376049: Requesting temporary logo change for ha.wiki - https://phabricator.wikimedia.org/T376049 [13:12:14] (03PS1) 10Muehlenhoff: Point irc.w.o to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1078665 (https://phabricator.wikimedia.org/T376014) [13:13:50] (03CR) 10Slyngshede: [C:03+1] "Awesome, looks good." [dns] - 10https://gerrit.wikimedia.org/r/1078665 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:14:13] I’m trying to look through T376446 and understand if there’s actually consensus for the config change [13:14:13] T376446: Enable $wgMFCollapseSectionsByDefault on English Wiktionary - https://phabricator.wikimedia.org/T376446 [13:14:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host deploy1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:14:27] Jdlrobson: are you around by any chance? [13:15:37] Lucas_WMDE Jon agreed it's okay. Note that what they're discussing is whether they should do it via JS on wiki if the config change is declined [13:15:46] (03CR) 10Alexandros Kosiaris: [C:03+1] "I 'd say with a DP member +1e it, we are good to go." [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [13:15:59] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-staging2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:16:10] (03CR) 10CI reject: [V:04-1] Fix issues reported by pylint >3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078663 (owner: 10Volans) [13:16:13] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:17:28] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:17:43] Ammar: but Jon also argued *against* the code that would implement the “expand if only one section” behavior, if I understand correctly [13:17:48] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078663 (owner: 10Volans) [13:18:20] (honestly I don’t understand the comment at https://phabricator.wikimedia.org/T376446#10204121 at all. Is it supposed to read “If this is for logged *in* users this is fine”?) [13:18:26] Lucas_WMDE okay fine, let's leave it [13:19:20] Ammar: I’ll leave a comment on the task [13:19:21] Jon is talking about the JS snippet posted in the comment immediately before that [13:20:27] (03PS1) 10Alexandros Kosiaris: mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) [13:20:52] alright, let’s proceed with the CodeMirror change then [13:20:57] TheresNoTime: want to self-service https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069293 ? [13:21:23] (03CR) 10CI reject: [V:04-1] mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) (owner: 10Alexandros Kosiaris) [13:21:26] Lucas_WMDE: I've closed terminals now, do you mind doing it? [13:21:30] sure [13:22:10] * Lucas_WMDE does a cheeky grep for CodeMirrorRTL in /srv/mediawiki-staging just to check [13:23:04] (03PS2) 10Alexandros Kosiaris: mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) [13:23:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069293 (https://phabricator.wikimedia.org/T170001) (owner: 10MusikAnimal) [13:23:44] (03PS1) 10Elukey: sre.hosts.provision: fix supermicro amd virtualization settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1078667 (https://phabricator.wikimedia.org/T365372) [13:24:17] (03CR) 10CI reject: [V:04-1] mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) (owner: 10Alexandros Kosiaris) [13:24:22] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [13:24:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [13:24:28] (03Merged) 10jenkins-bot: Remove $wgCodeMirrorRTL temporary feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069293 (https://phabricator.wikimedia.org/T170001) (owner: 10MusikAnimal) [13:24:55] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1069293|Remove $wgCodeMirrorRTL temporary feature flag (T170001 T357795)]] [13:24:58] (03PS2) 10Elukey: sre.hosts.provision: fix supermicro amd virtualization settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1078667 (https://phabricator.wikimedia.org/T365372) [13:24:59] T170001: Support CodeMirror syntax highlighting on RTL wikis - https://phabricator.wikimedia.org/T170001 [13:25:00] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [13:26:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10210578 (10Papaul) @klausman thank you for opening the task. Will it be possible for us to have the info on what DIMM(s) is having issues? T... [13:26:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10210561 (10Papaul) a:05Papaul→03None [13:26:53] TheresNoTime: I’m guessing there won’t be much to test on mwdebug for this? [13:26:57] seeing as it’s unused in the code [13:27:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, musikanimal: Backport for [[gerrit:1069293|Remove $wgCodeMirrorRTL temporary feature flag (T170001 T357795)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:27:15] Lucas_WMDE: nothing to test really, feel free to sync [13:27:18] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, musikanimal: Continuing with sync [13:28:03] (03CR) 10Elukey: [C:03+1] "\o/" [dns] - 10https://gerrit.wikimedia.org/r/1078665 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:31:05] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Decommission the alert1001 and alert2001 hosts - https://phabricator.wikimedia.org/T372607#10210589 (10Papaul) @andrea.denisse hello fyi; if you have this type of case where you have to decom a server in eqiad and the same in codfw best practice will... [13:31:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] wikidata-query-gui: remove experimental endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078664 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:31:52] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1069293|Remove $wgCodeMirrorRTL temporary feature flag (T170001 T357795)]] (duration: 06m 56s) [13:31:56] T170001: Support CodeMirror syntax highlighting on RTL wikis - https://phabricator.wikimedia.org/T170001 [13:31:56] T357795: CodeMirror 6 deployment - https://phabricator.wikimedia.org/T357795 [13:32:00] TheresNoTime: all done [13:32:08] Lucas_WMDE: thanks! :D [13:33:05] it’s probably a bit early to hope for a reply from Jon at T376446 [13:33:05] T376446: Enable $wgMFCollapseSectionsByDefault on English Wiktionary - https://phabricator.wikimedia.org/T376446 [13:33:18] so I think the enwiktionary config change will have to be postponed, sorry Ammar [13:33:35] no problem [13:33:49] !log UTC afternoon backport+config window done [13:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:40] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: mysql_legacy: SQL query quote escape - https://phabricator.wikimedia.org/T376712#10210606 (10ABran-WMF) [13:35:04] (03PS3) 10Alexandros Kosiaris: mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) [13:35:05] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: mysql_legacy data_directory getter - https://phabricator.wikimedia.org/T376701#10210609 (10ABran-WMF) [13:35:54] (03PS1) 10Herron: thanos-rule: add logstash_sli_availability:bool [puppet] - 10https://gerrit.wikimedia.org/r/1078671 (https://phabricator.wikimedia.org/T376638) [13:36:06] (03CR) 10Elukey: [C:03+1] Fix issues reported by pylint >3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078663 (owner: 10Volans) [13:37:48] (03CR) 10Zabe: [C:03+2] s5: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078412 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [13:37:50] (03PS1) 10Alexandros Kosiaris: mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 [13:38:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10210612 (10klausman) These are the most recent entries from ipmi SEL: `115 | Sep-30-2024 | 01:30:13 | ECC Uncorr Err | Memory... [13:38:33] (03Merged) 10jenkins-bot: s5: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078412 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [13:39:03] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1078412|s5: Reduce revision-slots cache expiry to 60 seconds (T183490)]] [13:39:07] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [13:39:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10210627 (10Jclark-ctr) @Eevans if you can update site.pp for insetup so we can complete this ticket. thanks [13:40:37] (03CR) 10Elukey: [C:03+2] tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [13:41:15] !log zabe@deploy2002 zabe: Backport for [[gerrit:1078412|s5: Reduce revision-slots cache expiry to 60 seconds (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:36] (03CR) 10Arnaudb: [V:03+1 C:03+1] swift: avoid rate-limit for the Docker account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [13:41:41] !log zabe@deploy2002 zabe: Continuing with sync [13:42:49] (03CR) 10JMeybohm: [C:03+1] wikidata-query-gui: remove experimental endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078664 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:43:56] (03CR) 10Volans: [C:03+1] "LGTM, I'm affected too :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078667 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:44:07] (03CR) 10Zabe: [C:03+2] Stop setting wgAbuseFilterActorTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078415 (https://phabricator.wikimedia.org/T188180) (owner: 10Zabe) [13:44:57] (03Merged) 10jenkins-bot: Stop setting wgAbuseFilterActorTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078415 (https://phabricator.wikimedia.org/T188180) (owner: 10Zabe) [13:46:00] (03CR) 10Elukey: "I like it a lot, left a comment to address Riccardo's concerns. I think that there may be some rebase changes to make since I added new ch" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [13:46:14] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078412|s5: Reduce revision-slots cache expiry to 60 seconds (T183490)]] (duration: 07m 10s) [13:46:17] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [13:46:27] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: fix supermicro amd virtualization settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1078667 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [13:46:45] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1078415|Stop setting wgAbuseFilterActorTableSchemaMigrationStage (T188180)]] [13:46:48] T188180: Read from and write to `actor` table in AbuseFilter - https://phabricator.wikimedia.org/T188180 [13:47:20] (03PS1) 10Slyngshede: Speed holes. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078675 [13:48:53] !log zabe@deploy2002 zabe: Backport for [[gerrit:1078415|Stop setting wgAbuseFilterActorTableSchemaMigrationStage (T188180)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:48:54] (03PS3) 10Slyngshede: ldapbackend: Remove post_save signal for user models. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) [13:49:01] (03CR) 10Slyngshede: ldapbackend: Remove post_save signal for user models. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) (owner: 10Slyngshede) [13:49:13] !log zabe@deploy2002 zabe: Continuing with sync [13:49:35] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-staging2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:50:09] (03CR) 10Herron: [C:03+2] thanos-rule: add logstash_sli_availability:bool [puppet] - 10https://gerrit.wikimedia.org/r/1078671 (https://phabricator.wikimedia.org/T376638) (owner: 10Herron) [13:50:25] (03CR) 10CI reject: [V:04-1] tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [13:52:49] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:53:00] (03PS1) 10JMeybohm: Migrate kubestage1003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078677 (https://phabricator.wikimedia.org/T362408) [13:53:02] (03PS1) 10JMeybohm: Migrate kubestage1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078678 (https://phabricator.wikimedia.org/T362408) [13:53:49] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078415|Stop setting wgAbuseFilterActorTableSchemaMigrationStage (T188180)]] (duration: 07m 03s) [13:53:51] T188180: Read from and write to `actor` table in AbuseFilter - https://phabricator.wikimedia.org/T188180 [13:54:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589#10210732 (10aborrero) checked the server today. No kernel panic. [13:58:04] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10210766 (10ssingh) >>! In T375014#10205990, @Volans wrote: > @ssingh what do you think of the above draft patch proposal?... [13:59:57] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:03:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:04:41] (03PS1) 10Ammarpad: sdwiki: Add new logo and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078680 (https://phabricator.wikimedia.org/T376536) [14:05:00] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:05:33] (03PS1) 10Herron: pyrra: add logstash-availability SLO [puppet] - 10https://gerrit.wikimedia.org/r/1078681 (https://phabricator.wikimedia.org/T376638) [14:08:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-misc2001 to codfw - jhancock@cumin2002" [14:08:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-misc2001 to codfw - jhancock@cumin2002" [14:08:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:08:41] (03CR) 10Herron: [C:03+2] pyrra: add logstash-availability SLO [puppet] - 10https://gerrit.wikimedia.org/r/1078681 (https://phabricator.wikimedia.org/T376638) (owner: 10Herron) [14:10:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1009.eqiad.wmnet [14:12:45] (03PS1) 10Tiziano Fogli: fix: ripeatlas puppet cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) [14:12:50] (03PS1) 10Muehlenhoff: Switch cloudcephosd1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078683 (https://phabricator.wikimedia.org/T349619) [14:13:33] (03CR) 10CI reject: [V:04-1] fix: ripeatlas puppet cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:13:36] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc2001 [14:13:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc2001 [14:13:54] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078683 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:14:07] (03CR) 10Kamila Součková: [C:03+1] Migrate kubestage1003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078677 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:14:22] (03PS1) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [14:15:06] (03PS2) 10Tiziano Fogli: fix: ripeatlas puppet cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) [14:15:07] (03CR) 10CI reject: [V:04-1] prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:15:32] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev [14:15:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb2004-dev [14:15:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [14:16:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev'] [14:16:43] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10210910 (10elukey) [14:16:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [14:17:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudlb2004-dev'] [14:17:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [14:18:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10210943 (10elukey) @MoritzMuehlenhoff I can't re-run the provision cookbook on these nodes since the Redfish license is still not u... [14:19:53] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:20:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) (owner: 10Slyngshede) [14:21:13] (03PS2) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [14:22:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:22:24] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc2001 [14:22:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc2001 [14:22:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1009.eqiad.wmnet [14:22:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev'] [14:23:12] (03PS3) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [14:23:29] (03CR) 10Hashar: "py39-unit failed due to:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [14:23:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1010.eqiad.wmnet [14:23:50] (03CR) 10CI reject: [V:04-1] prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [14:25:21] (03PS1) 10Muehlenhoff: Switch cloudcephosd1010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078688 (https://phabricator.wikimedia.org/T349619) [14:25:52] (03PS4) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [14:26:08] (03PS4) 10JHathaway: dhcp: Add option to omit sending filename to a vendor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 [14:27:20] (03CR) 10JHathaway: "I agree, although I don't think this change increases the complexity very much." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [14:27:32] (03PS1) 10Brouberol: airflow: expose non-sensitive configuration in the web UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078689 [14:28:23] (03CR) 10JHathaway: Add efi support to partman (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [14:28:32] (03PS3) 10Tiziano Fogli: fix: ripeatlas puppet cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) [14:28:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589#10211007 (10Jclark-ctr) @aborrero i did just update idrac from 4.4 to 7.0. unrelated. but since i was logged in and causes no reboot. let us know if you w... [14:30:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [14:30:03] (03PS4) 10Tiziano Fogli: fix: ripeatlas puppet cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) [14:30:05] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1010 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1078688 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:30:14] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10211008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [14:30:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2010.codfw.wmnet [14:31:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10211011 (10ops-monitoring-bot) Draining ganeti2010.codfw.wmnet of running VMs [14:31:09] (03CR) 10Ayounsi: sre.hosts.provision: initial UEFI support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [14:31:26] (03PS5) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [14:31:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10211018 (10MoritzMuehlenhoff) [14:31:53] (03CR) 10Ayounsi: "Sounds good, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [14:32:11] (03Abandoned) 10Ssingh: purged: set use_pki to true in magru [puppet] - 10https://gerrit.wikimedia.org/r/1039815 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:32:40] (03PS11) 10JHathaway: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [14:33:12] (03Abandoned) 10Ssingh: purged: add Puppet overrides to use cfssl for certs in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1032106 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:34:12] (03PS12) 10JHathaway: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [14:34:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1010.eqiad.wmnet [14:34:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2010.codfw.wmnet [14:35:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2010.codfw.wmnet [14:35:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10211037 (10ops-monitoring-bot) Draining ganeti2010.codfw.wmnet of running VMs [14:35:58] (03PS2) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1077753 [14:36:06] (03CR) 10Ssingh: [C:03+2] tlsproxy: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075612 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [14:36:07] (03CR) 10Aklapper: [C:03+2] Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1077753 (owner: 10Pppery) [14:36:11] (03CR) 10Aklapper: [V:03+2 C:03+2] Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1077753 (owner: 10Pppery) [14:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:39] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:36:54] (03CR) 10Ayounsi: redfish: add UEFI functions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [14:38:00] (03PS10) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [14:38:27] (03CR) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [14:39:48] (03CR) 10JHathaway: [C:03+2] dhcp: Add option to omit sending filename to a vendor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [14:40:38] (03CR) 10Ayounsi: sre.hosts.provision: make UEFI opt-out (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078539 (owner: 10Ayounsi) [14:40:48] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:41:08] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10211044 (10elukey) Next steps: * Apply https://gerrit.wikimedia.org/r/1078380 during tomorrow's MW Maintenance Window and retest... [14:41:18] !log installing python-aiosmtpd security updates [14:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:53] (03PS6) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [14:42:39] (03CR) 10Ayounsi: sre.hosts.provision: make UEFI opt-out (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078539 (owner: 10Ayounsi) [14:42:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10211045 (10andrea.denisse) Hi @Jclark-ctr @VRiley-WMF , this alert is still firing, please advice. [14:43:43] (03CR) 10Tiziano Fogli: [C:03+2] fix: ripeatlas puppet cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1078682 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10211050 (10phaultfinder) [14:44:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589#10211051 (10aborrero) >>! In T376589#10211007, @Jclark-ctr wrote: > @aborrero i did just update idrac from 4.4 to 7.0. unrelated. but since i was logged in a... [14:45:11] (03CR) 10Scott French: "Ah, yeah that's a good point now that we know that's coming. If we think it's likely that we'd need to add capacity to support that, then " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [14:49:41] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [14:49:58] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:50:55] (03PS3) 10Scott French: service: move mwdebug-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1072796 (https://phabricator.wikimedia.org/T372604) [14:50:55] (03PS3) 10Scott French: [DNM] service: move mwdebug-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1072798 (https://phabricator.wikimedia.org/T372604) [14:51:13] (03CR) 10Alexandros Kosiaris: [C:04-1] "LGTM on principle, couple of inline notes." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [14:51:24] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1072796 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [14:51:38] (03PS11) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [14:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:56:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [14:56:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [14:56:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:50] !log mr1-magru ongoing maintenance [14:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] eoghan, jelto, arnoldokoth, and mutante: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1500) [15:01:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:33] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: version upgrade [15:01:48] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: version upgrade [15:02:11] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: version upgrade [15:02:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: version upgrade [15:02:38] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phabricator.wikimedia.org with reason: version upgrade [15:02:40] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phabricator.wikimedia.org with reason: version upgrade [15:02:54] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: version upgrade [15:02:55] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phab.wmfusercontent.org with reason: version upgrade [15:03:27] !log brennen@deploy2002 Started deploy [phabricator/deployment@40a63c9]: test deploy phab2002 for T376720 [15:03:30] T376720: Deploy Phabricator/Phorge 2024-10-08 - https://phabricator.wikimedia.org/T376720 [15:03:54] !log brennen@deploy2002 Finished deploy [phabricator/deployment@40a63c9]: test deploy phab2002 for T376720 (duration: 00m 26s) [15:04:00] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [15:04:27] !log brennen@deploy2002 Started deploy [phabricator/deployment@40a63c9]: deploy phab1004 for T376720 [15:05:35] !log brennen@deploy2002 Finished deploy [phabricator/deployment@40a63c9]: deploy phab1004 for T376720 (duration: 01m 07s) [15:07:57] (03CR) 10David Caro: "Hmm, I htin" [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [15:08:15] (03CR) 10David Caro: "Oops, ignore this one xd" [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [15:09:29] (03PS12) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [15:11:05] (03PS1) 10Hnowlan: Remove RunSingleJobStdin script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078700 (https://phabricator.wikimedia.org/T369048) [15:12:41] (03CR) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [15:12:59] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:15:08] (03PS7) 10Arturo Borrero Gonzalez: prometheus: add kernel-panic detector [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) [15:19:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [15:19:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudlb2004-dev'] [15:19:53] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [15:20:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudlb2004-dev'] [15:20:46] (03PS4) 10Hnowlan: php-cli: include mercurius in 8.1 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1077682 (https://phabricator.wikimedia.org/T371699) [15:22:10] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [15:26:07] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [15:27:16] (03CR) 10Snwachukwu: Change New Eventschemas Git URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [15:30:00] 06SRE-OnFire, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#10211164 (10Eevans) Q: Should this be a part of the MVP (i.e. Day 1), or saved for a subsequent iteration? I'm wondering whether —from a testing/rollout strategy if nothi... [15:31:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev'] [15:32:34] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-e5-eqiad [15:32:42] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e5-eqiad [15:33:03] (03CR) 10David Caro: prometheus: add kernel-panic detector (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078684 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [15:33:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [15:33:15] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10211167 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [15:33:23] (03PS1) 10Tiziano Fogli: ripeatlas: clean up resource defs after deletion [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) [15:33:38] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f5-eqiad [15:33:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f5-eqiad [15:33:55] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f6-eqiad [15:34:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f6-eqiad [15:34:11] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-e6-eqiad [15:34:18] (03PS2) 10Tiziano Fogli: ripeatlas: clean up resource defs after deletion [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) [15:34:19] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e6-eqiad [15:34:27] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-e7-eqiad [15:34:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e7-eqiad [15:34:38] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [15:34:44] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f7-eqiad [15:34:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f7-eqiad [15:37:59] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:38:41] (03CR) 10JHathaway: [C:03+2] Add efi support to partman [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [15:40:09] (03CR) 10Tiziano Fogli: "Just to close the clean-up activity." [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [15:41:35] !log mr1-magru end of maintenance [15:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:04] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10211194 (10elukey) I dumped all the files stored in swift in a text file on ms-fe1009, and ran the following: ` from pprint import pprint pr... [15:49:53] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10211237 (10Papaul) [15:50:08] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10211239 (10Papaul) [15:51:31] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team: Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726 (10Dreamy_Jazz) 03NEW [15:51:48] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211288 (10Dreamy_Jazz) [15:52:34] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211289 (10Dreamy_Jazz) [15:54:30] (03PS1) 10Eevans: aqs1022: change to role(insetup::data_persistence) [puppet] - 10https://gerrit.wikimedia.org/r/1078706 (https://phabricator.wikimedia.org/T372514) [15:56:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10211296 (10Eevans) @Jclark-ctr site.pp was updated back when the host was being ordered (see [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+... [15:57:21] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211297 (10Dreamy_Jazz) [15:58:41] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211298 (10Dreamy_Jazz) [15:59:02] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10211299 (10ayounsi) About phase 1. I checked the pfw1 config and steps here. Gave some feedback over IRC. Overall lgtm. I didn't check phase 2 yet, will do to... [15:59:03] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211295 (10Dreamy_Jazz) @Legoktm do you still have a use case... [16:00:05] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:39] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211313 (10Dreamy_Jazz) [16:02:15] rzl: jhathaway: if there's nothing planned for the puppet window, any objections if I use your spot for an LVS service turnup? [16:02:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:02:37] swfrench-wmf: please do [16:03:07] swfrench-wmf: all yours [16:03:29] awesome, thanks! [16:05:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-misc2001 to codfw - jhancock@cumin2002" [16:05:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-misc2001 to codfw - jhancock@cumin2002" [16:05:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:06:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev [16:06:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb2004-dev [16:06:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:06:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:08:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:08:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:08:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudlb2004-dev [16:08:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudlb2004-dev [16:09:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:10:36] (03PS1) 10Giuseppe Lavagetto: python_deploy::venv: transform into a define [puppet] - 10https://gerrit.wikimedia.org/r/1078707 [16:10:36] (03PS1) 10Giuseppe Lavagetto: fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) [16:10:37] (03PS1) 10Giuseppe Lavagetto: profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782) [16:12:57] (03CR) 10CI reject: [V:04-1] python_deploy::venv: transform into a define [puppet] - 10https://gerrit.wikimedia.org/r/1078707 (owner: 10Giuseppe Lavagetto) [16:13:10] (03CR) 10CI reject: [V:04-1] profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [16:13:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [16:15:16] (03CR) 10Snwachukwu: Change New Eventschemas Git URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [16:17:05] (03CR) 10Btullis: [C:03+2] Change New Eventschemas Git URLs [puppet] - 10https://gerrit.wikimedia.org/r/1071891 (https://phabricator.wikimedia.org/T366836) (owner: 10Snwachukwu) [16:23:55] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad [16:24:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad [16:25:56] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad [16:26:38] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad [16:28:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudlb2004-dev.codfw.wmnet [16:36:56] (03PS1) 10Dreamy Jazz: Remove wgGlobalBlockingAllowGlobalAccountBlocks as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078712 [16:36:57] (03PS1) 10Dreamy Jazz: Define wgGlobalBlockingEnableAutoblocks as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078713 (https://phabricator.wikimedia.org/T374853) [16:37:17] !log disable Puppet fleet-wide for puppetmaster1001 hardware maintenance [16:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:35] jouncebot: nowandnext [16:37:35] For the next 0 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1600) [16:37:35] In 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1700) [16:38:14] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10211475 (10MoritzMuehlenhoff) [16:38:43] swfrench-wmf: Have you completed your work? I'd like to merge some config changes. [16:38:59] Not a priority, so if you are still working on it mine can wait. [16:39:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudlb2004-dev.codfw.wmnet [16:39:55] Dreamy_Jazz: I'm on hold for the moment due to ongoing puppet maintenance, so assuming you mean mediawiki-config backports or the like, no objections on my end [16:40:09] Yeah. A mediawiki-config backport. [16:40:12] Thanks [16:40:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudlb2004-dev.codfw.wmnet [16:40:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10211486 (10Jhancock.wm) [16:40:49] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10211476 (10bking) 05Open→03Resolved a:03bking The VM `wdqs-categories1001` has been provisioned successfully, so... [16:41:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078713 (https://phabricator.wikimedia.org/T374853) (owner: 10Dreamy Jazz) [16:41:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078712 (owner: 10Dreamy Jazz) [16:41:54] (03Merged) 10jenkins-bot: Remove wgGlobalBlockingAllowGlobalAccountBlocks as unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078712 (owner: 10Dreamy Jazz) [16:41:56] (03Merged) 10jenkins-bot: Define wgGlobalBlockingEnableAutoblocks as false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078713 (https://phabricator.wikimedia.org/T374853) (owner: 10Dreamy Jazz) [16:42:24] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1078713|Define wgGlobalBlockingEnableAutoblocks as false (T374853)]], [[gerrit:1078712|Remove wgGlobalBlockingAllowGlobalAccountBlocks as unused]] [16:42:27] T374853: Update the GlobalBlockManager service to support global autoblocks - https://phabricator.wikimedia.org/T374853 [16:43:31] (03PS3) 10Jdlrobson: Expand Vector 2022 roll out and support local variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) [16:43:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetserver1001.eqiad.wmnet with reason: RAM expansion [16:44:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetserver1001.eqiad.wmnet with reason: RAM expansion [16:44:17] 10ops-eqiad, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10211505 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ce403085-c2bd-4793-9f81-85a1032718c8) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with re... [16:44:36] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1078713|Define wgGlobalBlockingEnableAutoblocks as false (T374853)]], [[gerrit:1078712|Remove wgGlobalBlockingAllowGlobalAccountBlocks as unused]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:44:41] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [16:44:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudlb2004-dev.codfw.wmnet [16:47:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:48:13] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10211511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:48:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:48:30] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10211512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed w... [16:49:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2010.codfw.wmnet [16:49:15] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078713|Define wgGlobalBlockingEnableAutoblocks as false (T374853)]], [[gerrit:1078712|Remove wgGlobalBlockingAllowGlobalAccountBlocks as unused]] (duration: 06m 50s) [16:49:18] T374853: Update the GlobalBlockManager service to support global autoblocks - https://phabricator.wikimedia.org/T374853 [16:49:59] I'm finished with my config changes. [16:50:04] *mediawiki-config [16:50:33] great, thanks! [16:57:09] !log enable Puppet fleet-wide for puppetmaster1001 hardware maintenance [16:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1700) [17:02:51] 10ops-eqiad, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10211576 (10MoritzMuehlenhoff) 05Open→03Resolved a:03VRiley-WMF [17:03:58] !log ran disable-puppet on 'A:lvs and (A:eqiad or A:codfw)' - T372604 [17:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:15] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:04:33] (03CR) 10Scott French: [C:03+2] service: move mwdebug-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1072796 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:04:39] (03CR) 10Ladsgroup: [C:03+1] "sanity checked" [puppet] - 10https://gerrit.wikimedia.org/r/1078706 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans) [17:05:17] 10ops-codfw, 06SRE, 06DC-Ops: codfw puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057#10211586 (10MoritzMuehlenhoff) [17:07:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:05] !log ran and enabled puppet-agent on 'A:lvs and A:eqiad' - T372604 [17:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:05] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T372604) [17:12:08] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:12:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:39] (03PS5) 10Jdlrobson: Dark mode: Make LiquidThreads namespace explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 [17:13:16] (03PS6) 10Jdlrobson: Dark mode: Make LiquidThreads namespace exclusion explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 [17:13:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 (owner: 10Jdlrobson) [17:13:43] (03CR) 10Jdlrobson: [C:03+1] Turn on mobile support for Parsoid Read Views (but not on talk pages) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) (owner: 10C. Scott Ananian) [17:17:28] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:17:53] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T372604) [17:17:56] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:21:53] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T372604) [17:25:39] (03PS1) 10Bvibber: Switch iOS back-compat video transcodes from HLS to regular QuickTime [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078715 (https://phabricator.wikimedia.org/T363966) [17:27:23] (03CR) 10RLazarus: mw-script: Add prometheus-statsd-exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) (owner: 10Alexandros Kosiaris) [17:27:53] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T372604) [17:27:56] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:28:41] (03CR) 10Jforrester: [C:03+1] Switch iOS back-compat video transcodes from HLS to regular QuickTime [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078715 (https://phabricator.wikimedia.org/T363966) (owner: 10Bvibber) [17:31:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078715 (https://phabricator.wikimedia.org/T363966) (owner: 10Bvibber) [17:32:05] (03PS1) 10CDanis: recursor: Use NSes in eqiad+codfw private IP space [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) [17:32:15] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [17:33:44] (03PS2) 10CDanis: recursor: Use NSes in eqiad+codfw private IP space [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) [17:33:46] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [17:34:38] !log ran and enabled puppet-agent on 'A:lvs and A:codfw' - T372604 [17:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:41] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:35:04] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T372604) [17:35:50] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T372604) [17:36:43] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737 (10ssingh) 03NEW [17:37:12] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10211781 (10ssingh) p:05Triage→03High [17:39:08] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T372604) [17:39:57] (03PS3) 10CDanis: recursor: Use NSes in eqiad+codfw private IP space [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) [17:40:01] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [17:45:07] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T372604) [17:45:13] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:45:43] (03PS4) 10Scott French: service: move mwdebug-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1072798 (https://phabricator.wikimedia.org/T372604) [17:45:56] (03CR) 10RLazarus: [C:03+1] "LGTM after the other one, thanks for cleaning this up!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 (owner: 10Alexandros Kosiaris) [17:46:06] (03PS4) 10CDanis: recursor: Use NSes in eqiad+codfw private IP space [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) [17:46:10] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [17:48:57] (03CR) 10Brouberol: [C:03+1] Change an-worker117[67] to use reuse partman recipe. [puppet] - 10https://gerrit.wikimedia.org/r/1077913 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [17:49:11] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376094#10211831 (10VRiley-WMF) a:03VRiley-WMF [17:50:07] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211824 (10Legoktm) >>! In T376726#10211294, @Dreamy_Jazz wro... [17:52:59] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, 10Temporary accounts (Blockers to minor pilot wiki deployment): Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10211856 (10Dreamy_Jazz) >>! In T376726#10211824, @Legoktm wro... [17:53:36] (03CR) 10RLazarus: [C:03+1] service: move mwdebug-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1072798 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:54:07] (03PS1) 10Herron: alertmanager-irc: improve ErrorBudgetBurn SLO alert text [puppet] - 10https://gerrit.wikimedia.org/r/1078718 (https://phabricator.wikimedia.org/T376740) [17:56:20] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1072798 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:56:23] (03CR) 10Scott French: [C:03+2] service: move mwdebug-next to production [puppet] - 10https://gerrit.wikimedia.org/r/1072798 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:56:45] (03PS1) 10CDanis: ferm: allow DNS traffic against k8s control planes [puppet] - 10https://gerrit.wikimedia.org/r/1078719 (https://phabricator.wikimedia.org/T344171) [17:56:59] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078719 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [17:57:33] (03PS1) 10RLazarus: mediawiki: Allow setting mwscript job activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078720 (https://phabricator.wikimedia.org/T376099) [17:59:15] (03PS1) 10RLazarus: deployment_server: Add --timeout flag to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1078721 (https://phabricator.wikimedia.org/T376099) [17:59:44] (03CR) 10CDanis: mediawiki: Allow setting mwscript job activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078720 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [18:01:45] (03PS1) 10Muehlenhoff: Restore access for ssastry [puppet] - 10https://gerrit.wikimedia.org/r/1078722 [18:04:07] (03CR) 10Muehlenhoff: [C:03+2] Restore access for ssastry [puppet] - 10https://gerrit.wikimedia.org/r/1078722 (owner: 10Muehlenhoff) [18:04:34] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10211918 (10wiki_willy) a:03RobH [18:22:51] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10211985 (10RobH) I chatted with @ssingh about this via IRC: The directions will be to pull the 8 of 9 misc hosts and 8 cp hosts out of the racks. These... [18:26:33] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [18:31:18] (03PS5) 10CDanis: recursor: Use NSes in eqiad+codfw private IP space [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) [18:31:25] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [18:31:46] (03CR) 10CDanis: recursor: Use NSes in eqiad+codfw private IP space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [18:32:41] (03PS1) 10Elukey: sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1078726 (https://phabricator.wikimedia.org/T365372) [18:34:50] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:35:19] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [18:35:32] ^ sigh, wrong click lol [18:35:42] it's very unforgiving [18:36:12] https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&from=now-1h&to=now heh [18:38:21] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:38:49] (03CR) 10Ssingh: [C:03+1] "NOOP on doh*, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [18:39:09] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:39:38] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:40:29] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:40:54] !log 💙cdanis@cumin1002.eqiad.wmnet ~ 🕝☕ sudo cumin A:dnsbox 'disable-puppet "cdanis rolling out T344171 Ie7d5091bca40"' [18:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:57] T344171: Reverse DNS for k8s pods IPs - https://phabricator.wikimedia.org/T344171 [18:41:16] (03CR) 10CDanis: [C:03+2] recursor: Use NSes in eqiad+codfw private IP space [puppet] - 10https://gerrit.wikimedia.org/r/1078716 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [18:41:36] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:41:54] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:43:13] !log 💔cdanis@cumin1002.eqiad.wmnet ~ 🕝☕ sudo cumin -b1 -s120 A:dnsbox 'run-puppet-agent --enable "cdanis rolling out T344171 Ie7d5091bca40"' [18:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for pfw1 lo0 - pt1979@cumin2002" [18:43:59] (03PS3) 10Scott French: wmnet: add geoip discovery DYNA record for mw-debug-next [dns] - 10https://gerrit.wikimedia.org/r/1072794 (https://phabricator.wikimedia.org/T372604) [18:44:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for pfw1 lo0 - pt1979@cumin2002" [18:44:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:45:02] (03PS4) 10Scott French: wmnet: add geoip discovery DYNA record for mwdebug-next [dns] - 10https://gerrit.wikimedia.org/r/1072794 (https://phabricator.wikimedia.org/T372604) [18:46:18] (03CR) 10Ssingh: [C:03+1] wmnet: add geoip discovery DYNA record for mwdebug-next [dns] - 10https://gerrit.wikimedia.org/r/1072794 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:47:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:56] 06SRE, 06Data-Persistence, 06Data-Platform, 10Dumps-Generation, and 3 others: Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10212078 (10kostajh) [18:49:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212090 (10phaultfinder) [18:50:51] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=mwdebug-next,name=codfw [reason: pooling mwdebug-next in codfw to match mwdebug - T372604] [18:50:54] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [18:51:07] 06SRE, 06Data-Persistence, 06Data-Platform, 10Dumps-Generation, and 3 others: Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10212065 (10kostajh) [18:51:41] (03CR) 10Scott French: [C:03+2] wmnet: add geoip discovery DYNA record for mwdebug-next [dns] - 10https://gerrit.wikimedia.org/r/1072794 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:52:46] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1078702 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [18:53:23] 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10212100 (10ssingh) Thanks for writing it down @RobH. 1. Ganeti hosts: I think we can simply point to another installserver if this means doing this in o... [18:54:56] !log ran authdns-update on dns1004 to pick up mwdebug-next record - T372604 [18:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:39] 06SRE, 06Data-Platform, 10Dumps-Generation, 06Trust and Safety Product Team, and 2 others: Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10212111 (10Ladsgroup) [18:58:59] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:59:20] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:02:11] (03PS2) 10RLazarus: mediawiki: Allow setting mwscript job activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078720 (https://phabricator.wikimedia.org/T376099) [19:04:08] (03CR) 10RLazarus: mediawiki: Allow setting mwscript job activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078720 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [19:04:52] 06SRE, 06Data-Engineering, 06Data-Platform, 10Dumps-Generation, and 3 others: Hide autoblocks from the globalblocks table database dump - https://phabricator.wikimedia.org/T376726#10212150 (10Ottomata) [19:09:21] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212159 (10Papaul) @Jgreen @Dwisehaupt when do you think you will have time to relocate the 4 servers in the table that have "YES" on the the New U space colu... [19:10:31] (03PS1) 10Hashar: README.md: doc loading a plugin from the browser [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1078735 [19:14:48] (03CR) 10Hashar: "Isabelle once tested one of the JavaScript plugin using the Chromium extension, I have since found the browser supports editing a remote J" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1078735 (owner: 10Hashar) [19:15:14] (03CR) 10Scott French: [C:03+1] mediawiki: Allow setting mwscript job activeDeadlineSeconds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078720 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [19:16:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10212178 (10Jhancock.wm) looks like B1 is the problem. I do have a stick we can replace it with. we can do this first thing in the morning on... [19:17:09] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376537#10212189 (10VRiley-WMF) →14Duplicate dup:03T376094 [19:18:14] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - ms-be1077 / logging-hd1005 - https://phabricator.wikimedia.org/T376094#10212187 (10VRiley-WMF) [19:25:49] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Decommission the alert1001 and alert2001 hosts - https://phabricator.wikimedia.org/T372607#10212226 (10VRiley-WMF) [19:26:06] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Decommission the alert1001 and alert2001 hosts - https://phabricator.wikimedia.org/T372607#10212227 (10VRiley-WMF) 05Open→03Resolved [19:27:18] (03PS1) 10Herron: thanos-rule: adjust logstash-availability:bool recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1078739 (https://phabricator.wikimedia.org/T376638) [19:27:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [19:28:02] Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:30:14] (03CR) 10Scott French: deployment_server: Add --timeout flag to mwscript-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078721 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [19:30:57] (03CR) 10Herron: [C:03+2] thanos-rule: adjust logstash-availability:bool recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1078739 (https://phabricator.wikimedia.org/T376638) (owner: 10Herron) [19:31:12] (03PS2) 10Herron: alertmanager-irc: improve ErrorBudgetBurn SLO alert text [puppet] - 10https://gerrit.wikimedia.org/r/1078718 (https://phabricator.wikimedia.org/T376740) [19:31:22] (03PS2) 10Herron: thanos-rule: adjust logstash-availability:bool recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1078739 (https://phabricator.wikimedia.org/T376638) [19:31:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10212252 (10VRiley-WMF) Hey @andrea.denisse We can continue to try to troubleshoot this error. Currently, it isn't showing any hardware fault through the iDRAC. Ho... [19:32:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:34:21] (03CR) 10Herron: [V:03+2 C:03+2] thanos-rule: adjust logstash-availability:bool recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1078739 (https://phabricator.wikimedia.org/T376638) (owner: 10Herron) [19:37:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:37:59] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:39:25] (03PS3) 10Herron: alertmanager-irc: improve ErrorBudgetBurn SLO alert text [puppet] - 10https://gerrit.wikimedia.org/r/1078718 (https://phabricator.wikimedia.org/T376740) [19:41:56] FIRING: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:42:50] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:46:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:51:31] (03PS5) 10Dzahn: gerrit: include gerrit profile in insetup::gerrit for testing [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) [19:54:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [19:54:48] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10212323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [19:58:22] (03PS2) 10RLazarus: deployment_server: Add --timeout flag to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1078721 (https://phabricator.wikimedia.org/T376099) [19:58:28] (03CR) 10RLazarus: deployment_server: Add --timeout flag to mwscript-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1078721 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [20:00:04] (03CR) 10Dzahn: [C:03+2] "This now makes gerrit2003 as similar to gerrit2002 as possible.. including the lfs data sync.. it's ok because it just pulls from the acti" [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T2000). [20:00:05] Jdlrobson and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ [20:02:37] o/ [20:03:14] hi - i can deploy but i'm not sure what the current deployment server is - it's been a while - give me a sec [20:03:56] it's deploy2002.codfw.wmnet :) [20:04:03] ah - thank you [20:04:27] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:07:13] Jdlrobson: i'll start with your patches [20:07:46] (03PS4) 10Jdlrobson: Expand Vector 2022 roll out and support local variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) [20:09:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) (owner: 10Jdlrobson) [20:09:47] (03Merged) 10jenkins-bot: Expand Vector 2022 roll out and support local variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) (owner: 10Jdlrobson) [20:10:14] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1078454|Expand Vector 2022 roll out and support local variants (T375549)]] [20:10:15] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt backup1012 - jclark@cumin1002" [20:10:18] T375549: Deploy Vector 2022 as default to various sites - https://phabricator.wikimedia.org/T375549 [20:10:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt backup1012 - jclark@cumin1002" [20:10:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:11:50] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host backup1012 [20:11:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1012 [20:12:38] !log cjming@deploy2002 jdlrobson, cjming: Backport for [[gerrit:1078454|Expand Vector 2022 roll out and support local variants (T375549)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:38] Jdlrobson: lmk when to sync - 1st patch up on test servers [20:14:15] cjming: looking thanks [20:21:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T376758 (10ops-monitoring-bot) 03NEW [20:22:54] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T376758#10212446 (10VRiley-WMF) →14Duplicate dup:03T374540 [20:22:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10212443 (10VRiley-WMF) [20:23:24] hey cjming sorry this is taking a bit longer. We are checkign the Vector 2022 one right? [20:23:33] i believe so [20:23:41] I'm not seeing it on the debug servers [20:23:48] hmm [20:23:49] am a bit confused if this is caching related or not [20:24:18] ah there we go its working now [20:24:23] yay! [20:24:29] so ok to sync? [20:24:31] and looks good to sync! thank you! [20:24:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:24:37] !log cjming@deploy2002 jdlrobson, cjming: Continuing with sync [20:24:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:26:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on gerrit2003.wikimedia.org with reason: applying gerrit profile [20:26:46] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit2003.wikimedia.org with reason: applying gerrit profile [20:28:53] (03CR) 10Dzahn: [C:03+2] "noop on prod hosts. on new host: have to follow-up re: "Evaluation Error: Unknown variable: 'passwords::gerrit::gerrit_email_key'"" [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:29:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit2003.wikimedia.org with reason: applying gerrit profile [20:29:23] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2003.wikimedia.org with reason: applying gerrit profile [20:29:42] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078454|Expand Vector 2022 roll out and support local variants (T375549)]] (duration: 19m 28s) [20:29:45] T375549: Deploy Vector 2022 as default to various sites - https://phabricator.wikimedia.org/T375549 [20:29:59] (03PS7) 10Jdlrobson: Dark mode: Make LiquidThreads namespace exclusion explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 [20:31:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 (owner: 10Jdlrobson) [20:31:46] (03Merged) 10jenkins-bot: Dark mode: Make LiquidThreads namespace exclusion explicit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072562 (owner: 10Jdlrobson) [20:32:12] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1072562|Dark mode: Make LiquidThreads namespace exclusion explicit]] [20:34:22] !log cjming@deploy2002 jdlrobson, cjming: Backport for [[gerrit:1072562|Dark mode: Make LiquidThreads namespace exclusion explicit]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:34:51] Jdlrobson: 1st patch should be live, 2nd patch up on mwdebug [20:35:26] cjming: awesome [20:37:07] cjming: thanks LGTM! please sync [20:37:12] !log cjming@deploy2002 jdlrobson, cjming: Continuing with sync [20:39:46] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10212511 (10Dzahn) Based on the example content, I am thinking maybe those few users just understand how it works. So that @wikipedia.org won't go to her... [20:40:42] (03PS1) 10Dzahn: gerrit: move passwords include from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/1078748 (https://phabricator.wikimedia.org/T372804) [20:41:35] (03CR) 10Dzahn: [C:03+2] "need something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1078748 next" [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:42:11] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1072562|Dark mode: Make LiquidThreads namespace exclusion explicit]] (duration: 09m 58s) [20:42:27] Jdlrobson: 2nd patch should be live! [20:42:34] bvibber: still around? [20:42:47] cjming: yep :D [20:42:50] cool [20:43:05] (03PS2) 10Bvibber: Switch iOS back-compat video transcodes from HLS to regular QuickTime [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078715 (https://phabricator.wikimedia.org/T363966) [20:43:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078715 (https://phabricator.wikimedia.org/T363966) (owner: 10Bvibber) [20:44:02] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1078748/4249/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1078748 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:44:12] bvibber: presumably i should just sync when ready? [20:44:23] cjming: go for it :D should be safe [20:44:38] (i can conform on testwiki that the config is as expected :D) [20:44:44] (03Merged) 10jenkins-bot: Switch iOS back-compat video transcodes from HLS to regular QuickTime [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078715 (https://phabricator.wikimedia.org/T363966) (owner: 10Bvibber) [20:44:44] (once it's synced to debug) [20:44:58] gtk - will do [20:45:07] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1078715|Switch iOS back-compat video transcodes from HLS to regular QuickTime (T363966)]] [20:45:11] T363966: Videos still unplayable on Safari in iOS 11 and 12 - https://phabricator.wikimedia.org/T363966 [20:45:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q1): Degraded RAID on prometheus1008 - https://phabricator.wikimedia.org/T374540#10212529 (10andrea.denisse) I troubleshooted the error, there were 2 drives that were not added to the RAID array. I'm syncing data to the last of those drive, I'... [20:46:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10212530 (10Jclark-ctr) [20:47:15] thanks cjming for the help today! [20:47:18] !log cjming@deploy2002 bvibber, cjming: Backport for [[gerrit:1078715|Switch iOS back-compat video transcodes from HLS to regular QuickTime (T363966)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:59] (03CR) 10Eevans: [C:03+2] aqs1022: change to role(insetup::data_persistence) [puppet] - 10https://gerrit.wikimedia.org/r/1078706 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans) [20:48:04] Jdlrobson: np - ur welcome! [20:48:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10212532 (10Jclark-ctr) I was having issues connecting to mgmt on a new server i just racked Might have to do with troubleshooting @elukey is doing... [20:48:10] (03CR) 10Dzahn: [V:03+1 C:03+2] "no changes detected on prod hosts, merging to fix puppet on new machine" [puppet] - 10https://gerrit.wikimedia.org/r/1078748 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:48:18] !log cjming@deploy2002 bvibber, cjming: Continuing with sync [20:48:26] good so far :D [20:49:56] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212539 (10Papaul) [20:52:04] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:52:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10212551 (10Eevans) @Jclark-ctr OK, it's `insetup::data_persistence` [20:52:47] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078715|Switch iOS back-compat video transcodes from HLS to regular QuickTime (T363966)]] (duration: 07m 39s) [20:52:49] T363966: Videos still unplayable on Safari in iOS 11 and 12 - https://phabricator.wikimedia.org/T363966 [20:52:56] bvibber: should be live! [20:53:16] cjming: looks good thanks!! [20:53:23] nice :) [20:53:26] confirmed new config is in effect \o/ [20:53:32] woohoo! [20:53:39] now to learn how to use k8s maint scripts to do backfill of transcodes lol [20:53:49] good luck! [20:53:52] thx :D [20:54:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:54:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1022.eqiad.wmnet with OS bullseye [20:54:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10212555 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aqs1022.eqiad.wmnet with OS bullseye [20:54:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1022.eqiad.wmnet with OS bullseye [20:54:48] !log end of UTC late backport window [20:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10212556 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host aqs1022.eqiad.wmnet with OS bullseye execut... [20:56:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aqs1022.eqiad.wmnet with OS bullseye [20:56:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10212561 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aqs1022.eqiad.wmnet with OS bullseye [20:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:59:12] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1022.eqiad.wmnet with reason: host reimage [21:01:20] (03CR) 10Dzahn: [V:03+1 C:03+2] "after this the puppet part fully works, just the initial "scap deploy-local" doesn't yet." [puppet] - 10https://gerrit.wikimedia.org/r/1078748 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:02:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1022.eqiad.wmnet with reason: host reimage [21:08:19] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212583 (10Papaul) [21:10:52] (03PS1) 10Dzahn: gerrit: sync lfs data also to new machine [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) [21:16:39] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:17:28] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:21:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:21:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1022.eqiad.wmnet with OS bullseye [21:21:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10212619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host aqs1022.eqiad.wmnet with OS bullseye comple... [21:22:07] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1078721 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [21:22:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10212622 (10Jclark-ctr) @Eevans thanks it is finshed now [21:23:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10212620 (10Jclark-ctr) a:05Eevans→03Jclark-ctr [21:26:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [21:26:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10212628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host backup1012.eqiad.wmnet with OS bookworm [21:34:38] !log gerrit2003 - sudo -u gerrit-deploy /usr/bin/scap deploy-local --repo gerrit/gerrit -D log_json:False (for some reason this fails in puppet but works manually) T372804 T257317 T317412 [21:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:44] T372804: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804 [21:34:44] T257317: scap deploy --init on deployment server fails on first puppet run - https://phabricator.wikimedia.org/T257317 [21:34:45] T317412: Automate Gerrit deployment steps - https://phabricator.wikimedia.org/T317412 [21:35:06] !log running requeueTranscodes in k8s maint to clean up ios video transcodes (T363966) [21:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:09] T363966: Videos still unplayable on Safari in iOS 11 and 12 - https://phabricator.wikimedia.org/T363966 [21:36:42] (03CR) 10Dzahn: [C:03+2] "also needed: sudo -u gerrit-deploy /usr/bin/scap deploy-local --repo gerrit/gerrit -D log_json:False" [puppet] - 10https://gerrit.wikimedia.org/r/1074477 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:41:26] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on gerrit2003.wikimedia.org with reason: initial gerrit deploy wip [21:41:30] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on gerrit2003.wikimedia.org with reason: initial gerrit deploy wip [21:47:28] RESOLVED: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:57:55] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376764 (10phaultfinder) 03NEW [21:59:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:59:58] !log removing 3 files for legal compliance [21:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:37] !log removing 3 files for legal compliance [22:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:56] (03PS1) 10Dzahn: gerrit: avoid duplicate declaration error on first setup [puppet] - 10https://gerrit.wikimedia.org/r/1078759 (https://phabricator.wikimedia.org/T372804) [22:16:25] !log removing 1 file for legal compliance [22:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:24:38] (03CR) 10Dzahn: "well.. this does a couple things that are definitely needed, like nftables rules and hosts_allowed, but it doesn't setup the actual rsync " [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:29:07] (03PS2) 10Dzahn: gerrit: avoid duplicate declaration error on first setup [puppet] - 10https://gerrit.wikimedia.org/r/1078759 (https://phabricator.wikimedia.org/T372804) [22:32:13] !log removing 3 files for legal compliance [22:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:18] !log removing 1 file for legal compliance [22:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:41] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212771 (10Papaul) @Jhancock.wm we are going to put civi2001 on the new switch on port 7 since on U6 we have a 2U server so we will just be using port 6 and po... [22:38:38] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212778 (10Papaul) [22:42:01] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1078759/4252/" [puppet] - 10https://gerrit.wikimedia.org/r/1078759 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:50:01] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop confirmed on prod machines. expected fix on new machine not working though." [puppet] - 10https://gerrit.wikimedia.org/r/1078759 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:52:27] (03PS1) 10Jforrester: Update Z669x references to Z609x [extensions/WikiLambda] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1078762 [22:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10212802 (10phaultfinder) [23:01:08] (03PS1) 10ZhaoFJx: zhwiki: Allow event-organizer self remove usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) [23:08:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:28:50] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [23:37:59] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078765 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078765 (owner: 10TrainBranchBot)