[00:01:56] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:02:40] (03PS1) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [00:03:22] (03CR) 10CI reject: [V:04-1] Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [00:03:23] (03CR) 10Pppery: "Uploaded an alternate approach at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1078122." [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [00:04:32] (03PS2) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [00:04:56] (03PS3) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [00:05:03] (03PS4) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [00:05:27] (03PS5) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [00:07:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078121 (owner: 10TrainBranchBot) [00:09:50] (03CR) 10Pppery: "The idea is to do this for the other instances where a Wikipedia also hosts another Wikimedia project as a namespace. I'm starting with Sc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [00:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10205519 (10phaultfinder) [00:28:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 970.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:33:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 1.103s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:57:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:09:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:00:29] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:31:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:12:58] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10205559 (10phaultfinder) [03:58:31] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:01:56] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10205564 (10phaultfinder) [04:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:09:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10205566 (10phaultfinder) [05:09:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:00:29] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:32] (03PS2) 10Jelto: wikidata-query-gui: fix port already in use issue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077919 (https://phabricator.wikimedia.org/T350793) [06:22:23] (03PS1) 10Muehlenhoff: Remove LDAP access for jrbranaa [puppet] - 10https://gerrit.wikimedia.org/r/1078227 [06:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10205596 (10phaultfinder) [06:25:28] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for jrbranaa [puppet] - 10https://gerrit.wikimedia.org/r/1078227 (owner: 10Muehlenhoff) [06:29:17] (03CR) 10JMeybohm: [C:03+1] wikidata-query-gui: fix port already in use issue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077919 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [06:30:50] (03PS1) 10Muehlenhoff: Remove LDAP access for wdoran [puppet] - 10https://gerrit.wikimedia.org/r/1078228 [06:31:14] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:33:35] (03PS2) 10Muehlenhoff: Remove LDAP access for wdoran [puppet] - 10https://gerrit.wikimedia.org/r/1078228 [06:36:41] (03PS3) 10Tiziano Fogli: kafka: port mirror maker alerts to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [06:41:14] RESOLVED: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:41:56] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:42:26] 10SRE-Access-Requests: Aceess to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585 (10KartikMistry) 03NEW [06:43:12] 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Aceess to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10205619 (10KartikMistry) [06:45:29] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:46:42] 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Aceess to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10205621 (10santhosh) @isarantopoulos As discussed, @KartikMistry will be deploying recommendation API for LPL team. If he can get access to... [06:48:02] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for wdoran [puppet] - 10https://gerrit.wikimedia.org/r/1078228 (owner: 10Muehlenhoff) [06:50:52] (03PS4) 10Tiziano Fogli: kafka: port mirror maker alerts to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) [06:51:26] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077412 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [06:58:02] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: fix port already in use issue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077919 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [06:59:11] (03Merged) 10jenkins-bot: wikidata-query-gui: fix port already in use issue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077919 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T0700). [07:00:05] msz2001, Ammar, and Ammar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:55] o/ [07:04:04] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [07:04:20] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 64315 [07:04:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 64315 [07:04:29] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [07:12:30] (03CR) 10Tiziano Fogli: [C:04-1] kafka: port mirror maker alerts to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [07:12:58] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:13:57] Msz2001: I can deploy [07:14:19] Thanks! Please deploy [07:15:30] ahh that is logos.php that got manually adjusted isn't it? [07:15:50] so that change must be made to the YAML file and then a script should be run to update the logos.php file, isn't it? [07:16:11] :) [07:17:19] I just reverted the original change, I thought in such case the script run is not needed [07:17:46] :) [07:18:04] ah yeah sorry I mixed it up [07:18:54] (03CR) 10Hashar: [C:03+2] Revert "wikimaniawiki: Update logos to 2024" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077422 (https://phabricator.wikimedia.org/T376292) (owner: 10Msz2001) [07:19:28] (03Merged) 10jenkins-bot: Revert "wikimaniawiki: Update logos to 2024" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077422 (https://phabricator.wikimedia.org/T376292) (owner: 10Msz2001) [07:19:30] (03CR) 10CI reject: [V:04-1] logos: Sync config.yaml and logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077392 (https://phabricator.wikimedia.org/T374430) (owner: 10Ammarpad) [07:19:31] (03CR) 10CI reject: [V:04-1] hawiki: Add temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077400 (https://phabricator.wikimedia.org/T376049) (owner: 10Ammarpad) [07:19:34] oops [07:19:43] Msz2001: I am doing your change now [07:19:55] Okay [07:20:24] pour stashbot [07:20:41] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077422|Revert "wikimaniawiki: Update logos to 2024"]] [07:22:52] I have restarted stashbot but that does not make it join back [07:22:55] :/ [07:24:41] lets kill sal as well [07:24:53] ah I can't [07:24:54] :D [07:27:39] Msz2001: your patch is still in progress [07:27:42] Ammar: good morning! :) [07:30:51] hashar good morning [07:31:07] !log hashar@deploy2002 msz2001, hashar: Backport for [[gerrit:1077422|Revert "wikimaniawiki: Update logos to 2024"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:31:27] (03CR) 10Hashar: [C:03+2] "CI failed due to an infrastructure issue (DNS entries could not be resolved)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077392 (https://phabricator.wikimedia.org/T374430) (owner: 10Ammarpad) [07:31:49] hashar: Can confirm it works [07:31:51] !log hashar@deploy2002 msz2001, hashar: Continuing with sync [07:31:58] Msz2001: awesome :) thank you for the patch: [07:31:59] ! [07:32:10] (03Merged) 10jenkins-bot: logos: Sync config.yaml and logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077392 (https://phabricator.wikimedia.org/T374430) (owner: 10Ammarpad) [07:32:12] (03Merged) 10jenkins-bot: hawiki: Add temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077400 (https://phabricator.wikimedia.org/T376049) (owner: 10Ammarpad) [07:32:15] Ammar: I will deploy your two changes next. They have failed CI because of some issue in our infrastructure [07:35:41] hashar: Okay thanks [07:36:11] Ammar: is it that logos.php should never be changed manually but always generated from a modification made to logos/config.yaml ? [07:36:22] I feel like CI should enforce it and prevent manual changes made to the php file [07:37:44] hashar I thought so. But I am only doing it now, because If I run the scripts on the hawiki logo change, the changes get added there [07:37:53] yeah +1 [07:38:47] 07:38:42 K8s deployment progress: 33% (ok: 807; fail: 0; left: 1625) - [07:42:21] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077422|Revert "wikimaniawiki: Update logos to 2024"]] (duration: 21m 40s) [07:42:55] Thanks! [07:43:09] Msz2001: thank you for taking care of logos updates! :b [07:43:12] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1077392|logos: Sync config.yaml and logos.php (T374430)]], [[gerrit:1077400|hawiki: Add temporary logo (T376049)]] [07:43:15] T374430: Change logos in Arabic Wikipedia - https://phabricator.wikimedia.org/T374430 [07:43:15] T376049: Requesting temporary logo change for ha.wiki - https://phabricator.wikimedia.org/T376049 [07:45:15] !log hashar@deploy2002 ammarpad, hashar: Backport for [[gerrit:1077392|logos: Sync config.yaml and logos.php (T374430)]], [[gerrit:1077400|hawiki: Add temporary logo (T376049)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:46:18] (03PS1) 10Slyngshede: ldapbackend: Remove post_save signal for user models. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) [07:47:13] ah [07:48:44] Ammar: I don't see the logo :/ [07:48:50] hashar is it ready? [07:48:58] on the debug servers ye [07:48:59] s [07:49:30] ah https://ha.wikipedia.org/wiki/Babban_shafi?useskin=monobook [07:49:32] it shows up there [07:49:59] !log hashar@deploy2002 ammarpad, hashar: Continuing with sync [07:50:00] hashar oh yes, I couldn't see it WikimediaDebug either [07:50:13] that works with the old monobook skin though [07:50:24] maybe some other variable needs to be adjusted for the default skin? [07:51:22] Ok, it works now even with Vector: https://ha.wikipedia.org/wiki/Babban_shafi (WikimediaDebug) [07:52:52] It's okay for me (note it's not expected to work with the default skin, Vector-2022, that needs separate patch) [07:53:00] ahh ok [07:54:19] (03CR) 10Volans: "comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [07:54:31] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077392|logos: Sync config.yaml and logos.php (T374430)]], [[gerrit:1077400|hawiki: Add temporary logo (T376049)]] (duration: 11m 19s) [07:54:35] T374430: Change logos in Arabic Wikipedia - https://phabricator.wikimedia.org/T374430 [07:54:35] T376049: Requesting temporary logo change for ha.wiki - https://phabricator.wikimedia.org/T376049 [07:55:22] (03PS1) 10Elukey: registry: expand the HTTP Accept headers [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 [07:55:38] Ammar: all done! [07:56:03] !log UTC morning backport window completed [07:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T374215 db1233 depool as clone source for db1246', diff saved to https://phabricator.wikimedia.org/P69471 and previous config saved to /var/cache/conftool/dbconfig/20241007-075611-arnaudb.json [07:56:14] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [07:56:30] hashar Thank you! [07:56:34] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [07:57:09] !log aborrero@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [07:58:31] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:16] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [08:01:29] (03CR) 10Elukey: "Totally understand the push back, and I want to make clear that it is not me nitpicking :( My main concern is that we are touching the van" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [08:01:34] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [08:01:57] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [08:02:16] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [08:02:19] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] [08:02:32] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [08:02:38] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] (duration: 00m 18s) [08:10:35] 06SRE, 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Aceess to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10205773 (10MoritzMuehlenhoff) @calbon This needs your approval. [08:12:48] (03CR) 10Volans: "I totally agree with Luca, at the very least the new mechanism need to be tested with normal reimages in Boot mode for all OSes (buster, b" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [08:13:05] 06SRE, 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10205776 (10santhosh) [08:13:48] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: disable cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/1078346 [08:14:10] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: disable cadvisor [puppet] - 10https://gerrit.wikimedia.org/r/1078346 [08:14:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10205777 (10phaultfinder) [08:15:49] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078346 (owner: 10Arturo Borrero Gonzalez) [08:24:11] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] [08:24:24] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] (duration: 00m 13s) [08:24:58] (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "this service may be useful to collect cgroup usage stats, see https://wikitech.wikimedia.org/wiki/Cadvisor" [puppet] - 10https://gerrit.wikimedia.org/r/1078346 (owner: 10Arturo Borrero Gonzalez) [08:29:03] (03CR) 10Ayounsi: [C:03+1] efi: add efi boot files on apt server [puppet] - 10https://gerrit.wikimedia.org/r/1078020 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [08:29:37] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [08:29:50] FIRING: [2x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:31:02] (03PS1) 10Arturo Borrero Gonzalez: prometheus: cadvisor: declare dependency on network being online [puppet] - 10https://gerrit.wikimedia.org/r/1078349 [08:31:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:32:17] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@1699d34]: Refine staging fixes [airflow-dags@1699d34f] [08:32:55] Hi! An hour ago, I had my patch about changing Wikimania wiki logo deployed. However, it turns out, that (apart from deploying a patch) it's needed to purge the relevant image from cache (https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Change_the_logo_of_a_Wikimedia_wiki). Can I ask someone who has rights to purge them? These files were changed: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1077422 [08:33:07] 06SRE, 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10205814 (10KartikMistry) @Nikerabbit This needs your approval from the LPL team side. [08:37:00] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@1699d34]: Refine staging fixes [airflow-dags@1699d34f] (duration: 04m 43s) [08:39:24] (03PS1) 10Arturo Borrero Gonzalez: keepalived: delcare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 [08:39:43] (03CR) 10CI reject: [V:04-1] keepalived: delcare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [08:41:01] (03PS2) 10Arturo Borrero Gonzalez: keepalived: delcare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 [08:41:13] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [08:44:34] (03PS3) 10Arturo Borrero Gonzalez: keepalived: delcare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 [08:44:50] FIRING: [2x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:46:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:49:05] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [08:49:21] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10205869 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [08:51:46] (03PS4) 10Arturo Borrero Gonzalez: keepalived: declare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 [08:55:03] (03PS1) 10Muehlenhoff: Turn ganeti203[56] into Ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/1078352 (https://phabricator.wikimedia.org/T376594) [08:55:59] (03CR) 10Elukey: "Tried live on build2001 and got this other nice msg:" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1078345 (owner: 10Elukey) [08:56:41] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [08:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:58:02] (03CR) 10Muehlenhoff: [C:03+2] Turn ganeti203[56] into Ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/1078352 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [08:59:23] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 10Spicerack: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596 (10ABran-WMF) 03NEW [09:01:42] (03PS4) 10Ayounsi: sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) [09:03:06] (03CR) 10JMeybohm: [C:03+2] kubelet: Remove --pod-infra-container-image when using containerd [puppet] - 10https://gerrit.wikimedia.org/r/1077412 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:03:29] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10205940 (10ABran-WMF) 05Open→03In progress p:05Triage→03Medium a:03ABran-WMF [09:04:40] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10205946 (10Volans) Spicerack has support for prometheus, why not getting the metrics directly from there? [09:07:12] (03CR) 10Arnaudb: sre.mysql.pool: add two new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [09:11:28] (03CR) 10Volans: sre.mysql.pool: add two new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [09:12:59] (03CR) 10Arnaudb: sre.mysql.pool: add two new cookbooks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T374026) (owner: 10Volans) [09:13:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [09:14:31] (03CR) 10David Caro: keepalived: declare the service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [09:18:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 1%: T374215', diff saved to https://phabricator.wikimedia.org/P69473 and previous config saved to /var/cache/conftool/dbconfig/20241007-091854-arnaudb.json [09:18:57] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [09:19:31] (03PS5) 10Arturo Borrero Gonzalez: keepalived: declare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 [09:19:47] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [09:19:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: T374215', diff saved to https://phabricator.wikimedia.org/P69474 and previous config saved to /var/cache/conftool/dbconfig/20241007-091953-arnaudb.json [09:20:27] (03CR) 10Arturo Borrero Gonzalez: keepalived: declare the service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [09:21:12] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10205989 (10ABran-WMF) I was unaware of that feature, its way better indeed :) [09:21:14] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10205990 (10Volans) @ssingh what do you think of the above draft patch proposal? If that works for you I'll complete it and... [09:25:12] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [09:25:45] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1078025 (https://phabricator.wikimedia.org/T376528) (owner: 10Aklapper) [09:27:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'missing commit', diff saved to https://phabricator.wikimedia.org/P69476 and previous config saved to /var/cache/conftool/dbconfig/20241007-092714-arnaudb.json [09:30:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: maintenance [09:30:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: maintenance [09:31:40] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589#10206010 (10aborrero) [09:32:28] (03PS1) 10Muehlenhoff: Remove Will Doran as approver for dumps/snapshots access groups [puppet] - 10https://gerrit.wikimedia.org/r/1078358 [09:33:46] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): cloudgw1002: network interface problem - https://phabricator.wikimedia.org/T376589#10206023 (10aborrero) hey @VRiley-WMF or @Jclark-ctr have you seen this error before on any network card or related? rings any bell? Do you think that upgrading the NIC... [09:34:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 2%: T374215', diff saved to https://phabricator.wikimedia.org/P69477 and previous config saved to /var/cache/conftool/dbconfig/20241007-093359-arnaudb.json [09:34:03] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [09:38:17] 06SRE, 06Infrastructure-Foundations, 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: sre.discovery.datacenter should support switching the active/passive services to the other datacenter - https://phabricator.wikimedia.org/T335364#10206034 (10Clement_Goubert) The code hasn't been reviewed and... [09:43:03] (03Abandoned) 10Slyngshede: Test Account blocking [software/bitu] - 10https://gerrit.wikimedia.org/r/1066723 (owner: 10Slyngshede) [09:43:42] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078358 (owner: 10Muehlenhoff) [09:44:13] (03Abandoned) 10Slyngshede: P:ircstream allow config to switch between UDP and SSE. [puppet] - 10https://gerrit.wikimedia.org/r/1077386 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [09:49:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 5%: T374215', diff saved to https://phabricator.wikimedia.org/P69478 and previous config saved to /var/cache/conftool/dbconfig/20241007-094904-arnaudb.json [09:49:08] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [09:51:11] (03PS1) 10Lucas Werkmeister (WMDE): tables-catalog: Add WikibaseQualityConstraints table [puppet] - 10https://gerrit.wikimedia.org/r/1078361 (https://phabricator.wikimedia.org/T363581) [09:51:50] (03CR) 10Lucas Werkmeister (WMDE): "Cognate should also be added but that’s in x1 and I don’t know how that’s handled here." [puppet] - 10https://gerrit.wikimedia.org/r/1078361 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [09:53:10] (03CR) 10Hashar: [C:04-1] "Gerrit is fronted by Apache 2 and I suppose it is using `mod_ssl`. From https://httpd.apache.org/docs/2.4/mod/mod_ssl.html , the module ex" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [09:54:31] (03CR) 10FNegri: [C:03+1] keepalived: declare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [09:54:32] (03CR) 10Hashar: "@Jelto you can merge it anytime. Puppet will eventually update the HTML page and Gerrit should serve it immediately :)" [puppet] - 10https://gerrit.wikimedia.org/r/1078025 (https://phabricator.wikimedia.org/T376528) (owner: 10Aklapper) [09:55:11] (03PS2) 10Lucas Werkmeister (WMDE): tables-catalog: Add WikibaseQualityConstraints table [puppet] - 10https://gerrit.wikimedia.org/r/1078361 (https://phabricator.wikimedia.org/T363581) [09:56:28] (03CR) 10Jelto: [C:03+2] Update "Reset Password" URI in Gerrit footer from wikitech to idm [puppet] - 10https://gerrit.wikimedia.org/r/1078025 (https://phabricator.wikimedia.org/T376528) (owner: 10Aklapper) [09:56:55] andre: ^ :) [09:57:00] thank you jelto ! [09:57:15] np :) [09:59:59] !log uploaded golang-github-flyingmutant-rapid 1.1.0 to apt.wm.o (bookworm-wikimedia) - T376600 [10:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:04] T376600: Provide debian packages for liberica - https://phabricator.wikimedia.org/T376600 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T1000) [10:00:20] (03PS1) 10Lucas Werkmeister (WMDE): tables-catalog: Add EntitySchema table [puppet] - 10https://gerrit.wikimedia.org/r/1078367 (https://phabricator.wikimedia.org/T363581) [10:00:59] eh? [10:02:16] (03CR) 10Hnowlan: [C:03+2] thumbor: disable expensive counter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078043 (https://phabricator.wikimedia.org/T372470) (owner: 10Hnowlan) [10:02:33] andre: you made a patch to update the "Reset Password" link on https://gerrit.wikimedia.org/r/login/ [10:02:39] and that one got merged/deployed ! [10:02:45] merci [10:02:46] the link now points to the IDM [10:02:49] success! [10:03:35] (03Merged) 10jenkins-bot: thumbor: disable expensive counter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078043 (https://phabricator.wikimedia.org/T372470) (owner: 10Hnowlan) [10:04:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: T374215', diff saved to https://phabricator.wikimedia.org/P69480 and previous config saved to /var/cache/conftool/dbconfig/20241007-100410-arnaudb.json [10:04:16] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [10:04:18] (03CR) 10Lucas Werkmeister (WMDE): "Hm, I just noticed that `wb_id_counters` was marked as derivative 🤔" [puppet] - 10https://gerrit.wikimedia.org/r/1078367 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [10:04:20] (03PS5) 10Hnowlan: poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T339863) [10:05:19] (03PS7) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [10:08:25] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] keepalived: declare the service [puppet] - 10https://gerrit.wikimedia.org/r/1078350 (owner: 10Arturo Borrero Gonzalez) [10:09:22] (03PS1) 10Lucas Werkmeister (WMDE): tables-catalog: Add PropertySuggester table [puppet] - 10https://gerrit.wikimedia.org/r/1078369 (https://phabricator.wikimedia.org/T363581) [10:10:06] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [10:11:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [10:11:56] FIRING: [2x] SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:09] !log installing Linux 6.1.112 on Bookworm systems [10:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:58] (03CR) 10CI reject: [V:04-1] redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:17:04] !log uploaded golang-github-cloudflare-ipvs 0.10.2 to apt.wm.o (bookworm-wikimedia) - T376600 [10:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:06] T376600: Provide debian packages for liberica - https://phabricator.wikimedia.org/T376600 [10:18:00] (03CR) 10Ayounsi: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:19:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: T374215', diff saved to https://phabricator.wikimedia.org/P69481 and previous config saved to /var/cache/conftool/dbconfig/20241007-101914-arnaudb.json [10:19:17] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [10:21:14] (03PS3) 10Lucas Werkmeister (WMDE): tables-catalog: Add WikibaseQualityConstraints table [puppet] - 10https://gerrit.wikimedia.org/r/1078361 (https://phabricator.wikimedia.org/T363581) [10:21:15] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add WikibaseQualityConstraints table [puppet] - 10https://gerrit.wikimedia.org/r/1078361 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [10:21:17] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add WikibaseQualityConstraints table [puppet] - 10https://gerrit.wikimedia.org/r/1078361 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [10:24:21] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078358 (owner: 10Muehlenhoff) [10:25:32] (03CR) 10Ladsgroup: "The canonical part here is "what happens if it gets dropped?" If revision table gets dropped, we are doomed and only way is to recover fro" [puppet] - 10https://gerrit.wikimedia.org/r/1078367 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [10:26:14] (03CR) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [10:28:56] (03CR) 10CI reject: [V:04-1] redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:31:04] (03PS1) 10JMeybohm: cumin/aliases: Add containerd roles to wikikube aliases [puppet] - 10https://gerrit.wikimedia.org/r/1078374 (https://phabricator.wikimedia.org/T362408) [10:31:25] (03PS1) 10Btullis: ceph: Add the mds caps to the mgr and admin keyrings [puppet] - 10https://gerrit.wikimedia.org/r/1078375 (https://phabricator.wikimedia.org/T376402) [10:31:30] (03CR) 10Muehlenhoff: [C:03+2] Remove Will Doran as approver for dumps/snapshots access groups [puppet] - 10https://gerrit.wikimedia.org/r/1078358 (owner: 10Muehlenhoff) [10:31:44] (03CR) 10Ayounsi: "The error seems unrelated to the patch." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [10:32:18] (03PS1) 10JMeybohm: kubernetes/staging: Add role master_stacked_containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078376 (https://phabricator.wikimedia.org/T362408) [10:32:19] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4234/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078375 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [10:32:52] (03CR) 10Brouberol: [C:03+1] ceph: Add the mds caps to the mgr and admin keyrings [puppet] - 10https://gerrit.wikimedia.org/r/1078375 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [10:33:30] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4235/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078374 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:34:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: T374215', diff saved to https://phabricator.wikimedia.org/P69482 and previous config saved to /var/cache/conftool/dbconfig/20241007-103420-arnaudb.json [10:34:23] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [10:34:24] (03CR) 10Btullis: [V:03+1 C:03+2] ceph: Add the mds caps to the mgr and admin keyrings [puppet] - 10https://gerrit.wikimedia.org/r/1078375 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [10:43:13] (03PS1) 10Btullis: ceph: Add mds caps to the admin keyring [puppet] - 10https://gerrit.wikimedia.org/r/1078377 (https://phabricator.wikimedia.org/T376402) [10:43:39] (03CR) 10JMeybohm: [V:03+1 C:03+2] cumin/aliases: Add containerd roles to wikikube aliases [puppet] - 10https://gerrit.wikimedia.org/r/1078374 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:44:10] (03CR) 10Brouberol: [C:03+1] ceph: Add mds caps to the admin keyring [puppet] - 10https://gerrit.wikimedia.org/r/1078377 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [10:44:42] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4236/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078377 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [10:45:10] (03CR) 10Btullis: [V:03+1 C:03+2] ceph: Add mds caps to the admin keyring [puppet] - 10https://gerrit.wikimedia.org/r/1078377 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [10:47:24] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage2001.codfw.wmnet [10:47:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage2001.codfw.wmnet [10:47:51] (03CR) 10Ladsgroup: [C:03+2] dumps: Stop fetching custom Wikitech dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077440 (https://phabricator.wikimedia.org/T374114) (owner: 10Majavah) [10:49:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: T374215', diff saved to https://phabricator.wikimedia.org/P69483 and previous config saved to /var/cache/conftool/dbconfig/20241007-104925-arnaudb.json [10:49:28] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [10:49:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10206289 (10phaultfinder) [10:49:50] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage2002.codfw.wmnet [10:49:54] !log Started MediaModeration scanning script after it crashed for commonswiki - https://wikitech.wikimedia.org/wiki/MediaModeration [10:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:17] !log Started 2 day scan on enwiki for MediaModeration to catchup with monthly request limit - https://wikitech.wikimedia.org/wiki/MediaModeration [10:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:19] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add iba to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1078086 (https://phabricator.wikimedia.org/T376568) (owner: 10Gerrit maintenance bot) [10:52:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage2002.codfw.wmnet [10:57:19] (03PS2) 10Lucas Werkmeister (WMDE): tables-catalog: Add EntitySchema table [puppet] - 10https://gerrit.wikimedia.org/r/1078367 (https://phabricator.wikimedia.org/T363581) [10:57:19] (03PS2) 10Lucas Werkmeister (WMDE): tables-catalog: Add PropertySuggester table [puppet] - 10https://gerrit.wikimedia.org/r/1078369 (https://phabricator.wikimedia.org/T363581) [10:57:21] (03CR) 10Lucas Werkmeister (WMDE): "> My long term plan is to backup canonical tables more often and do integrity check on them so if they are not really needed, better to sk" [puppet] - 10https://gerrit.wikimedia.org/r/1078367 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [11:01:19] (03CR) 10Ladsgroup: "haha, fair but the overhead of one more table is actually high, the data itself is not that much of concern 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1078367 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [11:01:43] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add EntitySchema table [puppet] - 10https://gerrit.wikimedia.org/r/1078367 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [11:04:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: T374215', diff saved to https://phabricator.wikimedia.org/P69484 and previous config saved to /var/cache/conftool/dbconfig/20241007-110430-arnaudb.json [11:04:34] T374215: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215 [11:04:46] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add PropertySuggester table [puppet] - 10https://gerrit.wikimedia.org/r/1078369 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [11:05:43] 10SRE-tools, 06Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593#10206342 (10ayounsi) I re-ran John's script: ===== Hosts that require Manual upgrade (53): ====== 2.30.30.30 (1) puppetmaster1001 ====== 2.50.50.50 (38) an-launcher1002, an-presto1001, an-p... [11:09:13] (03PS1) 10Elukey: swift: avoid rate-limit for the Docker account [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) [11:10:18] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4237/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [11:10:27] (03CR) 10Elukey: "From https://github.com/openstack/swift/blob/master/etc/proxy-server.conf-sample#L807 it seems that the syntax should be correct :)" [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [11:10:43] (03CR) 10Volans: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [11:12:58] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:16:21] !log uploaded golang-github-mtchavez-jenkins 1.0.0 to apt.wm.o (bookworm-wikimedia) - T376600 [11:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:23] T376600: Provide debian packages for liberica - https://phabricator.wikimedia.org/T376600 [11:18:23] (03PS1) 10JMeybohm: deployment_server: Set internal docker registry name by default [puppet] - 10https://gerrit.wikimedia.org/r/1078381 (https://phabricator.wikimedia.org/T376608) [11:20:34] (03CR) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [11:22:08] (03PS1) 10Btullis: ceph: correct the mgr permissions [puppet] - 10https://gerrit.wikimedia.org/r/1078382 (https://phabricator.wikimedia.org/T376402) [11:23:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4238/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078382 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [11:23:54] (03CR) 10Btullis: [V:03+1 C:03+2] ceph: correct the mgr permissions [puppet] - 10https://gerrit.wikimedia.org/r/1078382 (https://phabricator.wikimedia.org/T376402) (owner: 10Btullis) [11:25:00] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:25:29] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:25:38] (03PS1) 10Volans: docs: removed deprecated call to sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078383 [11:29:39] (03CR) 10Jbond: [C:03+1] Admin data matrix: show ldap_only_users, too [puppet] - 10https://gerrit.wikimedia.org/r/1069229 (owner: 10Thcipriani) [11:29:43] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:29:52] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:31:29] (03CR) 10Jbond: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:31:50] (03CR) 10Jbond: [C:03+1] sre.switchdc.databases.prepare: add check [cookbooks] - 10https://gerrit.wikimedia.org/r/1074127 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [11:33:44] (03CR) 10Jbond: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1075612 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [11:45:50] 10ops-eqiad, 06DC-Ops: Upgrade puppetmaster1001 iDRAC - https://phabricator.wikimedia.org/T376611 (10ayounsi) 03NEW p:05Triage→03High [11:46:49] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused transport hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1078385 [11:48:34] (03CR) 10CI reject: [V:04-1] cloudgw: cleanup unused transport hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1078385 (owner: 10Arturo Borrero Gonzalez) [11:49:33] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: cleanup unused transport hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1078385 [11:49:45] (03CR) 10Jbond: sre.hosts.reimage: add UEFI HTTP Boot support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [11:51:08] (03PS1) 10Hnowlan: thumbor: add mcrouter config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078386 [11:52:23] (03CR) 10Jbond: [C:03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [11:52:54] (03CR) 10Jbond: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1077725 (owner: 10JHathaway) [11:53:21] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [11:54:38] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078385 (owner: 10Arturo Borrero Gonzalez) [11:54:43] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [11:54:50] (03CR) 10Volans: [C:03+2] "Self-merging to unblock other CI runs" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078383 (owner: 10Volans) [11:54:54] (03CR) 10Jbond: [V:03+1] "im happy either way. This adds some flexibility if its needed in the future but also easy to revert the previous change. let me know whi" [puppet] - 10https://gerrit.wikimedia.org/r/969701 (https://phabricator.wikimedia.org/T349915) (owner: 10Jbond) [11:57:55] (03CR) 10Jbond: "@jhathaway this is a patch set to get rid of the $nameservers global variable. Let me know if you are interested in shepherding this, oth" [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [11:58:31] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:17] (03CR) 10Volans: "Question and comment inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [12:00:35] (03PS2) 10Brouberol: Make it possible to deploy provisioner without the snahshotter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077873 (https://phabricator.wikimedia.org/T376406) [12:00:35] (03PS2) 10Brouberol: Run the driver-registrar as root [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077874 (https://phabricator.wikimedia.org/T376406) [12:00:35] (03PS2) 10Brouberol: Disable the priviledged security context of the liveness-prometheus container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077875 (https://phabricator.wikimedia.org/T376406) [12:00:35] (03PS2) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [12:00:36] (03PS1) 10Brouberol: Make it possible to create several storage classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078387 (https://phabricator.wikimedia.org/T376406) [12:02:43] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [12:03:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [12:05:57] (03Merged) 10jenkins-bot: docs: removed deprecated call to sphinx_rtd_theme [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078383 (owner: 10Volans) [12:07:00] (03PS10) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [12:07:00] (03CR) 10Jelto: "I tested this with `test-cookbook` and beside the normal downtime `27fe49f5-f909-47e8-b06c-3fd35fb79c04` (for `gitlab1003`) another downti" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:08:01] (03PS2) 10Brouberol: Make it possible to create several storage classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078387 (https://phabricator.wikimedia.org/T376406) [12:08:01] (03PS3) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [12:09:29] (03PS4) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [12:10:04] (03PS5) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [12:13:18] (03PS2) 10JMeybohm: kubernetes/staging: Add role master_stacked_containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078376 (https://phabricator.wikimedia.org/T362408) [12:13:18] (03PS1) 10JMeybohm: containerd: Enable unprivileged icmp and binding to ports < 1024 [puppet] - 10https://gerrit.wikimedia.org/r/1078391 (https://phabricator.wikimedia.org/T362408) [12:13:43] (03CR) 10Volans: sre.gitlab.upgrade: also use the service name for the downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:19:44] (03PS1) 10Ammarpad: hawiki: Add temporary tagline for Vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078396 (https://phabricator.wikimedia.org/T376049) [12:20:53] (03PS11) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [12:21:14] (03PS4) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [12:21:36] (03CR) 10Arnaudb: mariadb: clone cookbook maintenance (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [12:21:45] (03PS1) 10Ammarpad: enwiktionary: Enable $wgMFCollapseSectionsByDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078397 (https://phabricator.wikimedia.org/T376446) [12:26:12] (03PS5) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [12:26:18] (03CR) 10Arnaudb: mariadb: clone cookbook maintenance (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [12:27:15] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: cleanup unused transport hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/1078385 (owner: 10Arturo Borrero Gonzalez) [12:34:46] (03PS1) 10Brouberol: Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 [12:35:16] (03PS2) 10Brouberol: Provision dummy secrets for ceph-csi users [labs/private] - 10https://gerrit.wikimedia.org/r/1078401 (https://phabricator.wikimedia.org/T376407) [12:37:28] (03CR) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:43:42] (03PS1) 10Brouberol: ceph: provision the dse-k8s-csi-cephfs user capabilities [puppet] - 10https://gerrit.wikimedia.org/r/1078402 (https://phabricator.wikimedia.org/T376407) [12:44:14] (03PS2) 10Brouberol: ceph: provision the dse-k8s-csi-cephfs user capabilities [puppet] - 10https://gerrit.wikimedia.org/r/1078402 (https://phabricator.wikimedia.org/T376407) [12:44:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:45:08] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4239/console" [puppet] - 10https://gerrit.wikimedia.org/r/1078402 (https://phabricator.wikimedia.org/T376407) (owner: 10Brouberol) [12:47:05] Hi! In the morning window, I had my patch about changing Wikimania wiki logo deployed. However, it turns out, that (apart from deploying a patch) it's needed to purge the relevant image from cache (https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests#Change_the_logo_of_a_Wikimedia_wiki). Can I ask someone who has rights to purge them? These [12:47:05] files were changed: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1077422 [12:49:28] * Lucas_WMDE tries to dig up that tool which gave you the URLs [12:50:25] ok apparently I misremembered and it’s not in fact a web tool https://github.com/theresnotime/purge-logos-from-patch [12:51:34] I got those links by searching documentations, because logos appear to be cached (they change to newer only using WikimediaDebug) [12:52:18] jouncebot: now [12:52:19] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [12:53:11] !log printf 'https://en.wikipedia.org/static/images/%s\n' 'mobile/copyright/wikimaniawiki-wordmark.svg' 'project-logos/wikimaniawiki-1.5x.png' 'project-logos/wikimaniawiki-2x.png' 'project-logos/wikimaniawiki.png' 'icons/wikimaniawiki.svg' | mwscript-k8s --attach -- purgeList enwiki # T376292 [12:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:18] T376292: Change Wikimania wiki logo from 2024 to generic - https://phabricator.wikimedia.org/T376292 [12:53:31] Lucas_WMDE: https://logos-purge.toolforge.org/ [12:53:33] Msz2001: better now? [12:53:45] Yes, thanks! [12:53:54] TheresNoTime: too late, but thanks ^^ [12:53:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [12:54:20] ^^ [12:56:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077800 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [12:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T1300). [13:00:05] cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] * Lucas_WMDE filed T376616 for the issue noticed during deployment [13:00:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [13:01:28] (03CR) 10C. Scott Ananian: [C:03+1] scandium is being replaced by parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [13:01:35] (03CR) 10C. Scott Ananian: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [13:01:38] anyway, I can deploy! [13:01:43] thanks! i'm here [13:01:47] should be pretty quick [13:02:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2035.codfw.wmnet to cluster codfw and group C [13:02:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2035.codfw.wmnet to cluster codfw and group C [13:02:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077800 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [13:02:54] (03Merged) 10jenkins-bot: scandium is being replaced by parsoidtest1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077800 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [13:03:10] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1077800|scandium is being replaced by parsoidtest1001 (T363402)]] [13:03:19] T363402: parsoidtest1001 implementation tracking - https://phabricator.wikimedia.org/T363402 [13:03:57] (03PS1) 10Muehlenhoff: Add ganeti203[56] as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1078405 (https://phabricator.wikimedia.org/T376594) [13:04:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367856)', diff saved to https://phabricator.wikimedia.org/P69485 and previous config saved to /var/cache/conftool/dbconfig/20241007-130409-ladsgroup.json [13:04:13] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:04:50] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti203[56] as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1078405 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [13:05:15] !log lucaswerkmeister-wmde@deploy2002 arlolra, lucaswerkmeister-wmde: Backport for [[gerrit:1077800|scandium is being replaced by parsoidtest1001 (T363402)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:23] cscott: is the change testable on mwdebug? [13:05:37] i don't think i can test that on testservers, no, other than to verify that nothing has exploded in the logs [13:05:40] ok [13:05:49] !log lucaswerkmeister-wmde@deploy2002 arlolra, lucaswerkmeister-wmde: Continuing with sync [13:05:54] it looks safe enough to me so let’s go ahead I think [13:06:14] :thumbsup [13:06:27] er, 👍 [13:10:24] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077800|scandium is being replaced by parsoidtest1001 (T363402)]] (duration: 07m 14s) [13:10:27] T363402: parsoidtest1001 implementation tracking - https://phabricator.wikimedia.org/T363402 [13:11:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2035.codfw.wmnet to cluster codfw and group C [13:11:58] done? [13:12:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2035.codfw.wmnet to cluster codfw and group C [13:12:58] thanks! [13:14:22] is this the right channel to look for folks who can give a C+2 on a puppet patch? [13:14:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1077803 is the other side of that config patch, but I only have C+1 rights on that repo. [13:15:00] I think it’s approximately the right channel but I can’t do it either ^^ [13:15:04] #wikimedia-sre might also work [13:15:15] jouncebot: nowandnext [13:15:15] For the next 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T1300) [13:15:16] In 2 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T1530) [13:15:25] * Lucas_WMDE done deploying btw [13:15:34] Thanks. I might deploy shortly [13:15:40] thanks again lucas [13:16:14] np :) [13:16:26] (03PS1) 10Dreamy Jazz: Update globalblocks 'gb_address' index to allow autoblocks [extensions/GlobalBlocking] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1078406 (https://phabricator.wikimedia.org/T376052) [13:16:31] 06SRE, 10SRE-Access-Requests, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10206691 (10isarantopoulos) @santhosh @KartikMistry Unless this is urgent, we would prefer to provide access after implementing T37... [13:16:55] (03CR) 10Dreamy Jazz: [C:03+2] Update globalblocks 'gb_address' index to allow autoblocks [extensions/GlobalBlocking] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1078406 (https://phabricator.wikimedia.org/T376052) (owner: 10Dreamy Jazz) [13:18:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1078406 (https://phabricator.wikimedia.org/T376052) (owner: 10Dreamy Jazz) [13:18:43] Dreamy_Jazz: out of curiosity – I’m guessing GlobalBlockManager.php is the only “important” part of that backport? [13:18:51] Yup [13:18:52] (and the production schema change will happen separately) [13:18:55] ok :) [13:19:02] The production schema change has already happened [13:19:10] nice [13:19:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P69486 and previous config saved to /var/cache/conftool/dbconfig/20241007-131915-ladsgroup.json [13:20:00] cscott: you could also try https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241008T1600 btw but I would hope that someone will review the puppet change sooner than that ^^ [13:24:53] (03CR) 10Vgutierrez: hiera: Switch to digicert-2024 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1077711 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [13:28:36] (03Merged) 10jenkins-bot: Update globalblocks 'gb_address' index to allow autoblocks [extensions/GlobalBlocking] (wmf/1.43.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1078406 (https://phabricator.wikimedia.org/T376052) (owner: 10Dreamy Jazz) [13:28:54] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1078406|Update globalblocks 'gb_address' index to allow autoblocks (T376052)]] [13:28:59] T376052: Allow autoblocks on IP addresses in the globalblocks table for IP addresses which are already globally blocked - https://phabricator.wikimedia.org/T376052 [13:30:52] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1078406|Update globalblocks 'gb_address' index to allow autoblocks (T376052)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:31:06] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [13:32:12] (03CR) 10JMeybohm: [C:03+2] containerd: Enable unprivileged icmp and binding to ports < 1024 [puppet] - 10https://gerrit.wikimedia.org/r/1078391 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [13:34:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P69487 and previous config saved to /var/cache/conftool/dbconfig/20241007-133422-ladsgroup.json [13:34:35] (03PS2) 10Slyngshede: ldapbackend: Remove post_save signal for user models. [software/bitu] - 10https://gerrit.wikimedia.org/r/1078343 (https://phabricator.wikimedia.org/T346601) [13:35:30] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4241/co" [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [13:35:43] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078406|Update globalblocks 'gb_address' index to allow autoblocks (T376052)]] (duration: 06m 49s) [13:35:46] T376052: Allow autoblocks on IP addresses in the globalblocks table for IP addresses which are already globally blocked - https://phabricator.wikimedia.org/T376052 [13:35:52] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch to digicert-2024 certificates [puppet] - 10https://gerrit.wikimedia.org/r/1077711 (https://phabricator.wikimedia.org/T368560) (owner: 10Vgutierrez) [13:35:59] I'm done deploying [13:36:20] !log UTC afternoon backport+config window done [13:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:40] (03CR) 10Hnowlan: [V:03+1 C:03+2] scandium is being replaced by parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [13:37:20] !log switching to digicert-2024 certificates on esams, eqsin, drmrs and magru [13:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [13:38:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1076974 (https://phabricator.wikimedia.org/T376108) (owner: 10Slyngshede) [13:38:33] (03CR) 10Volans: [C:03+1] "LGTM, feel free to ignore the optional nit ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [13:40:35] 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#10206767 (10Volans) 05Open→03Resolved This has been released and tested. Resolving. [13:40:57] (03PS1) 10Zabe: s5: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078412 (https://phabricator.wikimedia.org/T183490) [13:43:32] (03CR) 10JMeybohm: "Looks correct. Although I'm worried the whitelist settings only apply to `account_*` rate limits and not to the `container_*` ones..." [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [13:44:14] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team, 10LPL Essential (LPL Essential 2024 Jul-Sep): Access to deploy recommendation API ML service for kartik - https://phabricator.wikimedia.org/T376585#10206774 (10isarantopoulos) [13:49:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367856)', diff saved to https://phabricator.wikimedia.org/P69488 and previous config saved to /var/cache/conftool/dbconfig/20241007-134929-ladsgroup.json [13:49:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [13:49:32] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:49:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [13:49:42] (03CR) 10Elukey: "This is a good point, didn't think about it.. It is not clear from the docs, and there seems to be no container_whitelist_* option afaics." [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [13:49:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [13:49:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T367856)', diff saved to https://phabricator.wikimedia.org/P69489 and previous config saved to /var/cache/conftool/dbconfig/20241007-134950-ladsgroup.json [13:51:02] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10206800 (10Jhancock.wm) 05Open→03Resolved [13:55:41] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Decommission the alert1001 and alert2001 hosts - https://phabricator.wikimedia.org/T372607#10206808 (10Jhancock.wm) [13:55:57] (03CR) 10Slyngshede: [C:03+2] Menu: Allow users on mobile to close the menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076974 (https://phabricator.wikimedia.org/T376108) (owner: 10Slyngshede) [13:59:22] (03Merged) 10jenkins-bot: Menu: Allow users on mobile to close the menu. [software/bitu] - 10https://gerrit.wikimedia.org/r/1076974 (https://phabricator.wikimedia.org/T376108) (owner: 10Slyngshede) [14:00:00] (03PS1) 10Zabe: Stop setting wgAbuseFilterActorTableSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078415 (https://phabricator.wikimedia.org/T188180) [14:01:04] (03PS3) 10JMeybohm: kubernetes/staging: Add role master_stacked_containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078376 (https://phabricator.wikimedia.org/T362408) [14:01:04] (03PS1) 10JMeybohm: wikikube-staging-codfw: Migrate kubestage2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078416 (https://phabricator.wikimedia.org/T362408) [14:01:26] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: also use the service name for the downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [14:02:23] (03CR) 10Kamila Součková: [C:03+1] poolcounter: introduce allowlist to skip rate limit [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1063217 (https://phabricator.wikimedia.org/T339863) (owner: 10Hnowlan) [14:08:36] (03CR) 10JMeybohm: [C:03+2] wikikube-staging-codfw: Migrate kubestage2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078416 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:08:39] (03CR) 10JMeybohm: [C:03+2] kubernetes/staging: Add role master_stacked_containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078376 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:10:52] (03PS1) 10JMeybohm: wikikube-staging-codfw: Migrate kubestage2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078417 (https://phabricator.wikimedia.org/T362408) [14:10:58] (03PS9) 10Arturo Borrero Gonzalez: cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) [14:11:56] FIRING: [2x] SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:19] (03CR) 10JMeybohm: [C:03+2] wikikube-staging-codfw: Migrate kubestage2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078417 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:15:23] (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [14:16:12] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wikikube-worker2092.codfw.wmnet with reason: Degraded RAID [14:16:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2092.codfw.wmnet with reason: Degraded RAID [14:16:31] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10206920 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d5aed8b0-eaca-4555-b388-ad989b1c0dd9) set by kamila@cumin1002 for 7 days, 0:00:00 on 1 host(s... [14:17:09] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Decommission the alert1001 and alert2001 hosts - https://phabricator.wikimedia.org/T372607#10206921 (10Jhancock.wm) a:03VRiley-WMF [14:17:54] (03CR) 10Jelto: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [14:18:41] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bookworm [14:23:26] (03CR) 10Alexandros Kosiaris: "Thanks. Most of this patch was already submitted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024400 and https://gerrit.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [14:24:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 828.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:28:10] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [14:29:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 812.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:31:41] (03PS1) 10Raymond Ndibe: aptrepo: add k8s 1.28 repos [puppet] - 10https://gerrit.wikimedia.org/r/1078420 (https://phabricator.wikimedia.org/T362867) [14:34:21] (03PS12) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [14:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:55] (03CR) 10Alexandros Kosiaris: [C:03+1] sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [14:39:33] (03CR) 10Alexandros Kosiaris: [C:03+1] sre.discovery.datacenter: exclude swift-https [cookbooks] - 10https://gerrit.wikimedia.org/r/1077111 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [14:40:35] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [14:43:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [14:46:21] (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [14:54:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10207022 (10phaultfinder) [15:00:57] !log ongoing maintenance on mr1-esams [15:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10207062 (10jcrespo) Any ETA for this and the codfw equivalent, DC-ops? I know there may be some delays due to the vendor peculiarities, but my "Need b... [15:08:01] (03PS6) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [15:12:58] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:13:02] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetmaster1001.eqiad.wmnet [15:13:17] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts puppetmaster1001.eqiad.wmnet [15:14:33] (03PS2) 10Muehlenhoff: Remove puppetserver1002 from active puppet servers [dns] - 10https://gerrit.wikimedia.org/r/1076899 (https://phabricator.wikimedia.org/T376058) [15:15:02] (03PS8) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [15:15:04] (03CR) 10Ayounsi: redfish: add UEFI functions (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [15:17:20] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver1002 from active puppet servers [dns] - 10https://gerrit.wikimedia.org/r/1076899 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [15:23:32] (03PS1) 10Volans: Fix issues reported by newer pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/1078427 [15:23:49] jelto: ^^^ patch to fix CI, now the problem is to find all the people to add to it :) [15:23:55] * volans starts looking at git blame [15:25:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetserver1002.eqiad.wmnet with reason: RAM expansion [15:25:51] (03CR) 10Ayounsi: sre.hosts.reimage: add UEFI HTTP Boot support (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [15:25:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetserver1002.eqiad.wmnet with reason: RAM expansion [15:25:59] (03CR) 10CI reject: [V:04-1] redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [15:26:04] (03PS4) 10Alexandros Kosiaris: Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) [15:26:11] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10207140 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6406f203-9647-4330-aa02-83cb4e8485b0) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and... [15:26:24] (03CR) 10CI reject: [V:04-1] Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) (owner: 10Alexandros Kosiaris) [15:26:35] (03Abandoned) 10Alexandros Kosiaris: Switch scandium references to parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [15:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T1530). [15:30:46] (03PS5) 10Alexandros Kosiaris: Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) [15:37:28] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:01] (03PS1) 10Muehlenhoff: Revert "Remove puppetserver1002 from active puppet servers" [dns] - 10https://gerrit.wikimedia.org/r/1078431 (https://phabricator.wikimedia.org/T376058) [15:38:28] (03PS2) 10Ayounsi: Add efi support to partman [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [15:39:40] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove puppetserver1002 from active puppet servers" [dns] - 10https://gerrit.wikimedia.org/r/1078431 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [15:40:11] (03PS3) 10Ayounsi: Add efi support to partman [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [15:40:17] (03CR) 10CI reject: [V:04-1] Add efi support to partman [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [15:41:09] (03PS5) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) [15:41:57] (03CR) 10Hashar: "Rebased to clear a conflict with I0eb8d9ba39ece2447665d30704f9790062fb0511" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [15:42:06] (03CR) 10CI reject: [V:04-1] Add efi support to partman [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [15:42:07] (03PS4) 10Ayounsi: Add efi support to partman [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [15:44:25] (03PS1) 10Muehlenhoff: Remove puppetserver1003 from active puppet servers [dns] - 10https://gerrit.wikimedia.org/r/1078432 (https://phabricator.wikimedia.org/T376058) [15:46:11] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver1003 from active puppet servers [dns] - 10https://gerrit.wikimedia.org/r/1078432 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [15:47:28] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:47:29] (03PS6) 10Alexandros Kosiaris: Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T376632) [15:48:02] (03CR) 10JHathaway: Add efi support to partman (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077740 (owner: 10JHathaway) [15:48:10] (03PS1) 10Muehlenhoff: Update point of contact for contracts formerly managed by Jean-Rene Branaa [puppet] - 10https://gerrit.wikimedia.org/r/1078434 [15:49:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetserver1003.eqiad.wmnet with reason: RAM expansion [15:49:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetserver1003.eqiad.wmnet with reason: RAM expansion [15:50:08] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10207237 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9145360a-ee65-45b6-a805-4fa59cb47d42) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and... [15:51:16] (03CR) 10CI reject: [V:04-1] tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [15:52:21] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10207239 (10Papaul) [15:52:28] (03PS1) 10Giuseppe Lavagetto: Add first version to deploy of hiddenparma [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1078435 [15:54:40] (03PS2) 10Giuseppe Lavagetto: Add first version to deploy of hiddenparma [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1078435 (https://phabricator.wikimedia.org/T371782) [15:56:37] (03PS3) 10JHathaway: efi: add efi boot files on apt server [puppet] - 10https://gerrit.wikimedia.org/r/1078020 (https://phabricator.wikimedia.org/T373519) [15:57:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [15:57:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [15:58:31] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:58:36] (03PS1) 10Giuseppe Lavagetto: git::replicated_local_repo: set mode of post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1078438 [15:59:16] (03PS1) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) [15:59:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [15:59:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [16:01:04] (03CR) 10JHathaway: [C:03+2] "Thanks for the review @john.r.bond+wmf-gerrit@gmail.com" [puppet] - 10https://gerrit.wikimedia.org/r/1077725 (owner: 10JHathaway) [16:03:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [16:03:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [16:06:05] (03PS1) 10Muehlenhoff: Revert "Remove puppetserver1003 from active puppet servers" [dns] - 10https://gerrit.wikimedia.org/r/1078442 (https://phabricator.wikimedia.org/T376058) [16:09:08] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078434 (owner: 10Muehlenhoff) [16:10:46] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:11:14] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove puppetserver1003 from active puppet servers" [dns] - 10https://gerrit.wikimedia.org/r/1078442 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [16:15:25] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10207279 (10MoritzMuehlenhoff) [16:16:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bookworm [16:19:58] (03PS1) 10JMeybohm: Remove kubelet systemd unit dependency to docker.service [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1078447 (https://phabricator.wikimedia.org/T362408) [16:24:03] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [16:26:09] (03CR) 10JHathaway: "thanks @john.r.bond+wmf-gerrit@gmail.com definitely interested, I have been trying to move forward on getting rid of realm.pp." [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:26:35] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm [16:31:37] (03PS1) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) [16:31:38] (03PS1) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [16:32:32] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10207364 (10elukey) Some notes: ml-serve* Supermicro nodes are AMD CPU based, so some BIOS settings don't apply to... [16:32:33] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10207365 (10Jhancock.wm) [16:35:57] (03CR) 10JMeybohm: "Agreed. Maybe test this first and if it does not change a thing, bump container_listing_ratelimit_200 temporarily." [puppet] - 10https://gerrit.wikimedia.org/r/1078380 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [16:36:24] (03PS6) 10Scott French: hieradata: add mw-debug "next" release to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1077481 (https://phabricator.wikimedia.org/T372604) [16:37:39] (03PS1) 10Jdlrobson: Expand Vector 2022 roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) [16:44:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:45:59] (03PS2) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) [16:45:59] (03PS2) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [16:47:03] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [16:47:47] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077481 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [16:51:01] (03CR) 10Alexandros Kosiaris: scandium is being replaced by parsoidtest1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077803 (https://phabricator.wikimedia.org/T363402) (owner: 10Arlolra) [16:52:10] (03CR) 10Hashar: "`prospector` / `pylint` are not pinned to a specific version and thus end up failing due to some upstream new release. But that is outsid" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) (owner: 10Hashar) [16:54:54] (03PS3) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) [16:54:54] (03PS3) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [16:55:59] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [16:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:00:04] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T1700). [17:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T1700). [17:00:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10207459 (10Jhancock.wm) [17:01:21] will be getting started shortly [17:01:36] (03CR) 10Scott French: [C:03+2] mw-debug: add initial "next" release (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072764 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:02:53] (03Merged) 10jenkins-bot: mw-debug: add initial "next" release (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072764 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:02:56] (03PS4) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) [17:02:56] (03PS4) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [17:03:03] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [17:06:19] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:06:47] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:07:15] (03PS1) 10Tiziano Fogli: kafka: remove mirror maker alerts from icinga [puppet] - 10https://gerrit.wikimedia.org/r/1078456 (https://phabricator.wikimedia.org/T370153) [17:12:26] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:12:44] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:18:02] (03CR) 10Scott French: [C:03+2] hieradata: add mw-debug "next" release to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1077481 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:24:28] (03CR) 10Xcollazo: [C:03+1] dumps: Update legal.html file to list different licences for Wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072268 (owner: 10Jforrester) [17:25:34] (03PS2) 10Jforrester: dumps: Update legal.html file to list different licences for Wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1072268 [17:26:26] !log swfrench@deploy2002 Started scap sync-world: Testing scap after mw-debug next bring-up - T372604 [17:26:29] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [17:29:11] !log swfrench@deploy2002 Finished scap sync-world: Testing scap after mw-debug next bring-up - T372604 (duration: 02m 45s) [17:33:00] (03CR) 10Scott French: "Scap has now created `/etc/helmfile-defaults/mediawiki/release/mw-debug-next.yaml`, so this is good to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078007 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:34:12] (03PS2) 10Scott French: mw-debug: remove temporary release value override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078007 (https://phabricator.wikimedia.org/T372604) [17:39:55] (03CR) 10RLazarus: [C:03+1] mw-debug: remove temporary release value override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078007 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:40:38] (03CR) 10Scott French: [C:03+2] mw-debug: remove temporary release value override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078007 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:41:56] (03Merged) 10jenkins-bot: mw-debug: remove temporary release value override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078007 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [17:45:33] alright, I believe I am now done with the infra window [17:52:10] (03PS1) 10Volans: Temporary limit prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078459 [17:52:27] (03PS2) 10Volans: Fix issues reported by newer pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/1078427 [17:52:27] (03PS1) 10Volans: Temporary limit prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1078460 [17:54:32] (03PS3) 10Scott French: service: add basic configuration for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1071933 (https://phabricator.wikimedia.org/T372604) [17:54:32] (03PS2) 10Scott French: [DNM] service: move mwdebug-next to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1072796 (https://phabricator.wikimedia.org/T372604) [17:56:11] (03PS1) 10Brouberol: Import ceph-csi-cephfs chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077872 (https://phabricator.wikimedia.org/T376406) [17:56:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [17:57:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [18:02:31] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1071933 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:03:00] (03CR) 10Scott French: [C:03+2] service: add basic configuration for mwdebug-next [puppet] - 10https://gerrit.wikimedia.org/r/1071933 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:05:00] (03CR) 10CI reject: [V:04-1] Fix issues reported by newer pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/1078427 (owner: 10Volans) [18:05:49] (03CR) 10Volans: [C:03+2] "Self-merging to unblock CI waiting for reviews on Icd5c48498b11bf5d86cfdc791451488037675a43" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078460 (owner: 10Volans) [18:08:47] (03CR) 10Volans: [C:03+2] "Self-merging to unblock CI for other patches." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078459 (owner: 10Volans) [18:10:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:11:56] FIRING: [2x] SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-misc2001 to codfw - jhancock@cumin2002" [18:14:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding mc-misc2001 to codfw - jhancock@cumin2002" [18:14:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:18:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:19:27] (03Merged) 10jenkins-bot: Temporary limit prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1078460 (owner: 10Volans) [18:19:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:19:46] (03PS3) 10Volans: Fix issues reported by newer pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/1078427 [18:20:12] (03Merged) 10jenkins-bot: Temporary limit prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078459 (owner: 10Volans) [18:22:28] !log running `git restore helmfile.d/services/thumbor/values.yaml` on deploy1003 to unblock git-pull timer [18:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:06] (03PS6) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) [18:23:10] (03CR) 10Legoktm: [C:04-1] "+1 to the concept, -1 to the grammar :)" [software/klaxon] - 10https://gerrit.wikimedia.org/r/1078077 (owner: 10Reedy) [18:23:16] (03PS9) 10Ayounsi: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 [18:23:51] (03PS2) 10Elukey: sre.hosts.provision: avoid a reboot if BIOS settings are already good [cookbooks] - 10https://gerrit.wikimedia.org/r/1078439 (https://phabricator.wikimedia.org/T365372) [18:24:04] (03PS13) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) [18:26:44] (03PS5) 10Scott French: sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) [18:26:56] FIRING: [2x] SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:16] ^ this should be fine now, but it might take a bit for that to be reflected here [18:41:06] (03CR) 10Scott French: "Thanks for the reviews, Alexandros!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [18:41:08] (03CR) 10Scott French: [C:03+2] sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [18:43:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073258 (https://phabricator.wikimedia.org/T373022) (owner: 10Esanders) [18:53:42] (03Merged) 10jenkins-bot: sre.discovery.datacenter: exclude kartotherian-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/1075625 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [18:59:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10207963 (10phaultfinder) [19:12:58] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:24:39] (03CR) 10BCornwall: [C:03+2] tlsproxy: Remove rsa-2048 certs [puppet] - 10https://gerrit.wikimedia.org/r/1075612 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [19:26:50] (03CR) 10Scott French: [C:03+2] sre.discovery.datacenter: exclude swift-https [cookbooks] - 10https://gerrit.wikimedia.org/r/1077111 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [19:40:15] (03Merged) 10jenkins-bot: sre.discovery.datacenter: exclude swift-https [cookbooks] - 10https://gerrit.wikimedia.org/r/1077111 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [19:46:51] (03PS1) 10BCornwall: haproxy: Reorder acmecerts to demote rsa-2048 [puppet] - 10https://gerrit.wikimedia.org/r/1078468 (https://phabricator.wikimedia.org/T370837) [19:47:49] (03PS2) 10BCornwall: haproxy: Reorder acmecerts to demote rsa-2048 [puppet] - 10https://gerrit.wikimedia.org/r/1078468 (https://phabricator.wikimedia.org/T370837) [19:55:06] (03PS5) 10JMeybohm: k8s/kubelet: Make kubelet.service depend on container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) [19:55:06] (03PS5) 10JMeybohm: k8s/kubelet: Remove absent containerd specific systemd override [puppet] - 10https://gerrit.wikimedia.org/r/1078451 (https://phabricator.wikimedia.org/T362408) [19:55:37] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078450 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [19:56:42] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [19:56:58] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [19:57:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev'] [19:58:31] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T2000). Please do the needful. [20:00:05] derenrich, Ammar, Ammar, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] o7 [20:00:20] i can deploy today [20:00:20] o/ [20:00:56] Ammar doesn't appear to be around yet [20:00:59] (03CR) 10BBlack: [C:03+1] haproxy: Reorder acmecerts to demote rsa-2048 [puppet] - 10https://gerrit.wikimedia.org/r/1078468 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:01:04] (03PS2) 10DErenrich: disable the Add A Fact QuickSurvey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076835 [20:01:05] (03CR) 10Urbanecm: [C:03+2] disable the Add A Fact QuickSurvey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076835 (owner: 10DErenrich) [20:01:11] (03PS2) 10Esanders: Enable EditCheck on ru.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073258 (https://phabricator.wikimedia.org/T373022) [20:01:14] (03CR) 10Urbanecm: [C:03+2] Enable EditCheck on ru.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073258 (https://phabricator.wikimedia.org/T373022) (owner: 10Esanders) [20:01:18] let's get started [20:01:47] (03Merged) 10jenkins-bot: disable the Add A Fact QuickSurvey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076835 (owner: 10DErenrich) [20:01:57] (03Merged) 10jenkins-bot: Enable EditCheck on ru.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1073258 (https://phabricator.wikimedia.org/T373022) (owner: 10Esanders) [20:04:24] (03PS1) 10Effie Mouzeli: site.pp: add mc-misc2* servers [puppet] - 10https://gerrit.wikimedia.org/r/1078469 (https://phabricator.wikimedia.org/T372800) [20:07:13] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: add mc-misc2* servers [puppet] - 10https://gerrit.wikimedia.org/r/1078469 (https://phabricator.wikimedia.org/T372800) (owner: 10Effie Mouzeli) [20:12:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [20:12:47] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10208231 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [20:12:52] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1076835|disable the Add A Fact QuickSurvey on enwiki]], [[gerrit:1073258|Enable EditCheck on ru.wiki (T373022)]] [20:12:56] T373022: Enable EditCheck in Russian Wikipedia - https://phabricator.wikimedia.org/T373022 [20:13:39] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1078468 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:14:52] !log urbanecm@deploy2002 esanders, derenrich, urbanecm: Backport for [[gerrit:1076835|disable the Add A Fact QuickSurvey on enwiki]], [[gerrit:1073258|Enable EditCheck on ru.wiki (T373022)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:07] Kemayo: derenrich: can you test at mwdebug, please? [20:15:24] Sure, just a second. [20:15:38] one second [20:15:41] thanks [20:15:45] urbanecm: Works. [20:15:49] (03CR) 10Vgutierrez: [C:03+1] "thx for taking care of this one. This will close the gap between digicert and LE configurations" [puppet] - 10https://gerrit.wikimedia.org/r/1078468 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:15:51] thanks for confirming! [20:15:54] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10208242 (10nisrael) Hi all, I met with Lisa today and we retrieved an example of one of the responses she's received. I'm attaching an image of it and I... [20:15:58] LGTM [20:16:01] (03PS3) 10C. Scott Ananian: Turn on mobile support for Parsoid Read Views (but not on talk pages) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) [20:16:01] !log urbanecm@deploy2002 esanders, derenrich, urbanecm: Continuing with sync [20:16:03] proceeding! [20:16:04] thanks both [20:16:35] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4243/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078468 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:17:00] (03CR) 10C. Scott Ananian: "This is good to go as soon as we get the ok from web." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) (owner: 10C. Scott Ananian) [20:20:33] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076835|disable the Add A Fact QuickSurvey on enwiki]], [[gerrit:1073258|Enable EditCheck on ru.wiki (T373022)]] (duration: 07m 41s) [20:20:36] T373022: Enable EditCheck in Russian Wikipedia - https://phabricator.wikimedia.org/T373022 [20:20:37] deployed [20:20:43] 06SRE, 06Infrastructure-Foundations, 10Mail: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10208262 (10jhathaway) >>! In T375643#10208242, @nisrael wrote: > I met with Lisa today and we retrieved an example of one of the responses she's received... [20:20:51] Ammar not around it looks [20:24:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10208277 (10jijiki) >>! In T372800#10199506, @Jhancock.wm wrote: > @jijiki hi, we got the servers in this week and are going to be racking them today. Could you... [20:27:05] (03CR) 10BCornwall: [V:03+1 C:03+2] haproxy: Reorder acmecerts to demote rsa-2048 [puppet] - 10https://gerrit.wikimedia.org/r/1078468 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [20:39:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10208401 (10Ottomata) Approved! [20:44:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:54:09] (03CR) 10Ottomata: [C:03+1] Add an hdfs_file type and provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074478 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [20:57:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:00:04] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241007T2100). Please do the needful. [21:23:11] (03PS2) 10JHathaway: dhcp: Add option to omit sending filename to a vendor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 [21:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade puppetmaster1001 iDRAC - https://phabricator.wikimedia.org/T376611#10208508 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr updated firmware to most latest version racadm getsysinfo RAC Information: RAC Date/Time = Mon Oct 7 21:23:29 2024 Firmwa... [21:25:13] (03CR) 10JHathaway: "Makes sense, patch updated, let me know what you think, with your suggestion updating the tests is not needed, which is great!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [21:29:35] (03CR) 10JHathaway: [C:03+2] efi: add efi boot files on apt server [puppet] - 10https://gerrit.wikimedia.org/r/1078020 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [21:32:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [21:33:06] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10208555 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed... [21:34:05] (03CR) 10CI reject: [V:04-1] dhcp: Add option to omit sending filename to a vendor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 (owner: 10JHathaway) [21:37:01] (03PS1) 10Scott French: mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) [21:37:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 870.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:40:24] (03PS3) 10JHathaway: dhcp: Add option to omit sending filename to a vendor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077804 [21:40:50] (03PS3) 10AikoChou: ml-services: update ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077309 (https://phabricator.wikimedia.org/T372405) [21:42:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 871.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:47:02] (03PS4) 10AikoChou: ml-services: update ref-quality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077309 (https://phabricator.wikimedia.org/T372405) [21:49:54] (03CR) 10Jdlrobson: [C:03+1] "This is fine provided we have the shared understanding that roll out of Parsoid for default read views is only ready for Wikivoyage projec" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) (owner: 10C. Scott Ananian) [21:50:40] (03PS9) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [21:51:54] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: initial UEFI support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077377 (https://phabricator.wikimedia.org/T373519) (owner: 10Ayounsi) [21:55:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 806.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:58:08] (03PS10) 10JHathaway: redfish: add UEFI functions [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [21:59:46] (03CR) 10JHathaway: redfish: add UEFI functions (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1077661 (owner: 10Ayounsi) [22:00:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 806.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:07:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 855.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 860.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:17:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 806.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:22:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 833.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 935.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:26:56] FIRING: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:31:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 859.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:31:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 867.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:36:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 803.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:04:55] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10208921 (10phaultfinder) [23:12:58] FIRING: [8x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:15:54] (03CR) 10C. Scott Ananian: "The mobile flag in parser migration is /after/ everything else, so this is implicitly only true for places where we've rolled out parsoid " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) (owner: 10C. Scott Ananian) [23:18:49] (03PS4) 10C. Scott Ananian: Turn on mobile support for Parsoid Read Views (but not on talk pages) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) [23:19:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) (owner: 10Jdlrobson) [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078495 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1078495 (owner: 10TrainBranchBot) [23:51:24] (03PS2) 10Jdlrobson: Expand Vector 2022 roll out and support local variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) [23:52:02] (03CR) 10CI reject: [V:04-1] Expand Vector 2022 roll out and support local variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078454 (https://phabricator.wikimedia.org/T375549) (owner: 10Jdlrobson) [23:58:31] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown