[00:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [00:01:11] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:10:43] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:24:57] 06SRE, 10MediaWiki-Debug-Logger, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#10640178 (10Pppery) [00:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128057 [00:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128057 (owner: 10TrainBranchBot) [00:49:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128057 (owner: 10TrainBranchBot) [00:51:11] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:54:52] (03CR) 10Tacsipacsi: search-redirect: Handle $_GET potential vulnerability scanning (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [01:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128059 [01:08:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128059 (owner: 10TrainBranchBot) [01:28:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128059 (owner: 10TrainBranchBot) [01:46:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/8e93d155635316ec7caba9a8066787b4cd39e47b01b908a0d067cb574d23b01f/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640230 (10phaultfinder) [02:06:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:13:43] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:42:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:27:57] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640249 (10phaultfinder) [04:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [04:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640274 (10phaultfinder) [05:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:27] (03PS1) 10KartikMistry: Update cxserver to 2025-03-14-045617-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128066 (https://phabricator.wikimedia.org/T382294) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:57] (03PS1) 10KartikMistry: MinT: staging: Increase rediness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.14s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:35:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.14s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:40:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:42:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:12:07] (03PS1) 10Marostegui: installserver: Do not reimage db2243 [puppet] - 10https://gerrit.wikimedia.org/r/1128215 [07:12:17] (03CR) 10Arnaudb: [C:03+1] mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [07:13:34] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1 [07:14:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1 [07:15:09] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section test-s4 [07:16:41] 06SRE, 06Commons, 07Wikimedia-production-error: https://commons.wikimedia.org/w/index.php?curid=162194998 - URL shows an exception instead of either a file description page or a 404response if there was no page associated with the curid/mediaid - https://phabricator.wikimedia.org/T389031#10640405 (10A_smart_k... [07:16:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section test-s4 [07:17:10] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1 [07:17:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1 [07:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640415 (10phaultfinder) [07:20:42] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2243 [puppet] - 10https://gerrit.wikimedia.org/r/1128215 (owner: 10Marostegui) [07:21:48] 06SRE, 06Commons, 07Wikimedia-production-error: Fatal exception of type "LogicException" when visiting some curid URLs on Wikimedia Commons - https://phabricator.wikimedia.org/T389031#10640416 (10A_smart_kitten) [07:24:00] (03CR) 10Slyngshede: idp-test: add Phabricator test instance client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [07:24:16] (03CR) 10Slyngshede: "That would be me :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [07:27:10] 06SRE, 06Commons, 07Wikimedia-production-error: Fatal exception of type "LogicException" when visiting some curid URLs on Wikimedia Commons - https://phabricator.wikimedia.org/T389031#10640432 (10A_smart_kitten) [07:27:58] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [07:28:20] !log marostegui@cumin2002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1 [07:28:44] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034 (10MoritzMuehlenhoff) 03NEW [07:29:04] (03PS1) 10Filippo Giunchedi: base: don't show diff for phaste config [puppet] - 10https://gerrit.wikimedia.org/r/1128225 [07:29:06] !log marostegui@cumin2002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1 [07:32:16] (03PS1) 10Filippo Giunchedi: prometheus: disable 'accelerator' cadvisor metric [puppet] - 10https://gerrit.wikimedia.org/r/1128319 (https://phabricator.wikimedia.org/T388632) [07:37:41] (03PS1) 10Filippo Giunchedi: pontoon: add hosts-for-role command [puppet] - 10https://gerrit.wikimedia.org/r/1128330 [07:39:14] (03PS1) 10Muehlenhoff: Remove vrook from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1128334 [07:41:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add hosts-for-role command [puppet] - 10https://gerrit.wikimedia.org/r/1128330 (owner: 10Filippo Giunchedi) [07:47:01] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1128334 (owner: 10Muehlenhoff) [07:47:20] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128322 (owner: 10Muehlenhoff) [07:48:21] (03Abandoned) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [07:48:31] (03Abandoned) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [07:48:43] (03Abandoned) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [07:48:50] (03CR) 10Muehlenhoff: [C:03+2] Remove vrook from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1128334 (owner: 10Muehlenhoff) [07:49:58] (03Abandoned) 10Brouberol: airflow: mount the hadoop configuration in the webserver and scheduler pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123527 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [07:50:42] (03CR) 10Muehlenhoff: [C:03+2] Add cn=bitu-account-managers to list of groups to drop on offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1128322 (owner: 10Muehlenhoff) [07:55:55] (03PS4) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [07:56:18] (03CR) 10CI reject: [V:04-1] Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [07:57:27] (03CR) 10Brouberol: [C:03+2] mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [07:57:34] (03CR) 10CI reject: [V:04-1] mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [08:00:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640484 (10phaultfinder) [08:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [08:02:57] (03PS5) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [08:05:32] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5081/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [08:13:37] (03CR) 10Elukey: [C:03+1] "LGTM, I don't see anything weird in the config. I think that we should coordinate on the deployment procedure, so that we can test properl" [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [08:15:14] (03CR) 10Arnaudb: [C:03+2] nftables: add a newline at the end of GERRIT_ABUSERS lines [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [08:16:40] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [08:16:46] (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [08:16:55] (03CR) 10Elukey: [C:03+1] services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [08:16:56] (03PS1) 10Tiziano Fogli: nrpe/monitoring-plugins-standard: fix deps [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680) [08:17:11] (03CR) 10Elukey: [C:03+2] Revert "Temporary revert changeprop/changeprop-jobqueue to node 18 images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127939 (owner: 10Aaron Schulz) [08:17:40] 07sre-alert-triage, 06serviceops: Alert in need of triage: WidespreadPuppetFailure - https://phabricator.wikimedia.org/T389037 (10LSobanski) 03NEW [08:17:48] (03PS6) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [08:18:42] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T389038 (10LSobanski) 03NEW [08:19:12] (03PS1) 10Arnaudb: Revert "nftables: add a newline at the end of GERRIT_ABUSERS lines" [puppet] - 10https://gerrit.wikimedia.org/r/1128337 [08:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640549 (10phaultfinder) [08:20:02] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [08:20:47] (03CR) 10Brouberol: [V:03+1] "@btullis@wikimedia.org @xcollazo@wikimedia.org I took the liberty to rework the patch to make sure that hive-site.xml shows the required d" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [08:22:33] (03CR) 10Arnaudb: [C:03+2] Revert "nftables: add a newline at the end of GERRIT_ABUSERS lines" [puppet] - 10https://gerrit.wikimedia.org/r/1128337 (owner: 10Arnaudb) [08:23:04] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [08:23:14] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [08:23:29] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [08:23:39] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [08:28:45] (03PS2) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [08:30:23] !log updated bookworm installer image to Bookworm 12.10 T389034 [08:30:24] (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:27] T389034: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034 [08:39:35] (03Abandoned) 10Brouberol: mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [08:41:15] (03PS1) 10Brouberol: mediawiki-dumps-legacy: enabled the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128339 (https://phabricator.wikimedia.org/T388378) [08:42:24] (03CR) 10Majavah: "fwiw, this would've worked with double quotes instead of single ones" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [08:43:32] (03CR) 10Arnaudb: [C:03+2] "TIL, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [08:45:16] (03PS7) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [08:45:58] (03PS2) 10Arnaudb: nftables: add a newline at the end of GERRIT_ABUSERS lines [puppet] - 10https://gerrit.wikimedia.org/r/1128338 (https://phabricator.wikimedia.org/T388783) [08:45:58] (03CR) 10Arnaudb: "Given https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127527/comments/c1e73adc_1c086bad I can also redo I6fecea2faa7774b21e68fdf70ce" [puppet] - 10https://gerrit.wikimedia.org/r/1128338 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [08:46:18] RECOVERY - Disk space on maps1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [08:46:32] !log freed 28G of disk space on maps1009 [08:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:06] (03PS3) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [08:48:19] (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:49:03] (03PS1) 10Brouberol: xmldumps-backup: enable the worker script to be called from any path [dumps] - 10https://gerrit.wikimedia.org/r/1128342 (https://phabricator.wikimedia.org/T388378) [08:50:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:28] (03CR) 10Slyngshede: [C:03+2] P:firewall absent conntrack_table_size monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1126503 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:55:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:17] 07sre-alert-triage, 06serviceops: Alert in need of triage: WidespreadPuppetFailure - https://phabricator.wikimedia.org/T389037#10640633 (10MoritzMuehlenhoff) 05Open→03Declined This is caused by WIP setup nodes for the parallel Bookworm cluster, but not affecting any production workloads. [08:59:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640636 (10phaultfinder) [09:00:26] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:36] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10640642 (10MoritzMuehlenhoff) [09:04:40] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10640643 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:11:28] (03PS1) 10Elukey: service: move kartotherian-k8s-ssl fully on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1128343 (https://phabricator.wikimedia.org/T386926) [09:14:03] (03CR) 10JMeybohm: [C:03+2] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [09:14:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [09:15:26] RESOLVED: [4x] SystemdUnitFailed: export_smart_data_dump.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:32] (03CR) 10Tiziano Fogli: "AFAICS, you don't have enough points to trigger the alert with an evaluation interval of 2 minutes." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:16:56] (03PS1) 10Elukey: service: set kartotherian and kartotherian-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128344 (https://phabricator.wikimedia.org/T389042) [09:16:58] (03PS1) 10Elukey: service: set kartotherian and kartotherian-ssl to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128345 (https://phabricator.wikimedia.org/T389042) [09:17:01] (03PS1) 10Elukey: service, conftool-data: final removal for unused Kartotherian configs [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042) [09:19:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [09:23:31] (03PS4) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [09:23:59] (03CR) 10JMeybohm: [C:03+2] services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127947 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [09:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640704 (10phaultfinder) [09:24:42] (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:24:58] (03CR) 10Ayounsi: "thanks, I tried it locally, but still seeing the same error." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:25:15] !log installing intel-microcode security updates [09:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:33] (03PS1) 10Elukey: maps: remove Kartotherian from bare metal nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) [09:28:28] (03PS1) 10Kamila Součková: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T341984) [09:28:30] (03PS1) 10Kamila Součková: Update wikikube-staging codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T386232) [09:28:50] (03PS5) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) [09:30:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680) (owner: 10Tiziano Fogli) [09:30:57] (03CR) 10Btullis: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128339 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:31:02] (03Merged) 10jenkins-bot: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [09:31:11] (03CR) 10Ayounsi: [C:03+1] Support setting custom arp-policer on CR interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1127592 (https://phabricator.wikimedia.org/T384774) (owner: 10Cathal Mooney) [09:32:05] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: enabled the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128339 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:32:11] (03CR) 10Ayounsi: "Not sure why local docker CI doesn't behave. But at long as it passes here..." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:32:18] (03CR) 10Btullis: [C:03+1] xmldumps-backup: enable the worker script to be called from any path [dumps] - 10https://gerrit.wikimedia.org/r/1128342 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:33:56] (03PS1) 10Marostegui: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128351 (https://phabricator.wikimedia.org/T388626) [09:34:09] (03PS1) 10Kamila Součková: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) [09:35:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128351 (https://phabricator.wikimedia.org/T388626) (owner: 10Marostegui) [09:36:19] (03CR) 10Brouberol: [C:03+2] xmldumps-backup: enable the worker script to be called from any path [dumps] - 10https://gerrit.wikimedia.org/r/1128342 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [09:36:46] (03CR) 10JMeybohm: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [09:36:49] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128351 (https://phabricator.wikimedia.org/T388626) (owner: 10Marostegui) [09:37:28] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1128351|db-production.php: Disable writes on es6 (T388626)]] [09:37:32] T388626: Prepare databases circular replication for the DC switchover - https://phabricator.wikimedia.org/T388626 [09:38:12] (03CR) 10JMeybohm: [C:03+1] Update wikikube-staging codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T386232) (owner: 10Kamila Součková) [09:38:14] (03PS2) 10Kamila Součková: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) [09:38:58] (03PS3) 10Kamila Součková: Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) [09:39:08] (03Merged) 10jenkins-bot: services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127947 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [09:39:58] (03CR) 10JMeybohm: "We need to change admission_plugins as well, see:" [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [09:40:54] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs3010 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [09:40:57] (03PS2) 10Kamila Součková: Update wikikube-staging eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T389045) [09:43:13] (03CR) 10Vgutierrez: [C:03+2] "thanks for the thorough reviews <3" [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [09:45:42] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3010.esams.wmnet with OS bookworm [09:47:10] (03PS4) 10Kamila Součková: Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) [09:49:37] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:49:57] (03Merged) 10jenkins-bot: sre.loadbalancer: upgrade/restart cookbook for liberica [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [09:50:21] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1128351|db-production.php: Disable writes on es6 (T388626)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:50:22] ^^ BGP alert is lvs3010 getting reimaged [09:50:25] T388626: Prepare databases circular replication for the DC switchover - https://phabricator.wikimedia.org/T388626 [09:50:26] !log marostegui@deploy2002 marostegui: Continuing with sync [09:51:20] (03PS2) 10Kamila Součková: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) [09:51:32] (03CR) 10Kamila Součková: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [09:55:17] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs4010.ulsfo.wmnet} and A:liberica [09:55:51] (03CR) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [09:56:20] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.upgrade (exit_code=1) upgradeing P{lvs4010.ulsfo.wmnet} and A:liberica [09:57:05] (03CR) 10Ayounsi: [C:03+1] "Nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1000) [10:00:54] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128351|db-production.php: Disable writes on es6 (T388626)]] (duration: 23m 25s) [10:00:58] T388626: Prepare databases circular replication for the DC switchover - https://phabricator.wikimedia.org/T388626 [10:01:28] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es6 [10:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:02:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es6 [10:02:33] (03CR) 10Alexandros Kosiaris: [C:04-1] "Couple of comments inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry) [10:04:57] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128356 [10:05:06] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es7 [10:05:38] (03PS1) 10Muehlenhoff: Bitu: Add approval config for airflow-research-ops [puppet] - 10https://gerrit.wikimedia.org/r/1128357 [10:05:38] (03PS1) 10Muehlenhoff: Bitu: Add obsolete test config [puppet] - 10https://gerrit.wikimedia.org/r/1128358 [10:06:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es7 [10:06:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128356 (owner: 10Marostegui) [10:06:25] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1 [10:06:58] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128356 (owner: 10Marostegui) [10:07:10] (03PS1) 10Kamila Součková: Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) [10:07:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1 [10:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:07:17] !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1128356|Revert "db-production.php: Disable writes on es6"]] [10:07:51] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s6 [10:09:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s6 [10:09:35] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s7 [10:09:38] (03PS2) 10Kamila Součková: Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) [10:09:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [10:10:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s7 [10:10:51] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s8 [10:10:55] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [10:10:59] (03PS1) 10Vgutierrez: sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) [10:11:18] (03CR) 10JMeybohm: [C:03+1] Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [10:11:42] (03CR) 10JMeybohm: [C:03+1] admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [10:11:45] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1128356|Revert "db-production.php: Disable writes on es6"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:11:49] (03CR) 10JMeybohm: [C:03+1] Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [10:12:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s8 [10:12:53] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s5 [10:13:49] !log marostegui@deploy2002 marostegui: Continuing with sync [10:14:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s5 [10:14:28] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s4 [10:15:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s4 [10:16:10] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s3 [10:17:03] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [10:17:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s3 [10:17:29] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s2 [10:18:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s2 [10:19:29] !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s1 [10:21:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s1 [10:21:58] !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128356|Revert "db-production.php: Disable writes on es6"]] (duration: 14m 41s) [10:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:23:12] (03PS1) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) [10:26:22] (03PS1) 10Muehlenhoff: preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966) [10:26:33] (03PS2) 10Muehlenhoff: preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966) [10:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:27:43] (03PS3) 10Muehlenhoff: preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966) [10:28:21] (03CR) 10Elukey: [C:03+1] preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966) (owner: 10Muehlenhoff) [10:29:34] (03PS2) 10Vgutierrez: sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) [10:29:55] (03CR) 10CI reject: [V:04-1] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:30:20] (03CR) 10Muehlenhoff: [C:03+2] preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966) (owner: 10Muehlenhoff) [10:30:44] jouncebot: nowandnext [10:30:44] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1000) [10:30:45] In 2 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300) [10:30:47] (03CR) 10Volans: [C:03+1] "Better :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [10:32:10] * MichaelG_WMF is interested in running a few low-risk maintenance scripts to clean up a few GrowthExperiments tables before the backport window, but nothing urgent [10:32:47] (03PS2) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) [10:32:50] (03PS1) 10Ladsgroup: media: Make SvgHandler respect physicalWidth when building URL for thumb [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128365 (https://phabricator.wikimedia.org/T360589) [10:32:57] (03PS1) 10Ladsgroup: findBadBlobs: Allow for timestamp based search via --scan-to [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953) [10:33:26] (03CR) 10Ladsgroup: [C:03+2] media: Make SvgHandler respect physicalWidth when building URL for thumb [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128365 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:33:31] (03CR) 10Ladsgroup: [C:03+2] findBadBlobs: Allow for timestamp based search via --scan-to [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [10:37:22] !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs3010.esams.wmnet with OS bookworm [10:38:01] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3010.esams.wmnet with OS bookworm [10:38:54] (03PS3) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) [10:39:33] (03PS1) 10Brouberol: airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367 [10:39:33] (03PS1) 10Brouberol: airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) [10:41:59] (03PS8) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [10:42:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:56] (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [10:43:57] !log restarting dbprov2005 T389052 [10:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:02] T389052: dbprov2005 lost network link - https://phabricator.wikimedia.org/T389052 [10:44:06] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [10:44:46] (03PS2) 10Majavah: P:wmcs: wikireplicas: Drop module_deps view [puppet] - 10https://gerrit.wikimedia.org/r/1128014 (https://phabricator.wikimedia.org/T388982) [10:44:49] (03CR) 10Ladsgroup: [V:03+2 C:03+2] P:wmcs: wikireplicas: Drop module_deps view [puppet] - 10https://gerrit.wikimedia.org/r/1128014 (https://phabricator.wikimedia.org/T388982) (owner: 10Majavah) [10:44:50] jouncebot: nowandnext [10:44:50] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1000) [10:44:50] In 2 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300) [10:45:10] (03PS2) 10Majavah: P:wmcs: wikireplicas: Fix fr_actor not being exposed [puppet] - 10https://gerrit.wikimedia.org/r/1128041 (https://phabricator.wikimedia.org/T383491) [10:45:17] (03CR) 10Ladsgroup: [V:03+2 C:03+2] P:wmcs: wikireplicas: Fix fr_actor not being exposed [puppet] - 10https://gerrit.wikimedia.org/r/1128041 (https://phabricator.wikimedia.org/T383491) (owner: 10Majavah) [10:45:40] (03PS1) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1128369 (https://phabricator.wikimedia.org/T388388) [10:45:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [10:46:58] (03CR) 10Kamila Součková: [C:03+1] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:46:59] (03PS9) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [10:47:18] * Dreamy_Jazz Is interested in doing a few config changes and a maintenance script run for https://phabricator.wikimedia.org/T387205 [10:47:21] (03CR) 10JMeybohm: [C:03+2] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:47:32] (03Merged) 10jenkins-bot: media: Make SvgHandler respect physicalWidth when building URL for thumb [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128365 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:47:58] (03PS1) 10Volans: setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370 [10:47:58] (03PS1) 10Volans: constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371 [10:48:00] (03CR) 10Elukey: [C:03+1] "I like the approach and it is way more simpler than the introduction of the ACLs. We should be careful in rolling out this change but it i" [puppet] - 10https://gerrit.wikimedia.org/r/1127150 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [10:48:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [10:49:09] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [10:49:28] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10641020 (10MoritzMuehlenhoff) [10:49:35] (03Merged) 10jenkins-bot: sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez) [10:49:37] (03CR) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [10:50:14] (03Merged) 10jenkins-bot: findBadBlobs: Allow for timestamp based search via --scan-to [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [10:50:33] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]] [10:50:38] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [10:50:38] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:51:34] Amir1: Will you be done backporting after this sync-world? I would like to do some backporting after you if there is time. [10:52:06] sure. I have an extra deploy after this but it can wait for a bit (and should) [10:53:01] (03PS1) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 [10:53:29] (03CR) 10CI reject: [V:04-1] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [10:53:35] (03Merged) 10jenkins-bot: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:54:37] (03PS2) 10Brouberol: airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) [10:54:38] (03PS1) 10Brouberol: airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 [10:54:42] Dreamy_Jazz: would you mind adding this noop patch to your deploys too? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1127894 totally fine if not possible [10:55:03] (03PS2) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 [10:55:11] Sure, I can do that. [10:55:11] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:55:45] !log kamila@cumin1002 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster staging-eqiad: k8s upgrade [10:56:00] 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641045 (10jcrespo) [10:56:21] (03PS1) 10Esanders: VE: Disable upcoming mobile insert menu everywhere except test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) [10:56:27] (03PS3) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 [10:56:29] 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641051 (10jcrespo) Probably a loose cable. If not, a card failure (doesn't look like it from the mgmt log) or a switch port misconfig/issue. [10:56:35] Thanks! [10:56:37] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [10:56:41] (03PS2) 10Brouberol: airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 [10:56:56] (03CR) 10CI reject: [V:04-1] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [10:57:01] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage [10:57:36] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:58:12] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs5006.eqsin.wmnet} and A:liberica [10:58:12] (03PS1) 10Dreamy Jazz: Re-enable the 'temporary-account-viewer' group for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205) [10:58:14] (03PS1) 10Dreamy Jazz: Unset the old 'checkuser-temporary-account-viewer' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205) [10:59:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs5006.eqsin.wmnet} and A:liberica [10:59:33] (03PS1) 10Ladsgroup: Bump thumbnail steps ratio to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589) [10:59:39] (03PS4) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 [10:59:41] volans: ^^ now it worked as expected :D [10:59:55] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [10:59:58] 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641089 (10jcrespo) [11:00:51] (03PS2) 10Dreamy Jazz: Re-enable the 'temporary-account-viewer' group for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205) [11:01:12] (03CR) 10Btullis: [C:03+1] airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367 (owner: 10Brouberol) [11:01:41] (03CR) 10Kamila Součková: [C:03+2] Update wikikube-staging eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [11:01:45] (03CR) 10Kamila Součková: [C:03+2] Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [11:01:50] (03CR) 10Kamila Součková: [C:03+2] Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [11:01:53] /13/13 [11:01:56] (03CR) 10Kamila Součková: [C:03+2] admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [11:01:58] uff err :) [11:02:09] elukey: feeling lucky today? [11:02:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage [11:02:41] vgutierrez: nice [11:02:43] vgutierrez: not so much :D [11:02:58] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster1003.eqiad.wmnet are marked down but pooled: k8s-ingress-staging_30443: Servers kubestage1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:02:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster1004.eqiad.wmnet are marked down but pooled: k8s-ingress-staging_30443: Servers kubestage1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:03:15] this is kamila_ and me, updating staging-eqiad [11:04:39] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe/monitoring-plugins-standard: fix deps [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680) (owner: 10Tiziano Fogli) [11:04:40] vgutierrez: I just deployed a fix that makes svg files also respect steps (which it wasn't because I was too stupid to test it for svg files), they should get a bump to 10%, I will also a bit later bump to 15% (the same stuff, every day five 5% bump) [11:04:42] FIRING: JobUnavailable: Reduced availability for job liberica in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:43] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]] (duration: 14m 09s) [11:04:47] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [11:04:48] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:04:49] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs[5005-5006].eqsin.wmnet} and A:liberica [11:04:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [11:04:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127894 (https://phabricator.wikimedia.org/T373037) (owner: 10Hashar) [11:05:04] (03PS5) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 [11:05:08] Amir1: cool .D [11:05:11] Dreamy_Jazz: the floor is you can see is yours [11:05:20] Thanks! [11:05:24] vgutierrez: let me know if upload starts to struggles [11:05:39] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [11:05:46] (03Merged) 10jenkins-bot: Re-enable the 'temporary-account-viewer' group for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [11:05:48] (03Merged) 10jenkins-bot: Remove obsolete $wgParserCacheNewKeySchemaRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127894 (https://phabricator.wikimedia.org/T373037) (owner: 10Hashar) [11:06:07] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1128375|Re-enable the 'temporary-account-viewer' group for migration (T387205)]], [[gerrit:1127894|Remove obsolete $wgParserCacheNewKeySchemaRatio (T373037)]] [11:06:12] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [11:06:12] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:06:18] (03CR) 10Filippo Giunchedi: [C:03+2] site: provision prometheus100[78] with role prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1127483 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:06:41] (03Merged) 10jenkins-bot: Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [11:07:06] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs[5005-5006].eqsin.wmnet} and A:liberica [11:07:09] 10 [11:07:24] (03Merged) 10jenkins-bot: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková) [11:08:25] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970) [11:08:48] (03CR) 10Elukey: [C:03+1] setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370 (owner: 10Volans) [11:09:36] (03CR) 10Elukey: [C:03+1] constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371 (owner: 10Volans) [11:10:25] !log dreamyjazz@deploy2002 dreamyjazz, hashar: Backport for [[gerrit:1128375|Re-enable the 'temporary-account-viewer' group for migration (T387205)]], [[gerrit:1127894|Remove obsolete $wgParserCacheNewKeySchemaRatio (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:10:28] !log dreamyjazz@deploy2002 dreamyjazz, hashar: Continuing with sync [11:11:05] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-magru [11:11:21] I will run the migrateUserGroup.php maintenance script and then deploy another config patch shortly after that [11:11:22] (03PS2) 10Btullis: mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) [11:11:36] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:11:52] Doing that to avoid creating too many translation errors caused by two groups having the same display name being defined at the same time. [11:12:34] So an increase in "group0 has the same name as group1" errors in logstash is expected until I've finished [11:13:13] It importantly won't cause any problems for the end user, just an increased rate of logs in logstash. [11:13:31] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-magru [11:14:13] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:14:42] RESOLVED: JobUnavailable: Reduced availability for job liberica in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:14:49] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:15:16] (03CR) 10Elukey: [C:03+1] "@jgiannelos@wikimedia.org lemme know if you have concerns about it. I don't think it should change much for Kartotherian, but we'd probabl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [11:16:35] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs[4008-4009].ulsfo.wmnet,lvs5004.eqsin.wmnet} and A:liberica [11:17:38] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:17:47] (03CR) 10Zoe: [C:03+1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127571 (owner: 10PipelineBot) [11:18:40] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128375|Re-enable the 'temporary-account-viewer' group for migration (T387205)]], [[gerrit:1127894|Remove obsolete $wgParserCacheNewKeySchemaRatio (T373037)]] (duration: 12m 32s) [11:18:45] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [11:18:45] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:19:04] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs[4008-4009].ulsfo.wmnet,lvs5004.eqsin.wmnet} and A:liberica [11:19:20] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:19:31] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:19:33] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:19:36] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:57] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:20:30] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:20:38] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128381 [11:20:42] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:20:49] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:21:09] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:21:27] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:21:59] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10641179 (10MatthewVernon) It's not the same issue - those two files have thumbs in different containers (`wikipedia-commons-local-thumb.c6` a... [11:22:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3010.esams.wmnet with OS bookworm [11:22:22] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:23:32] !log Ran `mwscript migrateUserGroup.php --wiki=testwiki checkuser-temporary-account-viewer temporary-account-viewer` [11:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:47] (03CR) 10Máté Szabó: [C:03+1] Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [11:25:41] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs3009 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477) [11:26:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [11:26:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10641197 (10phaultfinder) [11:29:35] (03CR) 10Volans: [C:03+2] setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370 (owner: 10Volans) [11:30:02] (03CR) 10Volans: [C:03+2] constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371 (owner: 10Volans) [11:30:03] Running `mwscript migrateUserGroup.php --wiki=X checkuser-temporary-account-viewer temporary-account-viewer` for all wikis with temporary accounts enabled or known (testwiki, loginwiki, test2wiki, metawiki, cswikiversity, igwiki, itwikiquote, swwiki, shwiki, fawiktionary, jawikibooks, zh_yuewiki, dawiki, srwiki, rowiki, nowiki, metawiki) [11:30:09] !log Running `mwscript migrateUserGroup.php --wiki=X checkuser-temporary-account-viewer temporary-account-viewer` for all wikis with temporary accounts enabled or known (testwiki, loginwiki, test2wiki, metawiki, cswikiversity, igwiki, itwikiquote, swwiki, shwiki, fawiktionary, jawikibooks, zh_yuewiki, dawiki, srwiki, rowiki, nowiki, metawiki) [11:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:08] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:32:52] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:33:38] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:33:56] The migrate script is going to take a while for metawiki, so if anyone wants to deploy in the mean while feel free. [11:35:36] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:35:38] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:35:43] 06SRE, 10Thumbor: Thumbnail failures on some SVGs - https://phabricator.wikimedia.org/T389060 (10MatthewVernon) 03NEW [11:36:07] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:36:08] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:36:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:36:17] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:36:20] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:36:30] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:36:52] @Dreamy_Jazz I have a maintenance script to run to clean up a GrowthExperiments table for eswiki+ptwiki+idwiki+arzwiki. Should be low risk (have run that exact same script for others already with any trouble) - that ok? [11:36:58] (03CR) 10Clément Goubert: [C:03+1] switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [11:36:59] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:37:25] MichaelG_WMF: Yeah, should be fine to run that. [11:37:39] 👍 [11:37:39] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:37:58] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:38:10] Especially as there are no overlaps on the wikis being run to my maintenance script run [11:38:20] !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:38:31] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:38:41] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:38:43] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:38:51] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:39:36] !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=eswiki --db-table --verbose --force 2>&1 | tee ~/eswiki-dbtable.txt` [11:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:23] (03Merged) 10jenkins-bot: setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370 (owner: 10Volans) [11:40:24] (03Merged) 10jenkins-bot: constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371 (owner: 10Volans) [11:40:36] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:40:38] (03CR) 10Tiziano Fogli: [C:03+2] nrpe/monitoring-plugins-standard: fix deps [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680) (owner: 10Tiziano Fogli) [11:41:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:41:45] !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=ptwiki --db-table --verbose --force 2>&1 | tee ~/ptwiki-dbtable.txt` [11:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:43:29] !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=idwiki --db-table --verbose --force 2>&1 | tee ~/idwiki-dbtable.txt` [11:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:34] !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:44:06] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [11:44:16] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:44:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster staging-eqiad: k8s upgrade [11:45:03] !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=arzwiki --db-table --verbose --force 2>&1 | tee ~/arzwiki-dbtable.txt` [11:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:58] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:46:00] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:47:04] !log `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=arzwiki --search-index --verbose 2>&1 | tee ~/arzwiki-searchindex.txt` [11:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:08] (03PS1) 10Vgutierrez: hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) [11:48:35] (03PS2) 10Vgutierrez: hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) [11:49:04] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/apertium: apply [11:49:07] !log `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=shwiki --search-index --verbose 2>&1 | tee ~/shwiki-searchindex.txt` [11:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:21] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/api-gateway: apply [11:49:22] (03CR) 10Jgiannelos: [C:03+2] changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) (owner: 10Jgiannelos) [11:49:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [11:50:27] Alright, I ran all the scripts I wanted to run, and I think I'm done. [11:50:46] (03Merged) 10jenkins-bot: changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) (owner: 10Jgiannelos) [11:51:24] FIRING: [2x] ProbeDown: Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:26] MichaelG_WMF: Can I encourage you to try and run them with mwscript-k8s if possible next time? [11:52:11] claime: Can I do that by now as someone who is not a deployer and only is part of the `restricted` group? [11:52:29] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/changeprop: apply [11:52:32] MichaelG_WMF: Ah, let me check, I think that's been fixed, or at least is in the progress of being fixed [11:52:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:52:46] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/changeprop-jobqueue: apply [11:52:48] claime: if that is possible by now, then I would be happy to be pointed to some tutorial or get some training around k8s [11:53:11] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/chart-renderer: apply [11:53:21] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/cirrus-streaming-updater: apply [11:53:43] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/citoid: apply [11:53:57] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/commons-impact-analytics: apply [11:54:28] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/cxserver: apply [11:54:53] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/data-gateway: apply [11:55:11] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/developer-portal: apply [11:55:27] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/device-analytics: apply [11:55:45] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/echostore: apply [11:55:45] (03CR) 10KartikMistry: "Noted. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry) [11:56:04] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/edit-analytics: apply [11:56:24] RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:25] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/editor-analytics: apply [11:56:54] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-analytics: apply [11:57:24] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-analytics-external: apply [11:57:57] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-logging-external: apply [11:58:13] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-main: apply [11:58:28] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventstreams: apply [11:59:04] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventstreams-internal: apply [11:59:20] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/geo-analytics: apply [11:59:39] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/image-suggestion: apply [12:00:21] (03PS2) 10KartikMistry: MinT: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125093 (https://phabricator.wikimedia.org/T386889) [12:01:04] (03PS1) 10Btullis: dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) [12:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [12:01:28] MichaelG_WMF: It wuold seem like mwscript-k8s isn't completely ready for restricted users afaict, sorry for pinging you about it [12:01:48] (03PS2) 10Btullis: dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) [12:02:01] claime: All good, I look forward to it when it is ready :) [12:02:03] Doc's here if you fancy a read https://wikitech.wikimedia.org/wiki/Mwscript-k8s [12:02:16] 👀 [12:02:27] (03CR) 10KartikMistry: MinT: staging: Increase rediness probe (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry) [12:02:58] (03PS2) 10KartikMistry: MinT: staging: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) [12:04:45] (03PS1) 10Jgiannelos: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 [12:06:43] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) (owner: 10Btullis) [12:07:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:09:10] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/linkrecommendation: apply [12:13:45] (03CR) 10Jgiannelos: [C:03+1] "We can deploy on staging, run a difftest with prod, then deploy to prod." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [12:14:54] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [12:15:15] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [12:15:27] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/kartotherian: apply [12:16:04] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:16:06] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/mathoid: apply [12:16:08] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:16:21] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/media-analytics: apply [12:17:09] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/miscweb: apply [12:17:33] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/mobileapps: apply [12:17:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:17:47] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/page-analytics: apply [12:18:04] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/proton: apply [12:18:37] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/push-notifications: apply [12:18:53] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/ratelimit: apply [12:19:04] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/rdf-streaming-updater: apply [12:19:28] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/recommendation-api: apply [12:19:43] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/rest-gateway: apply [12:20:01] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/sessionstore: apply [12:20:35] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox: apply [12:20:37] (03CR) 10Jforrester: "As written, this disables it everywhere including test2wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) (owner: 10Esanders) [12:20:53] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-constraints: apply [12:21:16] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-media: apply [12:21:44] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-syntaxhighlight: apply [12:22:06] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-timeline: apply [12:22:36] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-video: apply [12:23:14] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/termbox: apply [12:23:25] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/thumbor: apply [12:23:40] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/toolhub: apply [12:24:01] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/wikidata-query-gui: apply [12:24:14] (03CR) 10Sérgio Lopes: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos) [12:24:24] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/wikifeeds: apply [12:24:57] (03CR) 10Jgiannelos: [C:03+2] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos) [12:25:08] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/wikifunctions: apply [12:25:32] !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/zotero: apply [12:25:33] (03PS3) 10Clément Goubert: mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [12:26:31] (03Merged) 10jenkins-bot: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos) [12:26:50] btullis: sorry about that, my CRs were not up to date and I thought the feature flags were not merged [12:27:07] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [12:27:13] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [12:30:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128032 (https://phabricator.wikimedia.org/T342172) (owner: 10Anzx) [12:32:19] (03CR) 10Ayounsi: "Maybe the end of fixes like I3c2da307ed6059fab5581a26b36d249d6cf9ddb6 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [12:33:26] (03PS3) 10Btullis: dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) [12:34:46] (03PS1) 10Slyngshede: Handle empty query on block user page [software/bitu] - 10https://gerrit.wikimedia.org/r/1128399 (https://phabricator.wikimedia.org/T385947) [12:36:58] 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, and 2 others: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10641387 (10BTullis) Hello. Just FYI, we are planning to switch snapshot1016 back to using the core database serv... [12:37:25] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128358 (owner: 10Muehlenhoff) [12:38:27] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [12:38:33] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [12:40:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [12:41:08] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1128357 (owner: 10Muehlenhoff) [12:44:22] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065 (10BTullis) 03NEW [12:44:52] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065#10641421 (10BTullis) [12:50:47] (03CR) 10Gkyziridis: [C:03+1] "Thank you Kevin." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [12:50:56] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:51:42] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 464776160 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:58] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:52:19] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:52:42] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:53:16] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:53:40] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:54:13] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:54:22] (03PS1) 10Ayounsi: Sandbox vlan, allow return http(s) monitoring traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1128401 (https://phabricator.wikimedia.org/T388419) [12:55:39] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, George!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [12:56:53] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse) [12:57:24] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [12:57:36] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [12:57:50] (03CR) 10Btullis: [C:03+1] airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 (owner: 10Brouberol) [12:58:29] (03CR) 10Btullis: [C:03+2] Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (https://phabricator.wikimedia.org/T388472) (owner: 10Pppery) [12:59:03] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:00:03] (03PS1) 10Gergő Tisza: Revert "Disable new WebAuthn credentials creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300). [13:00:05] tgr, MichaelG_WMF, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] (03CR) 10Clément Goubert: [C:04-1] "@dzahn@wikimedia.org Can you add an httpbb test for that redirect to `modules/profile/files/httpbb/appserver/test_redirects.yaml` like the" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [13:00:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [13:00:18] * MichaelG_WMF is here [13:00:26] o/ [13:00:34] (just added one more patch) [13:00:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [13:00:46] I guess I should deploy since most patches are mine [13:00:50] My change only affects a maintenance script that runs on an hourly timer - nothing to directly test [13:00:53] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T389038#10641477 (10Clement_Goubert) →14Duplicate dup:03T383032 [13:01:01] @tgr_ Thank you :) [13:01:15] o/ [13:02:30] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:02:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:03:21] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-04-18 13:03:09. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:03:21] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-04-18 13:03:09. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:04:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [13:04:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128032 (https://phabricator.wikimedia.org/T342172) (owner: 10Anzx) [13:04:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [13:05:59] (03Merged) 10jenkins-bot: Revert "Disable new WebAuthn credentials creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [13:06:02] (03Merged) 10jenkins-bot: sqwiktionary: update logo, wordmark, tagline and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128032 (https://phabricator.wikimedia.org/T342172) (owner: 10Anzx) [13:06:04] (03Merged) 10jenkins-bot: Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [13:06:20] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53656 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:22] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]] [13:06:31] T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402 [13:06:32] T389064: Notify WebAuthn users about SUL3 changes - https://phabricator.wikimedia.org/T389064 [13:06:32] T342172: Icons: sqwiktionary logo icon should be localized to language - https://phabricator.wikimedia.org/T342172 [13:06:32] T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250 [13:07:21] (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs3009 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:08:00] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs3009.esams.wmnet with reason: depooled before reimage [13:08:03] (03CR) 10Ssingh: [C:03+1] hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [13:08:06] !log depooling lvs3009 before being reimaged - T384477 [13:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:11] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [13:10:07] (03CR) 10Gergő Tisza: [C:03+2] Enable credentials change special pages on SUL3 shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127965 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [13:10:17] !log tgr@deploy2002 tgr, migr, anzx: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:30] tgr_: looking [13:11:03] all good from my side, nothing to actively test [13:11:08] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs3009 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [13:11:31] tgr_: looks good [13:11:43] !log tgr@deploy2002 tgr, migr, anzx: Continuing with sync [13:11:44] (03PS2) 10Clément Goubert: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) [13:11:55] (03CR) 10Btullis: [C:03+1] "Good stuff, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [13:12:38] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:03] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 (10ayounsi) 03NEW p:05Triage→03High [13:13:45] (03CR) 10Btullis: [C:03+1] Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [13:14:13] !log ayounsi@cumin1002 START - Cookbook sre.network.debug for Netbox interface ID 20595 [13:14:25] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID 20595 [13:14:45] (03PS1) 10Slyngshede: P:firewall remove connection tracking monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1128405 (https://phabricator.wikimedia.org/T350694) [13:17:15] (03CR) 10MSantos: [C:03+1] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos) [13:17:57] echo 'https://en.wikipedia.org/static/images/icons/sqwiktionary.svg' | mwscript purgeList.php [13:17:57] echo 'https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-tagline-sq.svg' | mwscript purgeList.php [13:17:57] echo 'https://en.wikipedia.org/static/images/project-logos/sqwiktionary.png' | mwscript purgeList.php [13:17:57] echo 'https://en.wikipedia.org/static/images/project-logos/sqwiktionary-1.5x.png' | mwscript purgeList.php [13:17:57] echo 'https://en.wikipedia.org/static/images/project-logos/sqwiktionary-2x.png' | mwscript purgeList.php [13:17:59] (03PS1) 10Muehlenhoff: Double conntrack table size on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1128406 [13:18:42] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3009.esams.wmnet with OS bookworm [13:18:57] echo 'https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-tagline-sq.svg' | mwscript purgeList.php [13:19:12] anzx: should I run those? or are you doing it? [13:19:27] tgr_: please run those [13:19:49] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]] (duration: 13m 27s) [13:19:56] T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402 [13:19:56] T389064: Notify WebAuthn users about SUL3 changes - https://phabricator.wikimedia.org/T389064 [13:19:56] T342172: Icons: sqwiktionary logo icon should be localized to language - https://phabricator.wikimedia.org/T342172 [13:19:57] T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250 [13:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:20:43] (03Merged) 10jenkins-bot: Enable credentials change special pages on SUL3 shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127965 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza) [13:22:20] anzx: done [13:22:52] tgr_: thank you for deploying [13:23:01] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127965|Enable credentials change special pages on SUL3 shared domain (T362715)]] [13:23:05] T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715 [13:23:17] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10641632 (10ayounsi) [13:24:37] jouncebot: nowandnext [13:24:37] For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300) [13:24:37] In 2 hour(s) and 5 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530) [13:24:55] tgr_: hii, let me know when you're done [13:25:01] (03CR) 10Vgutierrez: [C:03+2] hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [13:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:25:33] !log upgrading HAProxy to version 3.1 in cp5032 (upload) - T386796 [13:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:37] T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796 [13:25:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:25:48] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [13:25:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye [13:26:02] Amir1: let me know if it's urgent, I'll be deploying for a while [13:26:17] nah, it's not [13:27:33] !log tgr@deploy2002 tgr: Backport for [[gerrit:1127965|Enable credentials change special pages on SUL3 shared domain (T362715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:31:34] !log uploaded HAProxy 3.1.5 to apt.wm.o (bullseye-wikimedia) component thirdparty/haproxy31 - T386796 [13:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:38] T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796 [13:33:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [13:34:26] (03CR) 10Muehlenhoff: "(PCC failure on Puppet 5 is fine, we only use this role on Puppet 7 and max_files isn't (as used by the role) isn't in Puppet 5 yet)" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [13:35:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:38:00] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage [13:38:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS bookworm [13:38:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641698 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm [13:39:44] !log begin moving k8s prometheus instances from prometheus2005 to prometheus2007 - T383232 [13:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:48] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [13:40:02] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [13:40:38] !log tgr@deploy2002 tgr: Continuing with sync [13:41:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage [13:44:56] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10641719 (10ssingh) [Stalled until further discussion] [13:45:16] (03CR) 10Elukey: [C:03+1] Double conntrack table size on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1128406 (owner: 10Muehlenhoff) [13:45:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [13:46:27] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:46:44] (03PS1) 10Muehlenhoff: Switch ganeti1029 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1128412 [13:47:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10641720 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs [13:47:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641721 (10MatthewVernon) OK, so the reimage isn't working because the SSDs are both RAID-0 arrays rather than JBOD. I'm going to try and un-RAID them, JBOD them, and try anothe... [13:47:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641722 (10Jclark-ctr) @Marostegui looks like db1257 is not in site.pp is currently failing Reimage [13:47:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [13:47:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [13:47:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641726 (10Jclark-ctr) a:05Papaul→03Marostegui [13:47:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10641727 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs [13:48:44] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127965|Enable credentials change special pages on SUL3 shared domain (T362715)]] (duration: 25m 42s) [13:48:48] T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715 [13:49:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641741 (10Marostegui) a:05Marostegui→03Papaul >>! In T384979#10641722, @Jclark-ctr wrote: > @Marostegui looks like db1257 is not in site.pp is currently failing Reimage It is:... [13:49:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:49:16] (03CR) 10Brouberol: [C:03+2] airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367 (owner: 10Brouberol) [13:49:19] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [13:49:22] (03CR) 10Brouberol: [C:03+2] airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 (owner: 10Brouberol) [13:50:16] (03CR) 10Xcollazo: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [13:50:25] RECOVERY - Host dbprov2005 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [13:50:39] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [13:51:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641744 (10Jclark-ctr) @Marostegui sorry i fotgot to update my repo before i was checking site.pp locally its monday [13:51:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641745 (10Jclark-ctr) a:05Papaul→03Jclark-ctr [13:51:14] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641747 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated both ends. looks like it was the switch side. port is... [13:51:42] (03CR) 10Elukey: [C:03+2] kartotherian: use wdqs-internal-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse) [13:51:42] (03Merged) 10jenkins-bot: airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367 (owner: 10Brouberol) [13:51:43] (03Merged) 10jenkins-bot: airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [13:51:45] (03Merged) 10jenkins-bot: airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 (owner: 10Brouberol) [13:51:47] (03Merged) 10jenkins-bot: Fix some SUL3 shared domain settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [13:52:05] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127648|Fix some SUL3 shared domain settings (T388218)]] [13:52:08] T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218 [13:52:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641756 (10Marostegui) >>! In T384979#10641744, @Jclark-ctr wrote: > @Marostegui sorry i fotgot to update my repo before i was checking site.pp locally its monday > ☕☕ [13:52:41] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:54:57] (03PS10) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [13:55:01] (03CR) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [13:55:14] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:55:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:56:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:56:39] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:56:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [13:57:03] ^^BGP alert triggered by lvs3009 reimage [13:57:12] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5086/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [13:57:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [13:57:39] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:52] 06SRE, 10Observability-Alerting, 06Traffic, 10SRE Observability (FY2024/2025-Q3): Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10641773 (10tappof) 05Open→03Resolved It looks like the patch has fixed the problem. I'm closing the task.... [13:58:40] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [13:59:43] (03CR) 10Brouberol: [V:03+1] Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [13:59:44] (03PS11) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [14:00:07] 06SRE, 10Observability-Alerting, 06Traffic, 10SRE Observability (FY2024/2025-Q3): Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10641781 (10ssingh) Thanks for taking care of it; can confirm resolved! [14:00:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10641782 (10Ottomata) Hi @BCornwall ! group owner approval for analytics-privatedata-users [[ https://github.com/wikimedi... [14:01:10] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3009.esams.wmnet with OS bookworm [14:01:18] (03PS1) 10TrainBranchBot: Revert "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128417 [14:01:19] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as I9e513f7ba97f281c9e60fc110fb7331c4e47385d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [14:03:55] (03CR) 10Gergő Tisza: "Fails with" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [14:04:07] (03PS1) 10Vgutierrez: hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) [14:04:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128417 (owner: 10TrainBranchBot) [14:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:04:25] (03PS2) 10Vgutierrez: hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) [14:04:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:05:01] (03Merged) 10jenkins-bot: Revert "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128417 (owner: 10TrainBranchBot) [14:05:20] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128417|Revert "Fix some SUL3 shared domain settings"]] [14:05:56] (03CR) 10Gergő Tisza: [C:03+2] Try both SUL2 and SUL3 central domain for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [14:06:41] (03CR) 10Ssingh: [C:03+1] hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:06:50] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:09:43] !log tgr@deploy2002 trainbranchbot, tgr: Backport for [[gerrit:1128417|Revert "Fix some SUL3 shared domain settings"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:51] !log repool lvs3009 running liberica - T384477 [14:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:55] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [14:09:59] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009.esams.wmnet} and A:liberica (T384477) [14:10:16] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009.esams.wmnet} and A:liberica (T384477) [14:10:43] !log tgr@deploy2002 trainbranchbot, tgr: Continuing with sync [14:11:21] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [14:12:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:12:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:13:45] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:14:35] (03Merged) 10jenkins-bot: Try both SUL2 and SUL3 central domain for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [14:14:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:14:43] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [14:16:37] (03CR) 10Xcollazo: [C:03+1] "Agree it is worth pursuing this change to confirm or deny whether the Analytics replicas are the culprit of the slowdown discussed in http" [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) (owner: 10Btullis) [14:16:43] (03CR) 10Btullis: [C:03+2] dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) (owner: 10Btullis) [14:16:51] (03PS1) 10Filippo Giunchedi: hieradata: add new prometheus hw to prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128420 (https://phabricator.wikimedia.org/T383232) [14:17:12] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128417|Revert "Fix some SUL3 shared domain settings"]] (duration: 11m 52s) [14:17:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:18:16] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs3008 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477) [14:18:20] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: add new prometheus hw to prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128420 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:18:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1257.eqiad.wmnet with OS bookworm [14:18:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm executed with errors: - db1257 (**FAIL**)... [14:18:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:19:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:19:55] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127952|Try both SUL2 and SUL3 central domain for autologin (T375796)]] [14:19:59] T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796 [14:20:49] (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs3008 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:22:22] (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs3008 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [14:22:28] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:22:42] !log depooling lvs3008 before being reimaged - T384477 [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:46] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [14:23:37] !log tgr@deploy2002 tgr: Backport for [[gerrit:1127952|Try both SUL2 and SUL3 central domain for autologin (T375796)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:23:41] 06SRE, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot account - https://phabricator.wikimedia.org/T388662#10641905 (10ssingh) The user does not seem to be part of https://ldap.toolforge.org/user/barrybrowsertestbot of any sensitive groups. For disabling an account and on checking internally with SRE, t... [14:24:17] (03PS1) 10Vgutierrez: hiera: Upgrade to HAProxy 3.1 on cp5024 (text) [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796) [14:24:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:24:46] (03CR) 10Jasmine: [C:03+1] wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:25:08] (03CR) 10Ssingh: [C:03+1] hiera: Upgrade to HAProxy 3.1 on cp5024 (text) [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [14:25:29] PROBLEM - pybal on lvs3008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:25:39] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:25:46] urgh.. forgot the downtime, sorry about the noise [14:25:52] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [14:25:54] (03CR) 10Jasmine: [C:03+1] wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:25:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be207... [14:26:09] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs3008.esams.wmnet with reason: depooled before reimage [14:26:21] (03CR) 10Hashar: [C:03+1] Remove unnecessary boolean statement for $wmgIncreaseDefaultVectorFontSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127929 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [14:26:42] (03CR) 10Jasmine: [C:03+1] geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1127069 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:27:01] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [14:27:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye [14:27:17] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [14:27:44] (03CR) 10Hashar: "`$wgVectorZebraDesign` and some other settings were removed recently by Ic4876a91ec1b2cedcf68d4f257e518837e15da89" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [14:27:50] (03CR) 10Hashar: [C:03+1] Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [14:28:32] (03PS4) 10Scott French: deployment_server: Support PHP version selection in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) [14:28:42] FIRING: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:06] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3008.esams.wmnet with OS bookworm [14:29:22] (03CR) 10Vgutierrez: [C:03+2] hiera: Upgrade to HAProxy 3.1 on cp5024 (text) [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [14:29:42] !log upgrading HAProxy to version 3.1 in cp5024 (text) - T386796 [14:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:45] T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796 [14:30:13] !incidents [14:30:14] 5747 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [14:30:14] 5745 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [14:30:14] 5744 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [14:30:14] 5743 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad) [14:30:21] jouncebot: nowandnext [14:30:21] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [14:30:21] In 0 hour(s) and 59 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530) [14:30:55] I am going to deployed a bunch of clean up patches for mediawiki-config [14:31:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm [14:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:31:20] (03CR) 10Scott French: [C:03+2] deployment_server: Support PHP version selection in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [14:32:28] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10641954 (10ssingh) [14:32:59] oh [14:33:16] (03PS1) 10Ayounsi: Add transit/peering in/out port saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) [14:33:33] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:33:38] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10641961 (10ssingh) https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Deployment_Groups dictates the user be added to the Gerrit group `wmf-deployment` which I have just done.... [14:33:46] tgr_: are you still doing the deployment of "Try both SUL2 and SUL3 central domain for autologin" [14:33:46] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [14:34:59] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1128405 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:35:34] I guess it needs a bit of testing :) [14:35:45] I will do the clean up patches later, they are not urgent [14:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:36:29] hashar: just about to roll it back [14:36:44] :-\ [14:36:44] (03PS1) 10Elukey: kartotherian: update statsd's config ttl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128430 (https://phabricator.wikimedia.org/T388860) [14:36:44] !log tgr@deploy2002 Sync cancelled. [14:36:51] poor SUL! [14:37:12] (03PS1) 10TrainBranchBot: Revert "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128431 [14:37:13] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as If0f231ae556fdf5e7ea242dc8cc24a8be8c1e343" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [14:37:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128431 (owner: 10TrainBranchBot) [14:38:25] (03CR) 10Jasmine: [C:03+1] deployment: switch deploy servers to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1127074 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:39:25] (03PS1) 10Elukey: kartotherian: simplify the readinessProble's path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 [14:40:09] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [14:40:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642002 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be207... [14:40:36] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [14:40:38] (03CR) 10Jasmine: [C:03+1] wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:40:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye [14:41:04] (03CR) 10Jgiannelos: [C:03+1] kartotherian: update statsd's config ttl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128430 (https://phabricator.wikimedia.org/T388860) (owner: 10Elukey) [14:41:42] (03CR) 10Elukey: [C:03+2] kartotherian: update statsd's config ttl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128430 (https://phabricator.wikimedia.org/T388860) (owner: 10Elukey) [14:41:56] (03CR) 10Ayounsi: "Cathal, let me know what you think of this approach, it's more basic than what you suggested on the task." [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [14:42:06] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:42:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:39] (03PS2) 10Filippo Giunchedi: prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) [14:42:40] (03CR) 10Jasmine: [C:03+1] mw-(web|api-ext): scale up in anticipation of switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127859 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:42:50] (03CR) 10Ayounsi: [C:03+2] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [14:43:46] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:43:48] (03CR) 10Muehlenhoff: [C:03+1] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi) [14:43:53] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:44:02] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:44:04] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:44:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:45:18] (03Merged) 10jenkins-bot: Revert "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128431 (owner: 10TrainBranchBot) [14:45:26] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [14:45:33] (03PS2) 10Filippo Giunchedi: hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) [14:45:39] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128431|Revert "Try both SUL2 and SUL3 central domain for autologin"]] [14:45:51] !incidents [14:45:52] 5749 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [14:45:52] 5747 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [14:45:52] 5745 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [14:45:52] 5744 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [14:45:52] 5743 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad) [14:47:11] (03PS1) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 [14:47:16] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642034 (10ssingh) From https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analytics-privatedata-users,... [14:47:23] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi) [14:47:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642035 (10ssingh) [14:48:00] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage [14:48:52] tgr_: still deploying? [14:49:32] Amir1: this is the last scap [14:49:51] noted [14:50:59] !log tgr@deploy2002 trainbranchbot, tgr: Backport for [[gerrit:1128431|Revert "Try both SUL2 and SUL3 central domain for autologin"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:51:24] !log herron@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [14:51:26] !log herron@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [14:51:42] !log herron@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [14:51:46] !log tgr@deploy2002 trainbranchbot, tgr: Continuing with sync [14:51:57] !log herron@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [14:52:09] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage [14:52:32] (03CR) 10Xcollazo: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [14:53:04] !log herron@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [14:53:07] !log herron@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [14:53:47] (03CR) 10Vgutierrez: [C:03+2] lists: Offer RSA+ECDSA certificates on lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1127066 (https://phabricator.wikimedia.org/T385067) (owner: 10Vgutierrez) [14:54:36] RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:54:42] (03PS2) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 [14:54:42] (03PS1) 10Ayounsi: type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438 [14:56:03] moritzm: this ^ is flapping awfully much today, but apparently it's a pattern for quite a few days now. https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=api-gateway&from=now-2d&to=now. I 'll open a task to ML, unless you got one already [14:56:27] (03PS2) 10Ayounsi: type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438 [14:56:27] (03PS3) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 [14:56:40] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi) [14:56:43] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi) [14:56:59] (03CR) 10CI reject: [V:04-1] Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi) [14:58:03] akosiaris: I spoke to Ilias earlier, they are working on a fix already, should be ready today or tomorrow [14:58:14] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128431|Revert "Try both SUL2 and SUL3 central domain for autologin"]] (duration: 12m 35s) [14:58:18] Amir1: done, sorry it took so long [14:58:42] (03CR) 10CI reject: [V:04-1] Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi) [14:58:43] hashar: ^ (but Amir1 asked first :) [14:58:50] o/ [14:58:54] yeah no worries [14:59:02] (03CR) 10Ayounsi: "forgot something important in the regex. It's now working as expected as you can see in the chained CR." [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi) [14:59:03] I will do the clean up patches later this week or next week [15:00:15] (03CR) 10Brouberol: [C:03+1] type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi) [15:00:40] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:07] (03CR) 10Vgutierrez: [C:03+2] exim: Use RSA+ECDSA certificates for lists [puppet] - 10https://gerrit.wikimedia.org/r/1127933 (https://phabricator.wikimedia.org/T385067) (owner: 10Vgutierrez) [15:03:18] hashar: I'm in a meeting so feel free to go ahead [15:03:42] RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [15:04:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:04:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:06:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:06:40] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:52] ^^ BGP alert triggered by lvs3008 being reimaged [15:08:40] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:09:23] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [15:10:44] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage [15:11:42] FIRING: [2x] JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:44] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3008.esams.wmnet with OS bookworm [15:13:57] moritzm: ok, cool, thanks! [15:14:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage [15:14:30] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642215 (10RobH) Draft of directions: > Support, > > We just had an optic fail on one of our router to switch links, and need the switch side optic swapped out with spa... [15:16:10] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10642230 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Both exim and apache2 have been reconfigured to offer RSA+E... [15:16:18] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver2004.codfw.wmnet with OS bookworm [15:16:42] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps ratio to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [15:17:23] (03PS1) 10Brouberol: Upgrade airflow-providers-cncf-kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128445 (https://phabricator.wikimedia.org/T388378) [15:17:47] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 362355672 and 53 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:17:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [15:18:09] (03Merged) 10jenkins-bot: Bump thumbnail steps ratio to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [15:18:26] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1128377|Bump thumbnail steps ratio to 15% (T360589)]] [15:18:30] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [15:18:47] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 299200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:18:55] (03PS1) 10Vgutierrez: hiera: Restore BGP priority for lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477) [15:19:21] (03CR) 10Ssingh: [C:03+1] hiera: Restore BGP priority for lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:19:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:19:47] (03CR) 10Muehlenhoff: [C:03+1] "This was signed off in the SRE IF meeting. @Alex, you can proceed with deploying" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [15:20:38] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642258 (10RobH) Had the option for 'normal work' which must be planned in work hours and 24 hours in advance (with time zone changes that means if I entered it now, it woul... [15:21:26] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642262 (10RobH) a:03RobH [15:21:42] (03PS12) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [15:22:20] (03CR) 10Vgutierrez: [C:03+2] hiera: Restore BGP priority for lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:23:02] (03CR) 10Ssingh: [C:03+1] service: move kartotherian-k8s-ssl fully on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1128343 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [15:23:06] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1128377|Bump thumbnail steps ratio to 15% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:24:13] (03PS1) 10Vgutierrez: cumin: Update (liberica|lvs)-esams aliases [puppet] - 10https://gerrit.wikimedia.org/r/1128448 (https://phabricator.wikimedia.org/T384477) [15:25:27] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:25:31] (03CR) 10Ssingh: [C:03+1] service: set kartotherian and kartotherian-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128344 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [15:25:54] (03PS1) 10Vgutierrez: hiera: Clean-up lvs::balancer keys for non-core DCs [puppet] - 10https://gerrit.wikimedia.org/r/1128449 (https://phabricator.wikimedia.org/T384477) [15:26:13] (03CR) 10Ssingh: [C:03+1] "Yes, please coordinate with us as you mention in the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/1128345 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [15:27:00] (03CR) 10Ssingh: [C:03+1] service, conftool-data: final removal for unused Kartotherian configs [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [15:27:06] 06SRE, 06Infrastructure-Foundations: Review Broadcom's storcli binary - https://phabricator.wikimedia.org/T388628#10642295 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [15:27:51] (03CR) 10Ssingh: [C:03+1] cumin: Update (liberica|lvs)-esams aliases [puppet] - 10https://gerrit.wikimedia.org/r/1128448 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:28:35] (03CR) 10Vgutierrez: [C:03+2] cumin: Update (liberica|lvs)-esams aliases [puppet] - 10https://gerrit.wikimedia.org/r/1128448 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:28:50] (03CR) 10Jdlrobson: [C:03+1] Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [15:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530) [15:30:18] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [15:31:01] !log repool lvs3008 running liberica - T384477 [15:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:04] T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477 [15:31:10] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3008.esams.wmnet} and A:liberica (T384477) [15:31:28] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3008.esams.wmnet} and A:liberica (T384477) [15:31:47] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128377|Bump thumbnail steps ratio to 15% (T360589)]] (duration: 13m 20s) [15:31:50] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [15:32:14] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10642338 (10phaultfinder) [15:32:15] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [15:32:50] (03CR) 10Clément Goubert: [C:03+1] service: move kartotherian-k8s-ssl fully on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1128343 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [15:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 14.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:33:30] (03CR) 10Herron: [C:03+1] prometheus: disable 'accelerator' cadvisor metric [puppet] - 10https://gerrit.wikimedia.org/r/1128319 (https://phabricator.wikimedia.org/T388632) (owner: 10Filippo Giunchedi) [15:34:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2075.codfw.wmnet with OS bullseye [15:34:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye completed: - ms-be2075 (**PASS**... [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:15] (03CR) 10Ayounsi: [C:03+2] type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi) [15:37:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:29] (03Abandoned) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi) [15:37:51] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [15:38:05] jouncebot: refresh [15:38:06] I refreshed my knowledge about deployments. [15:38:43] (03PS1) 10Vgutierrez: hieradata: Use codfw etcd cluster in liberica@(ulsfo|eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477) [15:38:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642387 (10BCornwall) Ugh, sorry about that. But we do still nee @ATsay-WMF to approve, right? [15:39:00] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:39:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T385814#10642389 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:39:38] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:09] (03CR) 10Elukey: "@cwhite@wikimedia.org Hi! I didn't notice the change in the kartotherian's chart until I deployed a new change today. This is the error th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [15:41:10] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642395 (10ssingh) >>! In T388693#10642387, @BCornwall wrote: > Ugh, sorry about that. But we do still nee @ATsay-WMF to... [15:43:12] (03CR) 10Ssingh: [C:03+1] hieradata: Use codfw etcd cluster in liberica@(ulsfo|eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:43:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:44:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2089 [15:44:07] (03CR) 10JMeybohm: [C:03+2] k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1128369 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [15:44:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2089 [15:44:28] (03PS7) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) [15:44:37] (03CR) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [15:44:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:45] jouncebot: now [15:44:45] For the next 0 hour(s) and 15 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530) [15:44:47] (03CR) 10Bking: "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [15:45:22] jan_drewniak: would there be any conflict with any portal deployments happening right now if I wanted to try to get a security patch out? [15:45:57] (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1128449/5088/" [puppet] - 10https://gerrit.wikimedia.org/r/1128449 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:46:14] (03CR) 10Vgutierrez: [C:03+2] hiera: Clean-up lvs::balancer keys for non-core DCs [puppet] - 10https://gerrit.wikimedia.org/r/1128449 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:46:15] (03PS3) 10Filippo Giunchedi: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [15:46:25] jouncebot: nowandnext [15:46:25] For the next 0 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530) [15:46:25] In 1 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700) [15:46:25] In 1 hour(s) and 13 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700) [15:46:38] (03PS4) 10Filippo Giunchedi: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [15:46:43] One portals update is done, I'd like to deploy [15:46:50] (03PS2) 10Scott French: Disable cookie-based enrollment in 8.1 (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845) [15:47:08] Dreamy_Jazz: skipping the portals deploy this week, feel free to deploy. [15:47:11] (03PS5) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) [15:47:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642410 (10MatthewVernon) Finally got the reimage to work; I'll leave this host overnight, and then check the kernel log tomorrow. [15:47:26] Thanks. sbassett: Do you want to deploy the security patch first? [15:47:46] (03CR) 10Ebernhardson: [C:03+2] cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [15:47:54] Dreamy_jazz: sure, that would be great. Should be quick? One mw core file affected for .20. [15:47:58] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [15:48:03] (03CR) 10Vgutierrez: [C:03+2] hieradata: Use codfw etcd cluster in liberica@(ulsfo|eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:48:08] (03CR) 10Cwhite: "Updated 30d->720h per findings from elukey: Iba0264c01df67083bf7d29bf6fe632811a56e0ef" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [15:48:59] (03Merged) 10jenkins-bot: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [15:49:00] (03CR) 10Filippo Giunchedi: "Thank you for the patch! I've tweaked the topic naming slightly to use k8s-mw prefix." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [15:49:16] Sure. Let me know when you are done. [15:50:38] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp30[66-72,74-80].esams.wmnet} and A:cp for 9.2.9-1wm1 [15:51:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128405 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:56:28] (03CR) 10Brouberol: [C:03+2] Upgrade airflow-providers-cncf-kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128445 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:57:17] (03CR) 10Alexandros Kosiaris: [C:03+1] Disable cookie-based enrollment in 8.1 (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:57:32] (03PS1) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1128458 [15:57:37] (03CR) 10Clément Goubert: [C:03+1] "Discussed out of band: The topics will get autocreated, logstash fetches from k8s-*, and the topics don't need any special config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [15:58:12] (03PS23) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [15:58:59] (03PS4) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) [15:59:32] (03PS1) 10Elukey: role::ml_k8s: extend nrpe_check_disk_options to allow containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128461 (https://phabricator.wikimedia.org/T387854) [15:59:34] (03PS1) 10Elukey: role::ml_k8s::staging::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128462 (https://phabricator.wikimedia.org/T387854) [15:59:35] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) [16:00:03] (03CR) 10Btullis: [C:03+1] Upgrade airflow-providers-cncf-kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128445 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:00:11] (03CR) 10Dzahn: "thanks! Done. added the test. I don't have to manually add the VirtualServer, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [16:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [16:02:25] (03CR) 10JMeybohm: [C:03+2] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1128458 (owner: 10JMeybohm) [16:02:34] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs[5004-5006].eqsin.wmnet,lvs[4008-4009].ulsfo.wmnet} and A:liberica [16:03:09] sbassett: Any update on deploying the security patch? [16:03:51] Dreamy_jazz: Almost done [16:03:56] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5089/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [16:03:59] Thanks. [16:04:11] Forgot that the security deploy doesn't log when it starts. [16:04:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:05:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:06:25] Dreamy_Jazz: 48% k8s restarted… [16:07:11] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs[5004-5006].eqsin.wmnet,lvs[4008-4009].ulsfo.wmnet} and A:liberica [16:07:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:09:46] !log Deployed security patch for T387478 [16:10:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:12:17] Dreamy_Jazz: all yours [16:12:23] Thanks! [16:13:33] (03PS2) 10Dreamy Jazz: Unset the old 'checkuser-temporary-account-viewer' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205) [16:13:44] (03PS1) 10Elukey: admin_ng: set request and limits the same for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128465 (https://phabricator.wikimedia.org/T386926) [16:13:48] !log bounce asw1-b12 et-0/0/48 - T389071 [16:13:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [16:14:23] (03PS24) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [16:15:00] (03Merged) 10jenkins-bot: Unset the old 'checkuser-temporary-account-viewer' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [16:15:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2089'] [16:15:20] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] [16:15:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2089'] [16:15:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:16:24] (03PS6) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) [16:16:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye [16:16:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10642546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2089.codfw... [16:17:01] (03PS1) 10Brouberol: Fix typo in image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128466 (https://phabricator.wikimedia.org/T388378) [16:17:51] (03PS7) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) [16:19:31] (03CR) 10Brouberol: [C:03+2] Fix typo in image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128466 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [16:20:01] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:20:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:20:16] (03PS1) 10JMeybohm: profile::kubernetes::client: install kubectl 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) [16:20:35] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [16:20:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [16:21:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [16:22:21] (03CR) 10CI reject: [V:04-1] profile::kubernetes::client: install kubectl 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [16:23:38] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5090/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [16:23:53] (03Abandoned) 10Jforrester: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090435 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [16:23:57] (03Abandoned) 10Jforrester: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090425 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [16:24:01] (03Abandoned) 10Jforrester: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1089924 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot) [16:26:08] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642593 (10ayounsi) remote hands replaced the optic, but the issue persists. Looking closer at it it converts the 40G port into 4x10G lanes. This might be because lane 1 is... [16:27:01] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] (duration: 11m 41s) [16:28:07] (03PS2) 10JMeybohm: profile::kubernetes::client: install kubectl 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) [16:28:12] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [16:29:11] (03PS8) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) [16:36:34] (03PS1) 10Vgutierrez: Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) [16:36:34] (03PS1) 10Vgutierrez: Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) [16:36:51] (03CR) 10Ssingh: [C:03+1] Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:36:54] (03CR) 10Ssingh: [C:03+1] Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:36:59] (03CR) 10CI reject: [V:04-1] Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:37:08] (03CR) 10CI reject: [V:04-1] Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:37:37] (03PS2) 10Vgutierrez: Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) [16:37:52] (03PS2) 10Vgutierrez: Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) [16:38:10] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642675 (10RobH) IRC update: We asked them to swap both optic and fiber patch to reduce complexity in troubleshooting. > Support, > > Background: For some reason this li... [16:38:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:38:41] (03CR) 10Ssingh: Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:38:46] (03CR) 10Ssingh: Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:39:08] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:39:38] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:39:48] !log downgrading HAProxy to version 2.8 in cp5024 (text) - T386796 [16:40:23] (03CR) 10Stoyofuku-wmf: [C:03+1] "The list is getting small!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) (owner: 10Jdlrobson) [16:40:52] logmsgbot is toasted? [16:40:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:24] vgutierrez: do you mean stashbot? [16:41:30] Lucas_WMDE: sorry, yes [16:41:34] * Lucas_WMDE looks [16:41:59] (03PS1) 10Cwhite: statsd_exporter: bugfix set ttl to associated variable [puppet] - 10https://gerrit.wikimedia.org/r/1128471 (https://phabricator.wikimedia.org/T359497) [16:42:03] oh god, quit half an hour ago ._. [16:44:11] (03PS1) 10Elukey: installserver: set puppetserver2004 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) [16:44:38] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2327 [16:45:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2327 [16:45:01] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T389096 (10Olivierpeyronnet) 03NEW Closing this task as invalid due to missing information. [16:45:01] vgutierrez: try again [16:45:08] * Lucas_WMDE scrolls up to see who else needs to relog stuff [16:45:12] !log downgrading HAProxy to version 2.8 in cp5024 (text) - T386796 [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:16] T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796 [16:45:17] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2327 [16:45:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2327 [16:45:22] Lucas_WMDE: thx <3 [16:45:42] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [16:46:13] !log downgrading HAProxy to version 2.8 in cp5032 (upload) - T386796 [16:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:36] FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [16:46:44] !incidents [16:46:44] 5750 (UNACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [16:46:44] XioNoX, jhathaway, Dreamy_Jazz, brouberol, moritzm: FYI, y’all might want to re-log some messages of the past ca. 35 minutes, stashbot quit IRC and we didn’t notice for a bit :( [16:46:44] 5749 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [16:46:45] 5747 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [16:46:45] 5745 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [16:46:45] 5744 (RESOLVED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [16:46:45] 5743 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad) [16:46:53] !ack 5750 [16:46:54] 5750 (ACKED) GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad) [16:47:02] I 'll silence this for a week or so [16:47:08] Thank you! [16:47:18] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] [16:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:22] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [16:47:23] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:26] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [16:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:30] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] (duration: 11m 41s) [16:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:58] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099 (10Olivierpeyronnet) 03NEW [16:48:52] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099#10642769 (10Olivierpeyronnet) I am a master's student working on my thesis about the postal and telegraph network in Bosnia-Herzegovina before World War I. My research requires mapping... [16:49:05] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10642771 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:50:44] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10642786 (10jcrespo) For the help part (not the approval), feel free to ping me, not the highest expert, but I can help with the commands if needed. [16:56:04] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T389096#10642818 (10Aklapper) →14Duplicate dup:03T389099 [16:56:06] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099#10642820 (10Aklapper) [16:57:11] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099#10642826 (10Aklapper) 05Open→03Declined Hi @Olivierpeyronnet, maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikimedia Affiliates. We are not... [16:58:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:00:05] swfrench and cwhite: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700). [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700). [17:01:01] o/ [17:01:15] !log silence GatewayBackendErrorsHigh lw_inference_reference_need_cluster in eqiad for 1 week [17:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:49] c.white and I will get started on this in a couple of minutes [17:02:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:03:16] (03CR) 10Scott French: [C:03+1] move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [17:04:39] (03PS2) 10Esanders: VE: Disable upcoming mobile insert menu everywhere except test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) [17:04:49] (03CR) 10Esanders: "Oops, forgot to stage." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) (owner: 10Esanders) [17:05:49] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [17:07:41] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [17:10:15] (03CR) 10Clément Goubert: mediawiki: add rewrite for rt.wikimedia.org to wikitech page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [17:14:40] (03PS3) 10Jforrester: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) [17:14:40] (03CR) 10Jforrester: search-redirect: Handle $_GET potential vulnerability scanning (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [17:15:43] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@4eb42a4]: search: drop export_queries_to_relforge [17:16:12] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@4eb42a4]: search: drop export_queries_to_relforge (duration: 00m 29s) [17:17:55] cwhite: ready to go? :) [17:17:59] (03CR) 10Cwhite: [C:03+2] move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [17:18:10] 🚀 [17:19:47] (03PS5) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) [17:19:54] (03CR) 10Clément Goubert: [C:03+1] admin_ng: set request and limits the same for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128465 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [17:20:09] (03CR) 10Dzahn: "oh yea, I think you are right, this should be a funnel. we want to redirect all URLs to the same target. amended." [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [17:20:19] (03Merged) 10jenkins-bot: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [17:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10643081 (10phaultfinder) [17:24:37] (03CR) 10DLynch: [C:03+1] VE: Disable upcoming mobile insert menu everywhere except test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) (owner: 10Esanders) [17:25:15] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:27:15] cwhite: mediawiki statsd exporter diffs look like what we expect - just the addition of the ttl (720h) and the chart version bump [17:27:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10643107 (10ATsay-WMF) Approved [17:27:43] \o/ [17:27:45] I'll start with mw-debug to confirm it updates, and then update the exporters in the other namespaces [17:28:04] (03PS12) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:28:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10643112 (10ssingh) [17:28:23] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:28:29] (03PS13) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:28:41] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:28:59] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:29:12] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002" [17:29:15] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:29:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002" [17:29:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:31] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5092/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:32:30] cwhite: mw-debug updates succeeded, and I just curl'd /metrics on one of the exporter pods to confirm that I see mw metrics [17:34:12] cwhite: anything you want to spot check before I move ahead with the other deployments? [17:34:44] (03PS14) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:35:05] swfrench-wmf: Good to proceed :) [17:35:43] off we go, then [17:35:47] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:36:05] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:36:11] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:36:25] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:36:30] (03PS2) 10Jdlrobson: Enable Donation banner on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127155 (https://phabricator.wikimedia.org/T387768) [17:36:31] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:36:34] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:36:41] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:36:52] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [17:36:57] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:37:03] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [17:37:10] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [17:37:16] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:37:29] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:37:46] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:38:00] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:38:06] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:38:14] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [17:39:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:40:18] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:40:51] no issues in eqiad - moving on codfw [17:41:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:41:30] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:41:36] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:41:50] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:41:55] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp30[66-72,74-80].esams.wmnet} and A:cp for 9.2.9-1wm1 [17:41:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:41:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:42:04] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:42:16] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:42:22] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [17:42:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [17:42:35] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:42:47] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:43:02] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:43:12] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:43:18] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [17:43:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [17:43:58] !log applied https://gerrit.wikimedia.org/r/1117638 to mediawiki statsd exporters [17:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:03] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002" [17:44:05] cwhite: all done :) [17:44:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002" [17:44:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:44:35] nice! looks good from my end! [17:45:18] awesome, thanks [17:45:24] alright, on to the next thing [17:45:34] thank you! [17:45:47] no problem at all [17:46:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:47:44] (03Merged) 10jenkins-bot: Disable cookie-based enrollment in 8.1 (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:48:02] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1128451|Disable cookie-based enrollment in 8.1 (cleanup) (T383845)]] [17:48:06] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:49:19] (03PS1) 10Ladsgroup: changeprop-jobqueue: Bump categorymembership job concurrancy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128482 [17:51:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye [17:51:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10643256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2089.codfw... [17:52:01] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1128451|Disable cookie-based enrollment in 8.1 (cleanup) (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:54:13] !log swfrench@deploy2002 swfrench: Continuing with sync [17:57:50] (03PS2) 10Ssingh: Add dsantamaria to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1127996 (https://phabricator.wikimedia.org/T388693) (owner: 10BCornwall) [17:57:51] swfrench: I have the new scap release ready with your fix and spiderpig stuff. Lemme know when it's safe to deploy it. [17:58:56] (03CR) 10Brouberol: [V:03+1] "It should be all good. Sorry it took many attempts, I'm not at ny best :D" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [18:00:07] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128484 [18:00:38] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128451|Disable cookie-based enrollment in 8.1 (cleanup) (T383845)]] (duration: 12m 35s) [18:00:41] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:00:55] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10643294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2089.codfw.wmn... [18:01:15] (03CR) 10Ssingh: [C:03+2] Add dsantamaria to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1127996 (https://phabricator.wikimedia.org/T388693) (owner: 10BCornwall) [18:03:00] alright, I am done with the UTC-late infra window :) [18:04:10] dancy: please let me know when you're done, I have a tiny and quick portals update deploy [18:04:37] Amir1: Please go fist [18:04:41] *first [18:04:53] dancy: apologies, missed your message - all yours! [18:05:05] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10643304 (10ssingh) 05In progress→03Resolved @DSantamaria: This has been merged, please try accessing Superset ~30... [18:05:07] thanks! [18:05:11] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [18:05:26] (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128484 (owner: 10Ladsgroup) [18:06:25] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128484 (owner: 10Ladsgroup) [18:14:13] (03CR) 10Xcollazo: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis) [18:15:23] (03PS1) 10Sohom Datta: Lua: Prevent PHP errors in production from displayNumber lookup [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924) [18:16:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924) (owner: 10Sohom Datta) [18:20:01] !log ladsgroup@deploy2002 Synchronized portals/wikipedia.org/assets: wikimedia.org updates (T373204) (duration: 12m 38s) [18:20:05] T373204: Wikimedia.org page redesign - https://phabricator.wikimedia.org/T373204 [18:22:41] !log ladsgroup@deploy2002 Synchronized portals: wikimedia.org updates (T373204) (duration: 02m 38s) [18:23:20] (03PS1) 10Máté Szabó: GlobalContributions: Use unique CentralAuth tokens per request [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717) [18:23:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó) [18:25:29] (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: Bump categorymembership job concurrancy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128482 (owner: 10Ladsgroup) [18:29:35] (03CR) 10JHathaway: [C:03+1] installserver: set puppetserver2004 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey) [18:39:58] (03PS1) 10Gergő Tisza: Re-apply "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218) [18:40:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [18:44:54] (03PS1) 10Scott French: hieradata: migrate mw-wikifunctions to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845) [18:44:55] (03PS1) 10Scott French: mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) [18:45:07] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115985 (owner: 10Ncmonitor) [18:45:18] !log dancy@deploy2002 Installing scap version "4.141.1" for 204 host(s) [18:46:56] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092931 (owner: 10Ncmonitor) [18:50:09] !log dancy@deploy2002 Installing scap version "4.141.1" for 1 host(s) [18:50:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10643476 (10Jhancock.wm) almost done with this, just fighting with the puppet server [18:51:03] !log dancy@deploy2002 Installation of scap version "4.141.1" completed for 1 hosts [18:53:07] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:eqsin and A:cp for 9.2.9-1wm1 [18:53:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:54:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:00:42] (03PS1) 10Gergő Tisza: Do not trigger edge login on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501 [19:01:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501 (owner: 10Gergő Tisza) [19:01:44] (03CR) 10Ladsgroup: "(We are not deploying this right now since this is before dc switchover and there is a db maint freeze, while this is not having any impac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125556 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [19:03:34] Reedy: sbassett: are you using the security window today? I have a bunch of SUL3 related fixes that won't fit into the normal backport window. If you are not using the security window, I'd like to steal it. [19:04:06] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T385896, xfer categories jnl) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards [19:04:10] T385896: Deploy wdqs-categories on wdqs-main/wdqs-internal-main hosts - https://phabricator.wikimedia.org/T385896 [19:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [19:04:56] this data transfer will depool one host in wdqs-main and one in wdqs-internal-main. this shouldn't cause any alerts, but i'll be watching [19:07:02] (03PS1) 10Gergő Tisza: Re-apply "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796) [19:07:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [19:08:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza) [19:08:48] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T385896, xfer categories jnl) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards [19:09:29] (03CR) 10Tacsipacsi: search-redirect: Handle $_GET potential vulnerability scanning (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [19:12:44] (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128505 (https://phabricator.wikimedia.org/T380572) [19:15:19] PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 105108 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [19:15:24] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128505 (https://phabricator.wikimedia.org/T380572) (owner: 10Ebernhardson) [19:16:25] (03CR) 10Giuseppe Lavagetto: [C:03+1] hieradata: migrate mw-wikifunctions to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [19:16:55] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [19:17:02] (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128505 (https://phabricator.wikimedia.org/T380572) (owner: 10Ebernhardson) [19:20:01] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:20:12] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:20:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10643585 (10phaultfinder) [19:21:26] (03CR) 10Federico Ceratto: "In order to progress the CR do you want me to rollback the last commits and move them to a different CR?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [19:27:05] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:27:13] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:28:01] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:28:09] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:32:40] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:32:48] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:45:21] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4051.ulsfo.wmnet [19:45:28] (03PS1) 10Gergő Tisza: Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) [19:45:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [19:48:03] (03CR) 10Ebernhardson: [C:03+2] Bump changelog version for sudachi analyzer [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1126663 (https://phabricator.wikimedia.org/T386868) (owner: 10Ryan Kemper) [19:48:06] (03PS2) 10Gergő Tisza: Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) [19:50:50] (03PS1) 10BCornwall: upgrade cp4051 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737) [19:52:41] (03CR) 10Ssingh: [C:03+1] upgrade cp4051 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:52:55] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:53:31] (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp4051 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:54:41] !log amastilovic@deploy2002 Started deploy [airflow-dags/analytics@f0d67b6]: Keeping up with the Kubernetes migration [19:55:14] !log amastilovic@deploy2002 Finished deploy [airflow-dags/analytics@f0d67b6]: Keeping up with the Kubernetes migration (duration: 00m 46s) [19:57:31] PROBLEM - Webrequests Varnishkafka log producer on cp4051 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:58:20] (03CR) 10DCausse: [C:03+1] wdqs categories: switch to internal-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124535 (https://phabricator.wikimedia.org/T375520) (owner: 10Ryan Kemper) [19:58:31] RECOVERY - Webrequests Varnishkafka log producer on cp4051 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T2000). [20:00:05] Jdlrobson, bvibber, Sohom_Datta, mszabo, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:00:36] moin [20:00:38] I have a lot of patches, I'll deploy them in the next window [20:00:42] o/ [20:00:48] o/ [20:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [20:01:08] o/ [20:02:42] also, I can deploy [20:03:05] thank you :) [20:03:09] Jdlrobson: bvibber: can the three config patches go together? [20:03:27] mine should be able to play well with others [20:03:57] tgr_: yes [20:04:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) (owner: 10Jdlrobson) [20:04:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127155 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [20:04:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127976 (https://phabricator.wikimedia.org/T385917) (owner: 10Bvibber) [20:04:36] (03PS1) 10Scott French: php8.1: fix comment typo in Dockerfile.template [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006) [20:05:21] (03Merged) 10jenkins-bot: Enable Vector 2022 on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) (owner: 10Jdlrobson) [20:05:24] (03Merged) 10jenkins-bot: Enable Donation banner on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127155 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson) [20:06:09] (03Merged) 10jenkins-bot: Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127976 (https://phabricator.wikimedia.org/T385917) (owner: 10Bvibber) [20:06:24] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4051.ulsfo.wmnet [20:06:29] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127677|Enable Vector 2022 on Wikidata (T387154)]], [[gerrit:1127155|Enable Donation banner on Catalan Wikipedia (T387768)]], [[gerrit:1127976|Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf (T385917)]] [20:06:35] T387154: Enable Vector 2022 in Wikidata.org by default - https://phabricator.wikimedia.org/T387154 [20:06:35] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [20:06:36] T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917 [20:10:29] (03CR) 10Scott French: "Ah, that's good to know! I think I'd slightly prefer to remove it, if only to avoid future confusion as to where to source the package fro" [puppet] - 10https://gerrit.wikimedia.org/r/1125539 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [20:10:40] !log tgr@deploy2002 bvibber, jdlrobson, tgr: Backport for [[gerrit:1127677|Enable Vector 2022 on Wikidata (T387154)]], [[gerrit:1127155|Enable Donation banner on Catalan Wikipedia (T387768)]], [[gerrit:1127976|Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf (T385917)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:41] (03CR) 10Scott French: [C:03+2] aptrepo: remove component/pcre2 [puppet] - 10https://gerrit.wikimedia.org/r/1125539 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [20:11:36] mine confirmed good [20:12:36] Wikidata.org: good to go [20:12:41] Just checking ca.wikipedia.org now [20:12:57] Also good to go [20:13:04] tgr thanks! [20:13:17] (03CR) 10Scott French: [V:03+2] "No functional change in built images." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [20:13:33] !log tgr@deploy2002 bvibber, jdlrobson, tgr: Continuing with sync [20:14:00] (03CR) 10Gergő Tisza: [C:03+2] Lua: Prevent PHP errors in production from displayNumber lookup [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924) (owner: 10Sohom Datta) [20:14:31] (03CR) 10Scott French: [V:03+2] "Self-merging, as this does not in any way affect built images." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [20:14:46] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: fix comment typo in Dockerfile.template [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [20:15:17] (03Merged) 10jenkins-bot: Lua: Prevent PHP errors in production from displayNumber lookup [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924) (owner: 10Sohom Datta) [20:16:08] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10643831 (10Papaul) [20:19:57] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127677|Enable Vector 2022 on Wikidata (T387154)]], [[gerrit:1127155|Enable Donation banner on Catalan Wikipedia (T387768)]], [[gerrit:1127976|Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf (T385917)]] (duration: 13m 28s) [20:20:03] T387154: Enable Vector 2022 in Wikidata.org by default - https://phabricator.wikimedia.org/T387154 [20:20:03] T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768 [20:20:04] T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917 [20:20:15] <3 [20:20:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:21:52] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128489|Lua: Prevent PHP errors in production from displayNumber lookup (T383924)]] [20:21:56] T383924: ProofreadPage\Pagination\PageNotInPaginationException: $page does not belong to the pagination - https://phabricator.wikimedia.org/T383924 [20:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:26:32] !log tgr@deploy2002 soda, tgr: Backport for [[gerrit:1128489|Lua: Prevent PHP errors in production from displayNumber lookup (T383924)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:27:19] Works! [20:27:28] !log tgr@deploy2002 soda, tgr: Continuing with sync [20:27:51] (03CR) 10Gergő Tisza: [C:03+2] GlobalContributions: Use unique CentralAuth tokens per request [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó) [20:29:01] thanks tgr_ for your help today! [20:32:12] +1, thank you :) [20:32:47] yw [20:32:57] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4043.ulsfo.wmnet [20:33:47] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128489|Lua: Prevent PHP errors in production from displayNumber lookup (T383924)]] (duration: 11m 54s) [20:33:50] T383924: ProofreadPage\Pagination\PageNotInPaginationException: $page does not belong to the pagination - https://phabricator.wikimedia.org/T383924 [20:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:36:21] (03PS1) 10BCornwall: upgrade cp4043 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737) [20:38:49] (03Merged) 10jenkins-bot: GlobalContributions: Use unique CentralAuth tokens per request [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó) [20:38:51] (03CR) 10Ssingh: [C:03+1] upgrade cp4043 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:39:49] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128493|GlobalContributions: Use unique CentralAuth tokens per request (T384717)]] [20:39:53] T384717: Investigate external API call error on Special:GlobalContributions - https://phabricator.wikimedia.org/T384717 [20:40:24] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:40:41] (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp4043 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:43:41] !log tgr@deploy2002 tgr, mszabo: Backport for [[gerrit:1128493|GlobalContributions: Use unique CentralAuth tokens per request (T384717)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:43:49] looking [20:44:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:45:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye [20:45:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10644008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2089.codfw... [20:46:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:48:50] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4043.ulsfo.wmnet [20:49:13] tgr: looks ok [20:50:15] !log tgr@deploy2002 tgr, mszabo: Continuing with sync [20:50:30] (03CR) 10Gergő Tisza: [C:03+2] Do not trigger edge login on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501 (owner: 10Gergő Tisza) [20:50:33] (03CR) 10Gergő Tisza: [C:03+2] Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [20:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:57:26] (03Merged) 10jenkins-bot: Do not trigger edge login on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501 (owner: 10Gergő Tisza) [20:57:26] (03Merged) 10jenkins-bot: Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [20:57:49] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128493|GlobalContributions: Use unique CentralAuth tokens per request (T384717)]] (duration: 18m 00s) [20:57:53] T384717: Investigate external API call error on Special:GlobalContributions - https://phabricator.wikimedia.org/T384717 [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T2100). [21:00:52] Reedy, sbassett, Maryum, manfredi: do you plan to use the window? I have another hour's worth of patches to go [21:01:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:06:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:06:58] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128501|Do not trigger edge login on the shared domain]], [[gerrit:1128515|Do not initiate central login on the passive central domain (T388218)]] [21:07:02] T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218 [21:08:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644083 (10phaultfinder) [21:08:57] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:eqsin and A:cp for 9.2.9-1wm1 [21:10:43] !log tgr@deploy2002 tgr: Backport for [[gerrit:1128501|Do not trigger edge login on the shared domain]], [[gerrit:1128515|Do not initiate central login on the passive central domain (T388218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:10:50] !log ran `reprepro --delete clearvanished` to complete removal of unused component/pcre2 - T386006 [21:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:54] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [21:17:25] (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811) [21:18:11] (03PS2) 10Ryan Kemper: sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811) [21:18:34] (03CR) 10Bking: [C:03+2] sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [21:18:37] (03CR) 10Bking: [V:03+2 C:03+2] sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [21:26:03] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 247 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 242, active_shards: 242, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 245, delayed_unassigned_shards: 0, number_of_pending_tasks: 85, numbe [21:26:03] flight_fetch: 0, task_max_waiting_in_queue_millis: 16264, active_shards_percent_as_number: 49.48875255623722 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:07] !log tgr@deploy2002 tgr: Continuing with sync [21:26:15] PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 247 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 242, active_shards: 242, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 247, delayed_unassigned_shards: 0, number_of_pe [21:26:15] sks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 49.48875255623722 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2089.codfw.wmnet with OS bullseye [21:29:06] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [21:29:11] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [21:29:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10644131 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2089.codfw.wmn... [21:30:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [21:30:41] (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1128536 [21:32:51] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128501|Do not trigger edge login on the shared domain]], [[gerrit:1128515|Do not initiate central login on the passive central domain (T388218)]] (duration: 25m 53s) [21:32:57] T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218 [21:34:29] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [21:34:33] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [21:34:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [21:35:32] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [21:35:44] (03Merged) 10jenkins-bot: Re-apply "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza) [21:36:03] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128496|Re-apply "Fix some SUL3 shared domain settings" (T388218)]] [21:38:20] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1128536 (owner: 10Ryan Kemper) [21:40:48] !log tgr@deploy2002 tgr: Backport for [[gerrit:1128496|Re-apply "Fix some SUL3 shared domain settings" (T388218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:40:52] T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218 [21:48:47] (03PS1) 10Eevans: corto: set production irc channels in /srv/git/private [puppet] - 10https://gerrit.wikimedia.org/r/1128538 [21:49:48] (03CR) 10Eevans: [C:03+2] corto: set production irc channels in /srv/git/private [puppet] - 10https://gerrit.wikimedia.org/r/1128538 (owner: 10Eevans) [22:01:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644245 (10phaultfinder) [22:24:28] (03Abandoned) 10BCornwall: ncmonitor: Ignore wikipediacreators.com [puppet] - 10https://gerrit.wikimedia.org/r/1115996 (owner: 10BCornwall) [22:27:01] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [22:32:17] (03PS1) 10BCornwall: ncmonitor: rm edit-for-pay domains from ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1128543 [22:34:03] (03PS2) 10BCornwall: ncmonitor: rm edit-for-pay domains from ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1128543 [22:34:11] !log tgr@deploy2002 tgr: Continuing with sync [22:34:11] (03PS1) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) [22:35:27] (03CR) 10CI reject: [V:04-1] data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis) [22:37:27] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1126676/5096/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1126676 (https://phabricator.wikimedia.org/T388354) (owner: 10Dzahn) [22:37:31] (03PS1) 10BCornwall: ncredir: Redirect seized edit-for-pay domains [puppet] - 10https://gerrit.wikimedia.org/r/1128545 [22:38:28] (03PS1) 10Cwhite: logstash: add ids to plugins where missing [puppet] - 10https://gerrit.wikimedia.org/r/1128546 (https://phabricator.wikimedia.org/T389072) [22:38:46] (03CR) 10Dzahn: [V:03+1 C:03+2] "compiler output shows how it affects stewards-l but not other lists" [puppet] - 10https://gerrit.wikimedia.org/r/1126676 (https://phabricator.wikimedia.org/T388354) (owner: 10Dzahn) [22:39:43] (03PS1) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 [22:40:07] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5097/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall) [22:40:07] (03CR) 10CI reject: [V:04-1] opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [22:40:14] (03CR) 10Bking: [C:03+1] opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478 (owner: 10DCausse) [22:40:38] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128496|Re-apply "Fix some SUL3 shared domain settings" (T388218)]] (duration: 64m 35s) [22:40:42] T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218 [22:40:47] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5098/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall) [22:41:31] (03PS2) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 [22:41:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [22:41:58] (03CR) 10CI reject: [V:04-1] opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [22:41:59] (03CR) 10Dzahn: [C:03+1] "lgtm. another change will then add them to ncredir service?" [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall) [22:42:04] (03PS2) 10Cwhite: logstash: add ids to plugins where missing [puppet] - 10https://gerrit.wikimedia.org/r/1128546 (https://phabricator.wikimedia.org/T389072) [22:42:21] (03CR) 10Ebernhardson: "I'm not 100% sure this is the best solution, ideally i think we would want a way to tell opensearch to look in a directory other than the " [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [22:45:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [22:45:46] (03CR) 10BCornwall: [V:03+1 C:03+2] "Indeed, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128545/" [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall) [22:47:30] (03PS2) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) [22:48:43] (03CR) 10CI reject: [V:04-1] data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis) [22:50:19] (03PS3) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 [22:50:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644539 (10phaultfinder) [22:50:42] (03CR) 10CI reject: [V:04-1] opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [22:52:54] (03Merged) 10jenkins-bot: Re-apply "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [22:53:13] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128502|Re-apply "Try both SUL2 and SUL3 central domain for autologin" (T375796)]] [22:53:16] T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796 [22:53:36] (03CR) 10Dzahn: [C:03+1] "lgtm. nice solution" [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall) [22:54:27] (03PS4) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 [22:55:08] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [22:57:11] !log tgr@deploy2002 tgr: Backport for [[gerrit:1128502|Re-apply "Try both SUL2 and SUL3 central domain for autologin" (T375796)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:59:17] (03PS5) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T2300) [23:00:42] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson) [23:01:23] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f9fc2c741c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec [23:01:23] dia.org/wiki/Search%23Administration [23:01:37] PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/ [23:02:03] h.wikimedia.org/wiki/PyBal [23:02:03] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/ [23:02:03] h.wikimedia.org/wiki/PyBal [23:02:11] PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [23:02:26] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS bookworm [23:02:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm [23:04:17] ^^ looking at the Cloudelastic alerts now [23:04:26] they should clear shortly [23:04:42] FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [23:05:35] (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Redirect seized edit-for-pay domains [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall) [23:08:05] (03PS3) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) [23:09:07] (03PS4) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [23:09:18] (03CR) 10CI reject: [V:04-1] data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis) [23:09:45] (03PS1) 10Dzahn: lists::automation: explain how this can sync mailman list members [puppet] - 10https://gerrit.wikimedia.org/r/1128551 (https://phabricator.wikimedia.org/T388354) [23:11:06] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [23:11:28] (03CR) 10Pppery: [C:03+1] ncredir: Redirect seized edit-for-pay domains [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall) [23:12:15] (03PS4) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) [23:13:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:13:03] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:13:11] RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 715 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [23:13:23] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: green, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 764, active_shards: 1531, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks [23:13:23] (03CR) 10Dzahn: [C:03+2] lists::automation: explain how this can sync mailman list members [puppet] - 10https://gerrit.wikimedia.org/r/1128551 (https://phabricator.wikimedia.org/T388354) (owner: 10Dzahn) [23:13:23] ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 261, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:13:37] RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 715 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [23:14:20] (03CR) 10Pppery: ncmonitor: rm edit-for-pay domains from ignorelist (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall) [23:15:17] (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: rm edit-for-pay domains from ignorelist (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall) [23:16:43] (03PS1) 10BCornwall: ncmonitor/ncredir: Add two more edit-for-pay sites [puppet] - 10https://gerrit.wikimedia.org/r/1128559 [23:21:03] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/ [23:21:03] h.wikimedia.org/wiki/PyBal [23:21:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/ [23:21:03] h.wikimedia.org/wiki/PyBal [23:22:03] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:22:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:22:06] ^^ again, these should clear momentarily [23:23:21] (03CR) 10Pppery: [C:03+1] ncmonitor/ncredir: Add two more edit-for-pay sites [puppet] - 10https://gerrit.wikimedia.org/r/1128559 (owner: 10BCornwall) [23:23:41] (03CR) 10BCornwall: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1128559 (owner: 10BCornwall) [23:29:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:33:37] !log tgr@deploy2002 tgr: Continuing with sync [23:34:19] (03PS1) 10Gergő Tisza: Do not schedule edge login recursively [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132) [23:35:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132) (owner: 10Gergő Tisza) [23:37:43] !log zabe@mwmaint2002:~$ cat group0.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php {} --deletedump /home/zabe/afl_text_table_deletedump/{} --dump /home/zabe/afl_text_table_dump/{}" # T381599 [23:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:46] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [23:39:58] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128502|Re-apply "Try both SUL2 and SUL3 central domain for autologin" (T375796)]] (duration: 46m 45s) [23:40:02] T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796 [23:41:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132) (owner: 10Gergő Tisza) [23:43:14] (03Merged) 10jenkins-bot: Do not schedule edge login recursively [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132) (owner: 10Gergő Tisza) [23:43:34] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]] [23:43:38] T389132: Neverending edge login - https://phabricator.wikimedia.org/T389132 [23:47:26] !log tgr@deploy2002 tgr: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:48:39] (03PS4) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) [23:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:52:55] !log tgr@deploy2002 tgr: Continuing with sync [23:59:09] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]] (duration: 15m 35s) [23:59:13] T389132: Neverending edge login - https://phabricator.wikimedia.org/T389132