[00:01:06] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[00:01:11] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:10:43] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:24:57] <wikibugs>	 06SRE, 10MediaWiki-Debug-Logger, 10observability, 10Observability-Logging, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#10640178 (10Pppery)
[00:38:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128057
[00:38:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128057 (owner: 10TrainBranchBot)
[00:49:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128057 (owner: 10TrainBranchBot)
[00:51:11] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:54:52] <wikibugs>	 (03CR) 10Tacsipacsi: search-redirect: Handle $_GET potential vulnerability scanning (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[01:05:25] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:08:49] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128059
[01:08:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128059 (owner: 10TrainBranchBot)
[01:28:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128059 (owner: 10TrainBranchBot)
[01:46:29] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/8e93d155635316ec7caba9a8066787b4cd39e47b01b908a0d067cb574d23b01f/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:54:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640230 (10phaultfinder)
[02:06:29] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:13:43] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:42:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:27:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[03:54:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640249 (10phaultfinder)
[04:01:06] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[04:54:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640274 (10phaultfinder)
[05:05:25] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:06:27] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-03-14-045617-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128066 (https://phabricator.wikimedia.org/T382294)
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:57] <wikibugs>	 (03PS1) 10KartikMistry: MinT: staging: Increase rediness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889)
[05:56:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:30:25] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:33:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.14s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:35:25] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:38:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.14s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:40:25] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:42:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T0700). nyaa~
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:12:07] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2243 [puppet] - 10https://gerrit.wikimedia.org/r/1128215
[07:12:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[07:13:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1
[07:14:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1
[07:15:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section test-s4
[07:16:41] <wikibugs>	 06SRE, 06Commons, 07Wikimedia-production-error: https://commons.wikimedia.org/w/index.php?curid=162194998 - URL shows an exception instead of either a file description page or a 404response if there was no page associated with the curid/mediaid - https://phabricator.wikimedia.org/T389031#10640405 (10A_smart_k...
[07:16:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section test-s4
[07:17:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1
[07:17:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1
[07:19:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640415 (10phaultfinder)
[07:20:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2243 [puppet] - 10https://gerrit.wikimedia.org/r/1128215 (owner: 10Marostegui)
[07:21:48] <wikibugs>	 06SRE, 06Commons, 07Wikimedia-production-error: Fatal exception of type "LogicException" when visiting some curid URLs on Wikimedia Commons - https://phabricator.wikimedia.org/T389031#10640416 (10A_smart_kitten)
[07:24:00] <wikibugs>	 (03CR) 10Slyngshede: idp-test: add Phabricator test instance client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[07:24:16] <wikibugs>	 (03CR) 10Slyngshede: "That would be me :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[07:27:10] <wikibugs>	 06SRE, 06Commons, 07Wikimedia-production-error: Fatal exception of type "LogicException" when visiting some curid URLs on Wikimedia Commons - https://phabricator.wikimedia.org/T389031#10640432 (10A_smart_kitten)
[07:27:58] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[07:28:20] <logmsgbot>	 !log marostegui@cumin2002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1
[07:28:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034 (10MoritzMuehlenhoff) 03NEW
[07:29:04] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: don't show diff for phaste config [puppet] - 10https://gerrit.wikimedia.org/r/1128225
[07:29:06] <logmsgbot>	 !log marostegui@cumin2002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1
[07:32:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: disable 'accelerator' cadvisor metric [puppet] - 10https://gerrit.wikimedia.org/r/1128319 (https://phabricator.wikimedia.org/T388632)
[07:37:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add hosts-for-role command [puppet] - 10https://gerrit.wikimedia.org/r/1128330
[07:39:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove vrook from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1128334
[07:41:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add hosts-for-role command [puppet] - 10https://gerrit.wikimedia.org/r/1128330 (owner: 10Filippo Giunchedi)
[07:47:01] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1128334 (owner: 10Muehlenhoff)
[07:47:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128322 (owner: 10Muehlenhoff)
[07:48:21] <wikibugs>	 (03Abandoned) 10Brouberol: rbac: deploy the airflow-dumps ClusterRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126920 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[07:48:31] <wikibugs>	 (03Abandoned) 10Brouberol: airflow: allow binding the airflow-dumps ClusterRole to the airflow SA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126921 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[07:48:43] <wikibugs>	 (03Abandoned) 10Brouberol: airflow-analytics-test: bind the airflow-dumps clusterRole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126922 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[07:48:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove vrook from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1128334 (owner: 10Muehlenhoff)
[07:49:58] <wikibugs>	 (03Abandoned) 10Brouberol: airflow: mount the hadoop configuration in the webserver and scheduler pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123527 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[07:50:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add cn=bitu-account-managers to list of groups to drop on offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1128322 (owner: 10Muehlenhoff)
[07:55:55] <wikibugs>	 (03PS4) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[07:56:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[07:57:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[07:57:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[08:00:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640484 (10phaultfinder)
[08:01:06] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[08:02:57] <wikibugs>	 (03PS5) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[08:05:32] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5081/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[08:13:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM, I don't see anything weird in the config. I think that we should coordinate on the deployment procedure, so that we can test properl" [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04)
[08:15:14] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] nftables: add a newline at the end of GERRIT_ABUSERS lines [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb)
[08:16:40] <wikibugs>	 (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[08:16:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[08:16:55] <wikibugs>	 (03CR) 10Elukey: [C:03+1] services: update codfw changeprop/changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[08:16:56] <wikibugs>	 (03PS1) 10Tiziano Fogli: nrpe/monitoring-plugins-standard: fix deps [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680)
[08:17:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert "Temporary revert changeprop/changeprop-jobqueue to node 18 images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127939 (owner: 10Aaron Schulz)
[08:17:40] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: WidespreadPuppetFailure - https://phabricator.wikimedia.org/T389037 (10LSobanski) 03NEW
[08:17:48] <wikibugs>	 (03PS6) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[08:18:42] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T389038 (10LSobanski) 03NEW
[08:19:12] <wikibugs>	 (03PS1) 10Arnaudb: Revert "nftables: add a newline at the end of GERRIT_ABUSERS lines" [puppet] - 10https://gerrit.wikimedia.org/r/1128337
[08:19:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640549 (10phaultfinder)
[08:20:02] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[08:20:47] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "@btullis@wikimedia.org @xcollazo@wikimedia.org I took the liberty to rework the patch to make sure that hive-site.xml shows the required d" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[08:22:33] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert "nftables: add a newline at the end of GERRIT_ABUSERS lines" [puppet] - 10https://gerrit.wikimedia.org/r/1128337 (owner: 10Arnaudb)
[08:23:04] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync
[08:23:14] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[08:23:29] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
[08:23:39] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
[08:28:45] <wikibugs>	 (03PS2) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641)
[08:30:23] <moritzm>	 !log updated bookworm installer image to Bookworm 12.10 T389034
[08:30:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[08:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:27] <stashbot>	 T389034: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034
[08:39:35] <wikibugs>	 (03Abandoned) 10Brouberol: mediawiki: render configmaps when dumps are enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126574 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[08:41:15] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: enabled the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128339 (https://phabricator.wikimedia.org/T388378)
[08:42:24] <wikibugs>	 (03CR) 10Majavah: "fwiw, this would've worked with double quotes instead of single ones" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb)
[08:43:32] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "TIL, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1127527 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb)
[08:45:16] <wikibugs>	 (03PS7) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[08:45:58] <wikibugs>	 (03PS2) 10Arnaudb: nftables: add a newline at the end of GERRIT_ABUSERS lines [puppet] - 10https://gerrit.wikimedia.org/r/1128338 (https://phabricator.wikimedia.org/T388783)
[08:45:58] <wikibugs>	 (03CR) 10Arnaudb: "Given https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127527/comments/c1e73adc_1c086bad I can also redo I6fecea2faa7774b21e68fdf70ce" [puppet] - 10https://gerrit.wikimedia.org/r/1128338 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb)
[08:46:18] <icinga-wm>	 RECOVERY - Disk space on maps1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops
[08:46:32] <moritzm>	 !log freed 28G of disk space on maps1009
[08:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:06] <wikibugs>	 (03PS3) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641)
[08:48:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[08:49:03] <wikibugs>	 (03PS1) 10Brouberol: xmldumps-backup: enable the worker script to be called from any path [dumps] - 10https://gerrit.wikimedia.org/r/1128342 (https://phabricator.wikimedia.org/T388378)
[08:50:25] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:53:28] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:firewall absent conntrack_table_size monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1126503 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:55:25] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:56:17] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: WidespreadPuppetFailure - https://phabricator.wikimedia.org/T389037#10640633 (10MoritzMuehlenhoff) 05Open→03Declined This is caused by WIP setup nodes for the parallel Bookworm cluster, but not affecting any production workloads.
[08:59:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640636 (10phaultfinder)
[09:00:26] <jinxer-wm>	 FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:04:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10640642 (10MoritzMuehlenhoff)
[09:04:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10640643 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:11:28] <wikibugs>	 (03PS1) 10Elukey: service: move kartotherian-k8s-ssl fully on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1128343 (https://phabricator.wikimedia.org/T386926)
[09:14:03] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[09:14:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[09:15:26] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: export_smart_data_dump.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:15:32] <wikibugs>	 (03CR) 10Tiziano Fogli: "AFAICS, you don't have enough points to trigger the alert with an evaluation interval of 2 minutes." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[09:16:56] <wikibugs>	 (03PS1) 10Elukey: service: set kartotherian and kartotherian-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128344 (https://phabricator.wikimedia.org/T389042)
[09:16:58] <wikibugs>	 (03PS1) 10Elukey: service: set kartotherian and kartotherian-ssl to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128345 (https://phabricator.wikimedia.org/T389042)
[09:17:01] <wikibugs>	 (03PS1) 10Elukey: service, conftool-data: final removal for unused Kartotherian configs [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042)
[09:19:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[09:23:31] <wikibugs>	 (03PS4) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641)
[09:23:59] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127947 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[09:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640704 (10phaultfinder)
[09:24:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[09:24:58] <wikibugs>	 (03CR) 10Ayounsi: "thanks, I tried it locally, but still seeing the same error." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[09:25:15] <moritzm>	 !log installing intel-microcode security updates
[09:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:33] <wikibugs>	 (03PS1) 10Elukey: maps: remove Kartotherian from bare metal nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042)
[09:28:28] <wikibugs>	 (03PS1) 10Kamila Součková: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T341984)
[09:28:30] <wikibugs>	 (03PS1) 10Kamila Součková: Update wikikube-staging codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T386232)
[09:28:50] <wikibugs>	 (03PS5) 10Ayounsi: Duplicate LibreNMS In/out interface errors [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641)
[09:30:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680) (owner: 10Tiziano Fogli)
[09:30:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128339 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:31:02] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127048 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[09:31:11] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Support setting custom arp-policer on CR interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1127592 (https://phabricator.wikimedia.org/T384774) (owner: 10Cathal Mooney)
[09:32:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: enabled the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128339 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:32:11] <wikibugs>	 (03CR) 10Ayounsi: "Not sure why local docker CI doesn't behave. But at long as it passes here..." [alerts] - 10https://gerrit.wikimedia.org/r/1127041 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[09:32:18] <wikibugs>	 (03CR) 10Btullis: [C:03+1] xmldumps-backup: enable the worker script to be called from any path [dumps] - 10https://gerrit.wikimedia.org/r/1128342 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:33:56] <wikibugs>	 (03PS1) 10Marostegui: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128351 (https://phabricator.wikimedia.org/T388626)
[09:34:09] <wikibugs>	 (03PS1) 10Kamila Součková: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045)
[09:35:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128351 (https://phabricator.wikimedia.org/T388626) (owner: 10Marostegui)
[09:36:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] xmldumps-backup: enable the worker script to be called from any path [dumps] - 10https://gerrit.wikimedia.org/r/1128342 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[09:36:46] <wikibugs>	 (03CR) 10JMeybohm: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[09:36:49] <wikibugs>	 (03Merged) 10jenkins-bot: db-production.php: Disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128351 (https://phabricator.wikimedia.org/T388626) (owner: 10Marostegui)
[09:37:28] <logmsgbot>	 !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1128351|db-production.php: Disable writes on es6 (T388626)]]
[09:37:32] <stashbot>	 T388626: Prepare databases circular replication for the DC switchover - https://phabricator.wikimedia.org/T388626
[09:38:12] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update wikikube-staging codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T386232) (owner: 10Kamila Součková)
[09:38:14] <wikibugs>	 (03PS2) 10Kamila Součková: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045)
[09:38:58] <wikibugs>	 (03PS3) 10Kamila Součková: Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045)
[09:39:08] <wikibugs>	 (03Merged) 10jenkins-bot: services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127947 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[09:39:58] <wikibugs>	 (03CR) 10JMeybohm: "We need to change admission_plugins as well, see:" [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[09:40:54] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs3010 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1127853 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[09:40:57] <wikibugs>	 (03PS2) 10Kamila Součková: Update wikikube-staging eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T389045)
[09:43:13] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] "thanks for the thorough reviews <3" [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[09:45:42] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3010.esams.wmnet with OS bookworm
[09:47:10] <wikibugs>	 (03PS4) 10Kamila Součková: Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045)
[09:49:37] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:49:57] <wikibugs>	 (03Merged) 10jenkins-bot: sre.loadbalancer: upgrade/restart cookbook for liberica [cookbooks] - 10https://gerrit.wikimedia.org/r/1127537 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[09:50:21] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1128351|db-production.php: Disable writes on es6 (T388626)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:50:22] <vgutierrez>	 ^^ BGP alert is lvs3010 getting reimaged
[09:50:25] <stashbot>	 T388626: Prepare databases circular replication for the DC switchover - https://phabricator.wikimedia.org/T388626
[09:50:26] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[09:51:20] <wikibugs>	 (03PS2) 10Kamila Součková: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045)
[09:51:32] <wikibugs>	 (03CR) 10Kamila Součková: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[09:55:17] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs4010.ulsfo.wmnet} and A:liberica
[09:55:51] <wikibugs>	 (03CR) 10Btullis: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[09:56:20] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.upgrade (exit_code=1) upgradeing P{lvs4010.ulsfo.wmnet} and A:liberica
[09:57:05] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1000)
[10:00:54] <logmsgbot>	 !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128351|db-production.php: Disable writes on es6 (T388626)]] (duration: 23m 25s)
[10:00:58] <stashbot>	 T388626: Prepare databases circular replication for the DC switchover - https://phabricator.wikimedia.org/T388626
[10:01:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es6
[10:02:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:02:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es6
[10:02:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "Couple of comments inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry)
[10:04:57] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128356
[10:05:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section es7
[10:05:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Bitu: Add approval config for airflow-research-ops [puppet] - 10https://gerrit.wikimedia.org/r/1128357
[10:05:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Bitu: Add obsolete test config [puppet] - 10https://gerrit.wikimedia.org/r/1128358
[10:06:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section es7
[10:06:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128356 (owner: 10Marostegui)
[10:06:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section x1
[10:06:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128356 (owner: 10Marostegui)
[10:07:10] <wikibugs>	 (03PS1) 10Kamila Součková: Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045)
[10:07:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section x1
[10:07:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:07:17] <logmsgbot>	 !log marostegui@deploy2002 Started scap sync-world: Backport for [[gerrit:1128356|Revert "db-production.php: Disable writes on es6"]]
[10:07:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s6
[10:09:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s6
[10:09:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s7
[10:09:38] <wikibugs>	 (03PS2) 10Kamila Součková: Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045)
[10:09:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große)
[10:10:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s7
[10:10:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s8
[10:10:55] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[10:10:59] <wikibugs>	 (03PS1) 10Vgutierrez: sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369)
[10:11:18] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[10:11:42] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[10:11:45] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1128356|Revert "db-production.php: Disable writes on es6"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:11:49] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[10:12:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s8
[10:12:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s5
[10:13:49] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[10:14:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s5
[10:14:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s4
[10:15:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s4
[10:16:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s3
[10:17:03] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[10:17:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s3
[10:17:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s2
[10:18:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s2
[10:19:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from codfw to eqiad for section s1
[10:21:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from codfw to eqiad for section s1
[10:21:58] <logmsgbot>	 !log marostegui@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128356|Revert "db-production.php: Disable writes on es6"]] (duration: 14m 41s)
[10:22:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:23:12] <wikibugs>	 (03PS1) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984)
[10:26:22] <wikibugs>	 (03PS1) 10Muehlenhoff: preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966)
[10:26:33] <wikibugs>	 (03PS2) 10Muehlenhoff: preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966)
[10:27:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:27:43] <wikibugs>	 (03PS3) 10Muehlenhoff: preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966)
[10:28:21] <wikibugs>	 (03CR) 10Elukey: [C:03+1] preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966) (owner: 10Muehlenhoff)
[10:29:34] <wikibugs>	 (03PS2) 10Vgutierrez: sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369)
[10:29:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:30:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] preseed: Fix syntax for new elastic UEFI nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128364 (https://phabricator.wikimedia.org/T384966) (owner: 10Muehlenhoff)
[10:30:44] <Amir1>	 jouncebot: nowandnext
[10:30:44] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1000)
[10:30:45] <jouncebot>	 In 2 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300)
[10:30:47] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Better :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[10:32:10] * MichaelG_WMF is interested in running a few low-risk maintenance scripts to clean up a few GrowthExperiments tables before the backport window, but nothing urgent
[10:32:47] <wikibugs>	 (03PS2) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984)
[10:32:50] <wikibugs>	 (03PS1) 10Ladsgroup: media: Make SvgHandler respect physicalWidth when building URL for thumb [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128365 (https://phabricator.wikimedia.org/T360589)
[10:32:57] <wikibugs>	 (03PS1) 10Ladsgroup: findBadBlobs: Allow for timestamp based search via --scan-to [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953)
[10:33:26] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] media: Make SvgHandler respect physicalWidth when building URL for thumb [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128365 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:33:31] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] findBadBlobs: Allow for timestamp based search via --scan-to [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[10:37:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs3010.esams.wmnet with OS bookworm
[10:38:01] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3010.esams.wmnet with OS bookworm
[10:38:54] <wikibugs>	 (03PS3) 10JMeybohm: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984)
[10:39:33] <wikibugs>	 (03PS1) 10Brouberol: airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367
[10:39:33] <wikibugs>	 (03PS1) 10Brouberol: airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282)
[10:41:59] <wikibugs>	 (03PS8) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[10:42:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:42:56] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[10:43:57] <jynus>	 !log restarting dbprov2005 T389052
[10:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:02] <stashbot>	 T389052: dbprov2005 lost network link - https://phabricator.wikimedia.org/T389052
[10:44:06] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[10:44:46] <wikibugs>	 (03PS2) 10Majavah: P:wmcs: wikireplicas: Drop module_deps view [puppet] - 10https://gerrit.wikimedia.org/r/1128014 (https://phabricator.wikimedia.org/T388982)
[10:44:49] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] P:wmcs: wikireplicas: Drop module_deps view [puppet] - 10https://gerrit.wikimedia.org/r/1128014 (https://phabricator.wikimedia.org/T388982) (owner: 10Majavah)
[10:44:50] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:44:50] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1000)
[10:44:50] <jouncebot>	 In 2 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300)
[10:45:10] <wikibugs>	 (03PS2) 10Majavah: P:wmcs: wikireplicas: Fix fr_actor not being exposed [puppet] - 10https://gerrit.wikimedia.org/r/1128041 (https://phabricator.wikimedia.org/T383491)
[10:45:17] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] P:wmcs: wikireplicas: Fix fr_actor not being exposed [puppet] - 10https://gerrit.wikimedia.org/r/1128041 (https://phabricator.wikimedia.org/T383491) (owner: 10Majavah)
[10:45:40] <wikibugs>	 (03PS1) 10JMeybohm: k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1128369 (https://phabricator.wikimedia.org/T388388)
[10:45:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey)
[10:46:58] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:46:59] <wikibugs>	 (03PS9) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[10:47:18] * Dreamy_Jazz Is interested in doing a few config changes and a maintenance script run for https://phabricator.wikimedia.org/T387205
[10:47:21] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:47:32] <wikibugs>	 (03Merged) 10jenkins-bot: media: Make SvgHandler respect physicalWidth when building URL for thumb [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128365 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:47:58] <wikibugs>	 (03PS1) 10Volans: setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370
[10:47:58] <wikibugs>	 (03PS1) 10Volans: constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371
[10:48:00] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I like the approach and it is way more simpler than the introduction of the ACLs. We should be careful in rolling out this change but it i" [puppet] - 10https://gerrit.wikimedia.org/r/1127150 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway)
[10:48:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[10:49:09] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[10:49:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10641020 (10MoritzMuehlenhoff)
[10:49:35] <wikibugs>	 (03Merged) 10jenkins-bot: sre.loadbalancer.upgrade: Fix liberica stop validation [cookbooks] - 10https://gerrit.wikimedia.org/r/1128361 (https://phabricator.wikimedia.org/T388369) (owner: 10Vgutierrez)
[10:49:37] <wikibugs>	 (03CR) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[10:50:14] <wikibugs>	 (03Merged) 10jenkins-bot: findBadBlobs: Allow for timestamp based search via --scan-to [core] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128366 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[10:50:33] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]]
[10:50:38] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[10:50:38] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:51:34] <Dreamy_Jazz>	 Amir1: Will you be done backporting after this sync-world? I would like to do some backporting after you if there is time.
[10:52:06] <Amir1>	 sure. I have an extra deploy after this but it can wait for a bit (and should)
[10:53:01] <wikibugs>	 (03PS1) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372
[10:53:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[10:53:35] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1128363 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:54:37] <wikibugs>	 (03PS2) 10Brouberol: airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282)
[10:54:38] <wikibugs>	 (03PS1) 10Brouberol: airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373
[10:54:42] <Amir1>	 Dreamy_Jazz: would you mind adding this noop patch to your deploys too? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1127894 totally fine if not possible
[10:55:03] <wikibugs>	 (03PS2) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372
[10:55:11] <Dreamy_Jazz>	 Sure, I can do that.
[10:55:11] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:55:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster staging-eqiad: k8s upgrade
[10:56:00] <wikibugs>	 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641045 (10jcrespo)
[10:56:21] <wikibugs>	 (03PS1) 10Esanders: VE: Disable upcoming mobile insert menu everywhere except test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591)
[10:56:27] <wikibugs>	 (03PS3) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372
[10:56:29] <wikibugs>	 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641051 (10jcrespo) Probably a loose cable. If not, a card failure (doesn't look like it from the mgmt log) or a switch port misconfig/issue.
[10:56:35] <Amir1>	 Thanks!
[10:56:37] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[10:56:41] <wikibugs>	 (03PS2) 10Brouberol: airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373
[10:56:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[10:57:01] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage
[10:57:36] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[10:58:12] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs5006.eqsin.wmnet} and A:liberica
[10:58:12] <wikibugs>	 (03PS1) 10Dreamy Jazz: Re-enable the 'temporary-account-viewer' group for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205)
[10:58:14] <wikibugs>	 (03PS1) 10Dreamy Jazz: Unset the old 'checkuser-temporary-account-viewer' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205)
[10:59:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs5006.eqsin.wmnet} and A:liberica
[10:59:33] <wikibugs>	 (03PS1) 10Ladsgroup: Bump thumbnail steps ratio to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589)
[10:59:39] <wikibugs>	 (03PS4) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372
[10:59:41] <vgutierrez>	 volans: ^^ now it worked as expected :D
[10:59:55] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[10:59:58] <wikibugs>	 10ops-codfw, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641089 (10jcrespo)
[11:00:51] <wikibugs>	 (03PS2) 10Dreamy Jazz: Re-enable the 'temporary-account-viewer' group for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205)
[11:01:12] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367 (owner: 10Brouberol)
[11:01:41] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Update wikikube-staging eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1128350 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[11:01:45] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Update staging-eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1128349 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[11:01:50] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[11:01:53] <elukey>	 /13/13
[11:01:56] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[11:01:58] <elukey>	 uff err :)
[11:02:09] <vgutierrez>	 elukey: feeling lucky today?
[11:02:17] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3010.esams.wmnet with reason: host reimage
[11:02:41] <volans>	 vgutierrez: nice
[11:02:43] <elukey>	 vgutierrez: not so much :D
[11:02:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster1003.eqiad.wmnet are marked down but pooled: k8s-ingress-staging_30443: Servers kubestage1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:02:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubestagemaster_6443: Servers kubestagemaster1004.eqiad.wmnet are marked down but pooled: k8s-ingress-staging_30443: Servers kubestage1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:03:15] <jayme>	 this is kamila_ and me, updating staging-eqiad
[11:04:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] nrpe/monitoring-plugins-standard: fix deps [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680) (owner: 10Tiziano Fogli)
[11:04:40] <Amir1>	 vgutierrez: I just deployed a fix that makes svg files also respect steps (which it wasn't because I was too stupid to test it for svg files), they should get a bump to 10%, I will also a bit later bump to 15% (the same stuff, every day five 5% bump)
[11:04:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job liberica in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:04:43] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]] (duration: 14m 09s)
[11:04:47] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[11:04:48] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[11:04:49] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs[5005-5006].eqsin.wmnet} and A:liberica
[11:04:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[11:04:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127894 (https://phabricator.wikimedia.org/T373037) (owner: 10Hashar)
[11:05:04] <wikibugs>	 (03PS5) 10Ayounsi: Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372
[11:05:08] <vgutierrez>	 Amir1: cool .D
[11:05:11] <Amir1>	 Dreamy_Jazz: the floor is you can see is yours
[11:05:20] <Dreamy_Jazz>	 Thanks!
[11:05:24] <Amir1>	 vgutierrez: let me know if upload starts to struggles
[11:05:39] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[11:05:46] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable the 'temporary-account-viewer' group for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128375 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[11:05:48] <wikibugs>	 (03Merged) 10jenkins-bot: Remove obsolete $wgParserCacheNewKeySchemaRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127894 (https://phabricator.wikimedia.org/T373037) (owner: 10Hashar)
[11:06:07] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1128375|Re-enable the 'temporary-account-viewer' group for migration (T387205)]], [[gerrit:1127894|Remove obsolete $wgParserCacheNewKeySchemaRatio (T373037)]]
[11:06:12] <stashbot>	 T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205
[11:06:12] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[11:06:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] site: provision prometheus100[78] with role prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1127483 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[11:06:41] <wikibugs>	 (03Merged) 10jenkins-bot: Update staging-eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128359 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[11:07:06] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs[5005-5006].eqsin.wmnet} and A:liberica
[11:07:09] <elukey>	 10
[11:07:24] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Change staging-eqiad pod ip range to 10.64.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128352 (https://phabricator.wikimedia.org/T389045) (owner: 10Kamila Součková)
[11:08:25] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970)
[11:08:48] <wikibugs>	 (03CR) 10Elukey: [C:03+1] setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370 (owner: 10Volans)
[11:09:36] <wikibugs>	 (03CR) 10Elukey: [C:03+1] constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371 (owner: 10Volans)
[11:10:25] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, hashar: Backport for [[gerrit:1128375|Re-enable the 'temporary-account-viewer' group for migration (T387205)]], [[gerrit:1127894|Remove obsolete $wgParserCacheNewKeySchemaRatio (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:10:28] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, hashar: Continuing with sync
[11:11:05] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing A:liberica-magru
[11:11:21] <Dreamy_Jazz>	 I will run the migrateUserGroup.php maintenance script and then deploy another config patch shortly after that
[11:11:22] <wikibugs>	 (03PS2) 10Btullis: mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650)
[11:11:36] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:11:52] <Dreamy_Jazz>	 Doing that to avoid creating too many translation errors caused by two groups having the same display name being defined at the same time.
[11:12:34] <Dreamy_Jazz>	 So an increase in "group0 has the same name as group1" errors in logstash is expected until I've finished
[11:13:13] <Dreamy_Jazz>	 It importantly won't cause any problems for the end user, just an increased rate of logs in logstash.
[11:13:31] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing A:liberica-magru
[11:14:13] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:14:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job liberica in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:14:49] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:15:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "@jgiannelos@wikimedia.org lemme know if you have concerns about it. I don't think it should change much for Kartotherian, but we'd probabl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse)
[11:16:35] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs[4008-4009].ulsfo.wmnet,lvs5004.eqsin.wmnet} and A:liberica
[11:17:38] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:17:47] <wikibugs>	 (03CR) 10Zoe: [C:03+1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127571 (owner: 10PipelineBot)
[11:18:40] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128375|Re-enable the 'temporary-account-viewer' group for migration (T387205)]], [[gerrit:1127894|Remove obsolete $wgParserCacheNewKeySchemaRatio (T373037)]] (duration: 12m 32s)
[11:18:45] <stashbot>	 T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205
[11:18:45] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[11:19:04] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs[4008-4009].ulsfo.wmnet,lvs5004.eqsin.wmnet} and A:liberica
[11:19:20] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:19:31] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:19:33] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:19:36] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:19:57] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:20:30] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:20:38] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128381
[11:20:42] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:20:49] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:21:09] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:21:27] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:21:59] <wikibugs>	 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10641179 (10MatthewVernon) It's not the same issue - those two files have thumbs in different containers (`wikipedia-commons-local-thumb.c6` a...
[11:22:17] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3010.esams.wmnet with OS bookworm
[11:22:22] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:23:32] <Dreamy_Jazz>	 !log Ran `mwscript migrateUserGroup.php --wiki=testwiki checkuser-temporary-account-viewer temporary-account-viewer`
[11:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:47] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+1] Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders)
[11:25:41] <wikibugs>	 (03PS1) 10Vgutierrez: site,hiera: Reimage lvs3009 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477)
[11:26:25] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[11:26:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10641197 (10phaultfinder)
[11:29:35] <wikibugs>	 (03CR) 10Volans: [C:03+2] setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370 (owner: 10Volans)
[11:30:02] <wikibugs>	 (03CR) 10Volans: [C:03+2] constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371 (owner: 10Volans)
[11:30:03] <Dreamy_Jazz>	 Running `mwscript migrateUserGroup.php --wiki=X checkuser-temporary-account-viewer temporary-account-viewer` for all wikis with temporary accounts enabled or known (testwiki, loginwiki, test2wiki, metawiki, cswikiversity, igwiki, itwikiquote, swwiki, shwiki, fawiktionary, jawikibooks, zh_yuewiki, dawiki, srwiki, rowiki, nowiki, metawiki)
[11:30:09] <Dreamy_Jazz>	 !log Running `mwscript migrateUserGroup.php --wiki=X checkuser-temporary-account-viewer temporary-account-viewer` for all wikis with temporary accounts enabled or known (testwiki, loginwiki, test2wiki, metawiki, cswikiversity, igwiki, itwikiquote, swwiki, shwiki, fawiktionary, jawikibooks, zh_yuewiki, dawiki, srwiki, rowiki, nowiki, metawiki)
[11:30:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:08] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:32:52] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:33:38] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:33:56] <Dreamy_Jazz>	 The migrate script is going to take a while for metawiki, so if anyone wants to deploy in the mean while feel free.
[11:35:36] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:35:38] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:35:43] <wikibugs>	 06SRE, 10Thumbor: Thumbnail failures on some SVGs - https://phabricator.wikimedia.org/T389060 (10MatthewVernon) 03NEW
[11:36:07] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:36:08] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:36:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:36:17] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:36:20] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:36:30] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:36:52] <MichaelG_WMF>	 @Dreamy_Jazz I have a maintenance script to run to clean up a GrowthExperiments table for eswiki+ptwiki+idwiki+arzwiki. Should be low risk (have run that exact same script for others already with any trouble) - that ok?
[11:36:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] switchdc: delete Job objects for mw-cron due to library support [cookbooks] - 10https://gerrit.wikimedia.org/r/1127878 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[11:36:59] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:37:25] <Dreamy_Jazz>	 MichaelG_WMF: Yeah, should be fine to run that.
[11:37:39] <MichaelG_WMF>	 👍
[11:37:39] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:37:58] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:38:10] <Dreamy_Jazz>	 Especially as there are no overlaps on the wikis being run to my maintenance script run
[11:38:20] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views
[11:38:31] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:38:41] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:38:43] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:38:51] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:39:36] <MichaelG_WMF>	 !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=eswiki --db-table --verbose --force 2>&1 | tee ~/eswiki-dbtable.txt`
[11:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:23] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: limit kafka-python version [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128370 (owner: 10Volans)
[11:40:24] <wikibugs>	 (03Merged) 10jenkins-bot: constants: replace path to old Puppet CA [software/spicerack] - 10https://gerrit.wikimedia.org/r/1128371 (owner: 10Volans)
[11:40:36] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:40:38] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] nrpe/monitoring-plugins-standard: fix deps [puppet] - 10https://gerrit.wikimedia.org/r/1128336 (https://phabricator.wikimedia.org/T388680) (owner: 10Tiziano Fogli)
[11:41:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:41:45] <MichaelG_WMF>	 !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=ptwiki --db-table --verbose --force 2>&1 | tee ~/ptwiki-dbtable.txt`
[11:41:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[11:43:29] <MichaelG_WMF>	 !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=idwiki --db-table --verbose --force 2>&1 | tee ~/idwiki-dbtable.txt`
[11:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:34] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views
[11:44:06] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[11:44:16] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[11:44:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster staging-eqiad: k8s upgrade
[11:45:03] <MichaelG_WMF>	 !log running `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=arzwiki --db-table --verbose --force 2>&1 | tee ~/arzwiki-dbtable.txt`
[11:45:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:58] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:46:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:47:04] <MichaelG_WMF>	 !log `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=arzwiki --search-index --verbose 2>&1 | tee ~/arzwiki-searchindex.txt`
[11:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:08] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796)
[11:48:35] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796)
[11:49:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/apertium: apply
[11:49:07] <MichaelG_WMF>	 !log `time mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=shwiki --search-index --verbose 2>&1 | tee ~/shwiki-searchindex.txt`
[11:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:21] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/api-gateway: apply
[11:49:22] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) (owner: 10Jgiannelos)
[11:49:34] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[11:50:27] <MichaelG_WMF>	 Alright, I ran all the scripts I wanted to run, and I think I'm done.
[11:50:46] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Rollout more wikis for PCS/RESTBase sunset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126569 (https://phabricator.wikimedia.org/T388140) (owner: 10Jgiannelos)
[11:51:24] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:51:26] <claime>	 MichaelG_WMF: Can I encourage you to try and run them with mwscript-k8s if possible next time?
[11:52:11] <MichaelG_WMF>	 claime: Can I do that by now as someone who is not a deployer and only is part of the `restricted` group?
[11:52:29] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/changeprop: apply
[11:52:32] <claime>	 MichaelG_WMF: Ah, let me check, I think that's been fixed, or at least is in the progress of being fixed
[11:52:36] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[11:52:46] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/changeprop-jobqueue: apply
[11:52:48] <MichaelG_WMF>	 claime: if that is possible by now, then I would be happy to be pointed to some tutorial or get some training around k8s
[11:53:11] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/chart-renderer: apply
[11:53:21] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/cirrus-streaming-updater: apply
[11:53:43] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/citoid: apply
[11:53:57] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/commons-impact-analytics: apply
[11:54:28] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/cxserver: apply
[11:54:53] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/data-gateway: apply
[11:55:11] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/developer-portal: apply
[11:55:27] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/device-analytics: apply
[11:55:45] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/echostore: apply
[11:55:45] <wikibugs>	 (03CR) 10KartikMistry: "Noted. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry)
[11:56:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/edit-analytics: apply
[11:56:24] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:56:25] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/editor-analytics: apply
[11:56:54] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-analytics: apply
[11:57:24] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-analytics-external: apply
[11:57:57] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-logging-external: apply
[11:58:13] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventgate-main: apply
[11:58:28] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventstreams: apply
[11:59:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/eventstreams-internal: apply
[11:59:20] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/geo-analytics: apply
[11:59:39] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/image-suggestion: apply
[12:00:21] <wikibugs>	 (03PS2) 10KartikMistry: MinT: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125093 (https://phabricator.wikimedia.org/T386889)
[12:01:04] <wikibugs>	 (03PS1) 10Btullis: dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255)
[12:01:06] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[12:01:28] <claime>	 MichaelG_WMF: It wuold seem like mwscript-k8s isn't completely ready for restricted users afaict, sorry for pinging you about it
[12:01:48] <wikibugs>	 (03PS2) 10Btullis: dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255)
[12:02:01] <MichaelG_WMF>	 claime: All good, I look forward to it when it is ready :)
[12:02:03] <claime>	 Doc's here if you fancy a read https://wikitech.wikimedia.org/wiki/Mwscript-k8s
[12:02:16] <MichaelG_WMF>	 👀
[12:02:27] <wikibugs>	 (03CR) 10KartikMistry: MinT: staging: Increase rediness probe (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry)
[12:02:58] <wikibugs>	 (03PS2) 10KartikMistry: MinT: staging: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889)
[12:04:45] <wikibugs>	 (03PS1) 10Jgiannelos: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388
[12:06:43] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) (owner: 10Btullis)
[12:07:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[12:09:10] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/linkrecommendation: apply
[12:13:45] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] "We can deploy on staging, run a difftest with prod, then deploy to prod." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse)
[12:14:54] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[12:15:15] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[12:15:27] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/kartotherian: apply
[12:16:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:16:06] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/mathoid: apply
[12:16:08] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:16:21] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/media-analytics: apply
[12:17:09] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/miscweb: apply
[12:17:33] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/mobileapps: apply
[12:17:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[12:17:47] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/page-analytics: apply
[12:18:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/proton: apply
[12:18:37] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/push-notifications: apply
[12:18:53] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/ratelimit: apply
[12:19:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/rdf-streaming-updater: apply
[12:19:28] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/recommendation-api: apply
[12:19:43] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/rest-gateway: apply
[12:20:01] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/sessionstore: apply
[12:20:35] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox: apply
[12:20:37] <wikibugs>	 (03CR) 10Jforrester: "As written, this disables it everywhere including test2wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) (owner: 10Esanders)
[12:20:53] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-constraints: apply
[12:21:16] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-media: apply
[12:21:44] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-syntaxhighlight: apply
[12:22:06] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-timeline: apply
[12:22:36] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/shellbox-video: apply
[12:23:14] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/termbox: apply
[12:23:25] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/thumbor: apply
[12:23:40] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/toolhub: apply
[12:24:01] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/wikidata-query-gui: apply
[12:24:14] <wikibugs>	 (03CR) 10Sérgio Lopes: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos)
[12:24:24] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/wikifeeds: apply
[12:24:57] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos)
[12:25:08] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/wikifunctions: apply
[12:25:32] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] OK helmfile.d/services/zotero: apply
[12:25:33] <wikibugs>	 (03PS3) 10Clément Goubert: mediawiki: Use the servergroup to configure the dumps feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis)
[12:26:31] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos)
[12:26:50] <claime>	 btullis: sorry about that, my CRs were not up to date and I thought the feature flags were not merged
[12:27:07] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[12:27:13] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[12:30:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128032 (https://phabricator.wikimedia.org/T342172) (owner: 10Anzx)
[12:32:19] <wikibugs>	 (03CR) 10Ayounsi: "Maybe the end of fixes like I3c2da307ed6059fab5581a26b36d249d6cf9ddb6 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[12:33:26] <wikibugs>	 (03PS3) 10Btullis: dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255)
[12:34:46] <wikibugs>	 (03PS1) 10Slyngshede: Handle empty query on block user page [software/bitu] - 10https://gerrit.wikimedia.org/r/1128399 (https://phabricator.wikimedia.org/T385947)
[12:36:58] <wikibugs>	 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, and 2 others: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10641387 (10BTullis) Hello. Just FYI, we are planning to switch snapshot1016 back to using the core database serv...
[12:37:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128358 (owner: 10Muehlenhoff)
[12:38:27] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[12:38:33] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[12:40:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[12:41:08] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1128357 (owner: 10Muehlenhoff)
[12:44:22] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065 (10BTullis) 03NEW
[12:44:52] <wikibugs>	 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065#10641421 (10BTullis)
[12:50:47] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thank you Kevin." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[12:50:56] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply
[12:51:42] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 464776160 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:51:58] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[12:52:19] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[12:52:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:53:16] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[12:53:40] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[12:54:13] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[12:54:22] <wikibugs>	 (03PS1) 10Ayounsi: Sandbox vlan, allow return http(s) monitoring traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1128401 (https://phabricator.wikimedia.org/T388419)
[12:55:39] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, George!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[12:56:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1127120 (https://phabricator.wikimedia.org/T387553) (owner: 10Andrea Denisse)
[12:57:24] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[12:57:36] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128378 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[12:57:50] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 (owner: 10Brouberol)
[12:58:29] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (https://phabricator.wikimedia.org/T388472) (owner: 10Pppery)
[12:59:03] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[13:00:03] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "Disable new WebAuthn credentials creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300).
[13:00:05] <jouncebot>	 tgr, MichaelG_WMF, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:04-1] "@dzahn@wikimedia.org Can you add an httpbb test for that redirect to `modules/profile/files/httpbb/appserver/test_redirects.yaml` like the" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[13:00:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza)
[13:00:18] * MichaelG_WMF is here
[13:00:26] <tgr_>	 o/
[13:00:34] <tgr_>	 (just added one more patch)
[13:00:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[13:00:46] <tgr_>	 I guess I should deploy since most patches are mine
[13:00:50] <MichaelG_WMF>	 My change only affects a maintenance script that runs on an hourly timer - nothing to directly test
[13:00:53] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance cumin1002:9100) - https://phabricator.wikimedia.org/T389038#10641477 (10Clement_Goubert) →14Duplicate dup:03T383032
[13:01:01] <MichaelG_WMF>	 @tgr_ Thank you :)
[13:01:15] <anzx>	 o/
[13:02:30] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:02:32] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:03:21] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-04-18 13:03:09. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:03:21] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-04-18 13:03:09. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:04:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza)
[13:04:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128032 (https://phabricator.wikimedia.org/T342172) (owner: 10Anzx)
[13:04:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große)
[13:05:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Disable new WebAuthn credentials creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128403 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza)
[13:06:02] <wikibugs>	 (03Merged) 10jenkins-bot: sqwiktionary: update logo, wordmark, tagline and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128032 (https://phabricator.wikimedia.org/T342172) (owner: 10Anzx)
[13:06:04] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126533 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große)
[13:06:20] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53656 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:06:22] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:06:22] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]]
[13:06:31] <stashbot>	 T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402
[13:06:32] <stashbot>	 T389064: Notify WebAuthn users about SUL3 changes - https://phabricator.wikimedia.org/T389064
[13:06:32] <stashbot>	 T342172: Icons: sqwiktionary logo icon should be localized to language - https://phabricator.wikimedia.org/T342172
[13:06:32] <stashbot>	 T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250
[13:07:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs3009 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[13:08:00] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs3009.esams.wmnet with reason: depooled before reimage
[13:08:03] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[13:08:06] <vgutierrez>	 !log depooling lvs3009 before being reimaged - T384477
[13:08:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:11] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[13:10:07] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Enable credentials change special pages on SUL3 shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127965 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza)
[13:10:17] <logmsgbot>	 !log tgr@deploy2002 tgr, migr, anzx: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:10:30] <anzx>	 tgr_: looking
[13:11:03] <MichaelG_WMF>	 all good from my side, nothing to actively test
[13:11:08] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs3009 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128382 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[13:11:31] <anzx>	 tgr_: looks good 
[13:11:43] <logmsgbot>	 !log tgr@deploy2002 tgr, migr, anzx: Continuing with sync
[13:11:44] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335)
[13:11:55] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Good stuff, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[13:12:38] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:13:03] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 (10ayounsi) 03NEW p:05Triage→03High
[13:13:45] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[13:14:13] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.debug for Netbox interface ID 20595
[13:14:25] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.debug (exit_code=99) for Netbox interface ID 20595
[13:14:45] <wikibugs>	 (03PS1) 10Slyngshede: P:firewall remove connection tracking monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1128405 (https://phabricator.wikimedia.org/T350694)
[13:17:15] <wikibugs>	 (03CR) 10MSantos: [C:03+1] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128388 (owner: 10Jgiannelos)
[13:17:57] <anzx>	 echo 'https://en.wikipedia.org/static/images/icons/sqwiktionary.svg' | mwscript purgeList.php
[13:17:57] <anzx>	 echo 'https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-tagline-sq.svg' | mwscript purgeList.php
[13:17:57] <anzx>	 echo 'https://en.wikipedia.org/static/images/project-logos/sqwiktionary.png' | mwscript purgeList.php
[13:17:57] <anzx>	 echo 'https://en.wikipedia.org/static/images/project-logos/sqwiktionary-1.5x.png' | mwscript purgeList.php
[13:17:57] <anzx>	 echo 'https://en.wikipedia.org/static/images/project-logos/sqwiktionary-2x.png' | mwscript purgeList.php
[13:17:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Double conntrack table size on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1128406
[13:18:42] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3009.esams.wmnet with OS bookworm
[13:18:57] <anzx>	 echo 'https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-tagline-sq.svg' | mwscript purgeList.php
[13:19:12] <tgr_>	 anzx: should I run those? or are you doing it?
[13:19:27] <anzx>	 tgr_: please run those 
[13:19:49] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128403|Revert "Disable new WebAuthn credentials creation" (T378402 T389064)]], [[gerrit:1128032|sqwiktionary: update logo, wordmark, tagline and icon (T342172)]], [[gerrit:1126533|Growth: eswiki+cswiki - enable new way of refreshing LinkRecommendations (T386250)]] (duration: 13m 27s)
[13:19:56] <stashbot>	 T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402
[13:19:56] <stashbot>	 T389064: Notify WebAuthn users about SUL3 changes - https://phabricator.wikimedia.org/T389064
[13:19:56] <stashbot>	 T342172: Icons: sqwiktionary logo icon should be localized to language - https://phabricator.wikimedia.org/T342172
[13:19:57] <stashbot>	 T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250
[13:20:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:20:43] <wikibugs>	 (03Merged) 10jenkins-bot: Enable credentials change special pages on SUL3 shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127965 (https://phabricator.wikimedia.org/T362715) (owner: 10Gergő Tisza)
[13:22:20] <tgr_>	 anzx: done
[13:22:52] <anzx>	 tgr_: thank you for deploying 
[13:23:01] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127965|Enable credentials change special pages on SUL3 shared domain (T362715)]]
[13:23:05] <stashbot>	 T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715
[13:23:17] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10641632 (10ayounsi)
[13:24:37] <Amir1>	 jouncebot: nowandnext
[13:24:37] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1300)
[13:24:37] <jouncebot>	 In 2 hour(s) and 5 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530)
[13:24:55] <Amir1>	 tgr_: hii, let me know when you're done
[13:25:01] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Test HAProxy 3.1 in cp5032 (upload) [puppet] - 10https://gerrit.wikimedia.org/r/1128384 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[13:25:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:25:33] <vgutierrez>	 !log upgrading HAProxy to version 3.1 in cp5032 (upload) - T386796
[13:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:37] <stashbot>	 T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796
[13:25:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:25:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye
[13:25:57] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye
[13:26:02] <tgr_>	 Amir1: let me know if it's urgent, I'll be deploying for a while
[13:26:17] <Amir1>	 nah, it's not
[13:27:33] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1127965|Enable credentials change special pages on SUL3 shared domain (T362715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:31:34] <vgutierrez>	 !log uploaded HAProxy 3.1.5 to apt.wm.o (bullseye-wikimedia) component thirdparty/haproxy31 - T386796
[13:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:38] <stashbot>	 T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796
[13:33:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[13:34:26] <wikibugs>	 (03CR) 10Muehlenhoff: "(PCC failure on Puppet 5 is fine, we only use this role on Puppet 7 and max_files isn't (as used by the role) isn't in Puppet 5 yet)" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[13:35:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:38:00] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage
[13:38:49] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS bookworm
[13:38:55] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641698 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm
[13:39:44] <godog>	 !log begin moving k8s prometheus instances from prometheus2005 to prometheus2007 - T383232
[13:39:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:48] <stashbot>	 T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232
[13:40:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: move remaining k8s instances to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1126934 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[13:40:38] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[13:41:18] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3009.esams.wmnet with reason: host reimage
[13:44:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10641719 (10ssingh) [Stalled until further discussion]
[13:45:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Double conntrack table size on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/1128406 (owner: 10Muehlenhoff)
[13:45:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[13:46:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:46:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti1029 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1128412
[13:47:04] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10641720 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs
[13:47:05] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641721 (10MatthewVernon) OK, so the reimage isn't working because the SSDs are both RAID-0 arrays rather than JBOD. I'm going to try and un-RAID them, JBOD them, and try anothe...
[13:47:06] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641722 (10Jclark-ctr)  @Marostegui looks like db1257 is not in site.pp is currently failing Reimage
[13:47:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[13:47:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[13:47:38] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641726 (10Jclark-ctr) a:05Papaul→03Marostegui
[13:47:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10641727 (10ops-monitoring-bot) Draining ganeti1029.eqiad.wmnet of running VMs
[13:48:44] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127965|Enable credentials change special pages on SUL3 shared domain (T362715)]] (duration: 25m 42s)
[13:48:48] <stashbot>	 T362715: Move credentials change to central login wiki - https://phabricator.wikimedia.org/T362715
[13:49:00] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641741 (10Marostegui) a:05Marostegui→03Papaul >>! In T384979#10641722, @Jclark-ctr wrote: >  @Marostegui looks like db1257 is not in site.pp is currently failing Reimage  It is:...
[13:49:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:49:16] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367 (owner: 10Brouberol)
[13:49:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[13:49:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 (owner: 10Brouberol)
[13:50:16] <wikibugs>	 (03CR) 10Xcollazo: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[13:50:25] <icinga-wm>	 RECOVERY - Host dbprov2005 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms
[13:50:39] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:50:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[13:51:07] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641744 (10Jclark-ctr) @Marostegui  sorry i fotgot to update my repo before i was checking site.pp locally   its monday
[13:51:12] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641745 (10Jclark-ctr) a:05Papaul→03Jclark-ctr
[13:51:14] <wikibugs>	 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 10database-backups, 06DC-Ops: hw troubleshooting: network link loss of dbprov2005 - https://phabricator.wikimedia.org/T389052#10641747 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated both ends. looks like it was the switch side. port is...
[13:51:42] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kartotherian: use wdqs-internal-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127856 (https://phabricator.wikimedia.org/T388860) (owner: 10DCausse)
[13:51:42] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: introduce a way to display custom messages in the UI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128367 (owner: 10Brouberol)
[13:51:43] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-test-k8s: display a custom message explaining the migration status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128368 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[13:51:45] <wikibugs>	 (03Merged) 10jenkins-bot: airflow-main: add a temporary info message about the ongoing migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128373 (owner: 10Brouberol)
[13:51:47] <wikibugs>	 (03Merged) 10jenkins-bot: Fix some SUL3 shared domain settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[13:52:05] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127648|Fix some SUL3 shared domain settings (T388218)]]
[13:52:08] <stashbot>	 T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218
[13:52:10] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641756 (10Marostegui) >>! In T384979#10641744, @Jclark-ctr wrote: > @Marostegui  sorry i fotgot to update my repo before i was checking site.pp locally   its monday >    ☕☕
[13:52:41] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' .
[13:54:57] <wikibugs>	 (03PS10) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[13:55:01] <wikibugs>	 (03CR) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[13:55:14] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[13:55:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:56:24] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[13:56:39] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:56:41] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:56:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[13:57:03] <vgutierrez>	 ^^BGP alert triggered by lvs3009 reimage
[13:57:12] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5086/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[13:57:23] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[13:57:39] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:57:52] <wikibugs>	 06SRE, 10Observability-Alerting, 06Traffic, 10SRE Observability (FY2024/2025-Q3): Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10641773 (10tappof) 05Open→03Resolved It looks like the patch has fixed the problem.  I'm closing the task....
[13:58:40] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[13:59:43] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[13:59:44] <wikibugs>	 (03PS11) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[14:00:07] <wikibugs>	 06SRE, 10Observability-Alerting, 06Traffic, 10SRE Observability (FY2024/2025-Q3): Icinga check_curl plugin is broken on bullseye and bookworm hosts - https://phabricator.wikimedia.org/T388680#10641781 (10ssingh) Thanks for taking care of it; can confirm resolved!
[14:00:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10641782 (10Ottomata) Hi @BCornwall !  group owner approval for analytics-privatedata-users [[ https://github.com/wikimedi...
[14:01:10] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3009.esams.wmnet with OS bookworm
[14:01:18] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128417
[14:01:19] <wikibugs>	 (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as I9e513f7ba97f281c9e60fc110fb7331c4e47385d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[14:03:55] <wikibugs>	 (03CR) 10Gergő Tisza: "Fails with" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127648 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[14:04:07] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477)
[14:04:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128417 (owner: 10TrainBranchBot)
[14:04:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:04:25] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477)
[14:04:54] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[14:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128417 (owner: 10TrainBranchBot)
[14:05:20] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128417|Revert "Fix some SUL3 shared domain settings"]]
[14:05:56] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Try both SUL2 and SUL3 central domain for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[14:06:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[14:06:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Restore lvs3009 BGP priority [puppet] - 10https://gerrit.wikimedia.org/r/1128418 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[14:09:43] <logmsgbot>	 !log tgr@deploy2002 trainbranchbot, tgr: Backport for [[gerrit:1128417|Revert "Fix some SUL3 shared domain settings"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:09:51] <vgutierrez>	 !log repool lvs3009 running liberica - T384477
[14:09:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:55] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[14:09:59] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009.esams.wmnet} and A:liberica (T384477)
[14:10:16] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009.esams.wmnet} and A:liberica (T384477)
[14:10:43] <logmsgbot>	 !log tgr@deploy2002 trainbranchbot, tgr: Continuing with sync
[14:11:21] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[14:12:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:12:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:13:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[14:14:35] <wikibugs>	 (03Merged) 10jenkins-bot: Try both SUL2 and SUL3 central domain for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[14:14:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[14:14:43] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[14:16:37] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "Agree it is worth pursuing this change to confirm or deny whether the Analytics replicas are the culprit of the slowdown discussed in http" [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) (owner: 10Btullis)
[14:16:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] dumps: Stop using the analytics replicas for misc dumps [puppet] - 10https://gerrit.wikimedia.org/r/1128386 (https://phabricator.wikimedia.org/T386255) (owner: 10Btullis)
[14:16:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: add new prometheus hw to prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128420 (https://phabricator.wikimedia.org/T383232)
[14:17:12] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128417|Revert "Fix some SUL3 shared domain settings"]] (duration: 11m 52s)
[14:17:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:18:16] <wikibugs>	 (03PS1) 10Vgutierrez: site,hiera: Reimage lvs3008 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477)
[14:18:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: add new prometheus hw to prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128420 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[14:18:21] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1257.eqiad.wmnet with OS bookworm
[14:18:27] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10641868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm executed with errors: - db1257 (**FAIL**)...
[14:18:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:19:28] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[14:19:55] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127952|Try both SUL2 and SUL3 central domain for autologin (T375796)]]
[14:19:59] <stashbot>	 T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796
[14:20:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs3008 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[14:22:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] site,hiera: Reimage lvs3008 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1128421 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[14:22:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[14:22:42] <vgutierrez>	 !log depooling lvs3008 before being reimaged - T384477
[14:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:46] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[14:23:37] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1127952|Try both SUL2 and SUL3 central domain for autologin (T375796)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:23:41] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot account - https://phabricator.wikimedia.org/T388662#10641905 (10ssingh) The user does not seem to be part of https://ldap.toolforge.org/user/barrybrowsertestbot of any sensitive groups. For disabling an account and on checking internally with SRE, t...
[14:24:17] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Upgrade to HAProxy 3.1 on cp5024 (text) [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796)
[14:24:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[14:24:46] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:25:08] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Upgrade to HAProxy 3.1 on cp5024 (text) [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[14:25:29] <icinga-wm>	 PROBLEM - pybal on lvs3008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:25:39] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:25:46] <vgutierrez>	 urgh.. forgot the downtime, sorry about the noise
[14:25:52] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye
[14:25:54] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:25:58] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be207...
[14:26:09] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs3008.esams.wmnet with reason: depooled before reimage
[14:26:21] <wikibugs>	 (03CR) 10Hashar: [C:03+1] Remove unnecessary boolean statement for $wmgIncreaseDefaultVectorFontSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127929 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson)
[14:26:42] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1127069 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:27:01] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye
[14:27:13] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10641914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye
[14:27:17] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[14:27:44] <wikibugs>	 (03CR) 10Hashar: "`$wgVectorZebraDesign` and some other settings were removed recently by Ic4876a91ec1b2cedcf68d4f257e518837e15da89" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson)
[14:27:50] <wikibugs>	 (03CR) 10Hashar: [C:03+1] Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson)
[14:28:32] <wikibugs>	 (03PS4) 10Scott French: deployment_server: Support PHP version selection in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917)
[14:28:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:29:06] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reimage for host lvs3008.esams.wmnet with OS bookworm
[14:29:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Upgrade to HAProxy 3.1 on cp5024 (text) [puppet] - 10https://gerrit.wikimedia.org/r/1128428 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[14:29:42] <vgutierrez>	 !log upgrading HAProxy to version 3.1 in cp5024 (text) - T386796
[14:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:45] <stashbot>	 T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796
[14:30:13] <akosiaris>	 !incidents
[14:30:14] <sirenbot>	 5747 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[14:30:14] <sirenbot>	 5745 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[14:30:14] <sirenbot>	 5744 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[14:30:14] <sirenbot>	 5743 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad)
[14:30:21] <hashar>	 jouncebot: nowandnext
[14:30:21] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 59 minute(s)
[14:30:21] <jouncebot>	 In 0 hour(s) and 59 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530)
[14:30:55] <hashar>	 I am going to deployed a bunch of clean up patches for mediawiki-config
[14:31:08] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm
[14:31:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:31:20] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: Support PHP version selection in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1126697 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French)
[14:32:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10641954 (10ssingh)
[14:32:59] <hashar>	 oh
[14:33:16] <wikibugs>	 (03PS1) 10Ayounsi: Add transit/peering in/out port saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052)
[14:33:33] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:33:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10641961 (10ssingh) https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Deployment_Groups dictates the user be added to the Gerrit group `wmf-deployment` which I have just done....
[14:33:46] <hashar>	 tgr_: are you still doing the deployment of "Try both SUL2 and SUL3 central domain for autologin"
[14:33:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[14:34:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1128405 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[14:35:34] <hashar>	 I guess it needs a bit of testing :)
[14:35:45] <hashar>	 I will do the clean up patches later, they are not urgent
[14:36:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:36:29] <tgr_>	 hashar: just about to roll it back
[14:36:44] <hashar>	 :-\
[14:36:44] <wikibugs>	 (03PS1) 10Elukey: kartotherian: update statsd's config ttl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128430 (https://phabricator.wikimedia.org/T388860)
[14:36:44] <logmsgbot>	 !log tgr@deploy2002 Sync cancelled.
[14:36:51] <hashar>	 poor SUL!
[14:37:12] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128431
[14:37:13] <wikibugs>	 (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as If0f231ae556fdf5e7ea242dc8cc24a8be8c1e343" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1127952 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[14:37:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128431 (owner: 10TrainBranchBot)
[14:38:25] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] deployment: switch deploy servers to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1127074 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:39:25] <wikibugs>	 (03PS1) 10Elukey: kartotherian: simplify the readinessProble's path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432
[14:40:09] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye
[14:40:18] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642002 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be207...
[14:40:36] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye
[14:40:38] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] wmnet: point deploy server at eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127073 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:40:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye
[14:41:04] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] kartotherian: update statsd's config ttl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128430 (https://phabricator.wikimedia.org/T388860) (owner: 10Elukey)
[14:41:42] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kartotherian: update statsd's config ttl [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128430 (https://phabricator.wikimedia.org/T388860) (owner: 10Elukey)
[14:41:56] <wikibugs>	 (03CR) 10Ayounsi: "Cathal, let me know what you think of this approach, it's more basic than what you suggested on the task." [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi)
[14:42:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[14:42:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:42:39] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232)
[14:42:40] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] mw-(web|api-ext): scale up in anticipation of switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127859 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[14:42:50] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[14:43:46] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:43:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] Basic pattern check on preseed hostname regex [puppet] - 10https://gerrit.wikimedia.org/r/1128372 (owner: 10Ayounsi)
[14:43:53] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[14:44:02] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[14:44:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: cleanup instance functionality [puppet] - 10https://gerrit.wikimedia.org/r/1127029 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[14:44:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[14:45:18] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128431 (owner: 10TrainBranchBot)
[14:45:26] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[14:45:33] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232)
[14:45:39] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128431|Revert "Try both SUL2 and SUL3 central domain for autologin"]]
[14:45:51] <akosiaris>	 !incidents
[14:45:52] <sirenbot>	 5749 (ACKED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[14:45:52] <sirenbot>	 5747 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[14:45:52] <sirenbot>	 5745 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[14:45:52] <sirenbot>	 5744 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[14:45:52] <sirenbot>	 5743 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad)
[14:47:11] <wikibugs>	 (03PS1) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437
[14:47:16] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642034 (10ssingh) From https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analytics-privatedata-users,...
[14:47:23] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi)
[14:47:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642035 (10ssingh)
[14:48:00] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage
[14:48:52] <Amir1>	 tgr_: still deploying?
[14:49:32] <tgr_>	 Amir1: this is the last scap
[14:49:51] <Amir1>	 noted
[14:50:59] <logmsgbot>	 !log tgr@deploy2002 trainbranchbot, tgr: Backport for [[gerrit:1128431|Revert "Try both SUL2 and SUL3 central domain for autologin"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:51:24] <logmsgbot>	 !log herron@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'.
[14:51:26] <logmsgbot>	 !log herron@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'.
[14:51:42] <logmsgbot>	 !log herron@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'.
[14:51:46] <logmsgbot>	 !log tgr@deploy2002 trainbranchbot, tgr: Continuing with sync
[14:51:57] <logmsgbot>	 !log herron@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'.
[14:52:09] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3008.esams.wmnet with reason: host reimage
[14:52:32] <wikibugs>	 (03CR) 10Xcollazo: Enable lock transaction management in the hive metastore on hadoop_test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[14:53:04] <logmsgbot>	 !log herron@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'.
[14:53:07] <logmsgbot>	 !log herron@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'.
[14:53:47] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] lists: Offer RSA+ECDSA certificates on lists.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/1127066 (https://phabricator.wikimedia.org/T385067) (owner: 10Vgutierrez)
[14:54:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[14:54:42] <wikibugs>	 (03PS2) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437
[14:54:42] <wikibugs>	 (03PS1) 10Ayounsi: type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438
[14:56:03] <akosiaris>	 moritzm: this ^ is flapping awfully much today, but apparently it's a pattern for quite a few days now. https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=api-gateway&from=now-2d&to=now. I 'll open a task to ML, unless you got one already
[14:56:27] <wikibugs>	 (03PS2) 10Ayounsi: type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438
[14:56:27] <wikibugs>	 (03PS3) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437
[14:56:40] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi)
[14:56:43] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi)
[14:56:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi)
[14:58:03] <moritzm>	 akosiaris: I spoke to Ilias earlier, they are working on a fix already, should be ready today or tomorrow
[14:58:14] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128431|Revert "Try both SUL2 and SUL3 central domain for autologin"]] (duration: 12m 35s)
[14:58:18] <tgr_>	 Amir1: done, sorry it took so long
[14:58:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi)
[14:58:43] <tgr_>	 hashar: ^ (but Amir1 asked first :)
[14:58:50] <hashar>	 o/
[14:58:54] <hashar>	 yeah no worries
[14:59:02] <wikibugs>	 (03CR) 10Ayounsi: "forgot something important in the regex. It's now working as expected as you can see in the chained CR." [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi)
[14:59:03] <hashar>	 I will do the clean up patches later this week or next week
[15:00:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi)
[15:00:40] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:02:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] exim: Use RSA+ECDSA certificates for lists [puppet] - 10https://gerrit.wikimedia.org/r/1127933 (https://phabricator.wikimedia.org/T385067) (owner: 10Vgutierrez)
[15:03:18] <Amir1>	 hashar: I'm in a meeting so feel free to go ahead
[15:03:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:04:42] <jinxer-wm>	 FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[15:04:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:04:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:05:55] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[15:06:02] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[15:06:40] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:52] <vgutierrez>	  ^^ BGP alert triggered by lvs3008 being reimaged
[15:08:40] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:09:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: cleanup k8s-mlstaging from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1127030 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[15:10:44] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage
[15:11:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job pybal in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:44] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3008.esams.wmnet with OS bookworm
[15:13:57] <akosiaris>	 moritzm: ok, cool, thanks!
[15:14:26] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage
[15:14:30] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642215 (10RobH) Draft of directions:     > Support, >  > We just had an optic fail on one of our router to switch links, and need the switch side optic swapped out with spa...
[15:16:10] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10642230 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Both exim and apache2 have been reconfigured to offer RSA+E...
[15:16:18] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host puppetserver2004.codfw.wmnet with OS bookworm
[15:16:42] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps ratio to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[15:17:23] <wikibugs>	 (03PS1) 10Brouberol: Upgrade airflow-providers-cncf-kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128445 (https://phabricator.wikimedia.org/T388378)
[15:17:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 362355672 and 53 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:17:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[15:18:09] <wikibugs>	 (03Merged) 10jenkins-bot: Bump thumbnail steps ratio to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128377 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[15:18:26] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1128377|Bump thumbnail steps ratio to 15% (T360589)]]
[15:18:30] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[15:18:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 299200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:18:55] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Restore BGP priority for lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477)
[15:19:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Restore BGP priority for lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:19:32] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:19:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "This was signed off in the SRE IF meeting. @Alex, you can proceed with deploying" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[15:20:38] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642258 (10RobH) Had the option for 'normal work' which must be planned in work hours and 24 hours in advance (with time zone changes that means if I entered it now, it woul...
[15:21:26] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642262 (10RobH) a:03RobH
[15:21:42] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547
[15:22:20] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Restore BGP priority for lvs3008 [puppet] - 10https://gerrit.wikimedia.org/r/1128446 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:23:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] service: move kartotherian-k8s-ssl fully on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1128343 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[15:23:06] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1128377|Bump thumbnail steps ratio to 15% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:24:13] <wikibugs>	 (03PS1) 10Vgutierrez: cumin: Update (liberica|lvs)-esams aliases [puppet] - 10https://gerrit.wikimedia.org/r/1128448 (https://phabricator.wikimedia.org/T384477)
[15:25:27] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[15:25:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] service: set kartotherian and kartotherian-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128344 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey)
[15:25:54] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Clean-up lvs::balancer keys for non-core DCs [puppet] - 10https://gerrit.wikimedia.org/r/1128449 (https://phabricator.wikimedia.org/T384477)
[15:26:13] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Yes, please coordinate with us as you mention in the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/1128345 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey)
[15:27:00] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] service, conftool-data: final removal for unused Kartotherian configs [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey)
[15:27:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Review Broadcom's storcli binary - https://phabricator.wikimedia.org/T388628#10642295 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff
[15:27:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] cumin: Update (liberica|lvs)-esams aliases [puppet] - 10https://gerrit.wikimedia.org/r/1128448 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:28:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] cumin: Update (liberica|lvs)-esams aliases [puppet] - 10https://gerrit.wikimedia.org/r/1128448 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:28:50] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson)
[15:30:05] <jouncebot>	 jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530)
[15:30:18] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync
[15:31:01] <vgutierrez>	 !log repool lvs3008 running liberica - T384477
[15:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:04] <stashbot>	 T384477: Replace pybal with liberica on the PoPs - https://phabricator.wikimedia.org/T384477
[15:31:10] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3008.esams.wmnet} and A:liberica (T384477)
[15:31:28] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3008.esams.wmnet} and A:liberica (T384477)
[15:31:47] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128377|Bump thumbnail steps ratio to 15% (T360589)]] (duration: 13m 20s)
[15:31:50] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[15:32:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10642338 (10phaultfinder)
[15:32:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync
[15:32:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] service: move kartotherian-k8s-ssl fully on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1128343 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[15:33:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 14.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:33:30] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: disable 'accelerator' cadvisor metric [puppet] - 10https://gerrit.wikimedia.org/r/1128319 (https://phabricator.wikimedia.org/T388632) (owner: 10Filippo Giunchedi)
[15:34:39] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2075.codfw.wmnet with OS bullseye
[15:34:46] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye completed: - ms-be2075 (**PASS**...
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:15] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] type Install_server::Preseed_host::Name fix regex [puppet] - 10https://gerrit.wikimedia.org/r/1128438 (owner: 10Ayounsi)
[15:37:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:37:29] <wikibugs>	 (03Abandoned) 10Ayounsi: Testing if breaking change is caught by CI [puppet] - 10https://gerrit.wikimedia.org/r/1128437 (owner: 10Ayounsi)
[15:37:51] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[15:38:05] <jnuche>	 jouncebot: refresh
[15:38:06] <jouncebot>	 I refreshed my knowledge about deployments.
[15:38:43] <wikibugs>	 (03PS1) 10Vgutierrez: hieradata: Use codfw etcd cluster in liberica@(ulsfo|eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477)
[15:38:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642387 (10BCornwall) Ugh, sorry about that. But we do still nee @ATsay-WMF to approve, right?
[15:39:00] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:39:12] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T385814#10642389 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:39:38] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:41:09] <wikibugs>	 (03CR) 10Elukey: "@cwhite@wikimedia.org Hi! I didn't notice the change in the kartotherian's chart until I deployed a new change today. This is the error th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite)
[15:41:10] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10642395 (10ssingh) >>! In T388693#10642387, @BCornwall wrote: > Ugh, sorry about that. But we do still nee @ATsay-WMF to...
[15:43:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hieradata: Use codfw etcd cluster in liberica@(ulsfo|eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:43:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:44:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2089
[15:44:07] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s::client: Allow for install of all kubectl versions [puppet] - 10https://gerrit.wikimedia.org/r/1128369 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm)
[15:44:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2089
[15:44:28] <wikibugs>	 (03PS7) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745)
[15:44:37] <wikibugs>	 (03CR) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking)
[15:44:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:44:45] <sbassett>	 jouncebot: now
[15:44:45] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530)
[15:44:47] <wikibugs>	 (03CR) 10Bking: "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking)
[15:45:22] <sbassett>	 jan_drewniak: would there be any conflict with any portal deployments happening right now if I wanted to try to get a security patch out?
[15:45:57] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1128449/5088/" [puppet] - 10https://gerrit.wikimedia.org/r/1128449 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:46:14] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Clean-up lvs::balancer keys for non-core DCs [puppet] - 10https://gerrit.wikimedia.org/r/1128449 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:46:15] <wikibugs>	 (03PS3) 10Filippo Giunchedi: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert)
[15:46:25] <Dreamy_Jazz>	 jouncebot: nowandnext
[15:46:25] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1530)
[15:46:25] <jouncebot>	 In 1 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700)
[15:46:25] <jouncebot>	 In 1 hour(s) and 13 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700)
[15:46:38] <wikibugs>	 (03PS4) 10Filippo Giunchedi: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert)
[15:46:43] <Dreamy_Jazz>	 One portals update is done, I'd like to deploy
[15:46:50] <wikibugs>	 (03PS2) 10Scott French: Disable cookie-based enrollment in 8.1 (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845)
[15:47:08] <jan_drewniak>	 Dreamy_Jazz: skipping the portals deploy this week, feel free to deploy. 
[15:47:11] <wikibugs>	 (03PS5) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497)
[15:47:12] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10642410 (10MatthewVernon) Finally got the reimage to work; I'll leave this host overnight, and then check the kernel log tomorrow.
[15:47:26] <Dreamy_Jazz>	 Thanks. sbassett: Do you want to deploy the security patch first?
[15:47:46] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking)
[15:47:54] <sbassett>	 Dreamy_jazz: sure, that would be great.  Should be quick?  One mw core file affected for .20.
[15:47:58] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[15:48:03] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hieradata: Use codfw etcd cluster in liberica@(ulsfo|eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1128452 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[15:48:08] <wikibugs>	 (03CR) 10Cwhite: "Updated 30d->720h per findings from elukey: Iba0264c01df67083bf7d29bf6fe632811a56e0ef" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite)
[15:48:59] <wikibugs>	 (03Merged) 10jenkins-bot: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking)
[15:49:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the patch! I've tweaked the topic naming slightly to use k8s-mw prefix." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert)
[15:49:16] <Dreamy_Jazz>	 Sure. Let me know when you are done.
[15:50:38] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp30[66-72,74-80].esams.wmnet} and A:cp for 9.2.9-1wm1
[15:51:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128405 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[15:56:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Upgrade airflow-providers-cncf-kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128445 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[15:57:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Disable cookie-based enrollment in 8.1 (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[15:57:32] <wikibugs>	 (03PS1) 10JMeybohm: Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1128458
[15:57:37] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Discussed out of band: The topics will get autocreated, logstash fetches from k8s-*, and the topics don't need any special config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert)
[15:58:12] <wikibugs>	 (03PS23) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945)
[15:58:59] <wikibugs>	 (03PS4) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777)
[15:59:32] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s: extend nrpe_check_disk_options to allow containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128461 (https://phabricator.wikimedia.org/T387854)
[15:59:34] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::staging::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128462 (https://phabricator.wikimedia.org/T387854)
[15:59:35] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854)
[16:00:03] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Upgrade airflow-providers-cncf-kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128445 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[16:00:11] <wikibugs>	 (03CR) 10Dzahn: "thanks! Done. added the test. I don't have to manually add the VirtualServer, right?" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[16:01:06] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[16:02:25] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1128458 (owner: 10JMeybohm)
[16:02:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs[5004-5006].eqsin.wmnet,lvs[4008-4009].ulsfo.wmnet} and A:liberica
[16:03:09] <Dreamy_Jazz>	 sbassett: Any update on deploying the security patch?
[16:03:51] <sbassett>	 Dreamy_jazz: Almost done
[16:03:56] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5089/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[16:03:59] <Dreamy_Jazz>	 Thanks.
[16:04:11] <Dreamy_Jazz>	 Forgot that the security deploy doesn't log when it starts.
[16:04:44] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:05:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:06:25] <sbassett>	 Dreamy_Jazz: 48% k8s restarted…
[16:07:11] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs[5004-5006].eqsin.wmnet,lvs[4008-4009].ulsfo.wmnet} and A:liberica
[16:07:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:09:46] <sbassett>	 !log Deployed security patch for T387478
[16:10:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[16:12:17] <sbassett>	 Dreamy_Jazz: all yours
[16:12:23] <Dreamy_Jazz>	 Thanks!
[16:13:33] <wikibugs>	 (03PS2) 10Dreamy Jazz: Unset the old 'checkuser-temporary-account-viewer' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205)
[16:13:44] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set request and limits the same for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128465 (https://phabricator.wikimedia.org/T386926)
[16:13:48] <XioNoX>	 !log bounce asw1-b12 et-0/0/48 - T389071
[16:13:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[16:14:23] <wikibugs>	 (03PS24) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945)
[16:15:00] <wikibugs>	 (03Merged) 10jenkins-bot: Unset the old 'checkuser-temporary-account-viewer' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128376 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[16:15:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2089']
[16:15:20] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]]
[16:15:33] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be2089']
[16:15:54] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[16:16:24] <wikibugs>	 (03PS6) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497)
[16:16:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye
[16:16:37] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10642546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2089.codfw...
[16:17:01] <wikibugs>	 (03PS1) 10Brouberol: Fix typo in image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128466 (https://phabricator.wikimedia.org/T388378)
[16:17:51] <wikibugs>	 (03PS7) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497)
[16:19:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Fix typo in image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128466 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol)
[16:20:01] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:20:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[16:20:16] <wikibugs>	 (03PS1) 10JMeybohm: profile::kubernetes::client: install kubectl 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388)
[16:20:35] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[16:20:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[16:21:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[16:22:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::kubernetes::client: install kubectl 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm)
[16:23:38] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5090/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm)
[16:23:53] <wikibugs>	 (03Abandoned) 10Jforrester: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090435 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot)
[16:23:57] <wikibugs>	 (03Abandoned) 10Jforrester: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1090425 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot)
[16:24:01] <wikibugs>	 (03Abandoned) 10Jforrester: Branch commit for wmf/1.44.0-wmf.3 [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1089924 (https://phabricator.wikimedia.org/T375662) (owner: 10TrainBranchBot)
[16:26:08] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642593 (10ayounsi) remote hands replaced the optic, but the issue persists.  Looking closer at it it converts the 40G port into 4x10G lanes. This might be because lane 1 is...
[16:27:01] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] (duration: 11m 41s)
[16:28:07] <wikibugs>	 (03PS2) 10JMeybohm: profile::kubernetes::client: install kubectl 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388)
[16:28:12] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm)
[16:29:11] <wikibugs>	 (03PS8) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497)
[16:36:34] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796)
[16:36:34] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796)
[16:36:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:36:54] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:36:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:37:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:37:37] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796)
[16:37:52] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796)
[16:38:10] <wikibugs>	 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10642675 (10RobH) IRC update: We asked them to swap both optic and fiber patch to reduce complexity in troubleshooting.    > Support, >  > Background: For some reason this li...
[16:38:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:38:41] <wikibugs>	 (03CR) 10Ssingh: Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:38:46] <wikibugs>	 (03CR) 10Ssingh: Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:39:08] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Upgrade to HAProxy 3.1 on cp5024 (text)" [puppet] - 10https://gerrit.wikimedia.org/r/1128469 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:39:38] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:39:48] <vgutierrez>	 !log downgrading HAProxy to version 2.8 in cp5024 (text) - T386796
[16:40:23] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "The list is getting small!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) (owner: 10Jdlrobson)
[16:40:52] <vgutierrez>	 logmsgbot is toasted?
[16:40:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:41:24] <Lucas_WMDE>	 vgutierrez: do you mean stashbot?
[16:41:30] <vgutierrez>	 Lucas_WMDE: sorry, yes
[16:41:34] * Lucas_WMDE looks
[16:41:59] <wikibugs>	 (03PS1) 10Cwhite: statsd_exporter: bugfix set ttl to associated variable [puppet] - 10https://gerrit.wikimedia.org/r/1128471 (https://phabricator.wikimedia.org/T359497)
[16:42:03] <Lucas_WMDE>	 oh god, quit half an hour ago ._.
[16:44:11] <wikibugs>	 (03PS1) 10Elukey: installserver: set puppetserver2004 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274)
[16:44:38] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:44:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2327
[16:45:00] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2327
[16:45:01] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T389096 (10Olivierpeyronnet) 03NEW Closing this task as invalid due to missing information.
[16:45:01] <Lucas_WMDE>	 vgutierrez: try again
[16:45:08] * Lucas_WMDE scrolls up to see who else needs to relog stuff
[16:45:12] <vgutierrez>	 !log downgrading HAProxy to version 2.8 in cp5024 (text) - T386796
[16:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:16] <stashbot>	 T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796
[16:45:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2327
[16:45:18] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2327
[16:45:22] <vgutierrez>	 Lucas_WMDE: thx <3
[16:45:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Test HAProxy 3.1 in cp5032 (upload)" [puppet] - 10https://gerrit.wikimedia.org/r/1128470 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[16:46:13] <vgutierrez>	 !log downgrading HAProxy to version 2.8 in cp5032 (upload) - T386796
[16:46:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[16:46:44] <akosiaris>	 !incidents
[16:46:44] <sirenbot>	 5750 (UNACKED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[16:46:44] <Lucas_WMDE>	 XioNoX, jhathaway, Dreamy_Jazz, brouberol, moritzm:  FYI, y’all might want to re-log some messages of the past ca. 35 minutes, stashbot quit IRC and we didn’t notice for a bit :(
[16:46:44] <sirenbot>	 5749 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[16:46:45] <sirenbot>	 5747 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[16:46:45] <sirenbot>	 5745 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[16:46:45] <sirenbot>	 5744 (RESOLVED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[16:46:45] <sirenbot>	 5743 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl1002:6443 probes/custom eqiad)
[16:46:53] <akosiaris>	 !ack 5750
[16:46:54] <sirenbot>	 5750 (ACKED)  GatewayBackendErrorsHigh sre (lw_inference_reference_need_cluster api-gateway eqiad)
[16:47:02] <akosiaris>	 I 'll silence this for a week or so
[16:47:08] <brett>	 Thank you!
[16:47:18] <Dreamy_Jazz>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]]
[16:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:22] <stashbot>	 T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205
[16:47:23] <Dreamy_Jazz>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:26] <Dreamy_Jazz>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[16:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:30] <Dreamy_Jazz>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128376|Unset the old 'checkuser-temporary-account-viewer' group (T387205)]] (duration: 11m 41s)
[16:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:58] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099 (10Olivierpeyronnet) 03NEW
[16:48:52] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099#10642769 (10Olivierpeyronnet) I am a master's student working on my thesis about the postal and telegraph network in Bosnia-Herzegovina before World War I. My research requires mapping...
[16:49:05] <wikibugs>	 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10642771 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[16:50:44] <wikibugs>	 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10642786 (10jcrespo) For the help part (not the approval), feel free to ping me, not the highest expert, but I can help with the commands if needed.
[16:56:04] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T389096#10642818 (10Aklapper) →14Duplicate dup:03T389099
[16:56:06] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099#10642820 (10Aklapper)
[16:57:11] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on my research about Bosnia - https://phabricator.wikimedia.org/T389099#10642826 (10Aklapper) 05Open→03Declined Hi @Olivierpeyronnet, maps.wikimedia.org tiles may only be used by Wikimedia wikis, and sites hosted by Wikimedia Affiliates. We are not...
[16:58:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:00:05] <jouncebot>	 swfrench and cwhite: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700).
[17:00:05] <jouncebot>	 ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T1700).
[17:01:01] <swfrench-wmf>	 o/
[17:01:15] <akosiaris>	 !log silence GatewayBackendErrorsHigh lw_inference_reference_need_cluster in eqiad for 1 week
[17:01:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:49] <swfrench-wmf>	 c.white and I will get started on this in a couple of minutes
[17:02:27] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:03:16] <wikibugs>	 (03CR) 10Scott French: [C:03+1] move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite)
[17:04:39] <wikibugs>	 (03PS2) 10Esanders: VE: Disable upcoming mobile insert menu everywhere except test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591)
[17:04:49] <wikibugs>	 (03CR) 10Esanders: "Oops, forgot to stage." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) (owner: 10Esanders)
[17:05:49] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[17:07:41] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[17:10:15] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki: add rewrite for rt.wikimedia.org to wikitech page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[17:14:40] <wikibugs>	 (03PS3) 10Jforrester: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019)
[17:14:40] <wikibugs>	 (03CR) 10Jforrester: search-redirect: Handle $_GET potential vulnerability scanning (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[17:15:43] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@4eb42a4]: search: drop export_queries_to_relforge
[17:16:12] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@4eb42a4]: search: drop export_queries_to_relforge (duration: 00m 29s)
[17:17:55] <swfrench-wmf>	 cwhite: ready to go? :)
[17:17:59] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite)
[17:18:10] <cwhite>	 🚀
[17:19:47] <wikibugs>	 (03PS5) 10Dzahn: mediawiki: add rewrite for rt.wikimedia.org to wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777)
[17:19:54] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] admin_ng: set request and limits the same for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128465 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[17:20:09] <wikibugs>	 (03CR) 10Dzahn: "oh yea, I think you are right, this should be a funnel. we want to redirect all URLs to the same target. amended." [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[17:20:19] <wikibugs>	 (03Merged) 10jenkins-bot: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite)
[17:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10643081 (10phaultfinder)
[17:24:37] <wikibugs>	 (03CR) 10DLynch: [C:03+1] VE: Disable upcoming mobile insert menu everywhere except test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388591) (owner: 10Esanders)
[17:25:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:27:15] <swfrench-wmf>	 cwhite: mediawiki statsd exporter diffs look like what we expect - just the addition of the ttl (720h) and the chart version bump
[17:27:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10643107 (10ATsay-WMF) Approved
[17:27:43] <cwhite>	 \o/
[17:27:45] <swfrench-wmf>	 I'll start with mw-debug to confirm it updates, and then update the exporters in the other namespaces
[17:28:04] <wikibugs>	 (03PS12) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:28:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10643112 (10ssingh)
[17:28:23] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:28:29] <wikibugs>	 (03PS13) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:28:41] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:28:59] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:29:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002"
[17:29:15] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:29:49] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002"
[17:29:50] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:30:31] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5092/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:32:30] <swfrench-wmf>	 cwhite: mw-debug updates succeeded, and I just curl'd /metrics on one of the exporter pods to confirm that I see mw metrics
[17:34:12] <swfrench-wmf>	 cwhite: anything you want to spot check before I move ahead with the other deployments?
[17:34:44] <wikibugs>	 (03PS14) 10Brouberol: Enable lock transaction management in the hive metastore on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:35:05] <cwhite>	 swfrench-wmf: Good to proceed :)
[17:35:43] <swfrench-wmf>	 off we go, then
[17:35:47] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[17:36:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[17:36:11] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:36:25] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:36:30] <wikibugs>	 (03PS2) 10Jdlrobson: Enable Donation banner on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127155 (https://phabricator.wikimedia.org/T387768)
[17:36:31] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:36:34] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:36:41] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[17:36:52] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[17:36:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[17:37:03] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[17:37:10] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[17:37:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply
[17:37:29] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply
[17:37:46] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:38:00] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:38:06] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[17:38:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[17:39:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:40:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:40:51] <swfrench-wmf>	 no issues in eqiad - moving on codfw
[17:41:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:41:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:41:36] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[17:41:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[17:41:55] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp30[66-72,74-80].esams.wmnet} and A:cp for 9.2.9-1wm1
[17:41:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:41:58] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:42:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[17:42:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[17:42:22] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[17:42:29] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[17:42:35] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[17:42:47] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[17:43:02] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:43:12] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:43:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[17:43:23] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[17:43:58] <swfrench-wmf>	 !log applied https://gerrit.wikimedia.org/r/1117638 to mediawiki statsd exporters
[17:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002"
[17:44:05] <swfrench-wmf>	 cwhite: all done :)
[17:44:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns entry for msw2-codfw - pt1979@cumin2002"
[17:44:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:44:35] <cwhite>	 nice!  looks good from my end!
[17:45:18] <swfrench-wmf>	 awesome, thanks
[17:45:24] <swfrench-wmf>	 alright, on to the next thing
[17:45:34] <cwhite>	 thank you!
[17:45:47] <swfrench-wmf>	 no problem at all
[17:46:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:47:44] <wikibugs>	 (03Merged) 10jenkins-bot: Disable cookie-based enrollment in 8.1 (cleanup) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128451 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:48:02] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1128451|Disable cookie-based enrollment in 8.1 (cleanup) (T383845)]]
[17:48:06] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[17:49:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:49:19] <wikibugs>	 (03PS1) 10Ladsgroup: changeprop-jobqueue: Bump categorymembership job concurrancy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128482
[17:51:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye
[17:51:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10643256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2089.codfw...
[17:52:01] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1128451|Disable cookie-based enrollment in 8.1 (cleanup) (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:54:13] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[17:57:50] <wikibugs>	 (03PS2) 10Ssingh: Add dsantamaria to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1127996 (https://phabricator.wikimedia.org/T388693) (owner: 10BCornwall)
[17:57:51] <dancy>	 swfrench: I have the new scap release ready with your fix and spiderpig stuff.  Lemme know when it's safe to deploy it.
[17:58:56] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "It should be all good. Sorry it took many attempts, I'm not at ny best :D" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[18:00:07] <wikibugs>	 (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128484
[18:00:38] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128451|Disable cookie-based enrollment in 8.1 (cleanup) (T383845)]] (duration: 12m 35s)
[18:00:41] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:00:55] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10643294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2089.codfw.wmn...
[18:01:15] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Add dsantamaria to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1127996 (https://phabricator.wikimedia.org/T388693) (owner: 10BCornwall)
[18:03:00] <swfrench-wmf>	 alright, I am done with the UTC-late infra window :)
[18:04:10] <Amir1>	 dancy: please let me know when you're done, I have a tiny and quick portals update deploy
[18:04:37] <dancy>	 Amir1: Please go fist
[18:04:41] <dancy>	 *first
[18:04:53] <swfrench-wmf>	 dancy: apologies, missed your message - all yours!
[18:05:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users group for DSantamaria - https://phabricator.wikimedia.org/T388693#10643304 (10ssingh) 05In progress→03Resolved @DSantamaria: This has been merged, please try accessing Superset ~30...
[18:05:07] <Amir1>	 thanks!
[18:05:11] <wikibugs>	 (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[18:05:26] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128484 (owner: 10Ladsgroup)
[18:06:25] <wikibugs>	 (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128484 (owner: 10Ladsgroup)
[18:14:13] <wikibugs>	 (03CR) 10Xcollazo: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1125486 (https://phabricator.wikimedia.org/T386854) (owner: 10Btullis)
[18:15:23] <wikibugs>	 (03PS1) 10Sohom Datta: Lua: Prevent PHP errors in production from displayNumber lookup [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924)
[18:16:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924) (owner: 10Sohom Datta)
[18:20:01] <logmsgbot>	 !log ladsgroup@deploy2002 Synchronized portals/wikipedia.org/assets: wikimedia.org updates (T373204) (duration: 12m 38s)
[18:20:05] <stashbot>	 T373204: Wikimedia.org page redesign - https://phabricator.wikimedia.org/T373204
[18:22:41] <logmsgbot>	 !log ladsgroup@deploy2002 Synchronized portals: wikimedia.org updates (T373204) (duration: 02m 38s)
[18:23:20] <wikibugs>	 (03PS1) 10Máté Szabó: GlobalContributions: Use unique CentralAuth tokens per request [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717)
[18:23:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó)
[18:25:29] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] changeprop-jobqueue: Bump categorymembership job concurrancy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128482 (owner: 10Ladsgroup)
[18:29:35] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] installserver: set puppetserver2004 for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1128473 (https://phabricator.wikimedia.org/T381274) (owner: 10Elukey)
[18:39:58] <wikibugs>	 (03PS1) 10Gergő Tisza: Re-apply "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218)
[18:40:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[18:44:54] <wikibugs>	 (03PS1) 10Scott French: hieradata: migrate mw-wikifunctions to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845)
[18:44:55] <wikibugs>	 (03PS1) 10Scott French: mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845)
[18:45:07] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115985 (owner: 10Ncmonitor)
[18:45:18] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.141.1" for 204 host(s)
[18:46:56] <wikibugs>	 (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1092931 (owner: 10Ncmonitor)
[18:50:09] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.141.1" for 1 host(s)
[18:50:12] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10643476 (10Jhancock.wm) almost done with this, just fighting with the puppet server
[18:51:03] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.141.1" completed for 1 hosts
[18:53:07] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:eqsin and A:cp for 9.2.9-1wm1
[18:53:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[18:54:13] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[19:00:42] <wikibugs>	 (03PS1) 10Gergő Tisza: Do not trigger edge login on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501
[19:01:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501 (owner: 10Gergő Tisza)
[19:01:44] <wikibugs>	 (03CR) 10Ladsgroup: "(We are not deploying this right now since this is before dc switchover and there is a db maint freeze, while this is not having any impac" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125556 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup)
[19:03:34] <tgr_>	 Reedy: sbassett: are you using the security window today? I have a bunch of SUL3 related fixes that won't fit into the normal backport window. If you are not using the security window, I'd like to steal it.
[19:04:06] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T385896, xfer categories jnl) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards
[19:04:10] <stashbot>	 T385896: Deploy wdqs-categories on wdqs-main/wdqs-internal-main hosts - https://phabricator.wikimedia.org/T385896
[19:04:42] <jinxer-wm>	 FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[19:04:56] <ryankemper>	 this data transfer will depool one host in wdqs-main and one in wdqs-internal-main. this shouldn't cause any alerts, but i'll be watching
[19:07:02] <wikibugs>	 (03PS1) 10Gergő Tisza: Re-apply "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796)
[19:07:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[19:08:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127954 (https://phabricator.wikimedia.org/T384153) (owner: 10Gergő Tisza)
[19:08:48] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T385896, xfer categories jnl) xfer categories from wdqs1021.eqiad.wmnet -> wdqs1025.eqiad.wmnet, repooling both afterwards
[19:09:29] <wikibugs>	 (03CR) 10Tacsipacsi: search-redirect: Handle $_GET potential vulnerability scanning (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[19:12:44] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128505 (https://phabricator.wikimedia.org/T380572)
[19:15:19] <icinga-wm>	 PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 105108 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops
[19:15:24] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128505 (https://phabricator.wikimedia.org/T380572) (owner: 10Ebernhardson)
[19:16:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] hieradata: migrate mw-wikifunctions to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128440 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[19:16:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-wikifunctions: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128439 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[19:17:02] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128505 (https://phabricator.wikimedia.org/T380572) (owner: 10Ebernhardson)
[19:20:01] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:20:12] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:20:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10643585 (10phaultfinder)
[19:21:26] <wikibugs>	 (03CR) 10Federico Ceratto: "In order to progress the CR do you want me to rollback the last commits and move them to a different CR?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[19:27:05] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:27:13] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:28:01] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:28:09] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:32:40] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[19:32:48] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:45:21] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4051.ulsfo.wmnet
[19:45:28] <wikibugs>	 (03PS1) 10Gergő Tisza: Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218)
[19:45:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[19:48:03] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] Bump changelog version for sudachi analyzer [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1126663 (https://phabricator.wikimedia.org/T386868) (owner: 10Ryan Kemper)
[19:48:06] <wikibugs>	 (03PS2) 10Gergő Tisza: Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218)
[19:50:50] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp4051 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737)
[19:52:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp4051 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:52:55] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:53:31] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp4051 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128519 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[19:54:41] <logmsgbot>	 !log amastilovic@deploy2002 Started deploy [airflow-dags/analytics@f0d67b6]: Keeping up with the Kubernetes migration
[19:55:14] <logmsgbot>	 !log amastilovic@deploy2002 Finished deploy [airflow-dags/analytics@f0d67b6]: Keeping up with the Kubernetes migration (duration: 00m 46s)
[19:57:31] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp4051 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:58:20] <wikibugs>	 (03CR) 10DCausse: [C:03+1] wdqs categories: switch to internal-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124535 (https://phabricator.wikimedia.org/T375520) (owner: 10Ryan Kemper)
[19:58:31] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp4051 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T2000).
[20:00:05] <jouncebot>	 Jdlrobson, bvibber, Sohom_Datta, mszabo, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <tgr_>	 o/
[20:00:36] <mszabo>	 moin
[20:00:38] <tgr_>	 I have a lot of patches, I'll deploy them in the next window
[20:00:42] <Sohom_Datta>	 o/
[20:00:48] <bvibber>	 o/
[20:01:06] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[20:01:08] <Jdlrobson>	 o/
[20:02:42] <tgr_>	 also, I can deploy
[20:03:05] <mszabo>	 thank you :)
[20:03:09] <tgr_>	 Jdlrobson: bvibber: can the three config patches go together?
[20:03:27] <bvibber>	 mine should be able to play well with others 
[20:03:57] <Jdlrobson>	 tgr_: yes
[20:04:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) (owner: 10Jdlrobson)
[20:04:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127155 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson)
[20:04:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127976 (https://phabricator.wikimedia.org/T385917) (owner: 10Bvibber)
[20:04:36] <wikibugs>	 (03PS1) 10Scott French: php8.1: fix comment typo in Dockerfile.template [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006)
[20:05:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector 2022 on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127677 (https://phabricator.wikimedia.org/T387154) (owner: 10Jdlrobson)
[20:05:24] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Donation banner on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127155 (https://phabricator.wikimedia.org/T387768) (owner: 10Jdlrobson)
[20:06:09] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127976 (https://phabricator.wikimedia.org/T385917) (owner: 10Bvibber)
[20:06:24] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4051.ulsfo.wmnet
[20:06:29] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1127677|Enable Vector 2022 on Wikidata (T387154)]], [[gerrit:1127155|Enable Donation banner on Catalan Wikipedia (T387768)]], [[gerrit:1127976|Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf (T385917)]]
[20:06:35] <stashbot>	 T387154: Enable Vector 2022 in Wikidata.org by default - https://phabricator.wikimedia.org/T387154
[20:06:35] <stashbot>	 T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768
[20:06:36] <stashbot>	 T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917
[20:10:29] <wikibugs>	 (03CR) 10Scott French: "Ah, that's good to know! I think I'd slightly prefer to remove it, if only to avoid future confusion as to where to source the package fro" [puppet] - 10https://gerrit.wikimedia.org/r/1125539 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French)
[20:10:40] <logmsgbot>	 !log tgr@deploy2002 bvibber, jdlrobson, tgr: Backport for [[gerrit:1127677|Enable Vector 2022 on Wikidata (T387154)]], [[gerrit:1127155|Enable Donation banner on Catalan Wikipedia (T387768)]], [[gerrit:1127976|Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf (T385917)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:10:41] <wikibugs>	 (03CR) 10Scott French: [C:03+2] aptrepo: remove component/pcre2 [puppet] - 10https://gerrit.wikimedia.org/r/1125539 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French)
[20:11:36] <bvibber>	 mine confirmed good
[20:12:36] <Jdlrobson>	 Wikidata.org: good to go
[20:12:41] <Jdlrobson>	 Just checking ca.wikipedia.org now
[20:12:57] <Jdlrobson>	 Also good to go
[20:13:04] <Jdlrobson>	 tgr thanks!
[20:13:17] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "No functional change in built images." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French)
[20:13:33] <logmsgbot>	 !log tgr@deploy2002 bvibber, jdlrobson, tgr: Continuing with sync
[20:14:00] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Lua: Prevent PHP errors in production from displayNumber lookup [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924) (owner: 10Sohom Datta)
[20:14:31] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Self-merging, as this does not in any way affect built images." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French)
[20:14:46] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: fix comment typo in Dockerfile.template [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1128522 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French)
[20:15:17] <wikibugs>	 (03Merged) 10jenkins-bot: Lua: Prevent PHP errors in production from displayNumber lookup [extensions/ProofreadPage] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128489 (https://phabricator.wikimedia.org/T383924) (owner: 10Sohom Datta)
[20:16:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10643831 (10Papaul)
[20:19:57] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1127677|Enable Vector 2022 on Wikidata (T387154)]], [[gerrit:1127155|Enable Donation banner on Catalan Wikipedia (T387768)]], [[gerrit:1127976|Re-enable wgTrackGlobalJsonLinksNamespaces for JsonConf (T385917)]] (duration: 13m 28s)
[20:20:03] <stashbot>	 T387154: Enable Vector 2022 in Wikidata.org by default - https://phabricator.wikimedia.org/T387154
[20:20:03] <stashbot>	 T387768: Fix and QA donate link instrumentation - https://phabricator.wikimedia.org/T387768
[20:20:04] <stashbot>	 T385917: Deploy patch-gjlw_namespace_text.sql on x1.commonswiki for JsonConfig - https://phabricator.wikimedia.org/T385917
[20:20:15] <bvibber>	 <3
[20:20:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:21:52] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128489|Lua: Prevent PHP errors in production from displayNumber lookup (T383924)]]
[20:21:56] <stashbot>	 T383924: ProofreadPage\Pagination\PageNotInPaginationException: $page does not belong to the pagination - https://phabricator.wikimedia.org/T383924
[20:25:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:26:32] <logmsgbot>	 !log tgr@deploy2002 soda, tgr: Backport for [[gerrit:1128489|Lua: Prevent PHP errors in production from displayNumber lookup (T383924)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:27:19] <Sohom_Datta>	 Works!
[20:27:28] <logmsgbot>	 !log tgr@deploy2002 soda, tgr: Continuing with sync
[20:27:51] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] GlobalContributions: Use unique CentralAuth tokens per request [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó)
[20:29:01] <Jdlrobson>	 thanks tgr_ for your help today!
[20:32:12] <Sohom_Datta>	 +1, thank you :)
[20:32:47] <tgr_>	 yw
[20:32:57] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4043.ulsfo.wmnet
[20:33:47] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128489|Lua: Prevent PHP errors in production from displayNumber lookup (T383924)]] (duration: 11m 54s)
[20:33:50] <stashbot>	 T383924: ProofreadPage\Pagination\PageNotInPaginationException: $page does not belong to the pagination - https://phabricator.wikimedia.org/T383924
[20:34:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:36:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:36:21] <wikibugs>	 (03PS1) 10BCornwall: upgrade cp4043 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737)
[20:38:49] <wikibugs>	 (03Merged) 10jenkins-bot: GlobalContributions: Use unique CentralAuth tokens per request [extensions/CheckUser] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128493 (https://phabricator.wikimedia.org/T384717) (owner: 10Máté Szabó)
[20:38:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] upgrade cp4043 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[20:39:49] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128493|GlobalContributions: Use unique CentralAuth tokens per request (T384717)]]
[20:39:53] <stashbot>	 T384717: Investigate external API call error on Special:GlobalContributions - https://phabricator.wikimedia.org/T384717
[20:40:24] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[20:40:41] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp4043 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1128527 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall)
[20:43:41] <logmsgbot>	 !log tgr@deploy2002 tgr, mszabo: Backport for [[gerrit:1128493|GlobalContributions: Use unique CentralAuth tokens per request (T384717)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:43:49] <mszabo>	 looking
[20:44:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:45:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye
[20:45:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10644008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2089.codfw...
[20:46:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:48:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:48:50] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4043.ulsfo.wmnet
[20:49:13] <mszabo>	 tgr: looks ok
[20:50:15] <logmsgbot>	 !log tgr@deploy2002 tgr, mszabo: Continuing with sync
[20:50:30] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Do not trigger edge login on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501 (owner: 10Gergő Tisza)
[20:50:33] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[20:53:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:57:26] <wikibugs>	 (03Merged) 10jenkins-bot: Do not trigger edge login on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128501 (owner: 10Gergő Tisza)
[20:57:26] <wikibugs>	 (03Merged) 10jenkins-bot: Do not initiate central login on the passive central domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128515 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[20:57:49] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128493|GlobalContributions: Use unique CentralAuth tokens per request (T384717)]] (duration: 18m 00s)
[20:57:53] <stashbot>	 T384717: Investigate external API call error on Special:GlobalContributions - https://phabricator.wikimedia.org/T384717
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T2100).
[21:00:52] <tgr_>	 Reedy, sbassett, Maryum, manfredi: do you plan to use the window? I have another hour's worth of patches to go
[21:01:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:06:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:06:58] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128501|Do not trigger edge login on the shared domain]], [[gerrit:1128515|Do not initiate central login on the passive central domain (T388218)]]
[21:07:02] <stashbot>	 T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218
[21:08:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644083 (10phaultfinder)
[21:08:57] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:eqsin and A:cp for 9.2.9-1wm1
[21:10:43] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1128501|Do not trigger edge login on the shared domain]], [[gerrit:1128515|Do not initiate central login on the passive central domain (T388218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:10:50] <swfrench-wmf>	 !log ran `reprepro --delete clearvanished` to complete removal of unused component/pcre2 - T386006
[21:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:54] <stashbot>	 T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006
[21:17:25] <wikibugs>	 (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811)
[21:18:11] <wikibugs>	 (03PS2) 10Ryan Kemper: sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811)
[21:18:34] <wikibugs>	 (03CR) 10Bking: [C:03+2] sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[21:18:37] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] sre.elasticsearch.rolling-operation: make plugin upgrade work for opensearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1128533 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[21:26:03] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 247 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 242, active_shards: 242, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 245, delayed_unassigned_shards: 0, number_of_pending_tasks: 85, numbe
[21:26:03] <icinga-wm>	 flight_fetch: 0, task_max_waiting_in_queue_millis: 16264, active_shards_percent_as_number: 49.48875255623722 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:07] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[21:26:15] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 247 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 242, active_shards: 242, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 247, delayed_unassigned_shards: 0, number_of_pe
[21:26:15] <icinga-wm>	 sks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 49.48875255623722 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:03] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2089.codfw.wmnet with OS bullseye
[21:29:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119
[21:29:11] <stashbot>	 T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119
[21:29:13] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10644131 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2089.codfw.wmn...
[21:30:10] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119
[21:30:41] <wikibugs>	 (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1128536
[21:32:51] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128501|Do not trigger edge login on the shared domain]], [[gerrit:1128515|Do not initiate central login on the passive central domain (T388218)]] (duration: 25m 53s)
[21:32:57] <stashbot>	 T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218
[21:34:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119
[21:34:33] <stashbot>	 T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119
[21:34:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[21:35:32] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119
[21:35:44] <wikibugs>	 (03Merged) 10jenkins-bot: Re-apply "Fix some SUL3 shared domain settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128496 (https://phabricator.wikimedia.org/T388218) (owner: 10Gergő Tisza)
[21:36:03] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128496|Re-apply "Fix some SUL3 shared domain settings" (T388218)]]
[21:38:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1128536 (owner: 10Ryan Kemper)
[21:40:48] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1128496|Re-apply "Fix some SUL3 shared domain settings" (T388218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:40:52] <stashbot>	 T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218
[21:48:47] <wikibugs>	 (03PS1) 10Eevans: corto: set production irc channels in /srv/git/private [puppet] - 10https://gerrit.wikimedia.org/r/1128538
[21:49:48] <wikibugs>	 (03CR) 10Eevans: [C:03+2] corto: set production irc channels in /srv/git/private [puppet] - 10https://gerrit.wikimedia.org/r/1128538 (owner: 10Eevans)
[22:01:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644245 (10phaultfinder)
[22:24:28] <wikibugs>	 (03Abandoned) 10BCornwall: ncmonitor: Ignore wikipediacreators.com [puppet] - 10https://gerrit.wikimedia.org/r/1115996 (owner: 10BCornwall)
[22:27:01] <wikibugs>	 (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor)
[22:32:17] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: rm edit-for-pay domains from ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1128543
[22:34:03] <wikibugs>	 (03PS2) 10BCornwall: ncmonitor: rm edit-for-pay domains from ignorelist [puppet] - 10https://gerrit.wikimedia.org/r/1128543
[22:34:11] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[22:34:11] <wikibugs>	 (03PS1) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046)
[22:35:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis)
[22:37:27] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1126676/5096/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1126676 (https://phabricator.wikimedia.org/T388354) (owner: 10Dzahn)
[22:37:31] <wikibugs>	 (03PS1) 10BCornwall: ncredir: Redirect seized edit-for-pay domains [puppet] - 10https://gerrit.wikimedia.org/r/1128545
[22:38:28] <wikibugs>	 (03PS1) 10Cwhite: logstash: add ids to plugins where missing [puppet] - 10https://gerrit.wikimedia.org/r/1128546 (https://phabricator.wikimedia.org/T389072)
[22:38:46] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "compiler output shows how it affects stewards-l but not other lists" [puppet] - 10https://gerrit.wikimedia.org/r/1126676 (https://phabricator.wikimedia.org/T388354) (owner: 10Dzahn)
[22:39:43] <wikibugs>	 (03PS1) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547
[22:40:07] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5097/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall)
[22:40:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson)
[22:40:14] <wikibugs>	 (03CR) 10Bking: [C:03+1] opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478 (owner: 10DCausse)
[22:40:38] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128496|Re-apply "Fix some SUL3 shared domain settings" (T388218)]] (duration: 64m 35s)
[22:40:42] <stashbot>	 T388218: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralDomainUtils::getWikiPageUrl() must be of the type string, null given - https://phabricator.wikimedia.org/T388218
[22:40:47] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5098/co" [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall)
[22:41:31] <wikibugs>	 (03PS2) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547
[22:41:57] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson)
[22:41:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson)
[22:41:59] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm. another change will then add them to ncredir service?" [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall)
[22:42:04] <wikibugs>	 (03PS2) 10Cwhite: logstash: add ids to plugins where missing [puppet] - 10https://gerrit.wikimedia.org/r/1128546 (https://phabricator.wikimedia.org/T389072)
[22:42:21] <wikibugs>	 (03CR) 10Ebernhardson: "I'm not 100% sure this is the best solution, ideally i think we would want a way to tell opensearch to look in a directory other than the " [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson)
[22:45:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[22:45:46] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] "Indeed, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128545/" [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall)
[22:47:30] <wikibugs>	 (03PS2) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046)
[22:48:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis)
[22:50:19] <wikibugs>	 (03PS3) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547
[22:50:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10644539 (10phaultfinder)
[22:50:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson)
[22:52:54] <wikibugs>	 (03Merged) 10jenkins-bot: Re-apply "Try both SUL2 and SUL3 central domain for autologin" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128502 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[22:53:13] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128502|Re-apply "Try both SUL2 and SUL3 central domain for autologin" (T375796)]]
[22:53:16] <stashbot>	 T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796
[22:53:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm. nice solution" [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall)
[22:54:27] <wikibugs>	 (03PS4) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547
[22:55:08] <wikibugs>	 (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson)
[22:57:11] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1128502|Re-apply "Try both SUL2 and SUL3 central domain for autologin" (T375796)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:59:17] <wikibugs>	 (03PS5) 10Ebernhardson: opensearch: Symlink sudachi dictionary into per-instance config [puppet] - 10https://gerrit.wikimedia.org/r/1128547
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250317T2300)
[23:00:42] <wikibugs>	 (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128547 (owner: 10Ebernhardson)
[23:01:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f9fc2c741c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitec
[23:01:23] <icinga-wm>	 dia.org/wiki/Search%23Administration
[23:01:37] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/
[23:02:03] <icinga-wm>	 h.wikimedia.org/wiki/PyBal
[23:02:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8243: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/
[23:02:03] <icinga-wm>	 h.wikimedia.org/wiki/PyBal
[23:02:11] <icinga-wm>	 PROBLEM - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:02:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1257.eqiad.wmnet with OS bookworm
[23:02:37] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10644582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1257.eqiad.wmnet with OS bookworm
[23:04:17] <inflatador>	 ^^ looking at the Cloudelastic alerts now
[23:04:26] <inflatador>	 they should clear shortly
[23:04:42] <jinxer-wm>	 FIRING: CertManagerCertNotReady: Certificate istio-system/jaeger is not in a ready state (k8s-aux@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-aux&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[23:05:35] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Redirect seized edit-for-pay domains [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall)
[23:08:05] <wikibugs>	 (03PS3) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046)
[23:09:07] <wikibugs>	 (03PS4) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor)
[23:09:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046) (owner: 10Btullis)
[23:09:45] <wikibugs>	 (03PS1) 10Dzahn: lists::automation: explain how this can sync mailman list members [puppet] - 10https://gerrit.wikimedia.org/r/1128551 (https://phabricator.wikimedia.org/T388354)
[23:11:06] <wikibugs>	 (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor)
[23:11:28] <wikibugs>	 (03CR) 10Pppery: [C:03+1] ncredir: Redirect seized edit-for-pay domains [puppet] - 10https://gerrit.wikimedia.org/r/1128545 (owner: 10BCornwall)
[23:12:15] <wikibugs>	 (03PS4) 10Btullis: data-platform: Fix the dashboard URL of the stats server load alert [alerts] - 10https://gerrit.wikimedia.org/r/1128544 (https://phabricator.wikimedia.org/T373046)
[23:13:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:13:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:13:11] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Prod MW AppServer Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 715 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:13:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: green, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 764, active_shards: 1531, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks
[23:13:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] lists::automation: explain how this can sync mailman list members [puppet] - 10https://gerrit.wikimedia.org/r/1128551 (https://phabricator.wikimedia.org/T388354) (owner: 10Dzahn)
[23:13:23] <icinga-wm>	 ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 261, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:13:37] <icinga-wm>	 RECOVERY - WMF Cloud -Chi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 715 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:14:20] <wikibugs>	 (03CR) 10Pppery: ncmonitor: rm edit-for-pay domains from ignorelist (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall)
[23:15:17] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: rm edit-for-pay domains from ignorelist (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1128543 (owner: 10BCornwall)
[23:16:43] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor/ncredir: Add two more edit-for-pay sites [puppet] - 10https://gerrit.wikimedia.org/r/1128559
[23:21:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/
[23:21:03] <icinga-wm>	 h.wikimedia.org/wiki/PyBal
[23:21:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudelasticlb6_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_8443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb6_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled: cloudelasticlb_9443: Servers cloudelastic1012.eqiad.wmnet are marked down but pooled https:/
[23:21:03] <icinga-wm>	 h.wikimedia.org/wiki/PyBal
[23:22:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:22:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:22:06] <inflatador>	 ^^ again, these should clear momentarily
[23:23:21] <wikibugs>	 (03CR) 10Pppery: [C:03+1] ncmonitor/ncredir: Add two more edit-for-pay sites [puppet] - 10https://gerrit.wikimedia.org/r/1128559 (owner: 10BCornwall)
[23:23:41] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1128559 (owner: 10BCornwall)
[23:29:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:33:37] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[23:34:19] <wikibugs>	 (03PS1) 10Gergő Tisza: Do not schedule edge login recursively [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132)
[23:35:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132) (owner: 10Gergő Tisza)
[23:37:43] <zabe>	 !log zabe@mwmaint2002:~$ cat group0.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php {} --deletedump /home/zabe/afl_text_table_deletedump/{} --dump /home/zabe/afl_text_table_dump/{}" # T381599
[23:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:46] <stashbot>	 T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599
[23:39:58] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128502|Re-apply "Try both SUL2 and SUL3 central domain for autologin" (T375796)]] (duration: 46m 45s)
[23:40:02] <stashbot>	 T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796
[23:41:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132) (owner: 10Gergő Tisza)
[23:43:14] <wikibugs>	 (03Merged) 10jenkins-bot: Do not schedule edge login recursively [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1128560 (https://phabricator.wikimedia.org/T389132) (owner: 10Gergő Tisza)
[23:43:34] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]]
[23:43:38] <stashbot>	 T389132: Neverending edge login - https://phabricator.wikimedia.org/T389132
[23:47:26] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:48:39] <wikibugs>	 (03PS4) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715)
[23:49:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:52:55] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[23:59:09] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128560|Do not schedule edge login recursively (T389132)]] (duration: 15m 35s)
[23:59:13] <stashbot>	 T389132: Neverending edge login - https://phabricator.wikimedia.org/T389132