[00:08:06] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P93802 and previous config saved to /var/cache/conftool/dbconfig/20260604-000805-fceratto.json [00:09:12] (03PS1) 10RLazarus: scaffold: Bump mesh.service version from 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297258 [00:10:38] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Beta cluster haproxy does not support `warn-blocked-traffic-after` keyword - https://phabricator.wikimedia.org/T428052#11983573 (10bd808) [00:10:46] 06SRE, 10Beta-Cluster-Infrastructure, 06Traffic: Beta cluster haproxy does not support `warn-blocked-traffic-after` keyword - https://phabricator.wikimedia.org/T428052#11983578 (10bd808) [00:15:23] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 290850680 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:16:12] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [00:16:13] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [00:16:56] !incidents [00:16:56] 8056 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule@main) [00:16:57] 8057 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule@main) [00:16:57] 8054 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [00:16:57] 8053 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [00:17:23] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 57392 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:18:14] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P93803 and previous config saved to /var/cache/conftool/dbconfig/20260604-001813-fceratto.json [00:18:55] !log mwscript-k8s --follow --dblist=all -- extensions/timeline/maintenance/DeleteOldTimelineFiles.php --date 20210101000000 [00:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:11] (03PS1) 10Ottomata: EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297260 (https://phabricator.wikimedia.org/T425087) [00:21:12] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [00:21:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [00:21:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297260 (https://phabricator.wikimedia.org/T425087) (owner: 10Ottomata) [00:23:12] (03Merged) 10jenkins-bot: EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297260 (https://phabricator.wikimedia.org/T425087) (owner: 10Ottomata) [00:24:34] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1297260|EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion (T425087)]] [00:24:38] T425087: Send JSON access logs for dumps.wikimedia.org to Kafka - https://phabricator.wikimedia.org/T425087 [00:26:40] !log otto@deploy1003 otto: Backport for [[gerrit:1297260|EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion (T425087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:28:16] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11983600 (10Bawolff) >>! In T427949#11982519, @Nemoralis wrote: > I think one of the questions that needs to be asked here is, are many of these f... [00:28:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T426633)', diff saved to https://phabricator.wikimedia.org/P93804 and previous config saved to /var/cache/conftool/dbconfig/20260604-002821-fceratto.json [00:28:44] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1252.eqiad.wmnet with reason: Maintenance [00:28:52] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T426633)', diff saved to https://phabricator.wikimedia.org/P93805 and previous config saved to /var/cache/conftool/dbconfig/20260604-002851-fceratto.json [00:39:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T426633)', diff saved to https://phabricator.wikimedia.org/P93806 and previous config saved to /var/cache/conftool/dbconfig/20260604-003914-fceratto.json [00:39:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P93807 and previous config saved to /var/cache/conftool/dbconfig/20260604-004922-fceratto.json [00:59:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P93808 and previous config saved to /var/cache/conftool/dbconfig/20260604-005929-fceratto.json [01:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:09:38] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T426633)', diff saved to https://phabricator.wikimedia.org/P93809 and previous config saved to /var/cache/conftool/dbconfig/20260604-010937-fceratto.json [01:09:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1297267 [01:09:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1297267 (owner: 10TrainBranchBot) [01:09:58] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1260.eqiad.wmnet with reason: Maintenance [01:10:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T426633)', diff saved to https://phabricator.wikimedia.org/P93810 and previous config saved to /var/cache/conftool/dbconfig/20260604-011005-fceratto.json [01:16:55] (03CR) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [01:17:32] (03PS2) 10RLazarus: Copy mesh.networkpolicy 1.2.1 -> 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296063 [01:17:32] (03PS2) 10RLazarus: Copy mesh.configuration 1.15.2 -> 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296064 [01:17:32] (03PS2) 10RLazarus: mesh.networkpolicy: Handle a services_proxy entry with no upstream.ips [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) [01:17:32] (03PS2) 10RLazarus: Copy mesh.service 1.2.0 -> 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296066 [01:17:33] (03PS3) 10RLazarus: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) [01:17:34] (03PS3) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) [01:17:54] (03PS1) 10Pppery: Redirect unknown wikinews languages to portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297268 (https://phabricator.wikimedia.org/T427126) [01:18:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T426633)', diff saved to https://phabricator.wikimedia.org/P93811 and previous config saved to /var/cache/conftool/dbconfig/20260604-011818-fceratto.json [01:22:08] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1297267 (owner: 10TrainBranchBot) [01:24:34] (03CR) 10Pppery: wmf-config: Add new private1-eqsin subnets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [01:26:54] (03CR) 10RLazarus: mesh.networkpolicy: Handle a services_proxy entry with no upstream.ips (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [01:28:26] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P93812 and previous config saved to /var/cache/conftool/dbconfig/20260604-012826-fceratto.json [01:29:25] (03CR) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [01:38:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P93813 and previous config saved to /var/cache/conftool/dbconfig/20260604-013833-fceratto.json [01:47:22] Yes. [01:47:27] er, wrong window, hi [01:48:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T426633)', diff saved to https://phabricator.wikimedia.org/P93814 and previous config saved to /var/cache/conftool/dbconfig/20260604-014841-fceratto.json [01:49:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1261.eqiad.wmnet with reason: Maintenance [01:49:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T426633)', diff saved to https://phabricator.wikimedia.org/P93815 and previous config saved to /var/cache/conftool/dbconfig/20260604-014909-fceratto.json [01:57:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T426633)', diff saved to https://phabricator.wikimedia.org/P93816 and previous config saved to /var/cache/conftool/dbconfig/20260604-015718-fceratto.json [02:01:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1281901 (https://phabricator.wikimedia.org/T424413) (owner: 10Codename Noreste) [02:07:27] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P93817 and previous config saved to /var/cache/conftool/dbconfig/20260604-020726-fceratto.json [02:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P93818 and previous config saved to /var/cache/conftool/dbconfig/20260604-021734-fceratto.json [02:27:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T426633)', diff saved to https://phabricator.wikimedia.org/P93819 and previous config saved to /var/cache/conftool/dbconfig/20260604-022742-fceratto.json [02:28:02] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1262.eqiad.wmnet with reason: Maintenance [02:28:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T426633)', diff saved to https://phabricator.wikimedia.org/P93820 and previous config saved to /var/cache/conftool/dbconfig/20260604-022809-fceratto.json [02:33:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:20] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T426633)', diff saved to https://phabricator.wikimedia.org/P93821 and previous config saved to /var/cache/conftool/dbconfig/20260604-023619-fceratto.json [02:46:28] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P93822 and previous config saved to /var/cache/conftool/dbconfig/20260604-024627-fceratto.json [02:55:38] (03PS3) 10RLazarus: Copy mesh.networkpolicy 1.2.1 -> 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296063 [02:55:38] (03PS3) 10RLazarus: mesh.networkpolicy: Add ingress ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) [02:55:38] (03PS4) 10RLazarus: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) [02:55:39] (03PS4) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) [02:55:40] (03PS3) 10RLazarus: function-{evaluator,orchestrator}: sextant update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) [02:55:41] (03PS3) 10RLazarus: wikifunctions: Add mesh.restricted_listeners port to orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296071 (https://phabricator.wikimedia.org/T427863) [02:55:45] (03PS3) 10RLazarus: function-evaluator: Add outgoing Envoy config and egress policy for callbacks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296072 (https://phabricator.wikimedia.org/T427863) [02:56:35] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P93823 and previous config saved to /var/cache/conftool/dbconfig/20260604-025634-fceratto.json [03:01:21] (03PS4) 10RLazarus: Copy mesh.networkpolicy 1.2.1 -> 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296063 [03:01:21] (03PS4) 10RLazarus: mesh.networkpolicy: Add ingress ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) [03:01:21] (03PS5) 10RLazarus: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) [03:01:22] (03PS5) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) [03:01:23] (03PS4) 10RLazarus: function-{evaluator,orchestrator}: sextant update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) [03:01:24] (03PS4) 10RLazarus: wikifunctions: Add mesh.restricted_listeners port to orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296071 (https://phabricator.wikimedia.org/T427863) [03:01:28] (03PS4) 10RLazarus: function-evaluator: Add outgoing Envoy config and egress policy for callbacks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296072 (https://phabricator.wikimedia.org/T427863) [03:06:43] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T426633)', diff saved to https://phabricator.wikimedia.org/P93824 and previous config saved to /var/cache/conftool/dbconfig/20260604-030642-fceratto.json [03:07:03] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1263.eqiad.wmnet with reason: Maintenance [03:07:11] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T426633)', diff saved to https://phabricator.wikimedia.org/P93825 and previous config saved to /var/cache/conftool/dbconfig/20260604-030710-fceratto.json [03:15:24] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T426633)', diff saved to https://phabricator.wikimedia.org/P93826 and previous config saved to /var/cache/conftool/dbconfig/20260604-031523-fceratto.json [03:25:31] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P93827 and previous config saved to /var/cache/conftool/dbconfig/20260604-032531-fceratto.json [03:34:08] (03CR) 10RLazarus: "No, you're right! It felt wrong in a vaguely scopey way -- mesh.networkpolicy doesn't depend on mesh.configuration, so why should the mesh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [03:34:14] (03CR) 10RLazarus: mesh.networkpolicy: Add ingress ports for restricted_listeners (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [03:34:36] (03Abandoned) 10RLazarus: orchestrator: Add restricted_listeners ports to network egress policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296070 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [03:35:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P93828 and previous config saved to /var/cache/conftool/dbconfig/20260604-033538-fceratto.json [03:39:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:44:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T426633)', diff saved to https://phabricator.wikimedia.org/P93829 and previous config saved to /var/cache/conftool/dbconfig/20260604-034546-fceratto.json [04:07:37] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [04:10:37] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [04:18:39] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1002 is CRITICAL: PROCS CRITICAL: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [04:19:39] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1002 is OK: PROCS OK: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [05:01:26] (03CR) 10Kevin Bazira: [C:03+2] ml-services: add cope-b-a4b isvc to experimental ns (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297199 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [05:03:27] (03Merged) 10jenkins-bot: ml-services: add cope-b-a4b isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297199 (https://phabricator.wikimedia.org/T427497) (owner: 10Kevin Bazira) [05:09:14] FIRING: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:13:40] (03CR) 10Ayounsi: "Nice that's worth a try with test-cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1297232 (https://phabricator.wikimedia.org/T427393) (owner: 10Cathal Mooney) [05:19:16] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [05:22:05] (03PS1) 10Giuseppe Lavagetto: requestctl_client: remove the absenting of the old package [puppet] - 10https://gerrit.wikimedia.org/r/1297288 [05:22:05] (03PS1) 10Giuseppe Lavagetto: requestctl: sync script [puppet] - 10https://gerrit.wikimedia.org/r/1297289 (https://phabricator.wikimedia.org/T428119) [05:22:07] (03PS1) 10Giuseppe Lavagetto: hiddenparma: switch to db-backed api tokens [puppet] - 10https://gerrit.wikimedia.org/r/1297290 (https://phabricator.wikimedia.org/T428119) [05:22:09] (03PS1) 10Giuseppe Lavagetto: requestctl: fetch api credentials from hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1297291 (https://phabricator.wikimedia.org/T428119) [05:23:27] (03CR) 10CI reject: [V:04-1] hiddenparma: switch to db-backed api tokens [puppet] - 10https://gerrit.wikimedia.org/r/1297290 (https://phabricator.wikimedia.org/T428119) (owner: 10Giuseppe Lavagetto) [05:25:37] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2215 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1297332 (https://phabricator.wikimedia.org/T428120) [05:27:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2215 with weight 0 T428120', diff saved to https://phabricator.wikimedia.org/P93830 and previous config saved to /var/cache/conftool/dbconfig/20260604-052722-marostegui.json [05:27:27] T428120: Switchover x1 master (db2191 -> db2215) - https://phabricator.wikimedia.org/T428120 [05:27:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: Primary switchover x1 T428120 [05:27:48] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2215 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1297332 (https://phabricator.wikimedia.org/T428120) (owner: 10Gerrit maintenance bot) [05:37:05] (03PS1) 10Giuseppe Lavagetto: Add new private info stub for hiddenparma [labs/private] - 10https://gerrit.wikimedia.org/r/1297493 (https://phabricator.wikimedia.org/T428119) [05:38:41] (03PS2) 10Giuseppe Lavagetto: requestctl: fetch api credentials from hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1297291 (https://phabricator.wikimedia.org/T428119) [05:44:39] (03PS1) 10Ayounsi: Sort webrequest_sampled_live dimensions alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/1297534 [05:44:49] !log Starting x1 codfw failover from db2191 to db2215 - T428120 [05:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:53] T428120: Switchover x1 master (db2191 -> db2215) - https://phabricator.wikimedia.org/T428120 [05:45:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2215 to x1 primary T428120', diff saved to https://phabricator.wikimedia.org/P93831 and previous config saved to /var/cache/conftool/dbconfig/20260604-054528-marostegui.json [05:46:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2191 T428120', diff saved to https://phabricator.wikimedia.org/P93832 and previous config saved to /var/cache/conftool/dbconfig/20260604-054614-marostegui.json [05:48:45] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [05:48:54] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [05:49:45] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [05:50:06] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 18 hosts with reason: Primary switchover x3 T427895 [05:50:10] T427895: Switchover x3 master (db1255 -> db1258) - https://phabricator.wikimedia.org/T427895 [05:50:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [05:50:22] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Set db1258 with weight 0 T427895', diff saved to https://phabricator.wikimedia.org/P93833 and previous config saved to /var/cache/conftool/dbconfig/20260604-055021-cwilliams.json [05:50:25] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2191: Upgrading db2191.codfw.wmnet [05:50:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2191: Upgrading db2191.codfw.wmnet [05:51:45] (03PS1) 10Kosta Harlan: hCaptcha risk scores: VE plugin to collect risk scores for block notices [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297536 (https://phabricator.wikimedia.org/T426943) [05:52:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2191.codfw.wmnet with OS trixie [05:52:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297536 (https://phabricator.wikimedia.org/T426943) (owner: 10Kosta Harlan) [05:52:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) (owner: 10Harroyo-wmf) [05:52:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297200 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [05:53:22] !log Starting x3 eqiad failover from db1255 to db1258 - T427895 [05:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:47] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Set x3 eqiad as read-only for maintenance - T427895', diff saved to https://phabricator.wikimedia.org/P93834 and previous config saved to /var/cache/conftool/dbconfig/20260604-055346-cwilliams.json [05:54:30] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Promote db1258 to x3 primary and set section read-write T427895', diff saved to https://phabricator.wikimedia.org/P93835 and previous config saved to /var/cache/conftool/dbconfig/20260604-055429-cwilliams.json [05:58:44] (03CR) 10CWilliams: [C:03+2] mariadb: Promote db1258 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1296510 (https://phabricator.wikimedia.org/T427895) (owner: 10Gerrit maintenance bot) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T0600). [06:02:02] !log cwilliams@dns1004 START - running authdns-update [06:03:27] !log cwilliams@dns1004 END - running authdns-update [06:04:29] !log cwilliams@cumin1003 dbctl commit (dc=all): 'Depool db1255 T427895', diff saved to https://phabricator.wikimedia.org/P93836 and previous config saved to /var/cache/conftool/dbconfig/20260604-060428-cwilliams.json [06:04:33] T427895: Switchover x3 master (db1255 -> db1258) - https://phabricator.wikimedia.org/T427895 [06:06:46] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [06:11:02] (03PS1) 10Ayounsi: webrequest_sampled_live: add "kind: number" when relevant [puppet] - 10https://gerrit.wikimedia.org/r/1297539 [06:11:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2191.codfw.wmnet with reason: host reimage [06:11:46] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [06:12:47] !log cwilliams@cumin1003 START - Cookbook sre.mysql.major-upgrade [06:12:56] !log cwilliams@cumin1003 START - Cookbook sre.mysql.depool depool db1255: Upgrading db1255.eqiad.wmnet [06:13:06] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1255: Upgrading db1255.eqiad.wmnet [06:15:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2191.codfw.wmnet with reason: host reimage [06:16:48] !log cwilliams@cumin1003 START - Cookbook sre.hosts.reimage for host db1255.eqiad.wmnet with OS trixie [06:31:29] !log cwilliams@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1255.eqiad.wmnet with reason: host reimage [06:32:40] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2191.codfw.wmnet with OS trixie [06:35:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1255.eqiad.wmnet with reason: host reimage [06:38:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2191: Migration of db2191.codfw.wmnet completed [06:40:21] (03PS3) 10Hashar: jenkins: ensure service is absent on new Jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:40:24] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:42:34] (03PS4) 10Hashar: jenkins: ensure service is absent on new Jenkins host [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:43:01] (03CR) 10Hashar: "> Hosts: O:jenkins results in:" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:43:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [06:51:45] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1255.eqiad.wmnet with OS trixie [06:52:14] (03CR) 10Marostegui: "Looks good to me but let's check with @rcoccioli@wikimedia.org" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [06:53:47] !log cwilliams@cumin1003 START - Cookbook sre.mysql.pool pool db1255: Migration of db1255.eqiad.wmnet completed [06:58:51] (03CR) 10Hashar: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1297236 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [07:00:05] Amir1, urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T0700). Please do the needful. [07:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:20] (03CR) 10Cathal Mooney: [C:03+1] Loopback filter: allow internal traceroutes [homer/public] - 10https://gerrit.wikimedia.org/r/1296933 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:00:31] hi [07:02:07] btullis: there’s a deploy for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1297260 that didn’t finish syncing yesterday [07:02:16] Should I sync It now? [07:03:31] !log otto@deploy1003 otto: Rolling back deployment [07:04:00] (03PS1) 10Kosta Harlan: Revert "EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297550 [07:04:04] !log otto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297260|EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion (T425087)]] (duration: 399m 30s) [07:04:08] T425087: Send JSON access logs for dumps.wikimedia.org to Kafka - https://phabricator.wikimedia.org/T425087 [07:04:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297550 (owner: 10Kosta Harlan) [07:05:14] (03Merged) 10jenkins-bot: Revert "EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297550 (owner: 10Kosta Harlan) [07:06:10] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1297550|Revert "EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion"]] [07:07:55] ottomata: I reverted https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1297260, because the change was synced to the test server but not actually deployed [07:08:17] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1297550|Revert "EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:45] !log kharlan@deploy1003 kharlan: Continuing with deployment [07:12:56] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297550|Revert "EventStreamConfig - webrequest.dumps.dev0 - enable canary events for hive ingestion"]] (duration: 06m 45s) [07:13:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297536 (https://phabricator.wikimedia.org/T426943) (owner: 10Kosta Harlan) [07:13:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297200 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [07:13:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) (owner: 10Harroyo-wmf) [07:14:45] (03Merged) 10jenkins-bot: hCaptcha: Enable risk-score collection for users blocked by IP blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297173 (https://phabricator.wikimedia.org/T424629) (owner: 10Harroyo-wmf) [07:15:08] (03Merged) 10jenkins-bot: hCaptcha risk scores: VE plugin to collect risk scores for block notices [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297536 (https://phabricator.wikimedia.org/T426943) (owner: 10Kosta Harlan) [07:24:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2191: Migration of db2191.codfw.wmnet completed [07:24:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [07:24:35] (03Merged) 10jenkins-bot: hCaptcha: Render a fresh mobile widget for each captcha attempt [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297200 (https://phabricator.wikimedia.org/T425929) (owner: 10Kosta Harlan) [07:25:09] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1297536|hCaptcha risk scores: VE plugin to collect risk scores for block notices (T426943)]], [[gerrit:1297200|hCaptcha: Render a fresh mobile widget for each captcha attempt (T425929)]], [[gerrit:1297173|hCaptcha: Enable risk-score collection for users blocked by IP blocks (T424629)]] [07:25:17] T426943: hCaptcha risk scores (MobileFrontend): Load ext.confirmEdit.hCaptcha when a block message is shown - https://phabricator.wikimedia.org/T426943 [07:25:17] T425929: Cannot publish after dismissing hCaptcha challenge triggered by AbuseFilter on mobile source editor - https://phabricator.wikimedia.org/T425929 [07:25:17] T424629: [epic] WE4.10.5 hCaptcha risk scores for blocked edit notices - https://phabricator.wikimedia.org/T424629 [07:25:48] (03CR) 10Ayounsi: [C:03+2] Loopback filter: allow internal traceroutes [homer/public] - 10https://gerrit.wikimedia.org/r/1296933 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:27:12] !log kharlan@deploy1003 kharlan, harroyo-wmf: Backport for [[gerrit:1297536|hCaptcha risk scores: VE plugin to collect risk scores for block notices (T426943)]], [[gerrit:1297200|hCaptcha: Render a fresh mobile widget for each captcha attempt (T425929)]], [[gerrit:1297173|hCaptcha: Enable risk-score collection for users blocked by IP blocks (T424629)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwd [07:27:12] ebug). Changes can now be verified there. [07:27:13] (03Merged) 10jenkins-bot: Loopback filter: allow internal traceroutes [homer/public] - 10https://gerrit.wikimedia.org/r/1296933 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:29:57] !log kharlan@deploy1003 kharlan, harroyo-wmf: Continuing with deployment [07:32:21] (03PS1) 10Ayounsi: allow_traceroute add action accept [homer/public] - 10https://gerrit.wikimedia.org/r/1297560 (https://phabricator.wikimedia.org/T348120) [07:33:12] (03CR) 10Cathal Mooney: [C:03+1] allow_traceroute add action accept [homer/public] - 10https://gerrit.wikimedia.org/r/1297560 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:33:52] (03CR) 10Ayounsi: [C:03+2] allow_traceroute add action accept [homer/public] - 10https://gerrit.wikimedia.org/r/1297560 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:34:05] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297536|hCaptcha risk scores: VE plugin to collect risk scores for block notices (T426943)]], [[gerrit:1297200|hCaptcha: Render a fresh mobile widget for each captcha attempt (T425929)]], [[gerrit:1297173|hCaptcha: Enable risk-score collection for users blocked by IP blocks (T424629)]] (duration: 08m 56s) [07:34:13] T426943: hCaptcha risk scores (MobileFrontend): Load ext.confirmEdit.hCaptcha when a block message is shown - https://phabricator.wikimedia.org/T426943 [07:34:13] T425929: Cannot publish after dismissing hCaptcha challenge triggered by AbuseFilter on mobile source editor - https://phabricator.wikimedia.org/T425929 [07:34:13] T424629: [epic] WE4.10.5 hCaptcha risk scores for blocked edit notices - https://phabricator.wikimedia.org/T424629 [07:35:11] (03Merged) 10jenkins-bot: allow_traceroute add action accept [homer/public] - 10https://gerrit.wikimedia.org/r/1297560 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:38:56] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:39:17] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1255: Migration of db1255.eqiad.wmnet completed [07:39:18] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [07:40:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:40:56] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:41:13] (03PS1) 10Brouberol: idp: add the kafka-ui service [puppet] - 10https://gerrit.wikimedia.org/r/1297563 (https://phabricator.wikimedia.org/T428053) [07:41:15] (03PS1) 10Brouberol: trafficserver: enable access to kafka.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1297564 (https://phabricator.wikimedia.org/T428053) [07:41:16] (03PS1) 10Brouberol: Define the kafka-ui chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) [07:41:18] (03PS1) 10Brouberol: Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) [07:41:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [07:41:58] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1224: Upgrading db1224.eqiad.wmnet [07:42:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1224: Upgrading db1224.eqiad.wmnet [07:42:44] (03PS1) 10Marostegui: control-mariadb-10.11-trixie: Change version [software] - 10https://gerrit.wikimedia.org/r/1297567 (https://phabricator.wikimedia.org/T427345) [07:43:01] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1224.eqiad.wmnet with OS trixie [07:44:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:50] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-trixie: Change version [software] - 10https://gerrit.wikimedia.org/r/1297567 (https://phabricator.wikimedia.org/T427345) (owner: 10Marostegui) [07:46:08] (03PS1) 10Ayounsi: allow_traceroute: restrict SRL term to IPv4 [homer/public] - 10https://gerrit.wikimedia.org/r/1297568 (https://phabricator.wikimedia.org/T348120) [07:46:21] (03Merged) 10jenkins-bot: control-mariadb-10.11-trixie: Change version [software] - 10https://gerrit.wikimedia.org/r/1297567 (https://phabricator.wikimedia.org/T427345) (owner: 10Marostegui) [07:47:19] (03CR) 10Brouberol: Define the kafka-ui chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:49:02] (03CR) 10Cathal Mooney: [C:03+1] allow_traceroute: restrict SRL term to IPv4 [homer/public] - 10https://gerrit.wikimedia.org/r/1297568 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:49:33] (03CR) 10JMeybohm: "Good catch, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1297101 (https://phabricator.wikimedia.org/T388969) (owner: 10Clément Goubert) [07:49:51] (03CR) 10Ayounsi: [C:03+2] allow_traceroute: restrict SRL term to IPv4 [homer/public] - 10https://gerrit.wikimedia.org/r/1297568 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:51:21] (03Merged) 10jenkins-bot: allow_traceroute: restrict SRL term to IPv4 [homer/public] - 10https://gerrit.wikimedia.org/r/1297568 (https://phabricator.wikimedia.org/T348120) (owner: 10Ayounsi) [07:53:32] (03PS1) 10Marostegui: db2249: Remove old note [puppet] - 10https://gerrit.wikimedia.org/r/1297620 [07:53:45] !log Install mariadb 10.11.17 on db2249 T427345 [07:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:49] T427345: Compile and package MariaDB 10.11.17 - https://phabricator.wikimedia.org/T427345 [07:54:06] (03CR) 10Brouberol: [C:04-1] "Temporary -1 until the app is properly deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1297564 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [07:55:27] (03CR) 10JMeybohm: kubernetes-1.31: Update systemd overrides and changelog. (031 comment) [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [08:00:05] dancy and jnuche: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T0800). nyaa~ [08:00:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage [08:02:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2249.codfw.wmnet with reason: upgrade [08:04:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1224.eqiad.wmnet with reason: host reimage [08:05:23] (03PS1) 10Cathal Mooney: ha-proxy: add TIFF image files to list of extensions for low-prio qos [puppet] - 10https://gerrit.wikimedia.org/r/1297621 (https://phabricator.wikimedia.org/T428098) [08:06:40] (03PS2) 10Brouberol: Define the kafka-ui chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) [08:06:40] (03PS2) 10Brouberol: Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) [08:09:25] (03CR) 10Elukey: "Tried to reply, lemme know your thoughts!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [08:12:43] (03CR) 10AikoChou: "I'd suggest load-testing this before exposing it via rest-gateway — getting an idea of how many rps it can serve to set replicas/autoscali" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [08:13:19] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11984142 (10cmooney) >>! In T427393#11983173, @BCornwall wrote: > I was advised by @taavi to also update mediawiki-config's `wmf-config... [08:18:22] (03PS3) 10Tiziano Fogli: Overide CertAlmostExpired for network devices [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [08:20:00] (03CR) 10Volans: "I think it would be less confusing if we were logging in both cases the same "value", so either the end time or the duration, but trying t" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [08:21:00] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [08:21:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1224.eqiad.wmnet with OS trixie [08:21:40] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [08:24:18] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [08:24:19] (03CR) 10CI reject: [V:04-1] Overide CertAlmostExpired for network devices [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [08:24:33] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [08:25:00] marostegui@cumin1003 major-upgrade (PID 1052418) is awaiting input [08:25:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:26:55] (03CR) 10Tiziano Fogli: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [08:27:13] (03CR) 10Marostegui: [C:03+2] db2249: Remove old note [puppet] - 10https://gerrit.wikimedia.org/r/1297620 (owner: 10Marostegui) [08:29:00] !log daniel@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [08:29:21] (03CR) 10Tiziano Fogli: [C:03+1] "I uploaded a new version of the test with a warning-only scenario for lsw1-f2-codfw.mgmt.codfw.wmnet (to avoid having the same label set i" [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [08:29:45] !log daniel@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [08:30:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:31:08] marostegui@cumin1003 major-upgrade (PID 1052418) is awaiting input [08:31:13] (03CR) 10Ozge: "Please note that the api is reading from a dict in memory." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [08:31:20] I know logmsgbot I am dealing with it! [08:31:44] (03CR) 10Elukey: [C:03+1] reuse-parts.sh: Allow to reuse swap with trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297186 (https://phabricator.wikimedia.org/T428078) (owner: 10JMeybohm) [08:32:22] (03CR) 10Elukey: [C:03+1] reuse-raid10-6dev.cfg: Fix swap reuse and grub-install on all disks [puppet] - 10https://gerrit.wikimedia.org/r/1297201 (https://phabricator.wikimedia.org/T428078) (owner: 10JMeybohm) [08:33:19] (03PS2) 10Giuseppe Lavagetto: requestctl_client: remove the absenting of the old package [puppet] - 10https://gerrit.wikimedia.org/r/1297288 [08:33:19] (03PS2) 10Giuseppe Lavagetto: requestctl: sync script [puppet] - 10https://gerrit.wikimedia.org/r/1297289 (https://phabricator.wikimedia.org/T428119) [08:33:19] (03PS2) 10Giuseppe Lavagetto: hiddenparma: switch to db-backed api tokens [puppet] - 10https://gerrit.wikimedia.org/r/1297290 (https://phabricator.wikimedia.org/T428119) [08:33:20] (03PS3) 10Giuseppe Lavagetto: requestctl: fetch api credentials from hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1297291 (https://phabricator.wikimedia.org/T428119) [08:34:38] (03CR) 10CI reject: [V:04-1] hiddenparma: switch to db-backed api tokens [puppet] - 10https://gerrit.wikimedia.org/r/1297290 (https://phabricator.wikimedia.org/T428119) (owner: 10Giuseppe Lavagetto) [08:35:10] (03PS1) 10Kevin Bazira: ml-services: bump cope-b-a4b isvc memory to 64Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297622 [08:36:33] (03PS9) 10Ozge: ml-services: add editing-suggestions isvc to experimental (eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [08:36:51] (03CR) 10Ozge: ml-services: add editing-suggestions isvc to experimental (eqiad) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [08:39:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:40:36] (03CR) 10Btullis: Define the kafka-ui chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:42:08] (03PS10) 10Ozge: ml-services: add editing-suggestions isvc to experimental (eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [08:42:21] (03CR) 10Btullis: Define the kafka-ui multi-cluster helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:42:35] (03CR) 10Bartosz Wójtowicz: [C:03+1] ml-services: bump cope-b-a4b isvc memory to 64Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297622 (owner: 10Kevin Bazira) [08:42:36] (03CR) 10Ozge: "moved to ml-staging-codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [08:43:14] (03PS11) 10Ozge: ml-services: add editing-suggestions isvc to experimental (eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [08:43:46] (03CR) 10Kevin Bazira: [C:03+2] ml-services: bump cope-b-a4b isvc memory to 64Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297622 (owner: 10Kevin Bazira) [08:43:51] (03CR) 10Btullis: "Looks good to me, as long as I/F are also happy with it." [puppet] - 10https://gerrit.wikimedia.org/r/1297563 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:43:55] (03CR) 10Btullis: [C:03+1] idp: add the kafka-ui service [puppet] - 10https://gerrit.wikimedia.org/r/1297563 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:44:28] (03PS1) 10Elukey: Modify rules to build sphinx on Bookworm and Trixie [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297624 (https://phabricator.wikimedia.org/T428024) [08:44:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:44:51] (03CR) 10Btullis: [C:03+1] "Looks good. Agree on holding until the app is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1297564 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:45:58] (03Merged) 10jenkins-bot: ml-services: bump cope-b-a4b isvc memory to 64Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297622 (owner: 10Kevin Bazira) [08:48:09] (03PS3) 10Blake: kubernetes-1.31: Update systemd overrides and changelog. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) [08:49:23] (03PS2) 10JMeybohm: reuse-parts.sh: Allow to reuse swap with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1297186 (https://phabricator.wikimedia.org/T428078) [08:49:23] (03PS2) 10JMeybohm: reuse-raid10-6dev.cfg: Fix swap reuse and grub-install on all disks [puppet] - 10https://gerrit.wikimedia.org/r/1297201 (https://phabricator.wikimedia.org/T428078) [08:50:31] (03PS1) 10Giuseppe Lavagetto: Deploy pluggable authentication system [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1297626 (https://phabricator.wikimedia.org/T428119) [08:50:43] (03PS4) 10Blake: kubernetes-1.31: Update systemd overrides and changelog. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) [08:50:54] (03PS5) 10Blake: kubernetes-1.31: Update systemd overrides and changelog. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) [08:51:33] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Deploy pluggable authentication system [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1297626 (https://phabricator.wikimedia.org/T428119) (owner: 10Giuseppe Lavagetto) [08:52:06] (03PS10) 10Daniel Kinzler: EXPERIMENT: run smokepy tests via helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267959 (https://phabricator.wikimedia.org/T424825) [08:52:10] (03PS3) 10Brouberol: Define the kafka-ui chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) [08:52:10] (03PS3) 10Brouberol: Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) [08:52:26] (03PS6) 10Blake: kubernetes-1.31: Update systemd service restart behaviour and changelog. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) [08:52:39] (03CR) 10Daniel Kinzler: [C:03+2] "Merging to trigger build pipeline. Will revert immediately." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267959 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [08:52:54] (03CR) 10Blake: kubernetes-1.31: Update systemd service restart behaviour and changelog. (031 comment) [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [08:53:08] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Introduce pluggable authentication - oblivian@cumin1003" [08:53:10] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Introduce pluggable authentication - oblivian@cumin1003 [08:53:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:53:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1224: Migration of db1224.eqiad.wmnet completed [08:54:04] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Introduce pluggable authentication - oblivian@cumin1003 [08:54:06] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Introduce pluggable authentication - oblivian@cumin1003" [08:55:57] (03CR) 10Brouberol: Define the kafka-ui chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [08:58:48] (03CR) 10Ilias Sarantopoulos: "Thanks! this LGTM, I left a comment regarding the commit msg." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [08:59:12] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: NetboxAccounting - https://phabricator.wikimedia.org/T428132 (10LSobanski) 03NEW [08:59:30] 07sre-alert-triage, 06ServiceOps new: Alert in need of triage: ProbeDown (instance sophroid:4252) - https://phabricator.wikimedia.org/T428133 (10LSobanski) 03NEW [08:59:34] (03PS8) 10Daniel Kinzler: EXPERIMENT: rest-gateway: Dockerize system tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286896 (https://phabricator.wikimedia.org/T424825) [08:59:40] (03CR) 10Daniel Kinzler: [C:03+2] EXPERIMENT: rest-gateway: Dockerize system tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286896 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [09:00:07] (03CR) 10Daniel Kinzler: [C:03+2] "Merging to trigger build pipeline. Will revert immediately." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286896 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [09:00:43] (03PS12) 10Ozge: ml-services: add editing-suggestions isvc to experimental (ml-staging-codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) [09:00:43] (03PS2) 10Elukey: Modify rules to build sphinx on Bookworm and Trixie [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297624 (https://phabricator.wikimedia.org/T428024) [09:00:59] (03CR) 10Ozge: ml-services: add editing-suggestions isvc to experimental (ml-staging-codfw) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:01:37] (03CR) 10Brouberol: Define the kafka-ui multi-cluster helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:01:56] (03CR) 10Btullis: Define the kafka-ui chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:01:59] (03Merged) 10jenkins-bot: EXPERIMENT: rest-gateway: Dockerize system tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286896 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [09:02:05] (03CR) 10AikoChou: "Yeah, it should be very fast — probably one replica is enough, no autoscaling needed. Still good to have the number though, e.g. when the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:02:11] (03PS3) 10Elukey: Modify rules to build sphinx on Bookworm and Trixie [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297624 (https://phabricator.wikimedia.org/T428024) [09:02:13] (03CR) 10Btullis: [C:03+1] Define the kafka-ui multi-cluster helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:02:15] (03Merged) 10jenkins-bot: EXPERIMENT: run smokepy tests via helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267959 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [09:02:34] (03CR) 10Brouberol: Define the kafka-ui chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:03:05] (03CR) 10AikoChou: [C:03+1] ml-services: add editing-suggestions isvc to experimental (ml-staging-codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:03:13] (03CR) 10Ozge: [C:03+2] ml-services: add editing-suggestions isvc to experimental (ml-staging-codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:03:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:03:59] RESOLVED: CertAlmostExpired: Certificate for service lsw1-f1-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#lsw1-f1-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:04:39] (03CR) 10Volans: [C:03+1] "LGTM, one caveat inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 (owner: 10Elukey) [09:05:16] (03Merged) 10jenkins-bot: ml-services: add editing-suggestions isvc to experimental (ml-staging-codfw) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:05:53] (03PS1) 10Ayounsi: Remove allow_traceroute from SRL [homer/public] - 10https://gerrit.wikimedia.org/r/1297628 [09:06:26] (03CR) 10Cathal Mooney: [C:03+1] Remove allow_traceroute from SRL [homer/public] - 10https://gerrit.wikimedia.org/r/1297628 (owner: 10Ayounsi) [09:06:52] (03PS3) 10Jcrespo: backup: Add job ids for read-only backups [puppet] - 10https://gerrit.wikimedia.org/r/1297081 (https://phabricator.wikimedia.org/T424661) [09:07:47] (03PS4) 10Brouberol: Define the kafka-ui chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) [09:07:47] (03PS4) 10Brouberol: Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) [09:08:20] (03CR) 10Ayounsi: [C:03+2] Remove allow_traceroute from SRL [homer/public] - 10https://gerrit.wikimedia.org/r/1297628 (owner: 10Ayounsi) [09:09:40] (03Merged) 10jenkins-bot: Remove allow_traceroute from SRL [homer/public] - 10https://gerrit.wikimedia.org/r/1297628 (owner: 10Ayounsi) [09:10:23] (03CR) 10Ayounsi: [C:03+2] Overide CertAlmostExpired for network devices [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [09:10:31] (03CR) 10Ayounsi: [C:03+2] "Awesome, thanks a lot!" [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [09:10:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:11:24] (03PS1) 10Daniel Kinzler: Revert "EXPERIMENT: rest-gateway: Dockerize system tests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297630 [09:11:34] (03CR) 10CI reject: [V:04-1] Revert "EXPERIMENT: rest-gateway: Dockerize system tests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297630 (owner: 10Daniel Kinzler) [09:11:44] (03PS1) 10Daniel Kinzler: Revert "EXPERIMENT: run smokepy tests via helm test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297631 [09:11:55] 06SRE, 06Infrastructure-Foundations, 10netops: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120#11984329 (10ayounsi) 05Open→03Resolved a:03ayounsi ` lsw1-a8-codfw> traceroute lo0.lsw1-a2-codfw.codfw.wmnet traceroute to lo0.lsw1-a2-codfw.co... [09:12:15] (03Merged) 10jenkins-bot: Overide CertAlmostExpired for network devices [alerts] - 10https://gerrit.wikimedia.org/r/1297163 (owner: 10Ayounsi) [09:12:27] (03CR) 10Ozge: [C:03+2] "ah I merged it before reading this message. Hopefully it should be quick next time though." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:14:16] 07sre-alert-triage, 06ServiceOps new: Alert in need of triage: ProbeDown (instance sophroid:4252) - https://phabricator.wikimedia.org/T428133#11984344 (10MLechvien-WMF) a:03jasmine_ @jasmine_ can you please take a look? cc @RLazarus Can we also see why this alert was not routed our way automatically? [09:15:39] (03PS5) 10Brouberol: Define the kafka-ui chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) [09:15:39] (03PS5) 10Brouberol: Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) [09:16:43] (03PS2) 10Daniel Kinzler: Revert "EXPERIMENT: rest-gateway: Dockerize system tests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297630 [09:17:01] (03CR) 10Btullis: Define the kafka-ui chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:17:11] (03CR) 10Daniel Kinzler: [C:03+2] Revert "EXPERIMENT: run smokepy tests via helm test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297631 (owner: 10Daniel Kinzler) [09:17:47] (03CR) 10Elukey: "Tested for Bookworm and Trixie!" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297624 (https://phabricator.wikimedia.org/T428024) (owner: 10Elukey) [09:18:25] (03CR) 10Brouberol: Define the kafka-ui chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:18:33] (03CR) 10Daniel Kinzler: [C:03+2] "revert to deployed status quo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297630 (owner: 10Daniel Kinzler) [09:19:30] (03Merged) 10jenkins-bot: Revert "EXPERIMENT: run smokepy tests via helm test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297631 (owner: 10Daniel Kinzler) [09:20:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:20:54] (03Merged) 10jenkins-bot: Revert "EXPERIMENT: rest-gateway: Dockerize system tests" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297630 (owner: 10Daniel Kinzler) [09:20:57] (03PS4) 10Elukey: Upgrade config and code to allow Trixie builds [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297184 (https://phabricator.wikimedia.org/T428024) [09:21:59] (03PS1) 10Daniel Kinzler: rest-gateway: bump apit-gateway chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297633 [09:22:08] (03CR) 10CI reject: [V:04-1] rest-gateway: bump apit-gateway chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297633 (owner: 10Daniel Kinzler) [09:22:42] (03PS2) 10Daniel Kinzler: rest-gateway: bump apit-gateway chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297633 [09:23:29] (03CR) 10Daniel Kinzler: [C:03+2] "Make the chart the we reverted to the latest chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297633 (owner: 10Daniel Kinzler) [09:24:45] (03PS1) 10Jcrespo: mariadb: Add requestctl to the allow list on backups [puppet] - 10https://gerrit.wikimedia.org/r/1297634 (https://phabricator.wikimedia.org/T411111) [09:25:55] !log Running `/usr/local/bin/foreachwikiindblist "group0.dblist + group1.dblist - mediamoderation-continuous-scan.dblist" extensions/MediaModeration/maintenance/scanFilesInScanTable.php --use-jobqueue --sleep=1 --poll-sleep=10 --verbose` [09:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:59] (03Merged) 10jenkins-bot: rest-gateway: bump apit-gateway chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297633 (owner: 10Daniel Kinzler) [09:26:19] !log Running `mwscript-k8s extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki="commonswiki" --use-jobqueue --poll-sleep=30 --sleep=60 --verbose` [09:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:52] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:27:34] (03CR) 10Btullis: [C:03+1] Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:27:49] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297184 (https://phabricator.wikimedia.org/T428024) (owner: 10Elukey) [09:28:23] (03CR) 10Brouberol: [C:03+2] Define the kafka-ui chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:28:27] (03CR) 10Brouberol: [C:03+2] Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:28:41] (03CR) 10Marostegui: [C:03+1] mariadb: Add requestctl to the allow list on backups [puppet] - 10https://gerrit.wikimedia.org/r/1297634 (https://phabricator.wikimedia.org/T411111) (owner: 10Jcrespo) [09:30:15] (03Merged) 10jenkins-bot: Define the kafka-ui chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297565 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:30:34] (03CR) 10Jcrespo: [C:03+2] mariadb: Add requestctl to the allow list on backups [puppet] - 10https://gerrit.wikimedia.org/r/1297634 (https://phabricator.wikimedia.org/T411111) (owner: 10Jcrespo) [09:30:37] (03Merged) 10jenkins-bot: Define the kafka-ui multi-cluster helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297566 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:30:44] (03PS2) 10Jcrespo: mariadb: Add requestctl to the allow list on backups [puppet] - 10https://gerrit.wikimedia.org/r/1297634 (https://phabricator.wikimedia.org/T411111) [09:31:42] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:32:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2057: Upgrading es2057.codfw.wmnet [09:32:24] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2057: Upgrading es2057.codfw.wmnet [09:33:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2057.codfw.wmnet with OS trixie [09:34:05] (03PS2) 10Brouberol: idp: add the kafka-ui service [puppet] - 10https://gerrit.wikimedia.org/r/1297563 (https://phabricator.wikimedia.org/T428053) [09:34:05] (03PS2) 10Brouberol: trafficserver: enable access to kafka.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1297564 (https://phabricator.wikimedia.org/T428053) [09:35:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/kafka-ui: apply [09:36:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/kafka-ui: apply [09:36:03] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:36:03] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:36:03] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:36:03] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:37:55] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/kafka-ui: apply [09:38:17] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/kafka-ui: apply [09:39:16] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1224: Migration of db1224.eqiad.wmnet completed [09:39:17] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [09:39:22] (03CR) 10JMeybohm: [C:03+1] kubernetes-1.31: Update systemd service restart behaviour and changelog. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [09:39:47] (03CR) 10Brouberol: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1297564 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [09:39:57] (03CR) 10Blake: [C:03+2] kubernetes-1.31: Update systemd service restart behaviour and changelog. [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297128 (https://phabricator.wikimedia.org/T427065) (owner: 10Blake) [09:42:12] (03CR) 10CWilliams: Provide downtime duration information in sre.mysql cookbooks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [09:43:18] (03CR) 10Ilias Sarantopoulos: ml-services: add editing-suggestions isvc to experimental (ml-staging-codfw) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297106 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [09:44:02] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:04] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:04] RECOVERY - Confd vcl based reload on cp6013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:04] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:44:04] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [09:45:48] (03CR) 10Jcrespo: [C:03+2] mariadb: Add requestctl to the allow list on backups [puppet] - 10https://gerrit.wikimedia.org/r/1297634 (https://phabricator.wikimedia.org/T411111) (owner: 10Jcrespo) [09:47:14] (03CR) 10Elukey: Fix datetime-related and pytest warnings (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 (owner: 10Elukey) [09:47:25] (03CR) 10Elukey: [C:03+2] Fix datetime-related and pytest warnings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 (owner: 10Elukey) [09:47:34] (03CR) 10Elukey: Fix datetime-related and pytest warnings [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293719 (owner: 10Elukey) [09:49:34] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2057.codfw.wmnet with reason: host reimage [09:49:56] (03PS1) 10Tiziano Fogli: slothslos/report2drive: add dummy drive secret [labs/private] - 10https://gerrit.wikimedia.org/r/1297640 (https://phabricator.wikimedia.org/T425795) [09:50:50] (03PS2) 10Tiziano Fogli: slothslos/report2drive: add dummy drive secret [labs/private] - 10https://gerrit.wikimedia.org/r/1297640 (https://phabricator.wikimedia.org/T425795) [09:50:52] (03PS1) 10Elukey: role::cache::{text,upload}: enable webrequest tagging in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1297641 (https://phabricator.wikimedia.org/T402512) [09:51:55] (03CR) 10Tiziano Fogli: [C:03+2] slothslos/report2drive: add dummy drive secret [labs/private] - 10https://gerrit.wikimedia.org/r/1297640 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [09:52:09] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] slothslos/report2drive: add dummy drive secret [labs/private] - 10https://gerrit.wikimedia.org/r/1297640 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [09:52:10] (03CR) 10JMeybohm: [C:03+2] "I did run a some more reimage tests:" [puppet] - 10https://gerrit.wikimedia.org/r/1297186 (https://phabricator.wikimedia.org/T428078) (owner: 10JMeybohm) [09:52:16] (03CR) 10JMeybohm: [C:03+2] reuse-raid10-6dev.cfg: Fix swap reuse and grub-install on all disks [puppet] - 10https://gerrit.wikimedia.org/r/1297201 (https://phabricator.wikimedia.org/T428078) (owner: 10JMeybohm) [09:53:49] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:54:06] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2057.codfw.wmnet with reason: host reimage [09:54:35] (03CR) 10Volans: Provide downtime duration information in sre.mysql cookbooks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [09:56:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [09:57:14] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1179: Upgrading db1179.eqiad.wmnet [09:58:09] !log redoing m2 backups after grant change T411111 [09:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:13] T411111: Database Creation request for requestctl.wikimedia.org - https://phabricator.wikimedia.org/T411111 [09:58:25] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1179: Upgrading db1179.eqiad.wmnet [09:59:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1179.eqiad.wmnet with OS trixie [09:59:32] (03PS1) 10Btullis: dumps: web: Make nginx ECS log_format conform to the Event Platform schema [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1000) [10:00:24] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:04:10] (03CR) 10Brouberol: [C:03+2] idp: add the kafka-ui service [puppet] - 10https://gerrit.wikimedia.org/r/1297563 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [10:04:12] (03CR) 10Dpogorzelski: ml-services: add liftwing-openapi-server deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [10:04:18] (03CR) 10Brouberol: [C:03+2] trafficserver: enable access to kafka.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1297564 (https://phabricator.wikimedia.org/T428053) (owner: 10Brouberol) [10:05:10] (03PS1) 10Tiziano Fogli: slothslos/report2drive: adjust key name [labs/private] - 10https://gerrit.wikimedia.org/r/1297643 (https://phabricator.wikimedia.org/T425795) [10:05:47] (03CR) 10Tiziano Fogli: [C:03+2] slothslos/report2drive: adjust key name [labs/private] - 10https://gerrit.wikimedia.org/r/1297643 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [10:05:49] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] slothslos/report2drive: adjust key name [labs/private] - 10https://gerrit.wikimedia.org/r/1297643 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [10:06:16] (03CR) 10Ottomata: [C:03+1] dumps: web: Make nginx ECS log_format conform to the Event Platform schema [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:06:44] (03CR) 10Brouberol: [C:03+1] dumps: web: Make nginx ECS log_format conform to the Event Platform schema [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:09:31] (03CR) 10Ottomata: dumps: web: Make nginx ECS log_format conform to the Event Platform schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:10:44] (03CR) 10Dpogorzelski: liftwing-openapi-server: Add new admin_ng service for serving OpenAPI specs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1297168 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [10:11:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2057.codfw.wmnet with OS trixie [10:11:42] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#11984536 (10BTullis) [10:11:43] (03CR) 10Jgiannelos: [C:03+2] tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295932 (owner: 10Jgiannelos) [10:12:28] (03CR) 10Ottomata: "Sorry, I meant to restart eventgate-analytics to pick this up but clearly forgot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297260 (https://phabricator.wikimedia.org/T425087) (owner: 10Ottomata) [10:13:23] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [10:13:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2057: repool after upgrade [10:13:56] (03Merged) 10jenkins-bot: tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1295932 (owner: 10Jgiannelos) [10:14:00] kostajh: thank you, sorry about that. [10:14:22] (03CR) 10Ottomata: "Oh I didn't even finish syncing it. Sorry about that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297550 (owner: 10Kosta Harlan) [10:15:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1179.eqiad.wmnet with reason: host reimage [10:15:35] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: apply [10:15:43] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [10:15:51] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [10:15:54] (03CR) 10Giuseppe Lavagetto: [C:03+1] role::cache::{text,upload}: enable webrequest tagging in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1297641 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [10:16:23] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [10:17:00] (03CR) 10Clément Goubert: ml-services: add liftwing-openapi-server deployment (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [10:18:51] (03CR) 10Clément Goubert: liftwing-openapi-server: Add new admin_ng service for serving OpenAPI specs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297168 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [10:18:54] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#11984548 (10BTullis) [10:19:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1179.eqiad.wmnet with reason: host reimage [10:22:14] (03PS2) 10Audrey Penven: Update config for WikiProjects linking prototype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) [10:22:14] (03PS1) 10Audrey Penven: WikiProject links - remove 'text' config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297644 (https://phabricator.wikimedia.org/T427804) [10:23:37] (03CR) 10Cathal Mooney: "Thanks for the patch. Logic is good, probably should put the vlan names after but I asked taavi on irc about it in general. I think it m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [10:27:45] (03PS2) 10Btullis: dumps: web: Make nginx ECS log_format conform to the Event Platform schema [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) [10:29:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:30:06] (03CR) 10Audrey Penven: "I considered doing this to avoid interruption on Beta and Test, and I think this definitely makes sense in a production context." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [10:32:12] (03CR) 10Btullis: dumps: web: Make nginx ECS log_format conform to the Event Platform schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [10:34:05] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11984612 (10PantheraLeo1359531) Orthophotos are among the few media types that simultaneously serve as illustrations, historical records, geospati... [10:34:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:38:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1179.eqiad.wmnet with OS trixie [10:40:05] (03CR) 10Majavah: wmf-config: Add new private1-eqsin subnets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [10:42:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:46:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1179: Migration of db1179.eqiad.wmnet completed [10:48:16] (03PS1) 10Blake: kubernetes-1.31: fix changelog date format [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1297648 (https://phabricator.wikimedia.org/T427065) [10:55:13] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [10:57:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:58:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:59:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2057: repool after upgrade [10:59:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [10:59:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es2050: Upgrading es2050.codfw.wmnet [11:00:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es2050: Upgrading es2050.codfw.wmnet [11:00:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2050.codfw.wmnet with OS trixie [11:02:12] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11984649 (10jcrespo) I don't think anyone is disputing that orthophotos can be educationally useful. The question is whether storing massive numbe... [11:09:37] (03CR) 10CWilliams: Provide downtime duration information in sre.mysql cookbooks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [11:16:35] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2050.codfw.wmnet with reason: host reimage [11:20:24] (03CR) 10Clément Goubert: "Tagging Arnaud for potential changes to miscweb chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:23:05] (03CR) 10Clément Goubert: liftwing-openapi-server: Add new admin_ng service for serving OpenAPI specs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297168 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:23:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2050.codfw.wmnet with reason: host reimage [11:24:28] (03CR) 10Clément Goubert: [C:03+2] deployment-server: Symlink clusterinfo for cluster_alias [puppet] - 10https://gerrit.wikimedia.org/r/1297101 (https://phabricator.wikimedia.org/T388969) (owner: 10Clément Goubert) [11:27:03] (03CR) 10Volans: Provide downtime duration information in sre.mysql cookbooks (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297126 (https://phabricator.wikimedia.org/T427780) (owner: 10CWilliams) [11:32:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1179: Migration of db1179.eqiad.wmnet completed [11:32:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [11:36:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:37:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1220: Upgrading db1220.eqiad.wmnet [11:40:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1220: Upgrading db1220.eqiad.wmnet [11:40:47] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2050.codfw.wmnet with OS trixie [11:42:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1220.eqiad.wmnet with OS trixie [11:44:07] marostegui@cumin1003 major-upgrade (PID 1091846) is awaiting input [11:44:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:21] (03PS3) 10Gkyziridis: ml-services: add liftwing-openapi-server deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) [11:49:14] (03PS1) 10CDanis: cache: haproxy: enable_mlock 🚀codfw&drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1297655 [11:52:08] (03PS2) 10Gkyziridis: liftwing-openapi-server: Add new admin_ng service for serving OpenAPI specs [puppet] - 10https://gerrit.wikimedia.org/r/1297168 (https://phabricator.wikimedia.org/T427902) [11:54:11] (03CR) 10Clément Goubert: ml-services: add liftwing-openapi-server deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [11:59:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1200) [12:02:24] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Okay to deploy at any time, should be a no-op. (The messages don’t exist yet – I only just merged the Extension:Wikidata.org change that a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [12:02:36] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] WikiProject links - remove 'text' config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297644 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [12:02:47] (03PS1) 10Cathal Mooney: netops: only alert on high optical power beyond safe threshold [alerts] - 10https://gerrit.wikimedia.org/r/1297664 [12:04:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1220.eqiad.wmnet with reason: host reimage [12:04:22] 06SRE, 10Wikimedia-Mailing-lists: Create new mailing lists: foundationbulletin@lists.wikimedia.org - https://phabricator.wikimedia.org/T428054#11984877 (10IKristiani-WMF) Thank you! bulletin@ is simpler but not clear which bulletin it refers to. foundationbulletin@ might be long but clearer. >>! In T428054#1... [12:04:34] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [12:04:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es2050: repool after upgrade [12:06:23] (03PS2) 10Cathal Mooney: netops: only alert on high optical power beyond safe threshold [alerts] - 10https://gerrit.wikimedia.org/r/1297664 [12:06:42] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "FWIW, for me it’s less about the broken functionality and more about logspam from PHP warnings about the missing array keys. (I *think* th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [12:07:37] (03PS1) 10Daniel Kinzler: EXPERIMENT: rest-gateway: Dockerize system tests (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297666 (https://phabricator.wikimedia.org/T424825) [12:07:38] (03CR) 10Ayounsi: [C:03+1] netops: only alert on high optical power beyond safe threshold [alerts] - 10https://gerrit.wikimedia.org/r/1297664 (owner: 10Cathal Mooney) [12:09:21] (03PS1) 10Daniel Kinzler: EXPERIMENT: run smokepy tests via helm test (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297668 (https://phabricator.wikimedia.org/T424825) [12:09:31] (03CR) 10CI reject: [V:04-1] EXPERIMENT: run smokepy tests via helm test (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297668 (https://phabricator.wikimedia.org/T424825) (owner: 10Daniel Kinzler) [12:09:36] (03PS2) 10Daniel Kinzler: EXPERIMENT: run smokepy tests via helm test (again) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297668 (https://phabricator.wikimedia.org/T424825) [12:12:14] (03PS4) 10Gkyziridis: ml-services: add liftwing-openapi-server deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297167 (https://phabricator.wikimedia.org/T427902) [12:12:24] (03CR) 10Cathal Mooney: [C:03+2] netops: only alert on high optical power beyond safe threshold [alerts] - 10https://gerrit.wikimedia.org/r/1297664 (owner: 10Cathal Mooney) [12:12:42] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11984904 (10MatthewVernon) Wearing my WMF staff hat, I'd like to note that "we should not store uncompressed TIFFs in commons" is definitely our c... [12:15:37] (03Merged) 10jenkins-bot: netops: only alert on high optical power beyond safe threshold [alerts] - 10https://gerrit.wikimedia.org/r/1297664 (owner: 10Cathal Mooney) [12:17:18] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1297670 (owner: 10L10n-bot) [12:20:38] jouncebot: nowandnext [12:20:38] For the next 0 hour(s) and 39 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1200) [12:20:38] In 0 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1300) [12:20:44] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1220.eqiad.wmnet with OS trixie [12:21:38] Anyone object to me using scap? [12:22:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [12:22:56] (03PS3) 10Dreamy Jazz: wmf-config: Skip CAPTCHA for action=mcrundo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [12:23:04] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [12:27:58] (03CR) 10Elukey: [C:03+2] Upgrade config and code to allow Trixie builds [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297184 (https://phabricator.wikimedia.org/T428024) (owner: 10Elukey) [12:28:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1220: Migration of db1220.eqiad.wmnet completed [12:28:12] RECOVERY - Confd vcl based reload on cp6011 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [12:28:16] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#11984961 (10BTullis) Setting T425087 as a parent task, since t... [12:28:47] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1297683 (https://phabricator.wikimedia.org/T428158) [12:28:52] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1297684 (https://phabricator.wikimedia.org/T428158) [12:29:22] (03PS1) 10Svantje Lilienthal: Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) [12:30:02] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [12:30:05] (03CR) 10CI reject: [V:04-1] wmf-config: Skip CAPTCHA for action=mcrundo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [12:30:12] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [12:30:12] PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [12:31:20] (03CR) 10CI reject: [V:04-1] Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [12:34:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [12:34:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:34:55] (Failure was random castor failure) [12:37:12] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/kafka-ui: apply [12:37:30] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/kafka-ui: apply [12:41:34] gate-and-submit for mediawiki config patches is being very slow.... [12:42:33] (03CR) 10Scott French: [C:03+1] scaffold: Bump mesh.service version from 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297258 (owner: 10RLazarus) [12:42:51] All the jobs are waiting on castor... [12:44:57] 06SRE, 10Wikimedia-Mailing-lists: Create new mailing lists: foundationbulletin@lists.wikimedia.org - https://phabricator.wikimedia.org/T428054#11985041 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/foundation-bulletin.lists.wikimedia.org/ {{done}} [12:47:45] (03CR) 10WMDE-Fisch: "Seems like we have to set the disallowed wikis manually 🤔" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [12:47:53] (03CR) 10Dreamy Jazz: [V:03+2] "Castor is stuck. The jobs passed otherwise" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296557 (https://phabricator.wikimedia.org/T427612) (owner: 10Mpostoronca) [12:48:23] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1296557|wmf-config: Skip CAPTCHA for action=mcrundo (T427612)]] [12:48:28] T427612: hCaptcha: mcrundo cannot be used when hCaptcha is enabled for editing - https://phabricator.wikimedia.org/T427612 [12:48:33] Force merged the config patch as the jobs all passed with the only issue being castor not running for 10+ minutes [12:48:54] (03PS1) 10Cathal Mooney: network data.yaml: add new per-rack vlan ranges for eqiad ab refresh [puppet] - 10https://gerrit.wikimedia.org/r/1297685 (https://phabricator.wikimedia.org/T418012) [12:49:08] (03CR) 10Ladsgroup: [C:03+1] "I'll get it deployed later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297268 (https://phabricator.wikimedia.org/T427126) (owner: 10Pppery) [12:50:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es2050: repool after upgrade [12:50:25] !log dreamyjazz@deploy1003 mpostoronca, dreamyjazz: Backport for [[gerrit:1296557|wmf-config: Skip CAPTCHA for action=mcrundo (T427612)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:52:01] (03CR) 10Ottomata: [C:03+1] dumps: web: Make nginx ECS log_format conform to the Event Platform schema [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [12:52:40] !log dreamyjazz@deploy1003 mpostoronca, dreamyjazz: Continuing with deployment [12:53:05] (03CR) 10Thiemo Kreuz (WMDE): "What if we change the default to true and exclude what we know needs to be excluded, including group2?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [12:54:32] (03PS2) 10Cathal Mooney: wmf-config: Update private subnets to include additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [12:54:41] jouncebot: nowandnext [12:54:41] For the next 0 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1200) [12:54:41] In 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1300) [12:55:02] (03CR) 10Cathal Mooney: wmf-config: Update private subnets to include additions (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [12:56:54] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296557|wmf-config: Skip CAPTCHA for action=mcrundo (T427612)]] (duration: 08m 30s) [12:57:06] T427612: hCaptcha: mcrundo cannot be used when hCaptcha is enabled for editing - https://phabricator.wikimedia.org/T427612 [12:57:13] (03PS3) 10Cathal Mooney: wmf-config: Update private subnets to include additions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297237 (https://phabricator.wikimedia.org/T427393) (owner: 10BCornwall) [12:59:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.major-upgrade [12:59:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool es1057: Upgrading es1057.eqiad.wmnet [13:00:02] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage - https://phabricator.wikimedia.org/T428161 (10Jclark-ctr) 03NEW [13:00:05] Lucas_WMDE, urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1300). [13:00:05] codenamenoreste: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool es1057: Upgrading es1057.eqiad.wmnet [13:00:25] I can deploy (there should also be another patch to deploy by yerdua_wmde in a moment ^^) [13:00:41] Does the patch in the window need discussion first? [13:00:42] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1057.eqiad.wmnet with OS trixie [13:00:44] Seems there is a -1 on it [13:00:57] (03CR) 10Scott French: [C:03+1] mesh.networkpolicy: Add ingress ports for restricted_listeners (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [13:01:03] I haven’t looked at it yet [13:01:04] * Lucas_WMDE looks [13:01:09] oh that [13:01:32] (03PS1) 10Ayounsi: gNMI TLS probe monitoring: add Nokia port [puppet] - 10https://gerrit.wikimedia.org/r/1297688 [13:01:59] yeah, sorry, I’m not deploying a config change that appears to actively circumvent a BoT decision [13:02:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [13:02:42] (03CR) 10Btullis: [C:03+2] dumps: web: Make nginx ECS log_format conform to the Event Platform schema [puppet] - 10https://gerrit.wikimedia.org/r/1297642 (https://phabricator.wikimedia.org/T291645) (owner: 10Btullis) [13:03:10] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage - https://phabricator.wikimedia.org/T428161#11985128 (10Jclark-ctr) @cmooney @ayounsi Following the NetOps sync on Tuesday, I verified the serial numbers of the fabric cards in storage and documented them Can you advise if these can be... [13:04:02] (03CR) 10Scott French: [C:03+1] mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [13:04:13] (commented on the phab task too, for visibility, since codenamenoreste is once again not online at the beginning of the window…) [13:04:14] (03CR) 10Awight: "This sounds like a good workaround." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [13:04:50] (03PS2) 10Ayounsi: gNMI TLS probe monitoring: add Nokia port [puppet] - 10https://gerrit.wikimedia.org/r/1297688 [13:04:58] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297688 (owner: 10Ayounsi) [13:05:30] yerdua_wmde: can you add your config change to the deployment schedule? [13:05:58] (03CR) 10Cathal Mooney: [C:03+1] gNMI TLS probe monitoring: add Nokia port [puppet] - 10https://gerrit.wikimedia.org/r/1297688 (owner: 10Ayounsi) [13:06:18] Lucas_WMDE: I think I just did [13:06:25] * Lucas_WMDE looks [13:06:31] indeed [13:06:41] ah, yes, 15:02 was the wikibugs message that I missed [13:07:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [13:08:17] (03CR) 10Ayounsi: [C:03+2] gNMI TLS probe monitoring: add Nokia port [puppet] - 10https://gerrit.wikimedia.org/r/1297688 (owner: 10Ayounsi) [13:09:05] CI waiting for castor-save-workspace-cache, as usual… [13:09:14] (03CR) 10Scott French: [C:03+1] mesh.service: Add TLS service ports for restricted_listeners (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [13:11:33] PROBLEM - Host db1224 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:11:53] I was waiting for 15 minutes on my config change [13:11:56] !ack [13:11:57] 8059 (ACKED) Host db1224 (paged) [13:12:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1224', diff saved to https://phabricator.wikimedia.org/P93875 and previous config saved to /var/cache/conftool/dbconfig/20260604-131219-marostegui.json [13:12:22] just depooled it [13:12:34] oh thanks [13:12:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: down [13:12:47] marostegui: do you want me to do anything? [13:13:02] I have https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1281901 to deploy [13:13:08] Amir1: all good [13:13:14] marostegui: <3 [13:13:14] I will take a look after the meeting [13:13:15] Thanks <3 [13:13:18] 1224 again? [13:13:20] Dreamy_Jazz: I used to have one of those “skeleton in front of PC” images captioned “waiting for jenkins” bookmarked but I think that got deleted at some point [13:13:29] :D [13:13:34] codenamenoreste: please see my comment on the task [13:13:37] (03Merged) 10jenkins-bot: Update config for WikiProjects linking prototype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1295978 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [13:13:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1220: Migration of db1220.eqiad.wmnet completed [13:13:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [13:14:01] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1295978|Update config for WikiProjects linking prototype (T427804)]] [13:14:06] T427804: [WIPR] Allow for translations of Wikiproject names - https://phabricator.wikimedia.org/T427804 [13:14:06] [morpheus voice] at last [13:14:44] I'll have a config change to do after you if it's just one? [13:14:58] federico3: https://bash.toolforge.org/quip/rDHSeYMB6FQ6iqKiHR7m :P [13:15:33] would T424413 even be closed as invalid anyway? I was busy and attempted to log in here Dreamy Jazz said I did not show up at the window [13:15:34] T424413: Create new custom namespaces (Report, WN) on cdo.wikipedia.org - https://phabricator.wikimedia.org/T424413 [13:16:01] RECOVERY - Host db1224 #page is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [13:16:04] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, audreypenven: Backport for [[gerrit:1295978|Update config for WikiProjects linking prototype (T427804)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:16:57] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11985241 (10FCeratto-WMF) 05Resolved→03Open The host crashed again. [13:17:12] RECOVERY - Confd vcl based reload on cp6010 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:17:12] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:17:13] see Wikinews#Languages that use Wikipedia to serve their Wikinews on Meta [13:17:15] yerdua_wmde: please test :) I guess the msg part is untestable but the Special:MyLanguage URL can be tested [13:17:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11985243 (10FCeratto-WMF) a:05FCeratto-WMF→03None [13:17:35] Pppery agreed with the change but others objected [13:17:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1057.eqiad.wmnet with reason: host reimage [13:17:57] codenamenoreste: I personally wouldn’t close the task at this point, but if enough other deployers refuse to deploy, it probably makes sense at some point [13:18:13] (but if someone else does deploy the change I won’t block it or anything) [13:18:30] it's just that you, Dreamy Jazz, and Neriah all agree on not deploying the change [13:18:55] Lucas_WMDE: taking a look now [13:19:12] PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 3 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:19:12] PROBLEM - Confd vcl based reload on cp6013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:20:27] (03CR) 10Scott French: [C:03+1] "Thanks for explaining, and I get what you mean." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [13:20:51] (03CR) 10Ssingh: [C:03+1] cache: haproxy: enable_mlock 🚀codfw&drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1297655 (owner: 10CDanis) [13:21:13] 06SRE, 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Uncompressed TIFFs on commons - https://phabricator.wikimedia.org/T427949#11985249 (10Ladsgroup) >>! In T427949#11984904, @MatthewVernon wrote: > Wearing my WMF staff hat, I'd like to note that "we should not store uncom... [13:21:27] (03CR) 10Elukey: [C:03+2] role::cache::{text,upload}: enable webrequest tagging in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1297641 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [13:21:41] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297624 (https://phabricator.wikimedia.org/T428024) (owner: 10Elukey) [13:22:10] (03CR) 10Elukey: [C:03+2] Modify rules to build sphinx on Bookworm and Trixie [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297624 (https://phabricator.wikimedia.org/T428024) (owner: 10Elukey) [13:23:43] I don't see the change, but I think I'm testing wrong [13:23:48] * Lucas_WMDE looks [13:23:53] where are you testing? [13:24:05] (03PS1) 10Ayounsi: gNMI TLS monitoring: add pfw and remove virtual-chassis [puppet] - 10https://gerrit.wikimedia.org/r/1297691 [13:24:35] beta, and I know I'm supposed to do something with the WikimediaDebug extension... It's on, but I don't see useful options [13:24:39] yerdua_wmde: I see a Special:MyLanguage link at https://test.wikidata.org/wiki/Q42 with WikimediaDebug [13:24:47] it has to be Test Wikidata [13:24:51] (or any other production wiki) [13:24:52] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297691 (owner: 10Ayounsi) [13:24:58] Beta changes can’t be tested to WikimediaDebug [13:25:06] they just get the config change 0–10 minutes after it’s been merged [13:25:12] (they = beta wikis) [13:25:23] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1057.eqiad.wmnet with reason: host reimage [13:25:43] ahhh, ok. I tried test and got confused. But now it's behaving like I expect [13:25:59] alright, then let’s deploy \o/ [13:26:03] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, audreypenven: Continuing with deployment [13:26:28] Dreamy_Jazz: btw can you add your change to the window already, for visibility? :) [13:27:02] It's not on gerrit just yet :D [13:27:07] I'll add it shortly [13:27:12] *wags finger* [13:27:13] ok ^^ [13:28:06] (03PS1) 10Dreamy Jazz: hCaptcha: Provide always challenge sitekey for account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297692 (https://phabricator.wikimedia.org/T421041) [13:28:11] There we go :D [13:28:17] yay ^^ [13:28:26] (03PS1) 10Ozge: ml-services: editing-suggestions updating storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297693 (https://phabricator.wikimedia.org/T427794) [13:28:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297692 (https://phabricator.wikimedia.org/T421041) (owner: 10Dreamy Jazz) [13:31:14] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1295978|Update config for WikiProjects linking prototype (T427804)]] (duration: 17m 13s) [13:31:16] Dreamy_Jazz: over to you :) [13:31:16] (03PS1) 10Jgiannelos: tegola: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297695 [13:31:18] T427804: [WIPR] Allow for translations of Wikiproject names - https://phabricator.wikimedia.org/T427804 [13:31:23] yerdua_wmde: all done ^^ [13:31:29] Thanks [13:31:34] \o/ [13:31:42] thanks! [13:31:49] (03CR) 10Cathal Mooney: [C:03+1] gNMI TLS monitoring: add pfw and remove virtual-chassis [puppet] - 10https://gerrit.wikimedia.org/r/1297691 (owner: 10Ayounsi) [13:31:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297692 (https://phabricator.wikimedia.org/T421041) (owner: 10Dreamy Jazz) [13:32:18] (03PS1) 10Tiziano Fogli: slothslos/report2drive: move secrets under grafana role [labs/private] - 10https://gerrit.wikimedia.org/r/1297696 (https://phabricator.wikimedia.org/T425795) [13:32:36] (03CR) 10Ayounsi: [C:03+2] gNMI TLS monitoring: add pfw and remove virtual-chassis [puppet] - 10https://gerrit.wikimedia.org/r/1297691 (owner: 10Ayounsi) [13:32:41] (03CR) 10Tiziano Fogli: [C:03+2] slothslos/report2drive: move secrets under grafana role [labs/private] - 10https://gerrit.wikimedia.org/r/1297696 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [13:32:43] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] slothslos/report2drive: move secrets under grafana role [labs/private] - 10https://gerrit.wikimedia.org/r/1297696 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [13:32:44] (03Merged) 10jenkins-bot: hCaptcha: Provide always challenge sitekey for account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297692 (https://phabricator.wikimedia.org/T421041) (owner: 10Dreamy Jazz) [13:33:10] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297692|hCaptcha: Provide always challenge sitekey for account creation (T421041)]] [13:33:14] T421041: hCaptcha challenge not shown when triggering AbuseFilter on Special:CreateAccount - https://phabricator.wikimedia.org/T421041 [13:34:33] (03PS1) 10Ayounsi: Alert after the gNMI TLS cert expired [alerts] - 10https://gerrit.wikimedia.org/r/1297698 [13:35:12] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297692|hCaptcha: Provide always challenge sitekey for account creation (T421041)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11985347 (10Marostegui) The HW logs are empty but this host has crashed again twice today. One when I was doing a reimage and a second one after it got to production @VRiley-WMF is there something we can... [13:36:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: down [13:36:35] (03PS2) 10Thiemo Kreuz (WMDE): Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [13:37:20] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] "Done in patchset 2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [13:37:26] (03CR) 10CI reject: [V:04-1] Global rollout - Sub-ref deployments to Group 0, Group 1 and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [13:37:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11985353 (10VRiley-WMF) a:03VRiley-WMF [13:37:54] (03PS1) 10Marostegui: db1224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1297699 (https://phabricator.wikimedia.org/T427535) [13:38:02] (03CR) 10Btullis: wdqs-backend: Deployment chart for the WDQS triple-store (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286374 (https://phabricator.wikimedia.org/T425007) (owner: 10Trueg) [13:38:06] !log dreamyjazz@deploy1003 dreamyjazz: Rolling back deployment [13:38:36] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297692|hCaptcha: Provide always challenge sitekey for account creation (T421041)]] (duration: 05m 27s) [13:38:40] T421041: hCaptcha challenge not shown when triggering AbuseFilter on Special:CreateAccount - https://phabricator.wikimedia.org/T421041 [13:38:57] (03CR) 10Marostegui: "@cwilliams@wikimedia.org @fceratto@wikimedia.org fyi" [puppet] - 10https://gerrit.wikimedia.org/r/1297699 (https://phabricator.wikimedia.org/T427535) (owner: 10Marostegui) [13:39:01] (03CR) 10Marostegui: [C:03+2] db1224: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1297699 (https://phabricator.wikimedia.org/T427535) (owner: 10Marostegui) [13:39:06] (03PS1) 10Dreamy Jazz: Revert "hCaptcha: Provide always challenge sitekey for account creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297700 [13:39:16] (03CR) 10Dreamy Jazz: [C:03+2] Revert "hCaptcha: Provide always challenge sitekey for account creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297700 (owner: 10Dreamy Jazz) [13:39:19] (03CR) 10Xcollazo: [C:03+1] "LGTM, thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1297256 (owner: 10Creynolds) [13:39:28] Decided to not sync that patch, reverting it now [13:39:34] Beyond that I think the window is done [13:39:44] jouncebot: nowandnext [13:39:44] For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1300) [13:39:44] In 0 hour(s) and 50 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1430) [13:40:17] (03Merged) 10jenkins-bot: Revert "hCaptcha: Provide always challenge sitekey for account creation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297700 (owner: 10Dreamy Jazz) [13:40:57] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297700|Revert "hCaptcha: Provide always challenge sitekey for account creation"]] [13:41:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1057.eqiad.wmnet with OS trixie [13:42:59] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297700|Revert "hCaptcha: Provide always challenge sitekey for account creation"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:43:55] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.major-upgrade (exit_code=99) [13:44:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool es1057: repool after upgrade [13:44:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11985407 (10VRiley-WMF) I will gather logs of this server and submit them to Dell. For them to replace the chassis takes a lot of convincing, so I'll see what I can do. There are oth... [13:46:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11985428 (10Marostegui) >>! In T427535#11985407, @VRiley-WMF wrote: > I will gather logs of this server and submit them to Dell. For them to replace the chassis takes a lot of convin... [13:46:52] (03CR) 10Btullis: [C:03+2] dumps: Clarify download types and refresh HTML dumps references [puppet] - 10https://gerrit.wikimedia.org/r/1297256 (owner: 10Creynolds) [13:47:03] !log sukhe@cp6011:~$ sudo -i varnish-frontend-restart [13:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:22] (03CR) 10Btullis: [C:03+2] "Looks good to me too. Merging and deploying." [puppet] - 10https://gerrit.wikimedia.org/r/1297256 (owner: 10Creynolds) [13:48:50] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11985434 (10cmooney) Actually @BCornwall I'm hoping to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1297232 , the goal... [13:49:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Observability-Logging: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T427748#11985435 (10colewhite) 05In progress→03Resolved Raid is rebuilt. Thanks for the hardware and prompt response! [13:50:14] Dreamy_Jazz: https://spiderpig.wikimedia.org/jobs/2189 is still waiting for your input, I think? [13:50:26] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [13:50:31] Whoops yes it is [13:50:40] :) [13:50:46] (I was just checking if we can log that the window is done yet) [13:50:54] I had assumed scap would detect a config revert as a beta deploy [13:50:58] But seemingly not [13:51:26] I… guess it could, yeah 🤔 [13:51:29] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:51:43] unless someone is worried that the :latest version of the mediawiki image is a bad one (which the rebuild fixes) [13:51:50] Yeah, good point [13:51:55] * Lucas_WMDE doesn’t know if our images even have :latest versions [13:51:59] I guess running scap again is harmless so... [13:52:12] yeah [13:53:14] (03CR) 10Thiemo Kreuz (WMDE): "Oh no." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297681 (https://phabricator.wikimedia.org/T425662) (owner: 10Svantje Lilienthal) [13:53:19] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297655 (owner: 10CDanis) [13:53:34] (03PS2) 10Tiziano Fogli: Alert after the gNMI TLS cert expired [alerts] - 10https://gerrit.wikimedia.org/r/1297698 (owner: 10Ayounsi) [13:53:41] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1297700 touches CommonSettings, it is not beta-only [13:54:34] Sure, though I would have thought a revert could be detected as "no change to CommonSettings.php over two commits" [13:54:35] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297700|Revert "hCaptcha: Provide always challenge sitekey for account creation"]] (duration: 13m 38s) [13:54:49] Though I can see how that could be problematic if someone edits the revert commit to add something else [13:56:08] (03CR) 10CDanis: [C:03+2] cache: haproxy: enable_mlock 🚀codfw&drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1297655 (owner: 10CDanis) [13:56:08] !log Afternoon UTC backport window done [13:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:30] just in time ^^ [13:56:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es2050 to es3 codfw primary T428050', diff saved to https://phabricator.wikimedia.org/P93878 and previous config saved to /var/cache/conftool/dbconfig/20260604-135631-marostegui.json [13:56:32] <_joe_> !log transferred requestctl api tokens for all ops to the db (T428119) [13:56:34] :D [13:56:36] T428050: Migrate es3 section to Debian Trixie - https://phabricator.wikimedia.org/T428050 [13:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:39] T428119: Switch api token management to be db-backed - https://phabricator.wikimedia.org/T428119 [13:58:31] (03CR) 10Tiziano Fogli: [C:03+1] Alert after the gNMI TLS cert expired [alerts] - 10https://gerrit.wikimedia.org/r/1297698 (owner: 10Ayounsi) [13:59:05] (03CR) 10Ayounsi: [C:03+2] Alert after the gNMI TLS cert expired [alerts] - 10https://gerrit.wikimedia.org/r/1297698 (owner: 10Ayounsi) [13:59:51] jouncebot: nowandnext [13:59:51] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1300) [13:59:51] In 0 hour(s) and 30 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1430) [13:59:56] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl_client: remove the absenting of the old package [puppet] - 10https://gerrit.wikimedia.org/r/1297288 (owner: 10Giuseppe Lavagetto) [14:00:04] If no one minds, I have another thing to deploy :D [14:00:19] (03PS1) 10Dreamy Jazz: Use the globalblock-local-status right over globalblock-whitelist [extensions/GlobalBlocking] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297704 (https://phabricator.wikimedia.org/T277942) [14:00:31] (03CR) 10Elukey: [C:03+1] tegola: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297695 (owner: 10Jgiannelos) [14:01:23] (03Merged) 10jenkins-bot: Alert after the gNMI TLS cert expired [alerts] - 10https://gerrit.wikimedia.org/r/1297698 (owner: 10Ayounsi) [14:01:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/GlobalBlocking] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297704 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [14:01:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296620 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [14:01:49] (03CR) 10Jgiannelos: [C:03+2] tegola: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297695 (owner: 10Jgiannelos) [14:02:36] (03CR) 10AikoChou: [C:03+1] ml-services: editing-suggestions updating storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297693 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [14:04:05] (03Merged) 10jenkins-bot: tegola: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297695 (owner: 10Jgiannelos) [14:04:38] (03Merged) 10jenkins-bot: Use the globalblock-local-status right over globalblock-whitelist [extensions/GlobalBlocking] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297704 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [14:04:51] (03CR) 10Dreamy Jazz: [C:03+2] core-Permissions: Stop assigning unused globalblock-whitelist right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296620 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [14:04:59] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [14:05:01] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [14:05:23] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [14:05:48] (03Merged) 10jenkins-bot: core-Permissions: Stop assigning unused globalblock-whitelist right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296620 (https://phabricator.wikimedia.org/T277942) (owner: 10Dreamy Jazz) [14:06:05] !log bump space for prometheus k8s-aux in eqiad [14:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:10] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [14:06:17] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297704|Use the globalblock-local-status right over globalblock-whitelist (T277942)]], [[gerrit:1296620|core-Permissions: Stop assigning unused globalblock-whitelist right (T277942)]] [14:06:20] T277942: Address Voice and Tone issues in GlobalBlocking - https://phabricator.wikimedia.org/T277942 [14:06:47] (03PS1) 10Cathal Mooney: Add variable to set QSFP+ port to 4x10G mode [homer/public] - 10https://gerrit.wikimedia.org/r/1297706 (https://phabricator.wikimedia.org/T427056) [14:06:57] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [14:07:33] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [14:07:50] (03CR) 10Ozge: [C:03+2] ml-services: editing-suggestions updating storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297693 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [14:08:21] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297704|Use the globalblock-local-status right over globalblock-whitelist (T277942)]], [[gerrit:1296620|core-Permissions: Stop assigning unused globalblock-whitelist right (T277942)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:53] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [14:09:26] (03CR) 10Ozge: [V:03+2 C:03+2] ml-services: editing-suggestions updating storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297693 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [14:09:50] (03Merged) 10jenkins-bot: ml-services: editing-suggestions updating storage uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297693 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [14:10:23] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:11:13] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage - https://phabricator.wikimedia.org/T428161#11985556 (10ayounsi) End of support was 2021 according to https://support.juniper.net/support/eol/product/m_series/ so I'd say we can recycle them. Also the end of support for the `MPC-3D-16XGE... [14:12:21] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage - https://phabricator.wikimedia.org/T428161#11985576 (10Jclark-ctr) 05Open→03Resolved Thank you for checking @ayounsi we will add them with the servers [14:12:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851) (owner: 10Arlolra) [14:12:51] (03CR) 10Ayounsi: [C:03+1] Add variable to set QSFP+ port to 4x10G mode [homer/public] - 10https://gerrit.wikimedia.org/r/1297706 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [14:12:58] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage - https://phabricator.wikimedia.org/T428161#11985584 (10cmooney) >>! In T428161#11985556, @ayounsi wrote: > End of support was 2021 according to https://support.juniper.net/support/eol/product/m_series/ > so I'd say we can recycle the... [14:13:03] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 0 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:13:03] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297704|Use the globalblock-local-status right over globalblock-whitelist (T277942)]], [[gerrit:1296620|core-Permissions: Stop assigning unused globalblock-whitelist right (T277942)]] (duration: 06m 46s) [14:13:07] T277942: Address Voice and Tone issues in GlobalBlocking - https://phabricator.wikimedia.org/T277942 [14:13:11] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 0h, 0 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:13:11] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 0h, 0 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:13:11] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 0h, 0 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:13:11] PROBLEM - Confd vcl based reload on cp6010 is CRITICAL: reload-vcl failed to run since 0h, 0 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:13:40] (03CR) 10Scott French: [C:03+1] wikifunctions: Add mesh.restricted_listeners port to orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296071 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [14:13:44] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage - https://phabricator.wikimedia.org/T428161#11985591 (10ayounsi) 05Resolved→03Open Keeping the task open to track the removal of the `MPC-3D-16XGE-SFPP` linecards. [14:14:02] (03PS1) 10Btullis: Move the definition of wdqs-v2 namespaces to the dse-k8s common values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297708 (https://phabricator.wikimedia.org/T422522) [14:14:10] (03CR) 10Cathal Mooney: [C:03+2] Add variable to set QSFP+ port to 4x10G mode [homer/public] - 10https://gerrit.wikimedia.org/r/1297706 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [14:15:29] (03Merged) 10jenkins-bot: Add variable to set QSFP+ port to 4x10G mode [homer/public] - 10https://gerrit.wikimedia.org/r/1297706 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [14:15:36] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [14:15:40] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [14:15:45] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [14:16:11] RECOVERY - Confd vcl based reload on cp6013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:16:27] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [14:16:31] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [14:16:34] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [14:16:37] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [14:16:41] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [14:17:28] 10ops-eqiad, 06SRE, 06DC-Ops: document Old line cards in eqiad Storage. and removal of MPC-3D-16XGE-SFPP line cards from CR1 and CR2 - https://phabricator.wikimedia.org/T428161#11985615 (10Jclark-ctr) [14:19:44] (03PS1) 10Btullis: Enable higher resource limites for wdqs-* namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297709 (https://phabricator.wikimedia.org/T422522) [14:20:25] (03PS1) 10Gkyziridis: dns: Add liftwing-openapi-server CNAME records [dns] - 10https://gerrit.wikimedia.org/r/1297710 (https://phabricator.wikimedia.org/T427902) [14:20:26] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:22:22] FIRING: CertAlmostExpired: gNMI TLS certificate for lsw1-f1-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:22:46] (03PS2) 10Btullis: Enable higher resource limits for wdqs-* namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297709 (https://phabricator.wikimedia.org/T422522) [14:25:19] (03CR) 10Ladsgroup: [C:03+1] "Time is a social construct anyway" [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804) (owner: 10Zabe) [14:25:24] (03PS2) 10Zabe: maintain-views: Loosen views for filerevision table [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804) [14:25:31] (03CR) 10Ladsgroup: [C:03+2] maintain-views: Loosen views for filerevision table [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804) (owner: 10Zabe) [14:25:34] (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Loosen views for filerevision table [puppet] - 10https://gerrit.wikimedia.org/r/1296029 (https://phabricator.wikimedia.org/T426804) (owner: 10Zabe) [14:25:56] jouncebot: nowandnext [14:25:57] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [14:25:57] In 0 hour(s) and 4 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1430) [14:25:59] (03PS1) 10Dreamy Jazz: hCaptcha: Enable for MobileFrontend in some Group 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297711 (https://phabricator.wikimedia.org/T425940) [14:26:13] Anyone mind me using scap? [14:26:25] (03CR) 10Brouberol: [C:03+1] Move the definition of wdqs-v2 namespaces to the dse-k8s common values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297708 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [14:26:48] (03CR) 10Brouberol: [C:03+1] Enable higher resource limits for wdqs-* namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297709 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [14:26:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297711 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [14:27:33] (03CR) 10Btullis: [C:03+2] Move the definition of wdqs-v2 namespaces to the dse-k8s common values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297708 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [14:27:45] (03Merged) 10jenkins-bot: hCaptcha: Enable for MobileFrontend in some Group 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297711 (https://phabricator.wikimedia.org/T425940) (owner: 10Dreamy Jazz) [14:28:11] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297711|hCaptcha: Enable for MobileFrontend in some Group 2 wikis (T425940)]] [14:28:15] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [14:28:26] (03CR) 10Clément Goubert: "small nits otherwise lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1297710 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [14:28:49] (03CR) 10Clément Goubert: [C:03+1] dns: Add liftwing-openapi-server CNAME records [dns] - 10https://gerrit.wikimedia.org/r/1297710 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [14:29:00] (03PS2) 10Audrey Penven: WikiProject links - remove 'text' config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297644 (https://phabricator.wikimedia.org/T427804) [14:29:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool es1057: repool after upgrade [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1430) [14:30:13] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297711|hCaptcha: Enable for MobileFrontend in some Group 2 wikis (T425940)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:32:23] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [14:33:30] (03CR) 10Btullis: [C:03+2] Enable higher resource limits for wdqs-* namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297709 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [14:35:31] (03Merged) 10jenkins-bot: Move the definition of wdqs-v2 namespaces to the dse-k8s common values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297708 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [14:36:23] (03CR) 10Ssingh: "Nice work! Leaving for Brett's review." [puppet] - 10https://gerrit.wikimedia.org/r/1297621 (https://phabricator.wikimedia.org/T428098) (owner: 10Cathal Mooney) [14:36:31] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297711|hCaptcha: Enable for MobileFrontend in some Group 2 wikis (T425940)]] (duration: 08m 20s) [14:36:35] T425940: hCaptcha: Rollout of MobileFrontend and VisualEditor integrations - https://phabricator.wikimedia.org/T425940 [14:37:08] Dreamy_Jazz: are you done? :) [14:37:13] I am done [14:37:28] alright [14:37:57] (03CR) 10Zabe: [C:03+2] Start reading from new file tables on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [14:38:55] (03Merged) 10jenkins-bot: Start reading from new file tables on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [14:39:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1270513|Start reading from new file tables on commons (T416548)]] [14:39:50] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [14:40:08] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f1-codfw [14:40:15] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw [14:41:50] !log zabe@deploy1003 zabe: Backport for [[gerrit:1270513|Start reading from new file tables on commons (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:42:11] (03PS2) 10Gkyziridis: dns: Add liftwing-openapi-server CNAME records [dns] - 10https://gerrit.wikimedia.org/r/1297710 (https://phabricator.wikimedia.org/T427902) [14:42:22] RESOLVED: CertAlmostExpired: gNMI TLS certificate for lsw1-f1-codfw.mgmt.codfw.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:42:32] (03Merged) 10jenkins-bot: Enable higher resource limits for wdqs-* namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297709 (https://phabricator.wikimedia.org/T422522) (owner: 10Btullis) [14:43:09] !log zabe@deploy1003 zabe: Continuing with deployment [14:43:45] !log zabe@deploy1003 sync-world aborted: Backport for [[gerrit:1270513|Start reading from new file tables on commons (T416548)]] (duration: 03m 58s) [14:44:07] (currently staying at canaries, will give it some time to see potential errors) [14:45:28] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296072 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [14:47:03] (03PS1) 10Ozge: ml-services: editing-suggestions eqiad deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297717 (https://phabricator.wikimedia.org/T427794) [14:48:54] (03CR) 10C. Scott Ananian: [C:03+1] Deploy PRV to 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851) (owner: 10Arlolra) [14:49:38] !log btullis@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:50:14] !log btullis@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:52:21] !log zabe@deploy1003 Started scap sync-world: T416548 [14:52:25] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [14:52:52] (03PS1) 10Zabe: Revert "Start reading from new file tables on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297724 [14:52:59] (03CR) 10Zabe: [C:03+2] Revert "Start reading from new file tables on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297724 (owner: 10Zabe) [14:53:15] (03CR) 10JMeybohm: [C:03+2] kafka-main2007: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288918 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [14:54:17] (03Merged) 10jenkins-bot: Revert "Start reading from new file tables on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297724 (owner: 10Zabe) [14:56:05] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS trixie [14:57:31] !log zabe@deploy1003 Finished scap sync-world: T416548 (duration: 05m 10s) [14:57:35] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [14:59:14] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1297724|Revert "Start reading from new file tables on commons"]] [15:00:05] dancy and jnuche: That opportune time for a Train log triage deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1500). [15:00:53] (03CR) 10AikoChou: [C:03+1] "LGTM! only two nits :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297717 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [15:01:06] Skipping train log triage meeting due to staff meeting. [15:01:28] !log zabe@deploy1003 zabe: Backport for [[gerrit:1297724|Revert "Start reading from new file tables on commons"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:02:01] !log zabe@deploy1003 zabe: Continuing with deployment [15:03:16] (03PS1) 10Zabe: Revert^2 "Start reading from new file tables on commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297727 (https://phabricator.wikimedia.org/T416548) [15:05:44] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [15:06:14] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297724|Revert "Start reading from new file tables on commons"]] (duration: 07m 00s) [15:06:25] (03PS1) 10Sbisson: ptwiki: Disable Article Guidance experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297730 (https://phabricator.wikimedia.org/T426871) [15:08:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:08:40] jouncebot now [15:08:40] For the next 0 hour(s) and 51 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1500) [15:09:01] (03CR) 10BCornwall: [C:03+1] "Great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1297621 (https://phabricator.wikimedia.org/T428098) (owner: 10Cathal Mooney) [15:09:27] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:29] I'm not sure what "Train log triage" is... is there any deployment happening at the moment? [15:13:14] Train log triage is a meeting where we go through the mediawiki error logs for the week and file tickets as needed. [15:13:28] You're welcome to deploy now [15:13:35] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [15:13:42] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1297621 (https://phabricator.wikimedia.org/T428098) (owner: 10Cathal Mooney) [15:14:02] dancy thank you! I have an important config change to deploy [15:15:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297730 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [15:16:26] (03CR) 10BCornwall: [V:03+1 C:03+2] ha-proxy: add TIFF image files to list of extensions for low-prio qos [puppet] - 10https://gerrit.wikimedia.org/r/1297621 (https://phabricator.wikimedia.org/T428098) (owner: 10Cathal Mooney) [15:16:55] (03Merged) 10jenkins-bot: ptwiki: Disable Article Guidance experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297730 (https://phabricator.wikimedia.org/T426871) (owner: 10Sbisson) [15:17:21] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1297730|ptwiki: Disable Article Guidance experiment (T426871)]] [15:17:25] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [15:18:47] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297735 [15:19:28] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [15:19:33] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1297730|ptwiki: Disable Article Guidance experiment (T426871)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:20:38] !log sbisson@deploy1003 sbisson: Continuing with deployment [15:24:20] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage [15:24:47] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297730|ptwiki: Disable Article Guidance experiment (T426871)]] (duration: 07m 26s) [15:24:51] T426871: Enable AG experiment on phase 2 wikis - https://phabricator.wikimedia.org/T426871 [15:26:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297644 (https://phabricator.wikimedia.org/T427804) (owner: 10Audrey Penven) [15:28:55] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [15:32:27] PROBLEM - SSH on logstash1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:33:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:33:31] FIRING: [2x] ProbeDown: Service logstash1032:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1032:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:04] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v12.7.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1297735 (owner: 10Elukey) [15:35:59] ladsgroup@cumin1003 update-views (PID 1139459) is awaiting input [15:36:17] RECOVERY - SSH on logstash1032 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:37:03] (03PS1) 10Elukey: Upstream release v12.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297738 [15:38:20] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v12.7.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1297738 (owner: 10Elukey) [15:38:31] RESOLVED: [2x] ProbeDown: Service logstash1032:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1032:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:20] (03PS2) 10Jforrester: abstractwiki-rust: Fetch cargo-chef from its own vendored-sources repo [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1297234 (https://phabricator.wikimedia.org/T427990) [15:39:52] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [15:41:32] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2007.codfw.wmnet with OS trixie [15:44:48] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp5030.* [15:45:47] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986028 (10BCornwall) @cmooney I've depooled cp5030. Have fun! [15:51:23] (03PS1) 10Dreamy Jazz: hCaptcha: Move ConfirmEditCaptchaClass hook inside hCaptcha block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297740 (https://phabricator.wikimedia.org/T428183) [15:51:26] jouncebot: nowandnext [15:51:27] For the next 0 hour(s) and 8 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1500) [15:51:27] In 0 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1600) [15:51:34] Want to deploy to fix beta [15:52:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297740 (https://phabricator.wikimedia.org/T428183) (owner: 10Dreamy Jazz) [15:53:02] (03Merged) 10jenkins-bot: hCaptcha: Move ConfirmEditCaptchaClass hook inside hCaptcha block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297740 (https://phabricator.wikimedia.org/T428183) (owner: 10Dreamy Jazz) [15:53:30] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297740|hCaptcha: Move ConfirmEditCaptchaClass hook inside hCaptcha block (T428183)]] [15:53:34] T428183: Beta deployments are not working because $doesEditApiInterfaceSupportHCaptcha is undefined - https://phabricator.wikimedia.org/T428183 [15:54:11] (03PS1) 10Dreamy Jazz: hCaptcha: Update MF interface name for instrumentation [extensions/WikimediaEvents] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297741 (https://phabricator.wikimedia.org/T428178) [15:54:20] (03PS1) 10Dreamy Jazz: hCaptcha: Update MF interface name for instrumentation [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297742 (https://phabricator.wikimedia.org/T428178) [15:55:35] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297740|hCaptcha: Move ConfirmEditCaptchaClass hook inside hCaptcha block (T428183)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:59:25] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [16:00:05] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1600). [16:00:05] urbanecm and MatmaRex: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:58] uhh hi [16:01:09] Sorry for my deploy going into this window [16:01:12] Was fixing beta [16:01:18] Should be done soon [16:02:21] i forgot i scheduled this and i don't really know how puppet deployments work [16:02:28] Oh I see [16:02:38] It gets deployed by an SRE who was pinged [16:02:46] You just need to be here and say it's fine to deploy :D [16:02:51] (and testing if it's relevant) [16:02:53] and i'm trying to follow the big meeting anyway. i can reschedule [16:03:03] it's fine to deploy if someone wants to do it right now though, heh [16:03:09] !log uploaded spicerack_12.7.0 to apt.wikimedia.org bookworm-wikimedia,trixie-wikimedia [16:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:13] We'll see what SRE says [16:03:20] If they are not here for the window, then :D [16:03:51] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297740|hCaptcha: Move ConfirmEditCaptchaClass hook inside hCaptcha block (T428183)]] (duration: 10m 21s) [16:03:55] T428183: Beta deployments are not working because $doesEditApiInterfaceSupportHCaptcha is undefined - https://phabricator.wikimedia.org/T428183 [16:04:44] o/ looking [16:05:56] Thanks, for MatmaRex's change I approve it from the PSI side (we should get these alerts instead) [16:06:06] rzl: im here too for my puppet patch :) [16:06:15] (03CR) 10Jasmine: [C:03+2] kafka-main2008: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288919 (https://phabricator.wikimedia.org/T427088) (owner: 10Jasmine) [16:06:27] It should be a practical no-op (the script is currently disabled via a feature flag) [16:06:41] (03CR) 10Jforrester: "Done; should I also add some blurb in the README about how I've done it for Rust, or are we (rightly) hoping this image continues to be a " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1297234 (https://phabricator.wikimedia.org/T427990) (owner: 10Jforrester) [16:06:46] 06SRE, 06Infrastructure-Foundations: Build spicerack for Trixie - https://phabricator.wikimedia.org/T428024#11986124 (10elukey) I was able to build spicerack for Trixie. Side note is that python-release needs a fix: ` Traceback (most recent call last): File "/home/elukey/Wikimedia/spicerack/setup.py", line... [16:07:56] Dreamy_Jazz: perfect, thanks for anticipating my ownership question :) [16:08:24] urandom: and yeah lgtm, I was just double-checking the cron syntax (both of them, lol) but of course you're good [16:08:27] er, sorry [16:08:28] urbanecm: ^ [16:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:00] (03CR) 10RLazarus: [C:03+2] growthexperiments.pp: Run cleanMentorList every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1296519 (https://phabricator.wikimedia.org/T427386) (owner: 10Urbanecm) [16:09:11] (03CR) 10RLazarus: [C:03+2] 'purge_temporary_accounts' job is owned by PSI, not MWP [puppet] - 10https://gerrit.wikimedia.org/r/1296664 (owner: 10Bartosz Dziewoński) [16:09:40] rzl: always better to check :) [16:09:41] merging both -- urbanecm it sounds like you won't need a test run, just let it start (and not do anything) on schedule? [16:10:29] (also assume you know and don't mind that sometimes it'll run on the 31st and then the next day on the 1st) [16:11:50] rzl: yes and yes [16:11:52] (03CR) 10Ozge: [C:03+2] ml-services: editing-suggestions eqiad deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297717 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [16:12:29] (03CR) 10Ozge: ml-services: editing-suggestions eqiad deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297717 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [16:12:46] thanks rzl [16:13:08] jasmine@cumin2002 reimage (PID 4193800) is awaiting input [16:13:34] (03PS2) 10Ozge: ml-services: editing-suggestions eqiad deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297717 (https://phabricator.wikimedia.org/T427794) [16:13:51] and Dreamy_Jazz re the alert-routing question, yeah it's configured at https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/modules/alertmanager/templates/alertmanager.yml.erb#342 but everything for that team name goes to slack [16:13:56] (03CR) 10Ozge: [V:03+2 C:03+2] ml-services: editing-suggestions eqiad deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297717 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [16:14:05] 06SRE, 06Data-Persistence, 06DBA: Build wmfdb-admin for Trixie - https://phabricator.wikimedia.org/T427900#11986145 (10elukey) Judgding from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/wmfdb/+/refs/heads/master/debian/changelog (if it is the right one) we copied bullseye version over t... [16:14:46] Thanks! [16:15:01] 06SRE, 10Observability-Alerting: ATS backend errors for performance.discovery.wmnet should not page - https://phabricator.wikimedia.org/T425299#11986146 (10hnowlan) 05Open→03Resolved a:03hnowlan [16:16:04] (03Merged) 10jenkins-bot: ml-services: editing-suggestions eqiad deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297717 (https://phabricator.wikimedia.org/T427794) (owner: 10Ozge) [16:17:25] !log ozge@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:17:41] (03PS1) 10Jasmine: Revert "kafka-main2008: apply host-level override in advance of trixie upgrade [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1297746 [16:18:34] (03CR) 10JMeybohm: [C:03+1] Revert "kafka-main2008: apply host-level override in advance of trixie upgrade [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1297746 (owner: 10Jasmine) [16:19:17] (03CR) 10Jasmine: [C:03+2] Revert "kafka-main2008: apply host-level override in advance of trixie upgrade [0]" [puppet] - 10https://gerrit.wikimedia.org/r/1297746 (owner: 10Jasmine) [16:26:09] Anyone mind me using scap [16:28:39] (03PS1) 10Cathal Mooney: netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 [16:29:01] (03PS2) 10Cathal Mooney: netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 [16:29:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297741 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [16:29:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297742 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [16:30:54] (03Merged) 10jenkins-bot: hCaptcha: Update MF interface name for instrumentation [extensions/WikimediaEvents] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297741 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [16:30:56] (03Merged) 10jenkins-bot: hCaptcha: Update MF interface name for instrumentation [extensions/ConfirmEdit] (wmf/1.47.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1297742 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [16:30:59] (03CR) 10CI reject: [V:04-1] netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 (owner: 10Cathal Mooney) [16:31:06] (03PS3) 10Cathal Mooney: netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 [16:31:27] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297741|hCaptcha: Update MF interface name for instrumentation (T428178)]], [[gerrit:1297742|hCaptcha: Update MF interface name for instrumentation (T428178)]] [16:31:33] T428178: hCaptcha: Instrumentation for hCaptcha executions is missing the MobileFrontend interface - https://phabricator.wikimedia.org/T428178 [16:32:28] (03PS1) 10Ozge: ml-services: makes editing-suggestions eqiad publicly available [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297748 (https://phabricator.wikimedia.org/T427794) [16:33:13] (03PS2) 10Ozge: ml-services: makes editing-suggestions eqiad publicly available [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297748 (https://phabricator.wikimedia.org/T427794) [16:33:25] (03CR) 10CI reject: [V:04-1] netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 (owner: 10Cathal Mooney) [16:33:27] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297741|hCaptcha: Update MF interface name for instrumentation (T428178)]], [[gerrit:1297742|hCaptcha: Update MF interface name for instrumentation (T428178)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:33:56] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:09] Still testing.... [16:36:11] (03PS1) 10CDanis: cache: haproxy: enable_mlock globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1297749 [16:36:41] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297749 (owner: 10CDanis) [16:36:45] (03CR) 10CI reject: [V:04-1] cache: haproxy: enable_mlock globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1297749 (owner: 10CDanis) [16:38:39] (03PS2) 10CDanis: cache: haproxy: enable_mlock globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1297749 [16:39:12] (03CR) 10CI reject: [V:04-1] cache: haproxy: enable_mlock globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1297749 (owner: 10CDanis) [16:39:23] (03PS3) 10CDanis: cache: haproxy: enable_mlock globally 🚀🌍 [puppet] - 10https://gerrit.wikimedia.org/r/1297749 [16:39:42] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1297749 (owner: 10CDanis) [16:41:17] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [16:42:41] (03PS4) 10Cathal Mooney: netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 [16:44:23] (03CR) 10CI reject: [V:04-1] netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 (owner: 10Cathal Mooney) [16:45:24] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297741|hCaptcha: Update MF interface name for instrumentation (T428178)]], [[gerrit:1297742|hCaptcha: Update MF interface name for instrumentation (T428178)]] (duration: 13m 58s) [16:45:29] T428178: hCaptcha: Instrumentation for hCaptcha executions is missing the MobileFrontend interface - https://phabricator.wikimedia.org/T428178 [16:46:05] Going to deploy again [16:46:11] jouncebot: nowandnext [16:46:11] For the next 0 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1600) [16:46:11] In 0 hour(s) and 13 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1700) [16:46:11] In 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1700) [16:46:37] (03PS1) 10Dreamy Jazz: hCaptcha: Update MF interface name for instrumentation [extensions/WikimediaEvents] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297752 (https://phabricator.wikimedia.org/T428178) [16:48:32] (03PS4) 10Dreamy Jazz: hCaptcha: Update MF interface name for instrumentation [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297751 (https://phabricator.wikimedia.org/T428178) [16:48:42] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11986259 (10VRiley-WMF) @Marostegui as it turns out while in the process of trying to submit a ticket into dell, this server has lost its warrenty as of Feburary 1st of this year. I will continue to look... [16:54:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297751 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [16:54:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297752 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [16:55:54] !log shift traffic off cr1-esams et-1/0/1 link to asw1-by27-esams T427056 [16:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:36] (03PS5) 10Cathal Mooney: netops: change dashboard used on the InterfaceDropPercent alert [alerts] - 10https://gerrit.wikimedia.org/r/1297747 [17:00:04] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1700). [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1700) [17:08:31] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#11986310 (10BCornwall) a:05ssingh→03BCornwall [17:10:28] (03Merged) 10jenkins-bot: hCaptcha: Update MF interface name for instrumentation [extensions/ConfirmEdit] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297751 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [17:10:30] (03Merged) 10jenkins-bot: hCaptcha: Update MF interface name for instrumentation [extensions/WikimediaEvents] (wmf/1.47.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1297752 (https://phabricator.wikimedia.org/T428178) (owner: 10Dreamy Jazz) [17:11:02] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1297751|hCaptcha: Update MF interface name for instrumentation (T428178)]], [[gerrit:1297752|hCaptcha: Update MF interface name for instrumentation (T428178)]] [17:11:08] T428178: hCaptcha: Instrumentation for hCaptcha executions is missing the MobileFrontend interface - https://phabricator.wikimedia.org/T428178 [17:13:06] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1297751|hCaptcha: Update MF interface name for instrumentation (T428178)]], [[gerrit:1297752|hCaptcha: Update MF interface name for instrumentation (T428178)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:13:29] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with deployment [17:16:25] (03CR) 10BBlack: [C:03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1297210 (https://phabricator.wikimedia.org/T428093) (owner: 10BCornwall) [17:17:41] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1297751|hCaptcha: Update MF interface name for instrumentation (T428178)]], [[gerrit:1297752|hCaptcha: Update MF interface name for instrumentation (T428178)]] (duration: 06m 40s) [17:17:45] T428178: hCaptcha: Instrumentation for hCaptcha executions is missing the MobileFrontend interface - https://phabricator.wikimedia.org/T428178 [17:27:37] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:28:37] (03PS1) 10Cathal Mooney: cr1-esams: move ospf from et-1/0/1.342 to ae0.342 [homer/public] - 10https://gerrit.wikimedia.org/r/1297755 (https://phabricator.wikimedia.org/T427056) [17:32:39] (03CR) 10Cathal Mooney: [C:03+2] cr1-esams: move ospf from et-1/0/1.342 to ae0.342 [homer/public] - 10https://gerrit.wikimedia.org/r/1297755 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [17:34:00] (03Merged) 10jenkins-bot: cr1-esams: move ospf from et-1/0/1.342 to ae0.342 [homer/public] - 10https://gerrit.wikimedia.org/r/1297755 (https://phabricator.wikimedia.org/T427056) (owner: 10Cathal Mooney) [17:42:41] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:46:48] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#11986421 (10VRiley-WMF) Hey @Dzahn thank you! Currently we can move forward with this. Currently where the servers sit right now may be the best spot for them unl... [17:52:37] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:52:45] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:53:02] 42 [17:53:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Moving switches to make space for the refreshed switches. - https://phabricator.wikimedia.org/T428195 (10VRiley-WMF) 03NEW [18:00:05] dancy and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T1800) [18:00:09] o/ [18:01:03] (03PS1) 10TrainBranchBot: group2 to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297759 (https://phabricator.wikimedia.org/T423914) [18:01:06] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297759 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [18:01:46] (03CR) 10Ssingh: [C:03+2] "This was a mistake on my part. I ran it on text and not upload. Thanks to @cdobbins@wikimedia.org for pointing this out." [puppet] - 10https://gerrit.wikimedia.org/r/1296654 (https://phabricator.wikimedia.org/T117618) (owner: 10SBassett) [18:02:06] (03Merged) 10jenkins-bot: group2 to 1.47.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297759 (https://phabricator.wikimedia.org/T423914) (owner: 10TrainBranchBot) [18:02:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11986474 (10Marostegui) >>! In T427535#11986259, @VRiley-WMF wrote: > @Marostegui as it turns out while in the process of trying to submit a ticket into dell, this server has lost its warrenty as of Febur... [18:08:40] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.47.0-wmf.5 refs T423914 [18:08:44] T423914: 1.47.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T423914 [18:25:53] (03PS1) 10Cathal Mooney: eqsin: remove OSPF on ae0 direct link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1297763 (https://phabricator.wikimedia.org/T424611) [18:30:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:30:13] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:31:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:31:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:31:29] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11986630 (10VRiley-WMF) @fgiunchedi for these servers cloudcephosd1048, cloudcephosd1049, cloudcephosd1050, cloudcephosd1051 would we be able to schedual a day and time for them? I am free anyt... [18:36:09] (03PS1) 10Cathal Mooney: eqsin - remove reverse ptr include for 2001:df2:e500:fe05::/64 [dns] - 10https://gerrit.wikimedia.org/r/1297764 (https://phabricator.wikimedia.org/T424611) [18:36:19] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:36:58] (03CR) 10CI reject: [V:04-1] eqsin - remove reverse ptr include for 2001:df2:e500:fe05::/64 [dns] - 10https://gerrit.wikimedia.org/r/1297764 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [18:37:06] !log sukhe@cp6013:~$ sudo traffic_server -C clear_cache [18:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:05] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove IPs that had been used for eqsin cr links - cmooney@cumin1003" [18:40:55] (03PS2) 10Cathal Mooney: eqsin - remove reverse ptr include for 2001:df2:e500:fe05::/64 [dns] - 10https://gerrit.wikimedia.org/r/1297764 (https://phabricator.wikimedia.org/T424611) [18:42:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove IPs that had been used for eqsin cr links - cmooney@cumin1003" [18:42:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:47:54] (03CR) 10Ssingh: [C:03+1] eqsin - remove reverse ptr include for 2001:df2:e500:fe05::/64 [dns] - 10https://gerrit.wikimedia.org/r/1297764 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [18:50:00] (03CR) 10Cathal Mooney: [C:03+2] eqsin - remove reverse ptr include for 2001:df2:e500:fe05::/64 [dns] - 10https://gerrit.wikimedia.org/r/1297764 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [18:50:16] !log cmooney@dns2005 START - running authdns-update [18:51:32] !log cmooney@dns2005 END - running authdns-update [18:52:42] (03PS4) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [18:58:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:ae0 (Core: cr2-eqsin:ae0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:01:32] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986698 (10cmooney) >>! In T427393#11986028, @BCornwall wrote: > @cmooney I've depooled cp5030. Have fun! Thanks! [19:03:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:ae0 (Core: cr2-eqsin:ae0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:05:05] (03PS5) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [19:08:07] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS trixie [19:08:24] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host cp5030.eqsin.wmnet with... [19:08:36] !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host cp5030 [19:09:10] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:09:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:15] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5030 - cmooney@cumin1003" [19:14:16] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cp5030 - cmooney@cumin1003" [19:14:16] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:14:17] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache cp5030.eqsin.wmnet 27.0.132.10.in-addr.arpa 7.2.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [19:14:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cp5030.eqsin.wmnet 27.0.132.10.in-addr.arpa 7.2.0.0.0.0.0.0.2.3.1.0.0.1.0.0.1.0.1.0.0.0.5.e.2.f.d.0.1.0.0.2.ip6.arpa on all recursors [19:14:21] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cp5030 [19:15:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5030 [19:15:07] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cp5030 [19:18:49] (03PS1) 10Cathal Mooney: cp5030: change IPs in hieradata to match its new ones [puppet] - 10https://gerrit.wikimedia.org/r/1297768 (https://phabricator.wikimedia.org/T427393) [19:19:48] (03PS6) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [19:25:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11986729 (10VRiley-WMF) I'm going to look into CPU2 as thats what some of the logs are pointing to. [19:26:03] (03PS7) 10CDobbins: varnish: Add CSP report-only header value [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) [19:28:03] (03CR) 10Ssingh: varnish: Add CSP report-only header value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:28:53] (03CR) 10Ssingh: [C:03+1] cp5030: change IPs in hieradata to match its new ones [puppet] - 10https://gerrit.wikimedia.org/r/1297768 (https://phabricator.wikimedia.org/T427393) (owner: 10Cathal Mooney) [19:29:37] (03CR) 10Cathal Mooney: [C:03+2] cp5030: change IPs in hieradata to match its new ones [puppet] - 10https://gerrit.wikimedia.org/r/1297768 (https://phabricator.wikimedia.org/T427393) (owner: 10Cathal Mooney) [19:33:00] (03PS1) 10CDobbins: trying out `alias` to get rid of redundancy [puppet] - 10https://gerrit.wikimedia.org/r/1297769 [19:43:49] !log cmooney@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [19:48:26] (03CR) 10JHathaway: redfish: improve add_account with AccountTypes (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1293593 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [19:49:52] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [19:55:58] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11986777 (10VRiley-WMF) @Marostegui I have swapped the CPUs in their sockets. I noticed this in the report Record: 58 Date/Time: 05/28/2026 15:06:07 Source: system Severity: Critical Descr... [19:56:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11986778 (10VRiley-WMF) db1224 should be back up, it is showing the login screen [19:58:40] (03CR) 10CDobbins: varnish: Add CSP report-only header value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:00:05] RoanKattouw, urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T2000). [20:00:05] arlolra and yerdua_wmde: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1224 is unreachable - https://phabricator.wikimedia.org/T427535#11986782 (10Marostegui) >>! In T427535#11986777, @VRiley-WMF wrote: > @Marostegui I have swapped the CPUs in their sockets. I noticed this in the report > > Record: 58 > Date/Time: 05/28/2026 15:0... [20:00:20] o/ [20:01:08] I'll get started [20:01:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851) (owner: 10Arlolra) [20:01:39] (03PS1) 10Brouberol: kafka-ui: connect to all kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297775 (https://phabricator.wikimedia.org/T428053) [20:02:30] (03Merged) 10jenkins-bot: Deploy PRV to 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296015 (https://phabricator.wikimedia.org/T427851) (owner: 10Arlolra) [20:02:48] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1296015|Deploy PRV to 6 wikis (T427851)]] [20:02:51] T427851: Parsoid Read Views to deploy ~2026-06-04 - https://phabricator.wikimedia.org/T427851 [20:03:39] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Repurpose ganeti102[3456] for Zuul migration - https://phabricator.wikimedia.org/T427353#11986788 (10Dzahn) @VRiley-WMF Feel free to proceed ! thank you [20:04:22] (03PS2) 10CDobbins: trying out `alias` to get rid of redundancy [puppet] - 10https://gerrit.wikimedia.org/r/1297769 [20:04:45] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1296015|Deploy PRV to 6 wikis (T427851)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:12] !log arlolra@deploy1003 arlolra: Continuing with deployment [20:08:20] !log Running `/usr/local/bin/foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/scanFilesInScanTable.php --use-jobqueue --sleep=1 --poll-sleep=10 --verbose` [20:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:37] (03PS3) 10CDobbins: trying out `alias` to get rid of redundancy [puppet] - 10https://gerrit.wikimedia.org/r/1297769 [20:10:27] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1296015|Deploy PRV to 6 wikis (T427851)]] (duration: 07m 39s) [20:10:31] T427851: Parsoid Read Views to deploy ~2026-06-04 - https://phabricator.wikimedia.org/T427851 [20:10:33] (03PS4) 10CDobbins: trying out `alias` to get rid of redundancy [puppet] - 10https://gerrit.wikimedia.org/r/1297769 [20:11:19] Doesn't look like yerdua_wmde is around so I guess the backport window is done [20:13:08] (03CR) 10CDobbins: varnish: Add CSP report-only header value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:15:38] (03PS6) 10Dreamy Jazz: hCaptcha: Don't show AbuseFilter CAPTCHA for wbsetclaim API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1296550 (https://phabricator.wikimedia.org/T427608) [20:15:40] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [20:16:05] (03CR) 10BCornwall: [C:03+1] dns: Add liftwing-openapi-server CNAME records [dns] - 10https://gerrit.wikimedia.org/r/1297710 (https://phabricator.wikimedia.org/T427902) (owner: 10Gkyziridis) [20:18:08] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5030.eqsin.wmnet with OS trixie [20:18:23] (03CR) 10BCornwall: [C:03+1] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1297684 (https://phabricator.wikimedia.org/T428158) (owner: 10Gerrit maintenance bot) [20:18:25] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host cp5030.eqsin.wmnet with OS t... [20:18:37] (03CR) 10BCornwall: [C:03+2] Remove digicert CAA records from most domains [dns] - 10https://gerrit.wikimedia.org/r/1297210 (https://phabricator.wikimedia.org/T428093) (owner: 10BCornwall) [20:18:40] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [20:19:01] !log brett@dns1006 START - running authdns-update [20:20:31] !log brett@dns1006 END - running authdns-update [20:21:32] (03CR) 10Cathal Mooney: "Reimage worked fine on cp5030 with `--move-vlan`." [cookbooks] - 10https://gerrit.wikimedia.org/r/1297232 (https://phabricator.wikimedia.org/T427393) (owner: 10Cathal Mooney) [20:22:32] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#11986831 (10Pppery) Hmm ... The timeline extension tries to store an additional `.map` file for timelines with wikilinks. That must be the thing that is failing to save properly. I... [20:22:41] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986833 (10cmooney) Ok @BCornwall the reimage seemed to work fine with the `--move-vlan` tag. I updated the IPs in hiera so I think y... [20:22:51] (03CR) 10Ssingh: varnish: Add CSP report-only header value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1297217 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:26:56] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp6013.drmrs.wmnet,service=(cdn|ats-be) [20:27:20] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#11986847 (10MarioProtIV) >>! In T428063#11986815, @GuardianOfArcadia wrote: > According to the diff and the HTML comment which was added by //OreoStar-fait// it seems to have somethin... [20:27:32] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1100.eqiad.wmnet,service=(cdn|ats-be) [20:29:22] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#11986848 (10Pppery) The first German one uses "#tag:timeline" and "till:{{#time: d/m/Y}}". This cases the timeline to be dynamic, and be considered "edited" every day. The third Germa... [20:40:39] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8658/co" [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [20:42:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host releases1003.eqiad.wmnet with OS trixie [20:45:10] (03CR) 10BCornwall: [V:03+1 C:03+2] P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [20:50:54] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5030.* [20:53:20] (03CR) 10BCornwall: [V:03+1] "Will deploy on monday" [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [20:53:30] (03CR) 10BCornwall: [V:03+1 C:03+1] P:cache:haproxy add image generator information [puppet] - 10https://gerrit.wikimedia.org/r/1295921 (https://phabricator.wikimedia.org/T414338) (owner: 10Slyngshede) [20:56:18] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11986915 (10BCornwall) We discussed this and the general consensus seemed to be to just decomm the server and wait for the refresh which is happening shortly anyway. @ssingh Is that accurate, and if s... [20:58:02] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage [20:58:31] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986930 (10BCornwall) Indeed, cp5030 is doing well, thanks! For the remainder of instances, should there be a separate task (or some e... [21:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260604T2100) [21:02:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:04:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage [21:28:34] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases1003.eqiad.wmnet with OS trixie [22:02:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:06:20] (03CR) 10RLazarus: [C:03+2] scaffold: Bump mesh.service version from 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297258 (owner: 10RLazarus) [22:08:23] (03Merged) 10jenkins-bot: scaffold: Bump mesh.service version from 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1297258 (owner: 10RLazarus) [22:09:23] (03CR) 10RLazarus: [C:03+2] Copy mesh.configuration 1.15.2 -> 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296064 (owner: 10RLazarus) [22:09:56] (03CR) 10RLazarus: [C:03+2] mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:12:32] (03PS5) 10RLazarus: Copy mesh.networkpolicy 1.2.1 -> 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296063 [22:12:32] (03PS5) 10RLazarus: mesh.networkpolicy: Add ingress ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) [22:12:32] (03PS6) 10RLazarus: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) [22:17:10] (03PS7) 10RLazarus: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) [22:19:27] (03CR) 10RLazarus: [C:03+2] mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:21:28] (03Merged) 10jenkins-bot: mesh.configuration: Add restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296067 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:23:23] (03CR) 10RLazarus: [C:03+2] Copy mesh.networkpolicy 1.2.1 -> 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296063 (owner: 10RLazarus) [22:25:22] (03Merged) 10jenkins-bot: Copy mesh.networkpolicy 1.2.1 -> 1.2.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296063 (owner: 10RLazarus) [22:34:05] (03PS6) 10RLazarus: mesh.networkpolicy: Add ingress ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) [22:36:35] (03CR) 10RLazarus: [C:03+2] mesh.networkpolicy: Add ingress ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:38:35] (03Merged) 10jenkins-bot: mesh.networkpolicy: Add ingress ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296065 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:39:31] (03PS3) 10RLazarus: Copy mesh.service 1.2.0 -> 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296066 [22:39:31] (03PS6) 10RLazarus: mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) [22:42:17] (03CR) 10RLazarus: [C:03+2] Copy mesh.service 1.2.0 -> 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296066 (owner: 10RLazarus) [22:44:20] (03Merged) 10jenkins-bot: Copy mesh.service 1.2.0 -> 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296066 (owner: 10RLazarus) [22:45:09] (03CR) 10RLazarus: [C:03+2] mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:47:12] (03Merged) 10jenkins-bot: mesh.service: Add TLS service ports for restricted_listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296068 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:49:29] (03PS4) 10Jasmine: k8s: add wikikube-worker2331 [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) [22:49:29] (03CR) 10Jasmine: "Rebased following [0], Modified regex" [puppet] - 10https://gerrit.wikimedia.org/r/1289022 (https://phabricator.wikimedia.org/T426688) (owner: 10Jasmine) [22:50:28] (03PS5) 10RLazarus: function-{evaluator,orchestrator}: sextant update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) [22:50:28] (03PS5) 10RLazarus: wikifunctions: Add mesh.restricted_listeners port to orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296071 (https://phabricator.wikimedia.org/T427863) [22:50:28] (03PS5) 10RLazarus: function-evaluator: Add outgoing Envoy config and egress policy for callbacks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296072 (https://phabricator.wikimedia.org/T427863) [22:57:17] (03CR) 10RLazarus: [C:03+2] function-{evaluator,orchestrator}: sextant update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [22:59:42] (03Merged) 10jenkins-bot: function-{evaluator,orchestrator}: sextant update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296069 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [23:05:01] (03CR) 10RLazarus: "James or David: This is ready to go whenever the callback handler on the orchestrator is up and ready -- just let me know when. (And if it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296071 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [23:09:29] (03CR) 10RLazarus: "James or David: ... and *this* is ready to go any time after I4a2007d2; after we merge this, you can switch on the callbacks on the evalua" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1296072 (https://phabricator.wikimedia.org/T427863) (owner: 10RLazarus) [23:09:41] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:17] (03CR) 10Jasmine: [C:03+2] kafka-main1006: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1285474 (https://phabricator.wikimedia.org/T419216) (owner: 10Jasmine) [23:18:22] 06SRE: Rework ACLs on Kafka 3.x clusters - https://phabricator.wikimedia.org/T425528#11987201 (10colewhite) >>! In T425528#11981310, @elukey wrote: > @colewhite @tappof @andrea.denisse Hi! I have to add some ACLs to both Kafka logging clusters, I am going to add some rationale here and you can tell me what you t... [23:20:04] !log jasmine@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS trixie [23:32:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [23:36:15] !log jasmine@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [23:40:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1297788 [23:40:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1297788 (owner: 10TrainBranchBot) [23:40:26] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1006.eqiad.wmnet with reason: host reimage [23:52:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [23:52:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1297788 (owner: 10TrainBranchBot) [23:52:55] jouncebot: nowandnext [23:52:55] No deployments scheduled for the next 6 hour(s) and 7 minute(s) [23:52:55] In 6 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260605T0600) [23:53:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297268 (https://phabricator.wikimedia.org/T427126) (owner: 10Pppery) [23:54:21] (03Merged) 10jenkins-bot: Redirect unknown wikinews languages to portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1297268 (https://phabricator.wikimedia.org/T427126) (owner: 10Pppery) [23:54:59] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1297268|Redirect unknown wikinews languages to portal (T427126)]] [23:55:03] T427126: Cleanup wikinews portal/incubator handling - https://phabricator.wikimedia.org/T427126 [23:56:58] !log ladsgroup@deploy1003 ladsgroup, pppery: Backport for [[gerrit:1297268|Redirect unknown wikinews languages to portal (T427126)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:57:45] !log jasmine@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1006.eqiad.wmnet with OS trixie [23:57:46] !log ladsgroup@deploy1003 ladsgroup, pppery: Continuing with deployment