[00:05:25] FIRING: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152445 [00:08:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152445 (owner: 10TrainBranchBot) [00:10:49] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:28:35] FIRING: [2x] ProbeDown: Service logstash1025:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:27] RESOLVED: [2x] ProbeDown: Service logstash1025:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:30:34] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152445 (owner: 10TrainBranchBot) [00:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:01:49] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:05:25] FIRING: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:55] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [04:14:55] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [04:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:58:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10874139 (10Marostegui) Thank you! [05:00:50] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10874154 (10Marostegui) [05:01:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036 T395771', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json [05:01:53] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [05:02:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance [05:06:46] (03PS1) 10Marostegui: mariadb: Productionize es2047 into es6 [puppet] - 10https://gerrit.wikimedia.org/r/1152451 (https://phabricator.wikimedia.org/T395771) [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:39] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2047 into es6 [puppet] - 10https://gerrit.wikimedia.org/r/1152451 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [05:14:33] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1003, ...), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:15:06] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet [05:16:38] (03PS1) 10Marostegui: instances.yaml: Add es2047 [puppet] - 10https://gerrit.wikimedia.org/r/1152452 (https://phabricator.wikimedia.org/T395771) [05:17:47] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2047 [puppet] - 10https://gerrit.wikimedia.org/r/1152452 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [05:19:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2047 to dbctl depooled T395771', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json [05:20:04] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [05:38:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:39:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169 T395663', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json [05:39:08] T395663: MariaDB 10.11.13 released - https://phabricator.wikimedia.org/T395663 [05:53:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:08:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json [06:11:36] (03PS4) 10Phuedx: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [06:14:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:15:56] (03PS5) 10Phuedx: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [06:16:29] (03PS6) 10Phuedx: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [06:17:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [06:18:37] o/ I will be ~5 minutes late for the morning backport window but I'll be here :) [06:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json [06:34:47] (03CR) 10Muehlenhoff: [C:03+2] Update canary [puppet] - 10https://gerrit.wikimedia.org/r/1152188 (owner: 10Muehlenhoff) [06:38:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json [06:48:44] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [06:52:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [06:53:00] (03PS1) 10KartikMistry: Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) [06:53:20] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [06:53:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json [06:53:48] (03CR) 10CI reject: [V:04-1] Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry) [06:57:02] (03PS1) 10Slyngshede: data.yaml: pwaigi1- offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1152559 [06:59:07] (03CR) 10Muehlenhoff: data.yaml: pwaigi1- offboarding (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede) [06:59:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org [07:00:04] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T0700). [07:00:04] phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:29] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org [07:02:29] (03PS2) 10Slyngshede: data.yaml: pwaigi1- offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1152559 [07:02:43] (03CR) 10Slyngshede: data.yaml: pwaigi1- offboarding (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede) [07:05:34] (03PS1) 10Marostegui: es1040: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152560 (https://phabricator.wikimedia.org/T395647) [07:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1040 T395647', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json [07:06:05] T395647: Migrate es7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395647 [07:06:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede) [07:08:31] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org [07:08:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json [07:08:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance [07:09:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance [07:12:13] o/ Here now [07:12:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:12:41] ^ es7 issues [07:12:46] RECOVERY - MariaDB memory on es1035 is OK: OK Memory 3% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:13:22] moritzm: Is it OK to deploy a Beta Cluster -only config change with that amount of ongoing errors or should I hold off? [07:13:27] Sorry [07:13:35] marostegui: ^^ [07:13:44] phuedx: yes [07:13:56] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [07:13:59] Yes it's OK or yes to hold off? :D [07:14:05] phuedx: you can proceed [07:14:08] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org [07:14:10] marostegui: ty ty [07:15:50] Amir1, urandom, awight: OK for me to deploy a config change? [07:15:56] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [07:17:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:20:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org [07:21:16] Alright. It's been ~5 minutes. No deployers appear to be around/awake at the moment. Continuing [07:21:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [07:21:53] jmm@cumin1003: Failed to log message to wiki. Somebody should check the error logs. [07:22:22] (03Merged) 10jenkins-bot: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt) [07:22:40] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1152253|Beta Cluster: Support A/B experiments (T393918)]] [07:22:42] T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918 [07:22:44] (03CR) 10Marostegui: [C:03+2] es1040: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152560 (https://phabricator.wikimedia.org/T395647) (owner: 10Marostegui) [07:27:45] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [07:32:41] Waiting on build-and-push-container-images. Looking at the log, it's running but taking time [07:34:07] phuedx: the first deploy of the week is doing a full rebuild of the mediawiki images [07:34:27] hashar: TIL! Thanks for the clarification [07:35:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874333 (10ayounsi) [07:35:14] that is because the base image is automatically rebuild over the week-end which in turns invalidate all the docker caching layers [07:35:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json [07:36:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [07:36:52] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10874334 (10Aklapper) [07:38:20] (03CR) 10Ayounsi: [C:03+1] EVPN_BGP: add peer-as to conf to match unicast and remove auto on bfd [homer/public] - 10https://gerrit.wikimedia.org/r/1152258 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [07:38:37] !log phuedx@deploy1003 phuedx, dr0ptp4kt: Backport for [[gerrit:1152253|Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:38:39] T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918 [07:40:25] FIRING: [2x] SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:20] OK. I've verified that the stream config setting appears for the product_metrics.web_base stream on beta metawiki but not beta enwiki and not on any production wikis [07:43:00] I've verified that the MetricsPlatform extension is still loaded in both the labs and production realms [07:43:46] I've verified on beta enwiki and beta metawiki that there is an active logged-in experiment but I'm not in sample [07:43:58] But that the ext.xLab RL module is still loaded [07:44:11] Just checking that the above does not happen in the production realm [07:47:59] Yup. The experiment is not running on enwiki or metawiki [07:48:03] No errors in the console [07:48:06] Continuing [07:49:11] !log phuedx@deploy1003 phuedx, dr0ptp4kt: Continuing with sync [07:50:25] FIRING: [2x] SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:30] Emperor: good morning fellow oncall! [07:50:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json [07:53:29] hi [07:55:24] (03CR) 10JMeybohm: [C:03+1] "Cool. Let's go for it then 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [07:58:38] (03CR) 10JMeybohm: [C:04-1] validating-admission-policies: add policy to permit hostPath mounts for mediawiki (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [07:58:40] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152253|Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) [07:58:43] T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918 [07:59:25] I will continue to poke at the Beta Cluster for a while longer :) [08:05:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json [08:11:22] (03PS1) 10Slyngshede: Permission management [software/bitu] - 10https://gerrit.wikimedia.org/r/1152635 [08:11:36] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm [08:11:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm [08:12:59] (03PS1) 10Vgutierrez: varnish: Set wmfuniq experiment reload period to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1152636 (https://phabricator.wikimedia.org/T391411) [08:19:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874400 (10ayounsi) [08:19:49] (03PS3) 10Majavah: P:toolforge::prometheus: Add components-api scrape endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe) [08:20:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152636 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:20:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json [08:22:06] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Add components-api scrape endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe) [08:23:09] (03PS1) 10Ayounsi: Add magru virtual IPs to network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263) [08:23:30] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:26:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. The PCC failure for Puppet 5 is expected, since the manifests on install* use Puppet syntax from Puppet 7." [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:26:29] (03CR) 10Ayounsi: [C:03+2] Add magru virtual IPs to network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:28:08] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2012 MB (3% inode=93%): /tmp 2012 MB (3% inode=93%): /var/tmp 2012 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [08:33:50] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet [08:35:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:36:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json [08:37:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:37:59] (03PS5) 10Majavah: O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) [08:40:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:41:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:42:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:42:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:42:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet [08:42:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874428 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir7003.magru.wmnet` - ncredir7003.magru.wmnet (**WARN**) - //Host not found... [08:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:44:36] (03CR) 10Vgutierrez: [C:03+2] varnish: Set wmfuniq experiment reload period to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1152636 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:45:08] !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm [08:45:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:45:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874450 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm executed with errors: - ncredir7003 (**FA... [08:46:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:46:35] topranks: the CoreRouterInterfaceDown alert above is you? [08:47:25] XioNoX: no public holiday here I’m not doing anything [08:47:28] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet [08:47:29] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [08:47:40] Oh sorry [08:47:53] my bad yeah that link I enabled last week [08:48:06] maybe a downtime that expired? [08:48:07] Jenn is gonna look at it today, it didn’t come up [08:48:19] anyway, will ack it for 24h [08:48:20] I didn’t add the BGP yet but enabled it in netbox [08:48:26] Thanks sry [08:48:36] no pb at all! [08:49:00] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:51:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json [08:51:08] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [08:51:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [08:51:12] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:12] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors [08:51:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:51:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors [08:51:42] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" [08:51:47] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" [08:53:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:54:48] jmm@cumin1003 makevm (PID 31285) is awaiting input [08:58:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:58:21] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [08:58:31] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm [08:58:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm [08:59:11] ncredir7003? 🍿 [09:00:20] for https://phabricator.wikimedia.org/T394263 [09:00:46] I'm installing these initially with insetup, the actual service setup will be passed over in a separate task [09:02:19] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [09:04:25] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10874556 (10MatthewVernon) I think the "check the file is in a consistent (p... [09:09:27] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [09:10:51] !log update gitlab-settings artifact retention to 6 month - T395014 [09:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:53] T395014: Check GitLab artifact retention time - https://phabricator.wikimedia.org/T395014 [09:13:23] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [09:13:56] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [09:14:54] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [09:20:13] (03CR) 10Majavah: [C:03+2] O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) (owner: 10Majavah) [09:20:37] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10874584 (10MatthewVernon) I've checked these objects in swift, and they are both present and co... [09:22:06] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [09:24:53] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage [09:25:46] (03PS1) 10Bartosz Wójtowicz: ml-services: Update docker images for production deployments. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) [09:27:44] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [09:28:28] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage [09:32:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:33:10] (03CR) 10Jcrespo: [C:03+2] "Thank you so much for handling this. This helps the dashboard being clean." [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [09:33:18] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [09:36:47] (03CR) 10Jcrespo: [C:03+2] "As an additional info, in case it helps, running `check_bacula.py` or `check_bacula.py ` at the bacula director host (it is a pyt" [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [09:40:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org [09:42:28] FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:44:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2039 T395647', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json [09:44:09] T395647: Migrate es7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395647 [09:45:07] (03PS1) 10Marostegui: es2039: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152644 (https://phabricator.wikimedia.org/T395647) [09:45:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance [09:45:58] (03CR) 10Marostegui: [C:03+2] es2039: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152644 (https://phabricator.wikimedia.org/T395647) (owner: 10Marostegui) [09:47:01] (03CR) 10Gmodena: [C:03+1] jobqueue: Set the host header in all jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [09:47:28] RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:49:08] (03CR) 10Gmodena: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [09:54:36] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10874669 (10ayounsi) That makes sens to me! +1 on removing the macros. [09:55:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json [09:55:42] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1152645 (https://phabricator.wikimedia.org/T395785) [09:55:52] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet [09:59:00] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm [09:59:01] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet [09:59:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm completed: - ncredir7003 (**PASS**) - R... [09:59:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet [10:00:07] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1000) [10:00:29] (03CR) 10Bartosz Wójtowicz: "Confirming that I verified all images and tags exist in our registry." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:02:05] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet [10:02:06] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [10:02:18] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning [10:02:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:18] (03PS1) 10Vgutierrez: varnish: Start using edge uniques config fetched from xlabs endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1152648 (https://phabricator.wikimedia.org/T391411) [10:05:08] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152648 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:05:15] (03PS1) 10Federico Ceratto: icinga: add Federico Ceratto to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1152647 [10:06:01] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" [10:06:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" [10:06:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:22] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors [10:06:25] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors [10:06:47] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" [10:06:52] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" [10:07:22] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm [10:07:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host durum7003.magru.wmnet with OS bookworm [10:08:56] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [10:10:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json [10:11:22] (03PS1) 10Clément Goubert: Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649 [10:12:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:43] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [10:15:31] (03PS2) 10Clément Goubert: Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649 [10:18:03] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [10:19:06] (03CR) 10Jcrespo: [C:03+1] icinga: add Federico Ceratto to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1152647 (owner: 10Federico Ceratto) [10:19:12] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:00] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet [10:25:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json [10:25:37] (03PS1) 10Marostegui: es2047: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1152652 (https://phabricator.wikimedia.org/T395771) [10:26:43] (03CR) 10Marostegui: [C:03+2] es2047: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1152652 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [10:27:08] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [10:28:42] (03CR) 10Klausman: ml-services: Update docker images for production deployments. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:29:12] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:11] (03CR) 10Klausman: [C:03+1] ml-services: Update docker images for production deployments. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:30:22] 07Puppet, 06DBA: labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846#10874763 (10jcrespo) Ignore my suggestion, sorry, it was done already. Current status: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/9ae1d81b4ae21559f66b6e6cd283d642814ac4cf/module... [10:30:53] (03CR) 10Federico Ceratto: [C:03+2] icinga: add Federico Ceratto to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1152647 (owner: 10Federico Ceratto) [10:31:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [10:32:37] (03CR) 10Gkyziridis: "LGTM! Please update commit message with the related changed to the model that you upload for articlequality/language-agnostic" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:32:48] (03CR) 10Gkyziridis: [C:03+1] ml-services: Update docker images for production deployments. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:33:22] (03CR) 10Kamila Součková: [C:03+2] aux-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [10:34:10] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [10:35:09] (03Merged) 10jenkins-bot: aux-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [10:36:02] (03CR) 10Vgutierrez: [C:03+2] varnish: Start using edge uniques config fetched from xlabs endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1152648 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:36:59] (03PS2) 10Bartosz Wójtowicz: ml-services: Update docker images for production deployments and update AQLA model files. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) [10:37:48] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [10:40:23] !log kamila@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [10:40:32] !log kamila@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [10:40:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json [10:40:41] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply [10:41:02] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply [10:41:16] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update docker images for production deployments and update AQLA model files. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:41:44] (03CR) 10Kamila Součková: [C:03+2] "Done, and indeed nothing seems broken right now :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [10:44:00] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:44:20] (03CR) 10Kamila Součková: [C:03+1] Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649 (owner: 10Clément Goubert) [10:44:47] (03Merged) 10jenkins-bot: ml-services: Update docker images for production deployments and update AQLA model files. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [10:47:00] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:48:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning [10:54:19] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet [10:55:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json [10:58:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet [10:58:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm [10:58:49] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet [10:58:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host durum7003.magru.wmnet with OS bookworm completed: - durum7003 (**PASS**) - Remov... [11:10:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json [11:11:38] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org [11:11:40] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [11:14:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [11:15:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:15:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json [11:16:35] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" [11:17:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" [11:17:02] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:17:02] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors [11:17:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors [11:17:27] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" [11:17:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" [11:18:36] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm [11:18:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host doh7003.wikimedia.org with OS bookworm [11:19:49] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet [11:23:06] jouncebot: nowandnext [11:23:06] No deployments scheduled for the next 1 hour(s) and 36 minute(s) [11:23:06] In 1 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1300) [11:23:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet [11:23:52] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649 (owner: 10Clément Goubert) [11:24:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json [11:30:49] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:31:34] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:32:29] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10875036 (10Lvova) > So Басков_переулок_19_СПб_02.jpg looks OK to me (I might have missed someth... [11:32:47] !log Manual run of cronjobs/generatecaptcha on k8s - T388531 [11:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:49] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [11:33:27] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [11:35:21] Reedy: i killed the pod and will have to rerun the script on mwmaint [11:36:58] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5738/console" [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy) [11:37:00] (03PS1) 10Clément Goubert: Revert^3 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152659 [11:37:08] (03CR) 10Clément Goubert: [V:03+2 C:03+2] Revert^3 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152659 (owner: 10Clément Goubert) [11:37:15] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [11:38:35] (03CR) 10Jelto: [V:03+1 C:03+2] profile::gitlab::runner: Resolve namservers to IPs [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy) [11:39:52] !log cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - T388531 [11:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:54] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [11:40:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json [11:41:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:42:57] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [11:44:04] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10875073 (10MoritzMuehlenhoff) [11:44:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:45:33] Hmmm I'm getting a storage backend error Reedy, maybe Emperor too [11:45:35] An unknown error occurred in storage backend "global-swift-eqiad". [11:46:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:46:55] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [11:47:27] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:47:47] manual page, I fucked up [11:47:56] https://auth.wikimedia.org/enwiki/wiki/Special:CreateAccount [11:48:10] I broke account creation [11:48:32] Emperor: XioNoX ^^ [11:48:34] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning [11:49:01] claime: need help or it's a fyi? [11:49:04] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage [11:49:19] I think I need help from someone who knows swift [11:49:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:34] trying to find which container may need to be cleaned up [11:49:52] Then Emperor is probably a safe bet [11:50:25] FIRING: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:50:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:51:10] hmmm [11:51:27] darn it, I was having lunch [11:51:29] * Emperor here [11:51:42] marostegui@cumin1002 clone (PID 2353986) is awaiting input [11:51:45] claime: what's up? [11:51:58] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage [11:52:13] Emperor: -security [11:52:16] ack [11:52:31] Emperor: the page doesn't seem related to claime, I'm having a look at it [11:52:48] XioNoX: ack [11:54:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:26] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet [11:55:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json [11:55:20] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:57:11] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [11:59:09] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:00:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet [12:00:25] RESOLVED: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:47] (03CR) 10Andrew Bogott: [C:03+2] preseed: cloudcontrol2010-dev will have a 4-disk sw raid. [puppet] - 10https://gerrit.wikimedia.org/r/1152390 (https://phabricator.wikimedia.org/T393102) (owner: 10Andrew Bogott) [12:05:50] (03CR) 10Andrew Bogott: [C:03+2] Add site.pp entries for new ceph osds [puppet] - 10https://gerrit.wikimedia.org/r/1152439 (https://phabricator.wikimedia.org/T394333) (owner: 10Andrew Bogott) [12:06:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:07:33] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet [12:07:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:28] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10875153 (10Andrew) @Jhancock.wm I updated the preseed rule for this server and it should make a SW raid now. If it still fails you ca... [12:08:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:09:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875158 (10Andrew) a:05Andrew→03None Site.pp is updated and cloudcephosd1039 is drained and ready to... [12:09:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:10:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json [12:10:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [12:10:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T395241)', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json [12:10:42] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render [12:10:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render [12:11:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet [12:11:49] !log cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - T388531 [12:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:53] T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531 [12:12:03] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm [12:12:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org [12:12:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10875165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host doh7003.wikimedia.org with OS bookworm completed: - doh7003 (**PASS**) - Removed... [12:13:12] 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796 (10MoritzMuehlenhoff) 03NEW [12:13:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:15:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:17:17] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet [12:17:19] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [12:20:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T395241)', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json [12:20:51] RESOLVED: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:22:11] !incidents [12:22:12] 6267 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [12:22:54] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" [12:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:24:12] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152670 [12:24:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" [12:24:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:24:34] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors [12:24:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors [12:25:00] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" [12:25:04] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" [12:25:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875237 (10Jclark-ctr) a:03Jclark-ctr [12:25:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875250 (10Jclark-ctr) [12:26:02] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm [12:26:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10875252 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm [12:30:15] (03PS1) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 [12:30:25] (03CR) 10Marostegui: [C:04-2] "This is WIP" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [12:32:25] (03CR) 10CI reject: [V:04-1] ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [12:33:19] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org [12:34:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10875292 (10SLyngshede-WMF) 05In progress→03Resolved [12:35:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json [12:35:24] (03PS2) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 [12:36:07] (03PS1) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 [12:37:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org [12:37:20] (03PS2) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 [12:37:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning [12:37:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet [12:41:29] (03CR) 10Jelto: [C:03+1] "looks good to me!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:41:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10875303 (10SLyngshede-WMF) a:05WMDECyn→03SLyngshede-WMF [12:41:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10875304 (10SLyngshede-WMF) [12:42:41] (03PS3) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 [12:43:17] (03PS1) 10Slyngshede: data.yaml: add neslihanturan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1152677 (https://phabricator.wikimedia.org/T394395) [12:43:18] (03CR) 10Bartosz Wójtowicz: "Thanks for looking into this issue <3 Left just one small comment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos) [12:43:39] 10ops-eqiad, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875307 (10Jclark-ctr) [12:44:03] (03PS2) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) [12:44:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:18] (03PS3) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 [12:45:37] (03PS4) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 [12:45:38] (03CR) 10Marostegui: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [12:46:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:47:22] (03CR) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos) [12:49:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:10] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Looks great, thank you <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos) [12:50:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json [12:51:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:52:12] PROBLEM - Disk space on restbase2035 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 67229 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2035&var-datasource=codfw+prometheus/ops [12:53:20] (03PS1) 10Muehlenhoff: CAS: Add service definition for Zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/1152680 (https://phabricator.wikimedia.org/T395304) [12:54:28] (03PS1) 10Aqu: Airflow: Increase k8s check frequency in analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152681 (https://phabricator.wikimedia.org/T369845) [12:55:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:57:12] 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796#10875321 (10ayounsi) For doh and durum, I suggest that we wait for the Bird contract work defined in T362392#10875314 to land. The alternative is to implemen... [12:57:46] (03PS1) 10Gkyziridis: ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) [12:58:34] (03CR) 10CI reject: [V:04-1] ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1300) [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:11] o/ [13:00:24] nothing in the calendar so far indeed [13:01:21] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos) [13:01:26] (03PS2) 10Gkyziridis: ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) [13:02:25] (03CR) 10CI reject: [V:04-1] ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis) [13:02:52] Lucas_WMDE: Hi, I have a patch waiting since last Friday, any chance you can deploy it? I did not schedule it though. [13:03:14] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:03:31] (03Merged) 10jenkins-bot: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos) [13:04:05] bunnypranav: can you add it to the schedule now? [13:04:23] Sure! [13:04:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav) [13:04:37] Done. [13:05:16] * Lucas_WMDE looking [13:05:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T395241)', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json [13:05:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [13:05:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T395241)', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json [13:06:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:07:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:19] (03PS1) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) [13:08:20] (03PS2) 10Lucas Werkmeister (WMDE): core-Namespaces: Add Page, Author to default search ns in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav) [13:08:30] (03PS2) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274 [13:08:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:25] RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:30] (03PS2) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) [13:09:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav) [13:09:46] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet [13:10:00] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#10875354 (10ayounsi) I see that there are now enough free ports on cloudsw1-d5-eqiad, @Jclark-ctr @dcaro I'm wondering if you could resume the... [13:10:26] (03Merged) 10jenkins-bot: core-Namespaces: Add Page, Author to default search ns in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav) [13:10:39] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1152191|core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] [13:10:41] T395632: Change default namespaces of AdvancedSearch on Russian Wikisource - https://phabricator.wikimedia.org/T395632 [13:11:34] (03PS1) 10Bartosz Wójtowicz: ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865) [13:12:26] (03PS3) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) [13:13:01] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:13:03] (03PS4) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) [13:13:36] (03PS5) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) [13:13:42] !log lucaswerkmeister-wmde@deploy1003 bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191|core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:14:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T395241)', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json [13:14:40] bunnypranav: please test :) [13:14:47] on it :) [13:14:57] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:15:33] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:15:36] Yup, all good. Thanks! :D [13:15:40] !log lucaswerkmeister-wmde@deploy1003 bunnypranav, lucaswerkmeister-wmde: Continuing with sync [13:15:43] great, thanks! [13:16:19] * bunnypranav smiles joyfully [13:16:46] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" [13:16:51] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390540#10875385 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:16:59] (03PS6) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 [13:16:59] (03CR) 10Marostegui: "@Ladsgroup@gmail.com maybe this is all we need? https://puppet-compiler.wmflabs.org/output/1152673/5743/db1152.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui) [13:17:14] (03Merged) 10jenkins-bot: ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:17:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" [13:17:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:19:27] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:19:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:19:48] (03CR) 10Ssingh: varnish: Don't let experiment_fetcher crash if endpoint is unavailable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:19:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:19:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:19:55] (03CR) 10Gergő Tisza: SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [13:20:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:20:41] (03CR) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:20:59] (03PS1) 10Marostegui: db1211: Make it sanitarium master for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152688 (https://phabricator.wikimedia.org/T390954) [13:21:41] (03CR) 10Marostegui: "Requires depooling and restarting mariadb on db1211" [puppet] - 10https://gerrit.wikimedia.org/r/1152688 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [13:21:44] (03CR) 10Marostegui: [C:03+2] db1211: Make it sanitarium master for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152688 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [13:21:44] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:22:42] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152191|core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) [13:22:44] T395632: Change default namespaces of AdvancedSearch on Russian Wikisource - https://phabricator.wikimedia.org/T395632 [13:22:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [13:22:46] (03CR) 10Ssingh: [C:03+1] varnish: Don't let experiment_fetcher crash if endpoint is unavailable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:22:50] bunnypranav: should be done :) [13:23:07] Cool, thanks again! :) [13:24:20] !log UTC afternoon backport+config window done [13:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:12] (03CR) 10Muehlenhoff: data.yaml: add neslihanturan to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152677 (https://phabricator.wikimedia.org/T394395) (owner: 10Slyngshede) [13:27:26] bking@cumin2002 decommission (PID 3416233) is awaiting input [13:27:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [13:27:41] (03CR) 10Federico Ceratto: [C:03+1] "As discussed on IRC: The "ops" group and the URL look correct to me." [puppet] - 10https://gerrit.wikimedia.org/r/1152680 (https://phabricator.wikimedia.org/T395304) (owner: 10Muehlenhoff) [13:29:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json [13:32:49] (03PS3) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274 [13:33:50] jclark@cumin1002 provision (PID 2632669) is awaiting input [13:34:07] Lucas_WMDE: Just to confirm, there's no deployments happening now, right? [13:34:09] (03PS1) 10Marostegui: site.pp: Add db1211 to sanitarium master role [puppet] - 10https://gerrit.wikimedia.org/r/1152700 (https://phabricator.wikimedia.org/T390954) [13:34:19] phuedx: confirmed [13:34:23] not as far as I know, anyway [13:34:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:34:25] jclark@cumin1002 provision (PID 2632730) is awaiting input [13:34:27] Lucas_WMDE: Thanks [13:34:51] jclark@cumin1002 provision (PID 2632432) is awaiting input [13:34:53] jclark@cumin1002 provision (PID 2632542) is awaiting input [13:35:28] (03PS2) 10Marostegui: site.pp: Add db1211 to sanitarium master role [puppet] - 10https://gerrit.wikimedia.org/r/1152700 (https://phabricator.wikimedia.org/T390954) [13:36:00] Experiment Platform is about to run an end to end test. There should be minimal disruption but I wanted to make sure that nothing is currently in flight [13:36:31] (03CR) 10Marostegui: [C:03+2] site.pp: Add db1211 to sanitarium master role [puppet] - 10https://gerrit.wikimedia.org/r/1152700 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [13:37:01] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [13:37:07] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [13:37:07] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:37:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet [13:38:14] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet [13:39:16] (03PS2) 10KartikMistry: Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) [13:40:01] jclark@cumin1002 provision (PID 2632542) is awaiting input [13:40:02] jclark@cumin1002 provision (PID 2632432) is awaiting input [13:40:04] jclark@cumin1002 provision (PID 2632730) is awaiting input [13:41:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [13:41:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:41:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:41:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:41:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:43:46] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:44:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json [13:49:36] bking@cumin2002 decommission (PID 3432509) is awaiting input [13:49:43] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm [13:49:44] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet [13:49:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10875488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm executed with errors: - prometheus7002... [13:51:37] (03CR) 10Majavah: "LGTM minus the few things inline. We can use codfw1dev cloudcontrols as the first tester for this" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [13:52:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10875495 (10Jclark-ctr) [13:52:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10875496 (10Jclark-ctr) @Stevemunene Finished upgrading drives to 8tb [13:52:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:54:50] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152716 [13:55:09] (03CR) 10CI reject: [V:04-1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152716 (owner: 10PipelineBot) [13:57:10] PROBLEM - Disk space on restbase2027 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 68649 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2027&var-datasource=codfw+prometheus/ops [13:59:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T395241)', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json [13:59:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:59:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T395241)', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json [14:00:38] 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10875560 (10Nikerabbit) [14:01:52] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152680 (https://phabricator.wikimedia.org/T395304) (owner: 10Muehlenhoff) [14:04:28] !log Enabling the SDS 2.4.11 Synthetic A/A Test in xLab [14:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:56] (03CR) 10JMeybohm: [C:03+1] add codfw to os-reports in service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1152308 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:06:24] (03CR) 10JMeybohm: [C:03+1] trafficserver: point os-reports to k8s record [puppet] - 10https://gerrit.wikimedia.org/r/1152305 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:08:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T395241)', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json [14:17:44] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-06-02-141244-production [puppet] - 10https://gerrit.wikimedia.org/r/1152742 [14:18:15] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-06-02-141244-production [puppet] - 10https://gerrit.wikimedia.org/r/1152742 (owner: 10Majavah) [14:22:42] (03PS1) 10Marostegui: db1154: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) [14:24:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json [14:28:03] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10875724 (10Jgreen) Did the "get more trial time" step. [14:28:35] (03PS1) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) [14:29:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [14:31:05] (03CR) 10Ladsgroup: [C:03+1] "Later we can probably go with 1/4th of the value based on my measurements." [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [14:31:11] jouncebot: next [14:31:11] In 0 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1530) [14:31:22] meh, too close [14:31:35] (03CR) 10Marostegui: "I will do it now" [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [14:32:09] (03PS2) 10Marostegui: db1154: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) [14:32:25] (03CR) 10Marostegui: "Better to start more conservatively and then we can increase it if needed." [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [14:32:47] (03PS2) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) [14:32:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [14:35:22] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:35:40] (03CR) 10Marostegui: [C:03+2] db1154: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [14:35:56] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:35:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet [14:36:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance [14:37:39] (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [14:39:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json [14:39:45] !log joal@deploy1003 Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] [14:40:28] (03PS3) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) [14:40:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [14:42:53] !log joal@deploy1003 Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) [14:43:22] !log joal@deploy1003 Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] [14:44:02] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:28] !log joal@deploy1003 Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) [14:44:51] !log joal@deploy1003 Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] [14:46:07] (03PS1) 10Vgutierrez: varnish: Provide basic logging and metrics for experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411) [14:47:02] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:48:05] (03PS4) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) [14:48:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [14:50:46] (03PS5) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) [14:50:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875836 (10Andrew) [14:52:29] (03PS1) 10Phuedx: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 [14:52:43] (03CR) 10Ssingh: [C:03+1] varnish: Provide basic logging and metrics for experiment_fetcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:52:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [14:53:12] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10875841 (10Dzahn) [14:53:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875843 (10Andrew) [14:54:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875845 (10Andrew) To simplify T394333, let's move cloudcephosd1046 to D5. That saves us having to move an already-in-service server. I've updated the racking details accordingly. [14:54:18] !log joal@deploy1003 Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) [14:54:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T395241)', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json [14:54:23] (03CR) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [14:54:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:54:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T395241)', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json [14:54:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875852 (10Andrew) [14:54:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10875853 (10Andrew) [14:55:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10875856 (10Andrew) After conversation with @Jclark-ctr we're going to move cloudcephosd1046 (part of T378828 and not yet networked or in service) instead... [14:56:00] 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796#10875858 (10ayounsi) Had a quick chat with Moritz and Sukhbir. We prefer not to wait for the Bird work to progress on setting up the Routed Ganeti cluster, s... [14:56:00] (03PS2) 10Phuedx: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 [14:56:47] (03PS6) 10Vgutierrez: varnish: Don't let wmfuniq_experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) [14:56:47] (03PS2) 10Vgutierrez: varnish: Provide basic logging and metrics for wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411) [14:56:56] (03PS3) 10Phuedx: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 [14:57:11] 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796#10875859 (10ssingh) >>! In T395796#10875858, @ayounsi wrote: > Had a quick chat with Moritz and Sukhbir. > We prefer not to wait for the Bird work to progres... [14:57:26] (03CR) 10Ssingh: [C:03+1] varnish: Provide basic logging and metrics for wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:57:29] (03CR) 10Ssingh: [C:03+1] varnish: Don't let wmfuniq_experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:57:59] (03CR) 10Santiago Faci: [C:03+1] "looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 (owner: 10Phuedx) [14:58:33] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:00:18] !log Disabled the SDS 2.4.11 Synthetic A/A Test in xLab [15:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:19] (03PS1) 10Santiago Faci: xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) [15:01:33] (03CR) 10CI reject: [V:04-1] xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci) [15:01:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T395241)', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json [15:02:02] (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci) [15:03:12] !log joal@deploy1003 Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] [15:03:19] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) [15:03:53] !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] [15:03:58] !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) [15:04:38] Is there room available for a config deployment? There are no active backports right now [15:05:13] (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci) [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:44] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [15:10:12] (03CR) 10Ebernhardson: cirrussearch: use correct port for snapshot monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [15:13:58] (03PS4) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274 [15:14:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json [15:15:16] (03CR) 10Bking: cirrussearch: use correct port for snapshot monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [15:15:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye [15:15:57] (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci) [15:16:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10875934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2009.codfw.wmnet with OS bu... [15:16:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json [15:17:02] (03PS1) 10Marostegui: check_private_data_report: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152760 (https://phabricator.wikimedia.org/T390954) [15:17:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [15:19:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [15:19:59] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10875941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcontrol2010-dev.codfw.wmnet with OS boo... [15:21:54] !log jouncebot nowandnext [15:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:58] (03CR) 10Majavah: profile::memcached::instance: Add support for nftables-compatible config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [15:21:59] dangint [15:22:07] jouncebot: nowandnext [15:22:07] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [15:22:07] In 0 hour(s) and 7 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1530) [15:22:52] phuedx: ^ looks like you should be clear [15:22:59] thcipriani: Thanks <3 [15:23:50] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [15:25:34] (03CR) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [15:25:41] (03PS5) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274 [15:26:30] !log joal@deploy1003 Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] [15:27:12] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) [15:27:18] (03CR) 10Bking: [C:03+2] cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking) [15:27:40] thcipriani: Just confirming a detail in the codebase. Then I'll proceed [15:27:53] ack [15:29:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json [15:30:05] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1530). Please do the needful. [15:30:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff) [15:32:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json [15:32:15] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152760 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [15:32:26] (03CR) 10Marostegui: [C:03+2] "[17:31:54] marostegui: on phone so can't do gerrit but 1152760 has my +1" [puppet] - 10https://gerrit.wikimedia.org/r/1152760 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui) [15:34:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 (owner: 10Phuedx) [15:34:14] (03CR) 10Hashar: "recheck after having reverted a faulty CI config change ( 8603b5e9181fecebee5ad171de61bdfe6c6947e5 )" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci) [15:34:47] (03Merged) 10jenkins-bot: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 (owner: 10Phuedx) [15:35:01] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1152757|Enable MetricsPlatform's experimentation feature]] [15:37:27] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci) [15:38:13] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1152757|Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:38:33] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:38:47] (03Merged) 10jenkins-bot: xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci) [15:39:21] Confirmed that the product_metrics.web_base stream is configured correctly in labs and production realms [15:39:33] Checking logs on enwiki [15:41:27] Logs for MetricsPlatform extension indicate that there's no config fetching going on, which is what we want [15:42:27] Continuing [15:42:33] !log phuedx@deploy1003 phuedx: Continuing with sync [15:42:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage [15:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json [15:44:44] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [15:46:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage [15:47:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T395241)', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json [15:47:17] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:47:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [15:47:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T395241)', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json [15:47:44] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:49:25] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152757|Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) [15:50:47] !log disable puppet on A:cp to merge CR: 1091330 [15:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:03] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR 1091330] [15:53:24] (03CR) 10Ssingh: [V:03+1 C:03+2] trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [15:53:51] (03PS1) 10Máté Szabó: ORES: Allow using RRML for pre-save revert risk detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152770 (https://phabricator.wikimedia.org/T364705) [15:54:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T395241)', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json [15:55:32] !log enable puppet and run agent on cp7001 [15:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:45] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10876055 (10Dzahn) This seems like a continuation of T330944 from 2023. [15:59:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [15:59:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json [16:03:05] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR 1091330] [16:09:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json [16:14:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json [16:15:43] (03CR) 10A smart kitten: ores-extension: enable revertrisk filter for a list of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis) [16:18:09] (03CR) 10Vgutierrez: conftool: rm ats-be services cache nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall) [16:22:57] !log sudo cumin -b1 -s60 'A:cp and not P{cp7001*}' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" [16:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:24:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json [16:30:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json [16:36:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:36:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:40:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T395241)', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json [16:40:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [16:40:10] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10876331 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcontrol2010-dev.codfw.wmnet with OS bookwor... [16:40:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [16:40:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T395241)', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json [16:43:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:44:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:47:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T395241)', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json [16:49:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm [16:50:07] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10876368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcontrol2010-dev.codfw.wmnet with OS boo... [16:50:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:52:49] (03PS1) 10Bking: elastic/cirrussearch: remove decomm'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1152775 (https://phabricator.wikimedia.org/T394350) [16:53:50] PROBLEM - Disk space on an-worker1117 is CRITICAL: DISK CRITICAL - free space: / 2124 MB (3% inode=94%): /tmp 2124 MB (3% inode=94%): /var/tmp 2124 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1700) [17:00:05] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1700). [17:01:22] (03PS1) 10Phuedx: ext.xLab: Send limited copies of stream configs [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) [17:01:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx) [17:02:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json [17:04:02] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet [17:05:44] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet [17:08:27] (03CR) 10Jasmine: [C:03+2] wikikube: decommission wikikube-worker102[6-8].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151759 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [17:15:46] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10876497 (10Jhancock.wm) a:05Jhancock.wm→03Andrew @Andrew not sure why but i can't get it to pxe at all anymore. Can you take a look for me? Thank you! [17:18:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json [17:20:55] jasmine@cumin1002 decommission (PID 2875802) is awaiting input [17:21:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876535 (10Jclark-ctr) @Andrew @dcaro Fyi these have Boss cards and are not supported with legacy bios [17:22:09] (03PS2) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152716 [17:22:59] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [17:23:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876552 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye [17:32:12] PROBLEM - Disk space on restbase2035 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68407 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2035&var-datasource=codfw+prometheus/ops [17:32:42] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10876663 (10Umherirrender) Happens sometimes, {T17430} / T393952 Please create the file page wi... [17:33:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T395241)', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json [17:33:22] (03CR) 10Dzahn: [C:03+2] gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:33:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [17:33:49] (03PS1) 10Ssingh: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 [17:38:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance [17:38:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T395241)', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json [17:39:44] (03CR) 10CI reject: [V:04-1] sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [17:42:04] (03CR) 10Ssingh: "Unrelated:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [17:44:02] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:45:47] (03CR) 10Dzahn: [C:03+2] "[ssh-connection]: Failed (UnsupportedCredentialItem) to execute: ssh://gerrit2@gerrit2003.wikimedia.org:22: org.eclipse.jgit.transport.Cre" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:45:50] (03PS1) 10Dzahn: gerrit: add ssh_host_rsa public key for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804) [17:45:54] (03CR) 10Ssingh: "@rcoccioli@wikimedia.org: self.phabricator looks OK here for sre/discovery/datacenter.py but is failing CI. I can try to dig into this but" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [17:46:16] (03PS2) 10Dzahn: gerrit: add ssh_host_rsa public key for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804) [17:46:52] (03CR) 10Dzahn: "taken from /etc/ssh/ssh_host_rsa_key.pub" [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [17:47:02] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:47:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T395241)', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json [17:47:09] jclark@cumin1002 reimage (PID 2880646) is awaiting input [17:49:02] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152782" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:49:34] jasmine@cumin1002 decommission (PID 2905278) is awaiting input [17:50:05] (03CR) 10Dzahn: [C:03+2] "replication to 2002 seems just fine.. just there was no host key for 2003 yet." [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:50:15] !log jasmine@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet [17:53:01] (03PS1) 10Andrew Bogott: octavia: move octavia amphorae (and auth) to 'octavia' project [puppet] - 10https://gerrit.wikimedia.org/r/1152783 (https://phabricator.wikimedia.org/T393783) [18:00:25] (03CR) 10Andrew Bogott: [C:03+2] octavia: move octavia amphorae (and auth) to 'octavia' project [puppet] - 10https://gerrit.wikimedia.org/r/1152783 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:02:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json [18:02:21] !log jasmine@cumin1002 START - Cookbook sre.dns.netbox [18:02:26] (03CR) 10Dzahn: [C:03+2] gerrit: add ssh_host_rsa public key for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [18:05:52] !log jasmine@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" [18:06:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [18:06:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye exe... [18:07:07] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [18:07:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye [18:09:01] jasmine@cumin1002 decommission (PID 2905278) is awaiting input [18:09:19] (03CR) 10Herron: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [18:10:15] !log jasmine@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" [18:10:15] !log jasmine@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:15] !log jasmine@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet [18:17:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json [18:21:20] (03CR) 10Dr0ptp4kt: [C:03+1] ext.xLab: Send limited copies of stream configs [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx) [18:21:24] !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 [18:21:53] !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) [18:23:47] (03PS3) 10Ilias Sarantopoulos: ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis) [18:23:48] !log include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia [18:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:48] !log include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia [18:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:54] (03CR) 10BCornwall: "Not sure if this is a path we want to go down but this would be what's necessary to switch to using variables." [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [18:31:57] FIRING: [9x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:16] oh sigh [18:32:17] this is me [18:32:28] !incidents [18:32:29] 6274 (UNACKED) [9x] ProbeDown sre (probes/service) [18:32:29] 6267 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [18:32:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T395241)', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json [18:32:32] !ack 6274 [18:32:33] 6274 (ACKED) [9x] ProbeDown sre (probes/service) [18:32:45] fixing [18:32:51] thanks sukhe [18:32:58] should resolve [18:33:30] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn|ats-be) [18:33:58] sorry about that :] [18:34:30] FIRING: LibericaDiffFPCheck: Liberica instance lvs3010:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?var-site=esams&var-instance=lvs3010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [18:34:45] !log jasmine@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet [18:36:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [18:36:25] !incidents [18:36:25] 6274 (ACKED) [9x] ProbeDown sre (probes/service) [18:36:25] 6275 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [18:36:26] 6267 (RESOLVED) [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [18:36:30] !ack 6275 [18:36:31] 6275 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [18:36:44] should be resolving soon, this is related to the alert above [18:36:54] that's definitely me so nothing unrelated [18:36:57] RESOLVED: [9x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:37:01] cool :) [18:37:15] (not really since I messed up but yes, the resolution) [18:38:13] 😀 [18:38:15] jasmine@cumin1002 decommission (PID 2955991) is awaiting input [18:39:30] FIRING: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [18:39:55] from ulsfo, I found wikipedia.org, commons.wikimedia.org, doc.wikimedia.org etc unreachable for a hot minute there. [18:40:02] yes please [18:40:33] that was me -- sorry about that, I was assuming commands were being run in parallel and they were not. screen scrollback let me down. [18:40:55] thanks for reacting to quick that I did not even get the ext [18:40:57] textr [18:41:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [18:43:16] sukhe: yeah, just confirming I hit it to trying to load metawiki (I'm in the greater LA area), I have a traceroute from the time, but it looks like you know what the issue was :) [18:43:57] (03PS1) 10Andrew Bogott: preseed.yaml: try to use the boss card (hw raid1) for new cloudcephosds [puppet] - 10https://gerrit.wikimedia.org/r/1152789 (https://phabricator.wikimedia.org/T394333) [18:44:18] !log jasmine@cumin1002 START - Cookbook sre.dns.netbox [18:46:56] !log jasmine@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:57] !log jasmine@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet [18:47:01] (03CR) 10Andrew Bogott: [C:03+2] preseed.yaml: try to use the boss card (hw raid1) for new cloudcephosds [puppet] - 10https://gerrit.wikimedia.org/r/1152789 (https://phabricator.wikimedia.org/T394333) (owner: 10Andrew Bogott) [18:47:14] (03CR) 10BCornwall: "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [18:50:07] (03PS1) 10Andrew Bogott: New cloudcephosds -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1152792 [18:50:45] (03PS2) 10Ssingh: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 [18:51:11] (03CR) 10Ssingh: "Thanks, updated!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [18:53:10] retro from the above is the cookbook which automates what I was trying to do via cumin ^ it worked in the first attempt but it errored out and in the second attempt, I ran it without -b1 -s60 [18:53:15] the cookbook should prevent that from happening again [18:53:46] the full command was: [18:53:48] sudo cumin -b1 -s60 "A:cp and not A:cp-codfw and not P{cp7001* or cp1100* or cp1101* or cp1102* or cp1103* or cp1104*}" "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" [18:57:54] (03CR) 10CI reject: [V:04-1] sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [18:59:01] (03CR) 10Ssingh: "The unrelated cookbook error was not a red herring, even if the import order was not correct." [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [18:59:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra) [19:05:19] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3010.esams.wmnet} and A:liberica [19:05:33] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3010.esams.wmnet} and A:liberica [19:05:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3010.esams.wmnet} and A:liberica [19:05:52] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs3010.esams.wmnet} and A:liberica [19:06:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs3010.esams.wmnet} and A:liberica [19:06:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3010.esams.wmnet} and A:liberica [19:06:52] (03PS1) 10Jsn.sherman: Undeploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) [19:08:12] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [19:08:34] (03CR) 10BCornwall: [C:03+1] "Sorry, shouldn't have replied here. Unresolving for @rcoccioli@wikimedia.org to look at." [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [19:09:30] FIRING: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [19:09:44] on this, clearing these up [19:09:47] one down, two to go [19:13:25] (03PS1) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) [19:14:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [19:14:21] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3009.esams.wmnet} and A:liberica [19:14:36] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3009.esams.wmnet} and A:liberica [19:14:46] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3009.esams.wmnet} and A:liberica [19:14:55] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs3009.esams.wmnet} and A:liberica [19:14:57] (03PS2) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) [19:15:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs3009.esams.wmnet} and A:liberica [19:15:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3009.esams.wmnet} and A:liberica [19:17:10] RECOVERY - Disk space on restbase2027 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2027&var-datasource=codfw+prometheus/ops [19:19:30] FIRING: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [19:19:48] ^ going away now [19:19:51] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3008.esams.wmnet} and A:liberica [19:20:05] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3008.esams.wmnet} and A:liberica [19:20:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3008.esams.wmnet} and A:liberica [19:20:25] FIRING: SystemdUnitFailed: nfacctd.service on netflow3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:25] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs3008.esams.wmnet} and A:liberica [19:20:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs3008.esams.wmnet} and A:liberica [19:20:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3008.esams.wmnet} and A:liberica [19:24:30] RESOLVED: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck [19:25:19] (03PS3) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) [19:32:34] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [19:35:25] RESOLVED: SystemdUnitFailed: nfacctd.service on netflow3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [19:37:27] (03CR) 10Dzahn: [C:03+1] "+1 if puppet runs on the active host before doc2003, should be fine" [puppet] - 10https://gerrit.wikimedia.org/r/1152300 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [19:40:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [19:40:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [19:46:33] (03PS1) 10Dzahn: gerrit: introduce second daemon_user name [puppet] - 10https://gerrit.wikimedia.org/r/1152810 (https://phabricator.wikimedia.org/T338470) [19:47:10] PROBLEM - Disk space on restbase2030 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 67805 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2030&var-datasource=codfw+prometheus/ops [19:49:28] T394955 [19:49:28] T394955: when servers are about to run out of disk, monitoring should notify the owners - https://phabricator.wikimedia.org/T394955 [19:52:02] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:55:02] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:55:26] 06SRE: restbase2030 running low on disk space - https://phabricator.wikimedia.org/T395845 (10Dzahn) 03NEW [19:56:40] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10877281 (10Dzahn) T394955 [19:56:49] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877285 (10Dzahn) T394955 [19:59:14] 06SRE: restbase2030 running low on disk space - https://phabricator.wikimedia.org/T395845#10877302 (10Dzahn) also see T390630 [19:59:22] (03CR) 10Jdlrobson: [C:04-1] Simple summaries survey for English (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T2000). [20:00:05] phuedx, arlolra, and JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] I'm here [20:00:41] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877308 (10Dzahn) restbase2003 is soon running out and is alerting: T395845 if you could take a look at... [20:01:05] o/ [20:01:12] here [20:03:50] hi - i can deploy but maybe everyone in the queue can/wants to self-deploy? [20:04:12] since spiderpig is pure joy [20:04:19] I can self deploy [20:04:34] as can i [20:04:51] As can I [20:04:58] phuedx: do you want me to take care of your patch? and then i can pass onto arlolra + JSherman? [20:05:12] Sure. I'll stick around to verify it [20:06:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx) [20:07:28] (03Merged) 10jenkins-bot: ext.xLab: Send limited copies of stream configs [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx) [20:07:45] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1152779|ext.xLab: Send limited copies of stream configs (T391988)]] [20:07:48] T391988: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: Route experiment-oriented MediaWiki JavaScript-based events conditionally - https://phabricator.wikimedia.org/T391988 [20:10:02] !log cjming@deploy1003 cjming, phuedx: Backport for [[gerrit:1152779|ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:16] phuedx: can you verify? [20:10:41] cjming: On it [20:13:16] (03CR) 10Jdlrobson: [C:04-1] Simple summaries survey for English (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:14:46] (03PS3) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) [20:14:51] (03CR) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [20:15:14] (03PS4) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) [20:16:04] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: T395240 [20:16:16] cjming: Confirmed that there's nothing in the logs. I've also confirmed that minimal stream configs are being sent to the browser by the extension [20:16:19] LGTM [20:16:23] yay [20:16:27] !log cjming@deploy1003 cjming, phuedx: Continuing with sync [20:17:22] (03PS5) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) [20:18:13] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [20:18:23] (03CR) 10Jdlrobson: [C:03+1] "Ready to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:21:19] andrew@cumin1002 reimage (PID 2969213) is awaiting input [20:22:17] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: T395240 [20:23:35] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:23:37] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152779|ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) [20:23:40] T391988: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: Route experiment-oriented MediaWiki JavaScript-based events conditionally - https://phabricator.wikimedia.org/T391988 [20:23:46] phuedx: should be live! [20:23:50] arlolra: all yours [20:23:56] thanks [20:24:00] * phuedx checks [20:24:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra) [20:25:19] (03Merged) 10jenkins-bot: Remove wgParserEnableLegacyHeadingDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra) [20:25:29] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1152165|Remove wgParserEnableLegacyHeadingDOM option (T371756)]] [20:25:32] T371756: [1.45] Remove wgParserEnableLegacyHeadingDOM option to disable new heading HTML - https://phabricator.wikimedia.org/T371756 [20:26:12] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 [20:26:37] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: T395240 [20:27:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:27:10] PROBLEM - Disk space on restbase2030 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 61828 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2030&var-datasource=codfw+prometheus/ops [20:27:21] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: T395240 [20:27:25] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1152165|Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 [20:29:04] !log arlolra@deploy1003 arlolra: Continuing with sync [20:29:16] (03PS1) 10BCornwall: lvs: Switch lvs1017/lvs1020 primary [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) [20:29:44] Hello. We added something last minute. If you cannot get to it, no worries. [20:30:37] hi kimberly_sarabia - happy to deploy your patch - JSerman, will you lmk when you're done? [20:30:50] cjming: sure thing! [20:31:14] cjming: tyty [20:31:47] (03PS2) 10BCornwall: lvs: Switch lvs1017/lvs1020 primary [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) [20:33:45] (03PS2) 10Jsn.sherman: Undeploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) [20:35:19] I'm loving the status column on spider pig [20:35:32] ++ [20:36:07] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152165|Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) [20:36:10] T371756: [1.45] Remove wgParserEnableLegacyHeadingDOM option to disable new heading HTML - https://phabricator.wikimedia.org/T371756 [20:36:24] JSherman: all yours [20:36:30] arlolra: thanks! [20:36:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [20:37:10] while looking at that progress bar... humming 'does whatever a spiderpig does' [20:38:31] (03Merged) 10jenkins-bot: Undeploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman) [20:38:46] !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1152797|Undeploy first set of Patroller Tools surveys (T389401)]] [20:38:49] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [20:38:53] (03PS6) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) [20:39:26] mutante: 100% same [20:39:54] If someone were to patch it to have faint background music… [20:41:23] !log jsn@deploy1003 jsn: Backport for [[gerrit:1152797|Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:42:30] verifying... [20:45:03] !log jsn@deploy1003 jsn: Continuing with sync [20:46:05] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [20:47:03] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [20:51:42] !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152797|Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) [20:51:47] T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401 [20:52:39] (03PS1) 10Dzahn: gerrit: replace gerrit2003 RSA host key with ed25519 host key [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804) [20:52:41] (03CR) 10BCornwall: [C:04-2] "The incumbent code checks for `X-WMF-UUID` headers that have been set and passes the value in to `X-Analytics`. We need to figure out what" [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins) [20:52:54] JSherman: ok to take over? [20:53:00] cjming: all yours [20:53:05] ty! [20:53:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:53:57] (sorry for being slow, I was just spot checking w/o the debug host [20:54:14] no worries! [20:54:52] (03Merged) 10jenkins-bot: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [20:55:07] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1152801|Simple summaries survey for English (T389393)]] [20:55:09] T389393: Summaries: Create QuickSurvey for community prototype - https://phabricator.wikimedia.org/T389393 [20:55:36] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877528 (10Eevans) >>! In T390630#10877285, @Dzahn wrote: > {T394955} This one is a bit different to th... [20:55:40] 06SRE: restbase2030 running low on disk space - https://phabricator.wikimedia.org/T395845#10877529 (10Eevans) a:03Eevans [20:56:49] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:56:56] !log cjming@deploy1003 cjming, ksarabia: Backport for [[gerrit:1152801|Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:57:14] kimberly_sarabia ^^ [20:59:01] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" [20:59:01] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [20:59:36] cjming: LGTM! [20:59:46] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:59:51] !log cjming@deploy1003 cjming, ksarabia: Continuing with sync [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T2100) [21:01:16] (03CR) 10CDobbins: "Thanks for the feedback. That makes sense. I originally used git grep to try to find where it's being set, but all that came up was the if" [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins) [21:02:02] (03CR) 10Ryan Kemper: [C:03+2] elastic/cirrussearch: remove decomm'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1152775 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [21:04:53] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye [21:05:12] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye [21:06:48] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152801|Simple summaries survey for English (T389393)]] (duration: 11m 41s) [21:06:51] T389393: Summaries: Create QuickSurvey for community prototype - https://phabricator.wikimedia.org/T389393 [21:09:46] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:11:43] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [21:11:54] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1152817/5749/lvs1017.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [21:12:11] RECOVERY - Disk space on restbase2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2035&var-datasource=codfw+prometheus/ops [21:12:38] (03CR) 10Ssingh: "@vgutierrez@wikimedia.org see above." [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [21:13:47] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet [21:16:13] !log tgr@deploy1003 Locking from deployment [MediaWiki]: T395758 [21:16:16] (03PS2) 10Dzahn: gerrit: replace gerrit2003 RSA host key with ed25519 host key [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804) [21:16:35] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877596 (10Scott_French) Thanks Eric and Daniel. +1 to Eric's articulation of how monitoring sessionstor... [21:16:45] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet|cirrussearch2056.codfw.wmnet|cirrussearch2057.codfw.wmnet|cirrussearch2058.codfw.wmnet|cirrussearch2059.codfw.wmnet|cirrussearch2060.codfw.wmnet|cirrussearch2091.codfw.wmnet [21:16:45] 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10877597 (10Eevans) [21:20:50] (03Abandoned) 10CDobbins: replace X-WMF-UUID with vmod_var variable [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins) [21:22:18] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for T395855 - bking@cumin2002 [21:22:20] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for T395855 - bking@cumin2002 [21:22:22] T395855: Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855 [21:23:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877622 (10Andrew) @Jclark-ctr, the new preseed recipe seems to work ok, 1048 is now reimaging properly. 1049 failed for me in a totally different way but... [21:25:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877624 (10Jclark-ctr) @Andrew thanks i was looking at 1048 right now also i see it imaging! yea i have not adjusted a few settings for the rest work on... [21:30:07] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10877637 (10bking) a:05bking→03None [21:32:16] (03PS1) 10Ryan Kemper: cirrus: add missing entry for cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610) [21:32:27] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:34:37] (03CR) 10Ryan Kemper: [C:03+2] cirrus: add missing entry for cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:34:45] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [21:38:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [21:38:45] !log tgr@deploy1003 Unlocked for deployment [MediaWiki]: T395758 (duration: 22m 32s) [21:38:45] (03PS1) 10Ryan Kemper: cirrus: remove 6 codfw hosts from pybal [puppet] - 10https://gerrit.wikimedia.org/r/1152831 (https://phabricator.wikimedia.org/T390901) [21:44:03] RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:46:18] (03PS2) 10Andrew Bogott: New cloudcephosds -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1152792 [21:46:18] (03PS1) 10Andrew Bogott: eqiad1: install Octavia lbaas [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) [21:47:03] PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:47:30] (03CR) 10Andrew Bogott: [C:03+2] New cloudcephosds -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1152792 (owner: 10Andrew Bogott) [21:47:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [21:50:17] preparing to do a security deploy with scap sync-world [21:51:00] to deploy some security patches and a config change in PrivateSettings.php [21:51:33] (03CR) 10Ladsgroup: [C:03+1] Drop Chart roll-out dblists, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) (owner: 10Jforrester) [21:52:05] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10877692 (10bking) Data Platform SRE steps are finished (we think). Sending to... [21:52:42] (03CR) 10Ladsgroup: [C:03+1] build: Rename the rarely-used 'typos' script to 'checkTypos' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151781 (owner: 10Jforrester) [21:53:32] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye [21:53:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bullseye [21:59:06] (03CR) 10Bking: [C:03+1] cirrus: add missing entry for cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [22:01:33] (03PS1) 10Andrew Bogott: Move octavia secrets into deployment-specific subdirs, add for eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1152839 (https://phabricator.wikimedia.org/T393783) [22:01:34] (03PS1) 10Andrew Bogott: Correct the name of a fake octavia password [labs/private] - 10https://gerrit.wikimedia.org/r/1152840 (https://phabricator.wikimedia.org/T393783) [22:01:51] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye [22:01:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877711 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye [22:03:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152831 (https://phabricator.wikimedia.org/T390901) (owner: 10Ryan Kemper) [22:03:29] (03PS2) 10Andrew Bogott: eqiad1: install Octavia lbaas [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) [22:03:29] (03PS1) 10Andrew Bogott: Openstack octavia: move secrets into a codfw1dev subdir [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) [22:03:34] (03CR) 10Bking: [C:03+2] cirrus: remove 6 codfw hosts from pybal [puppet] - 10https://gerrit.wikimedia.org/r/1152831 (https://phabricator.wikimedia.org/T390901) (owner: 10Ryan Kemper) [22:04:14] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:06:28] (03CR) 10Andrew Bogott: [C:03+2] Move octavia secrets into deployment-specific subdirs, add for eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1152839 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:06:33] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Correct the name of a fake octavia password [labs/private] - 10https://gerrit.wikimedia.org/r/1152840 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:06:37] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Move octavia secrets into deployment-specific subdirs, add for eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1152839 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:07:01] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:07:09] (03PS2) 10Andrew Bogott: Openstack octavia: move secrets into a codfw1dev subdir. [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) [22:07:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:08:59] !log scap sync-world finished to deploy several security bugs and PrivateSettings.php changes [22:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:01] (03CR) 10Andrew Bogott: [C:03+2] Openstack octavia: move secrets into a codfw1dev subdir. [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:13:28] (03Abandoned) 10Andrew Bogott: dnsrecursor: make network-timeout configurable, reduce for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott) [22:14:12] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:16:04] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [22:16:56] (03PS3) 10Andrew Bogott: eqiad1: install Octavia lbaas [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) [22:17:03] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [22:29:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye [22:29:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye [22:32:30] (03CR) 10Jdlrobson: Simple summaries survey for English (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia) [22:35:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [22:38:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [22:45:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [22:45:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877804 (10Jclark-ctr) [22:47:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [22:50:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [22:54:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [22:58:53] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [22:59:45] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 782, active_shards: 1853, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [22:59:45] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T2300) [23:05:43] (03PS1) 10Cwhite: logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) [23:08:04] (03CR) 10CI reject: [V:04-1] logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:09:52] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:10:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:10:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye [23:10:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bullseye com... [23:13:31] (03PS1) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) [23:15:12] (03PS2) 10Cwhite: logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) [23:15:46] (03CR) 10CI reject: [V:04-1] logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [23:17:16] (03CR) 10CI reject: [V:04-1] logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:19:34] (03PS2) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) [23:20:25] (03PS3) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) [23:22:22] (03CR) 10CI reject: [V:04-1] logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [23:22:34] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:22:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:22:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye [23:23:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877843 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye com... [23:24:04] (03PS4) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) [23:24:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:24:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:24:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye [23:24:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye com... [23:25:21] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [23:25:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye [23:26:17] (03PS1) 10Ladsgroup: etcd: Remove ES clusters from "write clusters" if section is RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696) [23:26:21] (03CR) 10CI reject: [V:04-1] logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite) [23:26:29] (03PS5) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) [23:30:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye [23:30:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10877852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be2008.codfw.wmnet with OS bull... [23:30:45] (03PS1) 10Scott French: deployment_server: Update the local helm cache in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1152854 (https://phabricator.wikimedia.org/T395521) [23:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152856 [23:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152856 (owner: 10TrainBranchBot) [23:50:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage [23:51:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152856 (owner: 10TrainBranchBot) [23:52:12] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [23:53:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage