[00:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152445
[00:08:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152445 (owner: 10TrainBranchBot)
[00:10:49] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:23:35] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[00:28:35] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service logstash1025:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service logstash1025:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:30:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1152445 (owner: 10TrainBranchBot)
[00:42:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:01:49] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:13:55] <icinga-wm>	 PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[04:14:55] <icinga-wm>	 RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[04:23:35] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:42:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:58:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install es204[78] - https://phabricator.wikimedia.org/T393106#10874139 (10Marostegui) Thank you!
[05:00:50] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es104[78] - https://phabricator.wikimedia.org/T393107#10874154 (10Marostegui)
[05:01:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036 T395771', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json
[05:01:53] <stashbot>	 T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771
[05:02:19] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance
[05:06:46] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es2047 into es6 [puppet] - 10https://gerrit.wikimedia.org/r/1152451 (https://phabricator.wikimedia.org/T395771)
[05:07:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:08:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2047 into es6 [puppet] - 10https://gerrit.wikimedia.org/r/1152451 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui)
[05:14:33] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1003, ...), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:15:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet
[05:16:38] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add es2047 [puppet] - 10https://gerrit.wikimedia.org/r/1152452 (https://phabricator.wikimedia.org/T395771)
[05:17:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2047 [puppet] - 10https://gerrit.wikimedia.org/r/1152452 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui)
[05:19:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2047 to dbctl depooled T395771', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json
[05:20:04] <stashbot>	 T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771
[05:38:12] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[05:39:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169 T395663', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json
[05:39:08] <stashbot>	 T395663: MariaDB 10.11.13 released - https://phabricator.wikimedia.org/T395663
[05:53:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json
[06:02:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:08:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json
[06:11:36] <wikibugs>	 (03PS4) 10Phuedx: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt)
[06:14:34] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:15:56] <wikibugs>	 (03PS5) 10Phuedx: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt)
[06:16:29] <wikibugs>	 (03PS6) 10Phuedx: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt)
[06:17:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt)
[06:18:37] <phuedx>	 o/ I will be ~5 minutes late for the morning backport window but I'll be here :)
[06:23:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json
[06:34:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update canary [puppet] - 10https://gerrit.wikimedia.org/r/1152188 (owner: 10Muehlenhoff)
[06:38:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json
[06:48:44] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet
[06:52:44] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet
[06:53:00] <wikibugs>	 (03PS1) 10KartikMistry: Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930)
[06:53:20] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org
[06:53:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json
[06:53:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry)
[06:57:02] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: pwaigi1- offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1152559
[06:59:07] <wikibugs>	 (03CR) 10Muehlenhoff: data.yaml: pwaigi1- offboarding (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede)
[06:59:38] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T0700).
[07:00:04] <jouncebot>	 phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:02:29] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org
[07:02:29] <wikibugs>	 (03PS2) 10Slyngshede: data.yaml: pwaigi1- offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1152559
[07:02:43] <wikibugs>	 (03CR) 10Slyngshede: data.yaml: pwaigi1- offboarding (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede)
[07:05:34] <wikibugs>	 (03PS1) 10Marostegui: es1040: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152560 (https://phabricator.wikimedia.org/T395647)
[07:06:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1040 T395647', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json
[07:06:05] <stashbot>	 T395647: Migrate es7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395647
[07:06:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152559 (owner: 10Slyngshede)
[07:08:31] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org
[07:08:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json
[07:08:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance
[07:09:57] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance
[07:12:13] <phuedx>	 o/ Here now
[07:12:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:12:41] <marostegui>	 ^ es7 issues
[07:12:46] <icinga-wm>	 RECOVERY - MariaDB memory on es1035 is OK: OK Memory 3% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[07:13:22] <phuedx>	 moritzm: Is it OK to deploy a Beta Cluster -only config change with that amount of ongoing errors or should I hold off?
[07:13:27] <phuedx>	 Sorry
[07:13:35] <phuedx>	 marostegui: ^^
[07:13:44] <marostegui>	 phuedx: yes 
[07:13:56] <icinga-wm>	 PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[07:13:59] <phuedx>	 Yes it's OK or yes to hold off? :D
[07:14:05] <marostegui>	 phuedx: you can proceed
[07:14:08] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org
[07:14:10] <phuedx>	 marostegui: ty ty
[07:15:50] <phuedx>	 Amir1, urandom, awight: OK for me to deploy a config change?
[07:15:56] <icinga-wm>	 RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[07:17:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:20:22] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org
[07:21:16] <phuedx>	 Alright. It's been ~5 minutes. No deployers appear to be around/awake at the moment. Continuing
[07:21:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt)
[07:21:53] <stashbot>	 jmm@cumin1003: Failed to log message to wiki. Somebody should check the error logs.
[07:22:22] <wikibugs>	 (03Merged) 10jenkins-bot: Beta Cluster: Support A/B experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152253 (https://phabricator.wikimedia.org/T393918) (owner: 10Dr0ptp4kt)
[07:22:40] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1152253|Beta Cluster: Support A/B experiments (T393918)]]
[07:22:42] <stashbot>	 T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918
[07:22:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1040: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152560 (https://phabricator.wikimedia.org/T395647) (owner: 10Marostegui)
[07:27:45] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet
[07:32:41] <phuedx>	 Waiting on build-and-push-container-images. Looking at the log, it's running but taking time
[07:34:07] <hashar>	 phuedx: the first deploy of the week is doing a full rebuild of the mediawiki images
[07:34:27] <phuedx>	 hashar: TIL! Thanks for the clarification
[07:35:03] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874333 (10ayounsi)
[07:35:14] <hashar>	 that is because the base image is automatically rebuild over the week-end which in turns invalidate all the docker caching layers
[07:35:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json
[07:36:04] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet
[07:36:52] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10874334 (10Aklapper)
[07:38:20] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] EVPN_BGP: add peer-as to conf to match unicast and remove auto on bfd [homer/public] - 10https://gerrit.wikimedia.org/r/1152258 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[07:38:37] <logmsgbot>	 !log phuedx@deploy1003 phuedx, dr0ptp4kt: Backport for [[gerrit:1152253|Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:38:39] <stashbot>	 T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918
[07:40:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:42:20] <phuedx>	 OK. I've verified that the stream config setting appears for the product_metrics.web_base stream on beta metawiki but not beta enwiki and not on any production wikis
[07:43:00] <phuedx>	 I've verified that the MetricsPlatform extension is still loaded in both the labs and production realms
[07:43:46] <phuedx>	 I've verified on beta enwiki and beta metawiki that there is an active logged-in experiment but I'm not in sample
[07:43:58] <phuedx>	 But that the ext.xLab RL module is still loaded
[07:44:11] <phuedx>	 Just checking that the above does not happen in the production realm
[07:47:59] <phuedx>	 Yup. The experiment is not running on enwiki or metawiki
[07:48:03] <phuedx>	 No errors in the console
[07:48:06] <phuedx>	 Continuing
[07:49:11] <logmsgbot>	 !log phuedx@deploy1003 phuedx, dr0ptp4kt: Continuing with sync
[07:50:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:50:30] <XioNoX>	 Emperor: good morning fellow oncall!
[07:50:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json
[07:53:29] <Emperor>	 hi
[07:55:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Cool. Let's go for it then 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[07:58:38] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] validating-admission-policies: add policy to permit hostPath mounts for mediawiki (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli)
[07:58:40] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152253|Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s)
[07:58:43] <stashbot>	 T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918
[07:59:25] <phuedx>	 I will continue to poke at the Beta Cluster for a while longer :)
[08:05:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json
[08:11:22] <wikibugs>	 (03PS1) 10Slyngshede: Permission management [software/bitu] - 10https://gerrit.wikimedia.org/r/1152635
[08:11:36] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm
[08:11:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm
[08:12:59] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Set wmfuniq experiment reload period to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1152636 (https://phabricator.wikimedia.org/T391411)
[08:19:48] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874400 (10ayounsi)
[08:19:49] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::prometheus: Add components-api scrape endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe)
[08:20:06] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152636 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[08:20:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json
[08:22:06] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Add components-api scrape endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe)
[08:23:09] <wikibugs>	 (03PS1) 10Ayounsi: Add magru virtual IPs to network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263)
[08:23:30] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[08:23:35] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[08:26:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. The PCC failure for Puppet 5 is expected, since the manifests on install* use Puppet syntax from Puppet 7." [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[08:26:29] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add magru virtual IPs to network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/1152637 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[08:28:08] <icinga-wm>	 PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2012 MB (3% inode=93%): /tmp 2012 MB (3% inode=93%): /var/tmp 2012 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops
[08:33:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet
[08:35:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:36:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json
[08:37:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:37:59] <wikibugs>	 (03PS5) 10Majavah: O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446)
[08:40:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:41:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:42:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:42:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:42:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet
[08:42:50] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874428 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir7003.magru.wmnet` - ncredir7003.magru.wmnet (**WARN**)   - //Host not found...
[08:42:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/3 (Core: ssw1-f1-codfw:et-0/0/31 {#changeme_cwdm2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:44:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] varnish: Set wmfuniq experiment reload period to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1152636 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[08:45:08] <logmsgbot>	 !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm
[08:45:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:45:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874450 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm executed with errors: - ncredir7003 (**FA...
[08:46:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:46:35] <XioNoX>	 topranks: the CoreRouterInterfaceDown alert above is you?
[08:47:25] <topranks>	 XioNoX: no public holiday here I’m not doing anything
[08:47:28] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet
[08:47:29] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[08:47:40] <topranks>	 Oh sorry
[08:47:53] <topranks>	 my bad yeah that link I enabled last week
[08:48:06] <XioNoX>	 maybe a downtime that expired?
[08:48:07] <topranks>	 Jenn is gonna look at it today, it didn’t come up
[08:48:19] <XioNoX>	 anyway, will ack it for 24h
[08:48:20] <topranks>	 I didn’t add the BGP yet but enabled it in netbox
[08:48:26] <topranks>	 Thanks sry
[08:48:36] <XioNoX>	 no pb at all!
[08:49:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:51:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json
[08:51:08] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003"
[08:51:12] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003"
[08:51:12] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:51:12] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors
[08:51:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:51:16] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors
[08:51:42] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003"
[08:51:47] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003"
[08:53:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:54:48] <logmsgbot>	 jmm@cumin1003 makevm (PID 31285) is awaiting input
[08:58:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[08:58:21] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet
[08:58:31] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm
[08:58:41] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm
[08:59:11] <vgutierrez>	 ncredir7003? 🍿
[09:00:20] <moritzm>	 for https://phabricator.wikimedia.org/T394263
[09:00:46] <moritzm>	 I'm installing these initially with insetup, the actual service setup will be passed over in a separate task
[09:02:19] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet
[09:04:25] <wikibugs>	 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10874556 (10MatthewVernon) I think the "check the file is in a consistent (p...
[09:09:27] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet
[09:10:51] <jelto>	 !log update gitlab-settings artifact retention to 6 month - T395014
[09:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:53] <stashbot>	 T395014: Check GitLab artifact retention time - https://phabricator.wikimedia.org/T395014
[09:13:23] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet
[09:13:56] <icinga-wm>	 PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[09:14:54] <icinga-wm>	 RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[09:20:13] <wikibugs>	 (03CR) 10Majavah: [C:03+2] O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) (owner: 10Majavah)
[09:20:37] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10874584 (10MatthewVernon) I've checked these objects in swift, and they are both present and co...
[09:22:06] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org
[09:24:53] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage
[09:25:46] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update docker images for production deployments. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865)
[09:27:44] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org
[09:28:28] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage
[09:32:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:33:10] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "Thank you so much for handling this. This helps the dashboard being clean." [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn)
[09:33:18] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org
[09:36:47] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "As an additional info, in case it helps, running `check_bacula.py` or `check_bacula.py <jobname>` at the bacula director host (it is a pyt" [puppet] - 10https://gerrit.wikimedia.org/r/1152330 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn)
[09:40:34] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org
[09:42:28] <jinxer-wm>	 FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:44:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2039 T395647', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json
[09:44:09] <stashbot>	 T395647: Migrate es7 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395647
[09:45:07] <wikibugs>	 (03PS1) 10Marostegui: es2039: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152644 (https://phabricator.wikimedia.org/T395647)
[09:45:15] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance
[09:45:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2039: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152644 (https://phabricator.wikimedia.org/T395647) (owner: 10Marostegui)
[09:47:01] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] jobqueue: Set the host header in all jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris)
[09:47:28] <jinxer-wm>	 RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:49:08] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris)
[09:54:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10874669 (10ayounsi) That makes sens to me! +1 on removing the macros.
[09:55:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json
[09:55:42] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1152645 (https://phabricator.wikimedia.org/T395785)
[09:55:52] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet
[09:59:00] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm
[09:59:01] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet
[09:59:05] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm completed: - ncredir7003 (**PASS**)   - R...
[09:59:37] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet
[10:00:07] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1000)
[10:00:29] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: "Confirming that I verified all images and tags exist in our registry." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[10:02:05] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet
[10:02:06] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[10:02:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning
[10:02:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:03:18] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Start using edge uniques config fetched from xlabs endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1152648 (https://phabricator.wikimedia.org/T391411)
[10:05:08] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152648 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[10:05:15] <wikibugs>	 (03PS1) 10Federico Ceratto: icinga: add Federico Ceratto to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1152647
[10:06:01] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003"
[10:06:21] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003"
[10:06:21] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:06:22] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors
[10:06:25] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors
[10:06:47] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003"
[10:06:52] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003"
[10:07:22] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm
[10:07:38] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host durum7003.magru.wmnet with OS bookworm
[10:08:56] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[10:10:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json
[10:11:22] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649
[10:12:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:12:43] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[10:15:31] <wikibugs>	 (03PS2) 10Clément Goubert: Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649
[10:18:03] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet
[10:19:06] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] icinga: add Federico Ceratto to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1152647 (owner: 10Federico Ceratto)
[10:19:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:22:00] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet
[10:25:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json
[10:25:37] <wikibugs>	 (03PS1) 10Marostegui: es2047: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1152652 (https://phabricator.wikimedia.org/T395771)
[10:26:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2047: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1152652 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui)
[10:27:08] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet
[10:28:42] <wikibugs>	 (03CR) 10Klausman: ml-services: Update docker images for production deployments. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[10:29:12] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:30:11] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: Update docker images for production deployments. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[10:30:22] <wikibugs>	 07Puppet, 06DBA: labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846#10874763 (10jcrespo) Ignore my suggestion, sorry, it was done already. Current status: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/9ae1d81b4ae21559f66b6e6cd283d642814ac4cf/module...
[10:30:53] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] icinga: add Federico Ceratto to authorized users [puppet] - 10https://gerrit.wikimedia.org/r/1152647 (owner: 10Federico Ceratto)
[10:31:05] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet
[10:32:37] <wikibugs>	 (03CR) 10Gkyziridis: "LGTM! Please update commit message with the related changed to the model that you upload for articlequality/language-agnostic" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[10:32:48] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] ml-services: Update docker images for production deployments. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[10:33:22] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] aux-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[10:34:10] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage
[10:35:09] <wikibugs>	 (03Merged) 10jenkins-bot: aux-k8s-services/*: use the correct helm version in each cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[10:36:02] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] varnish: Start using edge uniques config fetched from xlabs endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1152648 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[10:36:59] <wikibugs>	 (03PS2) 10Bartosz Wójtowicz: ml-services: Update docker images for production deployments and update AQLA model files. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865)
[10:37:48] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage
[10:40:23] <logmsgbot>	 !log kamila@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply
[10:40:32] <logmsgbot>	 !log kamila@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply
[10:40:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json
[10:40:41] <logmsgbot>	 !log kamila@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply
[10:41:02] <logmsgbot>	 !log kamila@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply
[10:41:16] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update docker images for production deployments and update AQLA model files. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[10:41:44] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] "Done, and indeed nothing seems broken right now :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127950 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková)
[10:44:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:44:20] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649 (owner: 10Clément Goubert)
[10:44:47] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update docker images for production deployments and update AQLA model files. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152643 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[10:47:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:48:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning
[10:54:19] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet
[10:55:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json
[10:58:05] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet
[10:58:49] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm
[10:58:49] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet
[10:58:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host durum7003.magru.wmnet with OS bookworm completed: - durum7003 (**PASS**)   - Remov...
[11:10:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json
[11:11:38] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org
[11:11:40] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[11:14:53] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[11:15:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[11:15:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json
[11:16:35] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003"
[11:17:02] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003"
[11:17:02] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:17:02] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors
[11:17:05] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors
[11:17:27] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003"
[11:17:32] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003"
[11:18:36] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm
[11:18:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10874940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host doh7003.wikimedia.org with OS bookworm
[11:19:49] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet
[11:23:06] <claime>	 jouncebot: nowandnext
[11:23:06] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 36 minute(s)
[11:23:06] <jouncebot>	 In 1 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1300)
[11:23:32] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet
[11:23:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert^2 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152649 (owner: 10Clément Goubert)
[11:24:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json
[11:30:49] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[11:31:34] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[11:32:29] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10875036 (10Lvova) > So Басков_переулок_19_СПб_02.jpg looks OK to me (I might have missed someth...
[11:32:47] <claime>	 !log Manual run of cronjobs/generatecaptcha on k8s - T388531
[11:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:49] <stashbot>	 T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531
[11:33:27] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org
[11:35:21] <claime>	 Reedy: i killed the pod and will have to rerun the script on mwmaint
[11:36:58] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5738/console" [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy)
[11:37:00] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^3 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152659
[11:37:08] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] Revert^3 "mw::maintenance: Migrate generatecaptcha to mw-cron" [puppet] - 10https://gerrit.wikimedia.org/r/1152659 (owner: 10Clément Goubert)
[11:37:15] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org
[11:38:35] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] profile::gitlab::runner: Resolve namservers to IPs [puppet] - 10https://gerrit.wikimedia.org/r/1152280 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy)
[11:39:52] <claime>	 !log cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service  - T388531
[11:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:54] <stashbot>	 T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531
[11:40:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json
[11:41:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:42:57] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org
[11:44:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10875073 (10MoritzMuehlenhoff)
[11:44:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:45:29] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:45:33] <claime>	 Hmmm I'm getting a storage backend error Reedy, maybe Emperor too
[11:45:35] <claime>	 An unknown error occurred in storage backend "global-swift-eqiad".
[11:46:45] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:46:55] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org
[11:47:27] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[11:47:47] <claime>	 manual page, I fucked up
[11:47:56] <claime>	 https://auth.wikimedia.org/enwiki/wiki/Special:CreateAccount
[11:48:10] <claime>	 I broke account creation
[11:48:32] <claime>	 Emperor: XioNoX ^^
[11:48:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning
[11:49:01] <XioNoX>	 claime: need help or it's a fyi?
[11:49:04] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage
[11:49:19] <claime>	 I think I need help from someone who knows swift
[11:49:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:49:34] <claime>	 trying to find which container may need to be cleaned up
[11:49:52] <XioNoX>	 Then Emperor is probably a safe bet
[11:50:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:50:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[11:51:10] <XioNoX>	 hmmm
[11:51:27] <Emperor>	 darn it, I was having lunch
[11:51:29] * Emperor here
[11:51:42] <logmsgbot>	 marostegui@cumin1002 clone (PID 2353986) is awaiting input
[11:51:45] <Emperor>	 claime: what's up?
[11:51:58] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage
[11:52:13] <claime>	 Emperor: -security
[11:52:16] <Emperor>	 ack
[11:52:31] <XioNoX>	 Emperor: the page doesn't seem related to claime, I'm having a look at it
[11:52:48] <Emperor>	 XioNoX: ack
[11:54:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:54:26] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet
[11:55:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json
[11:55:20] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[11:57:11] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' .
[11:59:09] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[12:00:17] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet
[12:00:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: backup-kdc-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] preseed: cloudcontrol2010-dev will have a 4-disk sw raid. [puppet] - 10https://gerrit.wikimedia.org/r/1152390 (https://phabricator.wikimedia.org/T393102) (owner: 10Andrew Bogott)
[12:05:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add site.pp entries for new ceph osds [puppet] - 10https://gerrit.wikimedia.org/r/1152439 (https://phabricator.wikimedia.org/T394333) (owner: 10Andrew Bogott)
[12:06:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:07:33] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet
[12:07:34] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:08:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10875153 (10Andrew) @Jhancock.wm I updated the preseed rule for this server and it should make a SW raid now. If it still fails you ca...
[12:08:30] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:08:56] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:09:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875158 (10Andrew) a:05Andrew→03None Site.pp is updated and cloudcephosd1039 is drained and ready to...
[12:09:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:26] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:10:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json
[12:10:34] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[12:10:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T395241)', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json
[12:10:42] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render
[12:10:51] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render
[12:11:32] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet
[12:11:49] <claime>	 !log cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service  - T388531
[12:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:53] <stashbot>	 T388531: Migrate Security-Team jobs to mw-cron - https://phabricator.wikimedia.org/T388531
[12:12:03] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm
[12:12:04] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org
[12:12:09] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10875165 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host doh7003.wikimedia.org with OS bookworm completed: - doh7003 (**PASS**)   - Removed...
[12:13:12] <wikibugs>	 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796 (10MoritzMuehlenhoff) 03NEW
[12:13:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:15:51] <jinxer-wm>	 FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:17:17] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet
[12:17:19] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[12:20:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T395241)', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json
[12:20:51] <jinxer-wm>	 RESOLVED: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[12:22:11] <claime>	 !incidents
[12:22:12] <sirenbot>	 6267 (RESOLVED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[12:22:54] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003"
[12:23:35] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[12:24:12] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152670
[12:24:34] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003"
[12:24:34] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:24:34] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors
[12:24:38] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors
[12:25:00] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003"
[12:25:04] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003"
[12:25:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875237 (10Jclark-ctr) a:03Jclark-ctr
[12:25:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875250 (10Jclark-ctr)
[12:26:02] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm
[12:26:08] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10875252 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm
[12:30:15] <wikibugs>	 (03PS1) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673
[12:30:25] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "This is WIP" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui)
[12:32:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui)
[12:33:19] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org
[12:34:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10875292 (10SLyngshede-WMF) 05In progress→03Resolved
[12:35:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json
[12:35:24] <wikibugs>	 (03PS2) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673
[12:36:07] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675
[12:37:05] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org
[12:37:20] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675
[12:37:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning
[12:37:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet
[12:41:29] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:41:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10875303 (10SLyngshede-WMF) a:05WMDECyn→03SLyngshede-WMF
[12:41:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10875304 (10SLyngshede-WMF)
[12:42:41] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675
[12:43:17] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: add neslihanturan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1152677 (https://phabricator.wikimedia.org/T394395)
[12:43:18] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: "Thanks for looking into this issue <3 Left just one small comment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos)
[12:43:39] <wikibugs>	 10ops-eqiad, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875307 (10Jclark-ctr)
[12:44:03] <wikibugs>	 (03PS2) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185)
[12:44:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:45:18] <wikibugs>	 (03PS3) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673
[12:45:37] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675
[12:45:38] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui)
[12:46:44] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:47:22] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: reduce cpu usage in ml-staging for ref-need (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos)
[12:49:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:50:10] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+1] "Looks great, thank you <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos)
[12:50:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json
[12:51:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:52:12] <icinga-wm>	 PROBLEM - Disk space on restbase2035 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 67229 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2035&var-datasource=codfw+prometheus/ops
[12:53:20] <wikibugs>	 (03PS1) 10Muehlenhoff: CAS: Add service definition for Zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/1152680 (https://phabricator.wikimedia.org/T395304)
[12:54:28] <wikibugs>	 (03PS1) 10Aqu: Airflow: Increase k8s check frequency in analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152681 (https://phabricator.wikimedia.org/T369845)
[12:55:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:57:12] <wikibugs>	 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796#10875321 (10ayounsi) For doh and durum, I suggest that we wait for the Bird contract work defined in T362392#10875314 to land. The alternative is to implemen...
[12:57:46] <wikibugs>	 (03PS1) 10Gkyziridis: ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964)
[12:58:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1300)
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:11] <Lucas_WMDE>	 o/
[13:00:24] <Lucas_WMDE>	 nothing in the calendar so far indeed
[13:01:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos)
[13:01:26] <wikibugs>	 (03PS2) 10Gkyziridis: ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964)
[13:02:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis)
[13:02:52] <bunnypranav>	 Lucas_WMDE: Hi, I have a patch waiting since last Friday, any chance you can deploy it? I did not schedule it though.
[13:03:14] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:03:31] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: reduce cpu usage in ml-staging for ref-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152675 (owner: 10Ilias Sarantopoulos)
[13:04:05] <Lucas_WMDE>	 bunnypranav: can you add it to the schedule now?
[13:04:23] <bunnypranav>	 Sure!
[13:04:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav)
[13:04:37] <bunnypranav>	 Done.
[13:05:16] * Lucas_WMDE looking
[13:05:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T395241)', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json
[13:05:41] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[13:05:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T395241)', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json
[13:06:44] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:07:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:08:19] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411)
[13:08:20] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): core-Namespaces: Add Page, Author to default search ns in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav)
[13:08:30] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274
[13:08:30] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:08:56] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:09:25] <jinxer-wm>	 RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:09:26] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:09:30] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411)
[13:09:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav)
[13:09:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet
[13:10:00] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#10875354 (10ayounsi) I see that there are now enough free ports on cloudsw1-d5-eqiad, @Jclark-ctr @dcaro I'm wondering if you could resume the...
[13:10:26] <wikibugs>	 (03Merged) 10jenkins-bot: core-Namespaces: Add Page, Author to default search ns in ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152191 (https://phabricator.wikimedia.org/T395632) (owner: 10Bunnypranav)
[13:10:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1152191|core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]]
[13:10:41] <stashbot>	 T395632: Change default namespaces of AdvancedSearch on Russian Wikisource - https://phabricator.wikimedia.org/T395632
[13:11:34] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865)
[13:12:26] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411)
[13:13:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[13:13:03] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411)
[13:13:36] <wikibugs>	 (03PS5) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411)
[13:13:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191|core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:14:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T395241)', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json
[13:14:40] <Lucas_WMDE>	 bunnypranav: please test :)
[13:14:47] <bunnypranav>	 on it :)
[13:14:57] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[13:15:33] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[13:15:36] <bunnypranav>	 Yup, all good. Thanks! :D
[13:15:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 bunnypranav, lucaswerkmeister-wmde: Continuing with sync
[13:15:43] <Lucas_WMDE>	 great, thanks!
[13:16:19] * bunnypranav smiles joyfully
[13:16:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51  - jclark@cumin1002"
[13:16:51] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390540#10875385 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[13:16:59] <wikibugs>	 (03PS6) 10Marostegui: ms1: Move hosts to objectstash role [puppet] - 10https://gerrit.wikimedia.org/r/1152673
[13:16:59] <wikibugs>	 (03CR) 10Marostegui: "@Ladsgroup@gmail.com maybe this is all we need? https://puppet-compiler.wmflabs.org/output/1152673/5743/db1152.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1152673 (owner: 10Marostegui)
[13:17:14] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update custom_env for reference-need revision model in staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152686 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[13:17:32] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51  - jclark@cumin1002"
[13:17:32] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:19:27] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[13:19:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:19:48] <wikibugs>	 (03CR) 10Ssingh: varnish: Don't let experiment_fetcher crash if endpoint is unavailable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[13:19:49] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:19:52] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:19:55] <wikibugs>	 (03CR) 10Gergő Tisza: SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01)
[13:20:11] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:20:41] <wikibugs>	 (03CR) 10Vgutierrez: varnish: Don't let experiment_fetcher crash if endpoint is unavailable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[13:20:59] <wikibugs>	 (03PS1) 10Marostegui: db1211: Make it sanitarium master for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152688 (https://phabricator.wikimedia.org/T390954)
[13:21:41] <wikibugs>	 (03CR) 10Marostegui: "Requires depooling and restarting mariadb on db1211" [puppet] - 10https://gerrit.wikimedia.org/r/1152688 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[13:21:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1211: Make it sanitarium master for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152688 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[13:21:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:22:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152191|core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s)
[13:22:44] <stashbot>	 T395632: Change default namespaces of AdvancedSearch on Russian Wikisource - https://phabricator.wikimedia.org/T395632
[13:22:46] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[13:22:46] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] varnish: Don't let experiment_fetcher crash if endpoint is unavailable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[13:22:50] <Lucas_WMDE>	 bunnypranav: should be done :)
[13:23:07] <bunnypranav>	 Cool, thanks again! :)
[13:24:20] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:12] <wikibugs>	 (03CR) 10Muehlenhoff: data.yaml: add neslihanturan to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152677 (https://phabricator.wikimedia.org/T394395) (owner: 10Slyngshede)
[13:27:26] <logmsgbot>	 bking@cumin2002 decommission (PID 3416233) is awaiting input
[13:27:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[13:27:41] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "As discussed on IRC: The "ops" group and the URL look correct to me." [puppet] - 10https://gerrit.wikimedia.org/r/1152680 (https://phabricator.wikimedia.org/T395304) (owner: 10Muehlenhoff)
[13:29:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json
[13:32:49] <wikibugs>	 (03PS3) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274
[13:33:50] <logmsgbot>	 jclark@cumin1002 provision (PID 2632669) is awaiting input
[13:34:07] <phuedx>	 Lucas_WMDE: Just to confirm, there's no deployments happening now, right?
[13:34:09] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Add db1211 to sanitarium master role [puppet] - 10https://gerrit.wikimedia.org/r/1152700 (https://phabricator.wikimedia.org/T390954)
[13:34:19] <Lucas_WMDE>	 phuedx: confirmed
[13:34:23] <Lucas_WMDE>	 not as far as I know, anyway
[13:34:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:34:25] <logmsgbot>	 jclark@cumin1002 provision (PID 2632730) is awaiting input
[13:34:27] <phuedx>	 Lucas_WMDE: Thanks
[13:34:51] <logmsgbot>	 jclark@cumin1002 provision (PID 2632432) is awaiting input
[13:34:53] <logmsgbot>	 jclark@cumin1002 provision (PID 2632542) is awaiting input
[13:35:28] <wikibugs>	 (03PS2) 10Marostegui: site.pp: Add db1211 to sanitarium master role [puppet] - 10https://gerrit.wikimedia.org/r/1152700 (https://phabricator.wikimedia.org/T390954)
[13:36:00] <phuedx>	 Experiment Platform is about to run an end to end test. There should be minimal disruption but I wanted to make sure that nothing is currently in flight
[13:36:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Add db1211 to sanitarium master role [puppet] - 10https://gerrit.wikimedia.org/r/1152700 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[13:37:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[13:37:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[13:37:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:37:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet
[13:38:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet
[13:39:16] <wikibugs>	 (03PS2) 10KartikMistry: Enable the Contribute menu (6th group) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930)
[13:40:01] <logmsgbot>	 jclark@cumin1002 provision (PID 2632542) is awaiting input
[13:40:02] <logmsgbot>	 jclark@cumin1002 provision (PID 2632432) is awaiting input
[13:40:04] <logmsgbot>	 jclark@cumin1002 provision (PID 2632730) is awaiting input
[13:41:00] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[13:41:26] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:41:30] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:41:33] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:41:43] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:43:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:44:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json
[13:49:36] <logmsgbot>	 bking@cumin2002 decommission (PID 3432509) is awaiting input
[13:49:43] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm
[13:49:44] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet
[13:49:53] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10875488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host prometheus7002.magru.wmnet with OS bookworm executed with errors: - prometheus7002...
[13:51:37] <wikibugs>	 (03CR) 10Majavah: "LGTM minus the few things inline. We can use codfw1dev cloudcontrols as the first tester for this" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[13:52:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10875495 (10Jclark-ctr)
[13:52:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10875496 (10Jclark-ctr) @Stevemunene  Finished upgrading drives to 8tb
[13:52:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:54:50] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152716
[13:55:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152716 (owner: 10PipelineBot)
[13:57:10] <icinga-wm>	 PROBLEM - Disk space on restbase2027 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 68649 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2027&var-datasource=codfw+prometheus/ops
[13:59:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T395241)', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json
[13:59:38] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[13:59:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T395241)', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json
[14:00:38] <wikibugs>	 10SRE-swift-storage, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX), 13Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10875560 (10Nikerabbit)
[14:01:52] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152680 (https://phabricator.wikimedia.org/T395304) (owner: 10Muehlenhoff)
[14:04:28] <phuedx>	 !log Enabling the SDS 2.4.11 Synthetic A/A Test in xLab
[14:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:56] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] add codfw to os-reports in service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1152308 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[14:06:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] trafficserver: point os-reports to k8s record [puppet] - 10https://gerrit.wikimedia.org/r/1152305 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[14:08:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T395241)', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json
[14:17:44] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-06-02-141244-production [puppet] - 10https://gerrit.wikimedia.org/r/1152742
[14:18:15] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-06-02-141244-production [puppet] - 10https://gerrit.wikimedia.org/r/1152742 (owner: 10Majavah)
[14:22:42] <wikibugs>	 (03PS1) 10Marostegui: db1154: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954)
[14:24:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json
[14:28:03] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10875724 (10Jgreen) Did the "get more trial time" step.
[14:28:35] <wikibugs>	 (03PS1) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717)
[14:29:02] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[14:31:05] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Later we can probably go with 1/4th of the value based on my measurements." [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[14:31:11] <marostegui>	 jouncebot: next
[14:31:11] <jouncebot>	 In 0 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1530)
[14:31:22] <marostegui>	 meh, too close
[14:31:35] <wikibugs>	 (03CR) 10Marostegui: "I will do it now" [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[14:32:09] <wikibugs>	 (03PS2) 10Marostegui: db1154: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954)
[14:32:25] <wikibugs>	 (03CR) 10Marostegui: "Better to start more conservatively and then we can increase it if needed." [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[14:32:47] <wikibugs>	 (03PS2) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717)
[14:32:54] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[14:35:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[14:35:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1154: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[14:35:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[14:35:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:35:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet
[14:36:12] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance
[14:37:39] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1152743 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[14:39:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json
[14:39:45] <logmsgbot>	 !log joal@deploy1003 Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f]
[14:40:28] <wikibugs>	 (03PS3) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717)
[14:40:37] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[14:42:53] <logmsgbot>	 !log joal@deploy1003 Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s)
[14:43:22] <logmsgbot>	 !log joal@deploy1003 Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f]
[14:44:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:28] <logmsgbot>	 !log joal@deploy1003 Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s)
[14:44:51] <logmsgbot>	 !log joal@deploy1003 Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f]
[14:46:07] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Provide basic logging and metrics for experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411)
[14:47:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:48:05] <wikibugs>	 (03PS4) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717)
[14:48:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[14:50:46] <wikibugs>	 (03PS5) 10Bking: cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717)
[14:50:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875836 (10Andrew)
[14:52:29] <wikibugs>	 (03PS1) 10Phuedx: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757
[14:52:43] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] varnish: Provide basic logging and metrics for experiment_fetcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:52:47] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[14:53:12] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10875841 (10Dzahn)
[14:53:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875843 (10Andrew)
[14:54:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g - https://phabricator.wikimedia.org/T378828#10875845 (10Andrew) To simplify T394333, let's move cloudcephosd1046 to D5. That saves us having to move an already-in-service server. I've updated the racking details accordingly.
[14:54:18] <logmsgbot>	 !log joal@deploy1003 Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s)
[14:54:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T395241)', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json
[14:54:23] <wikibugs>	 (03CR) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[14:54:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[14:54:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T395241)', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json
[14:54:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10875852 (10Andrew)
[14:54:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10875853 (10Andrew)
[14:55:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10875856 (10Andrew) After conversation with @Jclark-ctr we're going to move cloudcephosd1046 (part of T378828 and not yet networked or in service) instead...
[14:56:00] <wikibugs>	 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796#10875858 (10ayounsi) Had a quick chat with Moritz and Sukhbir. We prefer not to wait for the Bird work to progress on setting up the Routed Ganeti cluster, s...
[14:56:00] <wikibugs>	 (03PS2) 10Phuedx: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757
[14:56:47] <wikibugs>	 (03PS6) 10Vgutierrez: varnish: Don't let wmfuniq_experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411)
[14:56:47] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Provide basic logging and metrics for wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411)
[14:56:56] <wikibugs>	 (03PS3) 10Phuedx: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757
[14:57:11] <wikibugs>	 06SRE, 06Traffic: Move durum7003 / doh7003 / doh7003 into service and decom doh7002 / durum7002 / ncredir7002 - https://phabricator.wikimedia.org/T395796#10875859 (10ssingh) >>! In T395796#10875858, @ayounsi wrote: > Had a quick chat with Moritz and Sukhbir. > We prefer not to wait for the Bird work to progres...
[14:57:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] varnish: Provide basic logging and metrics for wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1152754 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:57:29] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] varnish: Don't let wmfuniq_experiment_fetcher crash if endpoint is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1152685 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:57:59] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] "looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 (owner: 10Phuedx)
[14:58:33] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:00:18] <phuedx>	 !log Disabled the SDS 2.4.11 Synthetic A/A Test in xLab
[15:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:19] <wikibugs>	 (03PS1) 10Santiago Faci: xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899)
[15:01:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci)
[15:01:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T395241)', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json
[15:02:02] <wikibugs>	 (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci)
[15:03:12] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c]
[15:03:19] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s)
[15:03:53] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f]
[15:03:58] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s)
[15:04:38] <phuedx>	 Is there room available for a config deployment? There are no active backports right now
[15:05:13] <wikibugs>	 (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci)
[15:07:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:44] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[15:10:12] <wikibugs>	 (03CR) 10Ebernhardson: cirrussearch: use correct port for snapshot monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[15:13:58] <wikibugs>	 (03PS4) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274
[15:14:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json
[15:15:16] <wikibugs>	 (03CR) 10Bking: cirrussearch: use correct port for snapshot monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[15:15:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye
[15:15:57] <wikibugs>	 (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci)
[15:16:10] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10875934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2009.codfw.wmnet with OS bu...
[15:16:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json
[15:17:02] <wikibugs>	 (03PS1) 10Marostegui: check_private_data_report: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152760 (https://phabricator.wikimedia.org/T390954)
[15:17:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:19:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[15:19:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm
[15:19:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10875941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcontrol2010-dev.codfw.wmnet with OS boo...
[15:21:54] <thcipriani>	 !log jouncebot nowandnext
[15:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:58] <wikibugs>	 (03CR) 10Majavah: profile::memcached::instance: Add support for nftables-compatible config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[15:21:59] <thcipriani>	 dangint
[15:22:07] <thcipriani>	 jouncebot: nowandnext
[15:22:07] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 7 minute(s)
[15:22:07] <jouncebot>	 In 0 hour(s) and 7 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1530)
[15:22:52] <thcipriani>	 phuedx: ^ looks like you should be clear
[15:22:59] <phuedx>	 thcipriani: Thanks <3
[15:23:50] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[15:25:34] <wikibugs>	 (03CR) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[15:25:41] <wikibugs>	 (03PS5) 10Muehlenhoff: profile::memcached::instance: Add support for nftables-compatible config [puppet] - 10https://gerrit.wikimedia.org/r/1152274
[15:26:30] <logmsgbot>	 !log joal@deploy1003 Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552]
[15:27:12] <logmsgbot>	 !log joal@deploy1003 Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s)
[15:27:18] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: use correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1152746 (https://phabricator.wikimedia.org/T395717) (owner: 10Bking)
[15:27:40] <phuedx>	 thcipriani: Just confirming a detail in the codebase. Then I'll proceed
[15:27:53] <thcipriani>	 ack
[15:29:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json
[15:30:05] <jouncebot>	 jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1530). Please do the needful.
[15:30:16] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152274 (owner: 10Muehlenhoff)
[15:32:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json
[15:32:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] check_private_data_report: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1152760 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[15:32:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] "[17:31:54]  <Amir1> marostegui: on phone so can't do gerrit but 1152760 has my +1" [puppet] - 10https://gerrit.wikimedia.org/r/1152760 (https://phabricator.wikimedia.org/T390954) (owner: 10Marostegui)
[15:34:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 (owner: 10Phuedx)
[15:34:14] <wikibugs>	 (03CR) 10Hashar: "recheck after having reverted a faulty CI config change ( 8603b5e9181fecebee5ad171de61bdfe6c6947e5 )" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci)
[15:34:47] <wikibugs>	 (03Merged) 10jenkins-bot: Enable MetricsPlatform's experimentation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152757 (owner: 10Phuedx)
[15:35:01] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1152757|Enable MetricsPlatform's experimentation feature]]
[15:37:27] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci)
[15:38:13] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1152757|Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:38:33] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:38:47] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploying xLab v0.6.4 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152759 (https://phabricator.wikimedia.org/T392899) (owner: 10Santiago Faci)
[15:39:21] <phuedx>	 Confirmed that the product_metrics.web_base stream is configured correctly in labs and production realms
[15:39:33] <phuedx>	 Checking logs on enwiki
[15:41:27] <phuedx>	 Logs for MetricsPlatform extension indicate that there's no config fetching going on, which is what we want
[15:42:27] <phuedx>	 Continuing
[15:42:33] <logmsgbot>	 !log phuedx@deploy1003 phuedx: Continuing with sync
[15:42:34] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage
[15:44:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json
[15:44:44] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[15:46:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage
[15:47:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T395241)', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json
[15:47:17] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[15:47:28] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[15:47:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T395241)', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json
[15:47:44] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[15:49:25] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152757|Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s)
[15:50:47] <sukhe>	 !log disable puppet on A:cp to merge CR: 1091330
[15:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:03] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR 1091330]
[15:53:24] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh)
[15:53:51] <wikibugs>	 (03PS1) 10Máté Szabó: ORES: Allow using RRML for pre-save revert risk detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152770 (https://phabricator.wikimedia.org/T364705)
[15:54:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T395241)', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json
[15:55:32] <sukhe>	 !log enable puppet and run agent on cp7001
[15:55:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:45] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10876055 (10Dzahn) This seems like a continuation of T330944 from 2023.
[15:59:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[15:59:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json
[16:03:05] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR 1091330]
[16:09:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json
[16:14:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json
[16:15:43] <wikibugs>	 (03CR) 10A smart kitten: ores-extension: enable revertrisk filter for a list of wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis)
[16:18:09] <wikibugs>	 (03CR) 10Vgutierrez: conftool: rm ats-be services cache nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1114074 (owner: 10BCornwall)
[16:22:57] <sukhe>	 !log sudo cumin -b1 -s60 'A:cp and not P{cp7001*}' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn"
[16:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:35] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[16:24:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json
[16:30:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json
[16:36:00] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[16:36:20] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[16:40:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T395241)', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json
[16:40:05] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm
[16:40:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10876331 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcontrol2010-dev.codfw.wmnet with OS bookwor...
[16:40:23] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[16:40:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T395241)', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json
[16:43:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:44:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:47:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T395241)', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json
[16:49:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm
[16:50:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10876368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcontrol2010-dev.codfw.wmnet with OS boo...
[16:50:51] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[16:52:49] <wikibugs>	 (03PS1) 10Bking: elastic/cirrussearch: remove decomm'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1152775 (https://phabricator.wikimedia.org/T394350)
[16:53:50] <icinga-wm>	 PROBLEM - Disk space on an-worker1117 is CRITICAL: DISK CRITICAL - free space: / 2124 MB (3% inode=94%): /tmp 2124 MB (3% inode=94%): /var/tmp 2124 MB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1117&var-datasource=eqiad+prometheus/ops
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1700)
[17:00:05] <jouncebot>	 ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T1700).
[17:01:22] <wikibugs>	 (03PS1) 10Phuedx: ext.xLab: Send limited copies of stream configs [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988)
[17:01:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx)
[17:02:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json
[17:04:02] <logmsgbot>	 !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet
[17:05:44] <logmsgbot>	 !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet
[17:08:27] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] wikikube: decommission wikikube-worker102[6-8].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151759 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine)
[17:15:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102#10876497 (10Jhancock.wm) a:05Jhancock.wm→03Andrew @Andrew not sure why but i can't get it to pxe at all anymore. Can you take a look for me? Thank you!
[17:18:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json
[17:20:55] <logmsgbot>	 jasmine@cumin1002 decommission (PID 2875802) is awaiting input
[17:21:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876535 (10Jclark-ctr) @Andrew  @dcaro   Fyi these have Boss cards and are not supported with  legacy bios
[17:22:09] <wikibugs>	 (03PS2) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152716
[17:22:59] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[17:23:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876552 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[17:32:12] <icinga-wm>	 PROBLEM - Disk space on restbase2035 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68407 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2035&var-datasource=codfw+prometheus/ops
[17:32:42] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page - https://phabricator.wikimedia.org/T395773#10876663 (10Umherirrender) Happens sometimes, {T17430} / T393952  Please create the file page wi...
[17:33:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T395241)', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json
[17:33:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[17:33:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[17:33:49] <wikibugs>	 (03PS1) 10Ssingh: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781
[17:38:43] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance
[17:38:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T395241)', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json
[17:39:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[17:42:04] <wikibugs>	 (03CR) 10Ssingh: "Unrelated:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[17:44:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:45:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "[ssh-connection]: Failed (UnsupportedCredentialItem) to execute: ssh://gerrit2@gerrit2003.wikimedia.org:22: org.eclipse.jgit.transport.Cre" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[17:45:50] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add ssh_host_rsa public key for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804)
[17:45:54] <wikibugs>	 (03CR) 10Ssingh: "@rcoccioli@wikimedia.org: self.phabricator looks OK here for sre/discovery/datacenter.py but is failing CI. I can try to dig into this but" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[17:46:16] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add ssh_host_rsa public key for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804)
[17:46:52] <wikibugs>	 (03CR) 10Dzahn: "taken from /etc/ssh/ssh_host_rsa_key.pub" [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[17:47:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:47:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T395241)', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json
[17:47:09] <logmsgbot>	 jclark@cumin1002 reimage (PID 2880646) is awaiting input
[17:49:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152782" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[17:49:34] <logmsgbot>	 jasmine@cumin1002 decommission (PID 2905278) is awaiting input
[17:50:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "replication to 2002 seems just fine.. just there was no host key for 2003 yet." [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[17:50:15] <logmsgbot>	 !log jasmine@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet
[17:53:01] <wikibugs>	 (03PS1) 10Andrew Bogott: octavia: move octavia amphorae (and auth) to 'octavia' project [puppet] - 10https://gerrit.wikimedia.org/r/1152783 (https://phabricator.wikimedia.org/T393783)
[18:00:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] octavia: move octavia amphorae (and auth) to 'octavia' project [puppet] - 10https://gerrit.wikimedia.org/r/1152783 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:02:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json
[18:02:21] <logmsgbot>	 !log jasmine@cumin1002 START - Cookbook sre.dns.netbox
[18:02:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add ssh_host_rsa public key for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1152782 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn)
[18:05:52] <logmsgbot>	 !log jasmine@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002"
[18:06:45] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[18:06:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye exe...
[18:07:07] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[18:07:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10876904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[18:09:01] <logmsgbot>	 jasmine@cumin1002 decommission (PID 2905278) is awaiting input
[18:09:19] <wikibugs>	 (03CR) 10Herron: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[18:10:15] <logmsgbot>	 !log jasmine@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002"
[18:10:15] <logmsgbot>	 !log jasmine@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:10:15] <logmsgbot>	 !log jasmine@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet
[18:17:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json
[18:21:20] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+1] ext.xLab: Send limited copies of stream configs [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx)
[18:21:24] <logmsgbot>	 !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6
[18:21:53] <logmsgbot>	 !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s)
[18:23:47] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ores-extension: enable revertrisk filter for a list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152682 (https://phabricator.wikimedia.org/T395823) (owner: 10Gkyziridis)
[18:23:48] <brett>	 !log include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia
[18:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:48] <brett>	 !log include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia
[18:24:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:54] <wikibugs>	 (03CR) 10BCornwall: "Not sure if this is a path we want to go down but this would be what's necessary to switch to using variables." [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[18:31:57] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:32:16] <sukhe>	 oh sigh
[18:32:17] <sukhe>	 this is me
[18:32:28] <sukhe>	 !incidents
[18:32:29] <sirenbot>	 6274 (UNACKED)  [9x] ProbeDown sre (probes/service)
[18:32:29] <sirenbot>	 6267 (RESOLVED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[18:32:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T395241)', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json
[18:32:32] <sukhe>	 !ack 6274
[18:32:33] <sirenbot>	 6274 (ACKED)  [9x] ProbeDown sre (probes/service)
[18:32:45] <sukhe>	 fixing
[18:32:51] <urandom>	 thanks sukhe 
[18:32:58] <sukhe>	 should resolve
[18:33:30] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn|ats-be)
[18:33:58] <sukhe>	 sorry about that :]
[18:34:30] <jinxer-wm>	 FIRING: LibericaDiffFPCheck: Liberica instance lvs3010:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?var-site=esams&var-instance=lvs3010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck
[18:34:45] <logmsgbot>	 !log jasmine@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet
[18:36:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[18:36:25] <sukhe>	 !incidents
[18:36:25] <sirenbot>	 6274 (ACKED)  [9x] ProbeDown sre (probes/service)
[18:36:25] <sirenbot>	 6275 (UNACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[18:36:26] <sirenbot>	 6267 (RESOLVED)  [5x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet)
[18:36:30] <sukhe>	 !ack 6275
[18:36:31] <sirenbot>	 6275 (ACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[18:36:44] <sukhe>	 should be resolving soon, this is related to the alert above
[18:36:54] <sukhe>	 that's definitely me so nothing unrelated
[18:36:57] <jinxer-wm>	 RESOLVED: [9x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:37:01] <sukhe>	 cool :)
[18:37:15] <sukhe>	 (not really since I messed up but yes, the resolution)
[18:38:13] <urandom>	 😀
[18:38:15] <logmsgbot>	 jasmine@cumin1002 decommission (PID 2955991) is awaiting input
[18:39:30] <jinxer-wm>	 FIRING: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck  - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck
[18:39:55] <Krinkle>	 from ulsfo, I found wikipedia.org, commons.wikimedia.org, doc.wikimedia.org etc unreachable for a hot minute there.
[18:40:02] <sukhe>	 yes please
[18:40:33] <sukhe>	 that was me -- sorry about that, I was assuming commands were being run in parallel and they were not. screen scrollback let me down.
[18:40:55] <mutante>	 thanks for reacting to quick that I did not even get the ext
[18:40:57] <mutante>	 textr
[18:41:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[18:43:16] <greg-g>	 sukhe: yeah, just confirming I hit it to trying to load metawiki (I'm in the greater LA area), I have a traceroute from the time, but it looks like you know what the issue was :)
[18:43:57] <wikibugs>	 (03PS1) 10Andrew Bogott: preseed.yaml: try to use the boss card (hw raid1) for new cloudcephosds [puppet] - 10https://gerrit.wikimedia.org/r/1152789 (https://phabricator.wikimedia.org/T394333)
[18:44:18] <logmsgbot>	 !log jasmine@cumin1002 START - Cookbook sre.dns.netbox
[18:46:56] <logmsgbot>	 !log jasmine@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:46:57] <logmsgbot>	 !log jasmine@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet
[18:47:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] preseed.yaml: try to use the boss card (hw raid1) for new cloudcephosds [puppet] - 10https://gerrit.wikimedia.org/r/1152789 (https://phabricator.wikimedia.org/T394333) (owner: 10Andrew Bogott)
[18:47:14] <wikibugs>	 (03CR) 10BCornwall: "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[18:50:07] <wikibugs>	 (03PS1) 10Andrew Bogott: New cloudcephosds -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1152792
[18:50:45] <wikibugs>	 (03PS2) 10Ssingh: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781
[18:51:11] <wikibugs>	 (03CR) 10Ssingh: "Thanks, updated!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[18:53:10] <sukhe>	 retro from the above is the cookbook which automates what I was trying to do via cumin ^ it worked in the first attempt but it errored out and in the second attempt, I ran it without -b1 -s60
[18:53:15] <sukhe>	 the cookbook should prevent that from happening again
[18:53:46] <sukhe>	 the full command was:
[18:53:48] <sukhe>	 sudo cumin -b1 -s60 "A:cp and not A:cp-codfw and not P{cp7001* or cp1100* or cp1101* or cp1102* or cp1103* or cp1104*}" "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn"
[18:57:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[18:59:01] <wikibugs>	 (03CR) 10Ssingh: "The unrelated cookbook error was not a red herring, even if the import order was not correct." [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[18:59:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra)
[19:05:19] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3010.esams.wmnet} and A:liberica
[19:05:33] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3010.esams.wmnet} and A:liberica
[19:05:43] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3010.esams.wmnet} and A:liberica
[19:05:52] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs3010.esams.wmnet} and A:liberica
[19:06:13] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs3010.esams.wmnet} and A:liberica
[19:06:15] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3010.esams.wmnet} and A:liberica
[19:06:52] <wikibugs>	 (03PS1) 10Jsn.sherman: Undeploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401)
[19:08:12] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[19:08:34] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] "Sorry, shouldn't have replied here. Unresolving for @rcoccioli@wikimedia.org to look at." [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[19:09:30] <jinxer-wm>	 FIRING: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck  - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck
[19:09:44] <sukhe>	 on this, clearing these up
[19:09:47] <sukhe>	 one down, two to go
[19:13:25] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393)
[19:14:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[19:14:21] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3009.esams.wmnet} and A:liberica
[19:14:36] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3009.esams.wmnet} and A:liberica
[19:14:46] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3009.esams.wmnet} and A:liberica
[19:14:55] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs3009.esams.wmnet} and A:liberica
[19:14:57] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393)
[19:15:19] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs3009.esams.wmnet} and A:liberica
[19:15:21] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3009.esams.wmnet} and A:liberica
[19:17:10] <icinga-wm>	 RECOVERY - Disk space on restbase2027 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2027&var-datasource=codfw+prometheus/ops
[19:19:30] <jinxer-wm>	 FIRING: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck  - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck
[19:19:48] <sukhe>	 ^ going away now
[19:19:51] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3008.esams.wmnet} and A:liberica
[19:20:05] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3008.esams.wmnet} and A:liberica
[19:20:15] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3008.esams.wmnet} and A:liberica
[19:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: nfacctd.service on netflow3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:20:25] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin pooling P{lvs3008.esams.wmnet} and A:liberica
[19:20:45] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs3008.esams.wmnet} and A:liberica
[19:20:47] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3008.esams.wmnet} and A:liberica
[19:24:30] <jinxer-wm>	 RESOLVED: [3x] LibericaDiffFPCheck: Liberica instance lvs3008:9100 control plane status doesn't match with forwarding plane status - https://wikitech.wikimedia.org/wiki/Liberica#LibericaDiffFPCheck  - https://alerts.wikimedia.org/?q=alertname%3DLibericaDiffFPCheck
[19:25:19] <wikibugs>	 (03PS3) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393)
[19:32:34] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage
[19:35:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: nfacctd.service on netflow3003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:36:06] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage
[19:37:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "+1 if puppet runs on the active host before doc2003, should be fine" [puppet] - 10https://gerrit.wikimedia.org/r/1152300 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth)
[19:40:34] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[19:40:51] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[19:46:33] <wikibugs>	 (03PS1) 10Dzahn: gerrit: introduce second daemon_user name [puppet] - 10https://gerrit.wikimedia.org/r/1152810 (https://phabricator.wikimedia.org/T338470)
[19:47:10] <icinga-wm>	 PROBLEM - Disk space on restbase2030 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 67805 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2030&var-datasource=codfw+prometheus/ops
[19:49:28] <mutante>	 T394955
[19:49:28] <stashbot>	 T394955: when servers are about to run out of disk, monitoring should notify the owners  - https://phabricator.wikimedia.org/T394955
[19:52:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:52:12] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[19:55:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:55:26] <wikibugs>	 06SRE: restbase2030 running low on disk space - https://phabricator.wikimedia.org/T395845 (10Dzahn) 03NEW
[19:56:40] <wikibugs>	 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10877281 (10Dzahn) T394955
[19:56:49] <wikibugs>	 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877285 (10Dzahn) T394955
[19:59:14] <wikibugs>	 06SRE: restbase2030 running low on disk space - https://phabricator.wikimedia.org/T395845#10877302 (10Dzahn) also see T390630
[19:59:22] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Simple summaries survey for English (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T2000).
[20:00:05] <jouncebot>	 phuedx, arlolra, and JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:23] <JSherman>	 I'm here
[20:00:41] <wikibugs>	 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877308 (10Dzahn) restbase2003 is soon running out and is alerting: T395845 if you could take a look at...
[20:01:05] <phuedx>	 o/
[20:01:12] <arlolra>	 here
[20:03:50] <cjming>	 hi - i can deploy but maybe everyone in the queue can/wants to self-deploy?
[20:04:12] <cjming>	 since spiderpig is pure joy
[20:04:19] <JSherman>	 I can self deploy
[20:04:34] <arlolra>	 as can i
[20:04:51] <phuedx>	 As can I
[20:04:58] <cjming>	 phuedx: do you want me to take care of your patch? and then i can pass onto arlolra + JSherman?
[20:05:12] <phuedx>	 Sure. I'll stick around to verify it
[20:06:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx)
[20:07:28] <wikibugs>	 (03Merged) 10jenkins-bot: ext.xLab: Send limited copies of stream configs [extensions/MetricsPlatform] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152779 (https://phabricator.wikimedia.org/T391988) (owner: 10Phuedx)
[20:07:45] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1152779|ext.xLab: Send limited copies of stream configs (T391988)]]
[20:07:48] <stashbot>	 T391988: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: Route experiment-oriented MediaWiki JavaScript-based events conditionally - https://phabricator.wikimedia.org/T391988
[20:10:02] <logmsgbot>	 !log cjming@deploy1003 cjming, phuedx: Backport for [[gerrit:1152779|ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:10:16] <cjming>	 phuedx: can you verify?
[20:10:41] <phuedx>	 cjming: On it
[20:13:16] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Simple summaries survey for English (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia)
[20:14:46] <wikibugs>	 (03PS3) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185)
[20:14:51] <wikibugs>	 (03CR) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01)
[20:15:14] <wikibugs>	 (03PS4) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393)
[20:16:04] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: T395240
[20:16:16] <phuedx>	 cjming: Confirmed that there's nothing in the logs. I've also confirmed that minimal stream configs are being sent to the browser by the extension
[20:16:19] <phuedx>	 LGTM
[20:16:23] <cjming>	 yay
[20:16:27] <logmsgbot>	 !log cjming@deploy1003 cjming, phuedx: Continuing with sync
[20:17:22] <wikibugs>	 (03PS5) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393)
[20:18:13] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[20:18:23] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "Ready to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia)
[20:21:19] <logmsgbot>	 andrew@cumin1002 reimage (PID 2969213) is awaiting input
[20:22:17] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: T395240
[20:23:35] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:23:37] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152779|ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s)
[20:23:40] <stashbot>	 T391988: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: Route experiment-oriented MediaWiki JavaScript-based events conditionally - https://phabricator.wikimedia.org/T391988
[20:23:46] <cjming>	 phuedx: should be live!
[20:23:50] <cjming>	 arlolra: all yours
[20:23:56] <arlolra>	 thanks
[20:24:00] * phuedx checks
[20:24:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra)
[20:25:19] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgParserEnableLegacyHeadingDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra)
[20:25:29] <logmsgbot>	 !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1152165|Remove wgParserEnableLegacyHeadingDOM option (T371756)]]
[20:25:32] <stashbot>	 T371756: [1.45] Remove wgParserEnableLegacyHeadingDOM option to disable new heading HTML - https://phabricator.wikimedia.org/T371756
[20:26:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028
[20:26:37] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: T395240
[20:27:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia)
[20:27:10] <icinga-wm>	 PROBLEM - Disk space on restbase2030 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 61828 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2030&var-datasource=codfw+prometheus/ops
[20:27:21] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: T395240
[20:27:25] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1152165|Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:27:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028
[20:29:04] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Continuing with sync
[20:29:16] <wikibugs>	 (03PS1) 10BCornwall: lvs: Switch lvs1017/lvs1020 primary [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145)
[20:29:44] <kimberly_sarabia>	 Hello. We added something last minute. If you cannot get to it, no worries. 
[20:30:37] <cjming>	 hi kimberly_sarabia - happy to deploy your patch - JSerman, will you lmk when you're done?
[20:30:50] <JSherman>	 cjming: sure thing!
[20:31:14] <kimberly_sarabia>	 cjming: tyty
[20:31:47] <wikibugs>	 (03PS2) 10BCornwall: lvs: Switch lvs1017/lvs1020 primary [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145)
[20:33:45] <wikibugs>	 (03PS2) 10Jsn.sherman: Undeploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401)
[20:35:19] <JSherman>	 I'm loving the status column on spider pig
[20:35:32] <cjming>	 ++
[20:36:07] <logmsgbot>	 !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152165|Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s)
[20:36:10] <stashbot>	 T371756: [1.45] Remove wgParserEnableLegacyHeadingDOM option to disable new heading HTML - https://phabricator.wikimedia.org/T371756
[20:36:24] <arlolra>	 JSherman: all yours
[20:36:30] <JSherman>	 arlolra: thanks!
[20:36:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[20:37:10] <mutante>	 while looking at that progress bar... humming 'does whatever a spiderpig does'
[20:38:31] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy first set of Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152797 (https://phabricator.wikimedia.org/T389401) (owner: 10Jsn.sherman)
[20:38:46] <logmsgbot>	 !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1152797|Undeploy first set of Patroller Tools surveys (T389401)]]
[20:38:49] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[20:38:53] <wikibugs>	 (03PS6) 10Kimberly Sarabia: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393)
[20:39:26] <JSherman>	 mutante: 100% same
[20:39:54] <phuedx>	 If someone were to patch it to have faint background music…
[20:41:23] <logmsgbot>	 !log jsn@deploy1003 jsn: Backport for [[gerrit:1152797|Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:42:30] <JSherman>	 verifying...
[20:45:03] <logmsgbot>	 !log jsn@deploy1003 jsn: Continuing with sync
[20:46:05] <icinga-wm>	 PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[20:47:03] <icinga-wm>	 RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[20:51:42] <logmsgbot>	 !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152797|Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s)
[20:51:47] <stashbot>	 T389401: Deploy first set of Patroller Tools surveys - https://phabricator.wikimedia.org/T389401
[20:52:39] <wikibugs>	 (03PS1) 10Dzahn: gerrit: replace gerrit2003 RSA host key with ed25519 host key [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804)
[20:52:41] <wikibugs>	 (03CR) 10BCornwall: [C:04-2] "The incumbent code checks for `X-WMF-UUID` headers that have been set and passes the value in to `X-Analytics`. We need to figure out what" [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins)
[20:52:54] <cjming>	 JSherman: ok to take over?
[20:53:00] <JSherman>	 cjming: all yours
[20:53:05] <cjming>	 ty!
[20:53:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia)
[20:53:57] <JSherman>	 (sorry for being slow, I was just spot checking w/o the debug host
[20:54:14] <cjming>	 no worries!
[20:54:52] <wikibugs>	 (03Merged) 10jenkins-bot: Simple summaries survey for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia)
[20:55:07] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1152801|Simple summaries survey for English (T389393)]]
[20:55:09] <stashbot>	 T389393: Summaries: Create QuickSurvey for community prototype - https://phabricator.wikimedia.org/T389393
[20:55:36] <wikibugs>	 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877528 (10Eevans) >>! In T390630#10877285, @Dzahn wrote: > {T394955}  This one is a bit different to th...
[20:55:40] <wikibugs>	 06SRE: restbase2030 running low on disk space - https://phabricator.wikimedia.org/T395845#10877529 (10Eevans) a:03Eevans
[20:56:49] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[20:56:56] <logmsgbot>	 !log cjming@deploy1003 cjming, ksarabia: Backport for [[gerrit:1152801|Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:57:14] <cjming>	 kimberly_sarabia ^^
[20:59:01] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[20:59:01] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[20:59:36] <kimberly_sarabia>	 cjming: LGTM!
[20:59:46] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:59:51] <logmsgbot>	 !log cjming@deploy1003 cjming, ksarabia: Continuing with sync
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T2100)
[21:01:16] <wikibugs>	 (03CR) 10CDobbins: "Thanks for the feedback. That makes sense. I originally used git grep to try to find where it's being set, but all that came up was the if" [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins)
[21:02:02] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] elastic/cirrussearch: remove decomm'd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1152775 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking)
[21:04:53] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye
[21:05:12] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye
[21:06:48] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152801|Simple summaries survey for English (T389393)]] (duration: 11m 41s)
[21:06:51] <stashbot>	 T389393: Summaries: Create QuickSurvey for community prototype - https://phabricator.wikimedia.org/T389393
[21:09:46] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate mobileapps.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:11:43] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[21:11:54] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1152817/5749/lvs1017.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall)
[21:12:11] <icinga-wm>	 RECOVERY - Disk space on restbase2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2035&var-datasource=codfw+prometheus/ops
[21:12:38] <wikibugs>	 (03CR) 10Ssingh: "@vgutierrez@wikimedia.org see above." [puppet] - 10https://gerrit.wikimedia.org/r/1152817 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall)
[21:13:47] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet
[21:16:13] <logmsgbot>	 !log tgr@deploy1003 Locking from deployment [MediaWiki]: T395758
[21:16:16] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replace gerrit2003 RSA host key with ed25519 host key [puppet] - 10https://gerrit.wikimedia.org/r/1152819 (https://phabricator.wikimedia.org/T372804)
[21:16:35] <wikibugs>	 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10877596 (10Scott_French) Thanks Eric and Daniel. +1 to Eric's articulation of how monitoring sessionstor...
[21:16:45] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet|cirrussearch2056.codfw.wmnet|cirrussearch2057.codfw.wmnet|cirrussearch2058.codfw.wmnet|cirrussearch2059.codfw.wmnet|cirrussearch2060.codfw.wmnet|cirrussearch2091.codfw.wmnet
[21:16:45] <wikibugs>	 06SRE: restbase2030 (and others) running low on disk space - https://phabricator.wikimedia.org/T395845#10877597 (10Eevans)
[21:20:50] <wikibugs>	 (03Abandoned) 10CDobbins: replace X-WMF-UUID with vmod_var variable [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins)
[21:22:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for T395855 - bking@cumin2002
[21:22:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for T395855 - bking@cumin2002
[21:22:22] <stashbot>	 T395855: Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855
[21:23:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877622 (10Andrew) @Jclark-ctr, the new preseed recipe seems to work ok, 1048 is now reimaging properly. 1049 failed for me in a totally different way but...
[21:25:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877624 (10Jclark-ctr) @Andrew  thanks i was looking at 1048 right now also i see it imaging!  yea i have not adjusted a few settings for the rest work on...
[21:30:07] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10877637 (10bking) a:05bking→03None
[21:32:16] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus: add missing entry for cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610)
[21:32:27] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper)
[21:34:37] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] cirrus: add missing entry for cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper)
[21:34:45] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage
[21:38:09] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage
[21:38:45] <logmsgbot>	 !log tgr@deploy1003 Unlocked for deployment [MediaWiki]: T395758 (duration: 22m 32s)
[21:38:45] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus: remove 6 codfw hosts from pybal [puppet] - 10https://gerrit.wikimedia.org/r/1152831 (https://phabricator.wikimedia.org/T390901)
[21:44:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:46:18] <wikibugs>	 (03PS2) 10Andrew Bogott: New cloudcephosds -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1152792
[21:46:18] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1: install Octavia lbaas [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783)
[21:47:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:47:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] New cloudcephosds -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1152792 (owner: 10Andrew Bogott)
[21:47:49] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[21:50:17] <maryum>	 preparing to do a security deploy with scap sync-world
[21:51:00] <maryum>	 to deploy some security patches and a config change in PrivateSettings.php
[21:51:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Drop Chart roll-out dblists, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) (owner: 10Jforrester)
[21:52:05] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350#10877692 (10bking) Data Platform SRE steps are finished (we think). Sending to...
[21:52:42] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] build: Rename the rarely-used 'typos' script to 'checkTypos' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151781 (owner: 10Jforrester)
[21:53:32] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye
[21:53:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bullseye
[21:59:06] <wikibugs>	 (03CR) 10Bking: [C:03+1] cirrus: add missing entry for cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1152830 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper)
[22:01:33] <wikibugs>	 (03PS1) 10Andrew Bogott: Move octavia secrets into deployment-specific subdirs, add for eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1152839 (https://phabricator.wikimedia.org/T393783)
[22:01:34] <wikibugs>	 (03PS1) 10Andrew Bogott: Correct the name of a fake octavia password [labs/private] - 10https://gerrit.wikimedia.org/r/1152840 (https://phabricator.wikimedia.org/T393783)
[22:01:51] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye
[22:01:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877711 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye
[22:03:22] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152831 (https://phabricator.wikimedia.org/T390901) (owner: 10Ryan Kemper)
[22:03:29] <wikibugs>	 (03PS2) 10Andrew Bogott: eqiad1: install Octavia lbaas [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783)
[22:03:29] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack octavia: move secrets into a codfw1dev subdir [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783)
[22:03:34] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrus: remove 6 codfw hosts from pybal [puppet] - 10https://gerrit.wikimedia.org/r/1152831 (https://phabricator.wikimedia.org/T390901) (owner: 10Ryan Kemper)
[22:04:14] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:06:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Move octavia secrets into deployment-specific subdirs, add for eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1152839 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:06:33] <wikibugs>	 (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Correct the name of a fake octavia password [labs/private] - 10https://gerrit.wikimedia.org/r/1152840 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:06:37] <wikibugs>	 (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Move octavia secrets into deployment-specific subdirs, add for eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1152839 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:07:01] <wikibugs>	 (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:07:09] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack octavia: move secrets into a codfw1dev subdir. [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783)
[22:07:11] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:08:59] <maryum>	 !log scap sync-world finished to deploy several security bugs and PrivateSettings.php changes
[22:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack octavia: move secrets into a codfw1dev subdir. [puppet] - 10https://gerrit.wikimedia.org/r/1152841 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:13:28] <wikibugs>	 (03Abandoned) 10Andrew Bogott: dnsrecursor: make network-timeout configurable, reduce for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott)
[22:14:12] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:16:04] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[22:16:56] <wikibugs>	 (03PS3) 10Andrew Bogott: eqiad1: install Octavia lbaas [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783)
[22:17:03] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152833 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[22:29:02] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye
[22:29:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye
[22:32:30] <wikibugs>	 (03CR) 10Jdlrobson: Simple summaries survey for English (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152801 (https://phabricator.wikimedia.org/T389393) (owner: 10Kimberly Sarabia)
[22:35:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage
[22:38:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage
[22:45:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage
[22:45:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877804 (10Jclark-ctr)
[22:47:51] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage
[22:50:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage
[22:54:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage
[22:58:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:59:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 782, active_shards: 1853, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[22:59:45] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250602T2300)
[23:05:43] <wikibugs>	 (03PS1) 10Cwhite: logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565)
[23:08:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:09:52] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:10:18] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:10:18] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye
[23:10:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.wmnet with OS bullseye com...
[23:13:31] <wikibugs>	 (03PS1) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956)
[23:15:12] <wikibugs>	 (03PS2) 10Cwhite: logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565)
[23:15:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite)
[23:17:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:19:34] <wikibugs>	 (03PS2) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956)
[23:20:25] <wikibugs>	 (03PS3) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956)
[23:22:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite)
[23:22:34] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:22:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:22:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye
[23:23:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877843 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.wmnet with OS bullseye com...
[23:24:04] <wikibugs>	 (03PS4) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956)
[23:24:28] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:24:45] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:24:45] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye
[23:24:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.wmnet with OS bullseye com...
[23:25:21] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[23:25:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10877846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1048.eqiad.wmnet with OS bullseye
[23:26:17] <wikibugs>	 (03PS1) 10Ladsgroup: etcd: Remove ES clusters from "write clusters" if section is RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152853 (https://phabricator.wikimedia.org/T395696)
[23:26:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956) (owner: 10Cwhite)
[23:26:29] <wikibugs>	 (03PS5) 10Cwhite: logstash: add test helper and unit tests for dlq_transformer [puppet] - 10https://gerrit.wikimedia.org/r/1152852 (https://phabricator.wikimedia.org/T368956)
[23:30:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye
[23:30:32] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10877852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be2008.codfw.wmnet with OS bull...
[23:30:45] <wikibugs>	 (03PS1) 10Scott French: deployment_server: Update the local helm cache in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1152854 (https://phabricator.wikimedia.org/T395521)
[23:38:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152856
[23:38:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152856 (owner: 10TrainBranchBot)
[23:50:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage
[23:51:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152856 (owner: 10TrainBranchBot)
[23:52:12] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[23:53:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage