[00:00:28] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_drmrs [00:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:23] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_drmrs [00:09:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143694 [00:09:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143694 (owner: 10TrainBranchBot) [00:27:34] RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [00:29:42] RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [00:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:06] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:46:14] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:47:00] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:10:27] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143694 (owner: 10TrainBranchBot) [02:46:12] (03CR) 10RLazarus: [C:03+2] deployment_server: Add --env to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [03:12:48] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:36:55] FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:51:55] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [04:01:49] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [04:01:54] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 05s) [04:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:22] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [04:02:28] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 05s) [04:02:51] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [04:02:57] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 05s) [04:03:17] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [04:03:22] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 05s) [04:03:36] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host [04:03:42] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: deploy to freshly reimaged host (duration: 00m 06s) [04:11:50] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [04:11:57] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [04:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:42:06] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:42:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:07:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:09:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2014.codfw.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [05:09:59] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [05:11:55] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:12:31] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2010.codfw.wmnet w/ force delete existing files, repooling neither afterwards [05:12:42] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1012.eqiad.wmnet -> wdqs1013.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [05:30:52] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cumin1003.eqiad.wmnet with reason: WIP new Bookworm host [05:35:47] (03PS1) 10Muehlenhoff: Stop installing prometheus-node-exporter on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1143703 (https://phabricator.wikimedia.org/T371375) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250509T0600) [06:07:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1012.eqiad.wmnet -> wdqs1013.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [06:07:10] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [06:10:34] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2010.codfw.wmnet w/ force delete existing files, repooling neither afterwards [06:11:17] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:11:55] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:17] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:25:57] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling neither afterwards [06:26:01] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [06:26:03] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling neither afterwards [06:26:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling neither afterwards [06:26:26] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2010.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling neither afterwards [06:27:09] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1012.eqiad.wmnet -> wdqs1014.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [06:27:32] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1013.eqiad.wmnet -> wdqs1015.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [06:30:43] (03PS1) 10Jelto: gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143709 (https://phabricator.wikimedia.org/T393498) [06:32:41] (03CR) 10Jelto: [C:03+2] gerrit: add more IPs to abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143709 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto) [06:48:51] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10806526 (10BWojtowicz-WMF) Hello @Eevans, I've tem... [06:54:27] (03CR) 10Elukey: [C:03+1] test-cookbook: expand help message [puppet] - 10https://gerrit.wikimedia.org/r/1143485 (owner: 10Volans) [06:56:55] (03CR) 10Elukey: [C:03+1] "IIUC you are moving it to cumin2002 because of the upcoming 1003 change? If so please add it to the commit msg for clarity, then merge at " [puppet] - 10https://gerrit.wikimedia.org/r/1143486 (owner: 10Volans) [06:57:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:58:16] <_joe_> uhm gerrit down again? [06:58:25] <_joe_> wfm [06:59:29] I think it is just very slow to respond to health checks, it is very slow for me [06:59:42] but jelto seems working on it https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143709 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250509T0700) [07:01:05] yes there was a short traffic burst [07:02:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:02:39] jelto: <3 [07:12:28] (03CR) 10Majavah: Stop installing prometheus-node-exporter on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143703 (https://phabricator.wikimedia.org/T371375) (owner: 10Muehlenhoff) [07:15:29] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [07:22:13] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1012.eqiad.wmnet -> wdqs1014.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [07:22:16] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [07:22:32] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs1013.eqiad.wmnet -> wdqs1015.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [07:22:33] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling neither afterwards [07:23:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2010.codfw.wmnet -> wdqs2012.codfw.wmnet w/ force delete existing files, repooling neither afterwards [07:24:00] FIRING: [12x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:05] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2010.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling neither afterwards [07:26:29] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10806569 (10MatthewVernon) Thanks! :) [07:26:55] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:24] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10806570 (10MatthewVernon) [07:30:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10806581 (10MoritzMuehlenhoff) [07:31:17] (03PS1) 10Giuseppe Lavagetto: gerrit: ban old browsers/OS fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/1143712 [07:42:45] (03PS2) 10Muehlenhoff: Stop installing prometheus-ethtool-exporter on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1143703 (https://phabricator.wikimedia.org/T371375) [07:43:05] (03CR) 10Muehlenhoff: Stop installing prometheus-ethtool-exporter on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143703 (https://phabricator.wikimedia.org/T371375) (owner: 10Muehlenhoff) [07:47:47] (03CR) 10Jelto: [C:03+1] "this looks good to me, happy to test it. Although I have a small concern we might block a few legitimate users with `Windows\ NT\ 6\.[0-3]" [puppet] - 10https://gerrit.wikimedia.org/r/1143712 (owner: 10Giuseppe Lavagetto) [07:50:53] (03PS2) 10Giuseppe Lavagetto: gerrit: ban old browsers/OS fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/1143712 [07:51:55] FIRING: [12x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:52:02] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1143712 (owner: 10Giuseppe Lavagetto) [07:53:24] (03CR) 10Elukey: [C:03+1] gerrit: ban old browsers/OS fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/1143712 (owner: 10Giuseppe Lavagetto) [07:54:22] (03CR) 10Jelto: [C:03+2] gerrit: ban old browsers/OS fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/1143712 (owner: 10Giuseppe Lavagetto) [07:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:57:08] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7efe2347ced0: Failed to establish a new connection: [Errno 113 [07:57:08] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:57:55] !log imported puppet-agent 7.23.0-1+wmf13u1 to component/puppet7 for trixie-wikimedia T392790 [07:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:59] T392790: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790 [07:58:38] (03PS1) 10Muehlenhoff: puppet: On Trixie install Puppet 7 from component/puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1143746 (https://phabricator.wikimedia.org/T392790) [07:58:50] (03PS2) 10Muehlenhoff: puppet: On Trixie install Puppet 7 from component/puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1143746 (https://phabricator.wikimedia.org/T392790) [07:59:08] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: [07:59:08] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:59:46] (03PS1) 10Muehlenhoff: Add library hint for librabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/1143747 [08:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:46] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:09:30] !log volans@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin1003.eqiad.wmnet with reason: Release v0.9.0 - volans@cumin2002 [08:10:29] !log volans@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin1003.eqiad.wmnet with reason: Release v0.9.0 - volans@cumin2002 [08:14:31] (03PS3) 10Vgutierrez: trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) [08:14:31] (03PS2) 10Vgutierrez: cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) [08:14:31] (03PS3) 10Vgutierrez: hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) [08:17:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T388134, bring new main graph hosts into service) xfer wikidata_main from wdqs2010.codfw.wmnet -> wdqs2013.codfw.wmnet w/ force delete existing files, repooling neither afterwards [08:17:10] T388134: Drop support for the full Wikidata graph from query.wikidata.org - https://phabricator.wikimedia.org/T388134 [08:19:00] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:19:23] (03PS4) 10Vgutierrez: trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) [08:19:23] (03PS3) 10Vgutierrez: cache::haproxy: Drop incoming X-Experiment-Enrollments header [puppet] - 10https://gerrit.wikimedia.org/r/1143608 (https://phabricator.wikimedia.org/T391411) [08:19:23] (03PS4) 10Vgutierrez: hiera: Split ATS cache on X-Experiment-Enrollments [puppet] - 10https://gerrit.wikimedia.org/r/1143603 (https://phabricator.wikimedia.org/T391411) [08:20:29] (03CR) 10Vgutierrez: trafficserver: Allow splitting the cache by HTTP header content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:21:55] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:40] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for librabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/1143747 (owner: 10Muehlenhoff) [08:36:39] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:42:06] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:47:08] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f95af580f10: Failed to establish a new connection: [Errno 113 [08:47:08] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:48:08] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_task [08:48:08] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:51:47] (03PS1) 10Fabfur: cache: remove unused allowed_methods check from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) [08:51:53] (03CR) 10Vgutierrez: [C:03+1] varnish: Replace X-IS-ALT-DOMAIN with variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [08:53:06] (03CR) 10Alexandros Kosiaris: [C:04-1] "You need to also update module.json to list the new version. Also an inline pedantic comment. I 'd also add a fixture to define stats_conf" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:55:40] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [09:00:29] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Move OpenSSH server config away from using a Puppet template - https://phabricator.wikimedia.org/T393762 (10MoritzMuehlenhoff) 03NEW [09:00:45] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Move OpenSSH server config away from using a Puppet template - https://phabricator.wikimedia.org/T393762#10806747 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [09:01:30] (03PS2) 10Fabfur: cache: remove unused allowed_methods check from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) [09:02:13] (03CR) 10Fabfur: "Not sure about removing entirely the modules/varnish/files/tests/text/40-allowed-methods.vtc file" [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [09:03:05] !log zabe@deploy1003:~$ mwscript-k8s --comment="T393372" --follow -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwikibooks --logwiki=metawiki 'Adityaindumdum' 'Renamed user a71c8354dc822ea0d3aab24d1ce886f02c25fe91' [09:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:08] T393372: Unblock stuck global rename of Renamed_user_a71c8354dc822ea0d3aab24d1ce886f02c25fe91 - https://phabricator.wikimedia.org/T393372 [09:05:15] !log zabe@deploy1003:~$ mwscript-k8s --comment="T393761" --follow -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=amwiki --logwiki=metawiki 'Jeroen' 'Retireduser-vfs199s31yvbtxsfmygg' [09:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:18] T393761: Unblock stuck global rename of Retireduser-vfs199s31yvbtxsfmygg - https://phabricator.wikimedia.org/T393761 [09:13:20] (03CR) 10Vgutierrez: "you need to remove it or tests will fail otherwise, what's the doubt?" [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [09:13:28] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Move OpenSSH server config away from using a Puppet template - https://phabricator.wikimedia.org/T393762#10806774 (10MoritzMuehlenhoff) The default Debian config only ships these seven config directives, apart of that it uses the OpenSSH server defaults: `... [09:16:08] (03CR) 10JMeybohm: modules: allow to config envoy's stats_config in mesh.configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:18:46] (03PS9) 10Hnowlan: mw::maintenance: move refreshLinkRecommendations job to shared object [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) [09:20:30] (03CR) 10Fabfur: "I think we can proceed both on cache and upload deployment-prep hosts, the setup is similar to prod one (haproxy handles port 80 and 443)" [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [09:22:37] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10806780 (10JMeybohm) Thanks. I've re-added the drive to the md array, will check back when the resync has completed [09:26:07] (03CR) 10Joely Rooke WMDE: [C:03+1] Create feature flags for resolving Wikibase item labels on Watchlist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [09:27:48] (03CR) 10Joely Rooke WMDE: [C:03+1] "I can't +2 this since it has to be merged + deployed in a BACON window. I'll show you how to do this on Monday :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) (owner: 10Neslihan Turan) [09:29:29] (03PS5) 10Elukey: modules: initial fork of mesh.configuration 1.12 in 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143586 (https://phabricator.wikimedia.org/T391333) [09:29:30] (03PS8) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [09:30:45] (03CR) 10Elukey: "I changed the module.json in the previous patch of the chain, but it was probably confusing so I added it here directly." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:31:40] (03CR) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [09:36:24] (03PS1) 10Slyngshede: gerrit: better match for old macOS [puppet] - 10https://gerrit.wikimedia.org/r/1143758 [09:36:47] (03CR) 10Vgutierrez: "is varnish there unreachable from the outside as well? we got a pure UDS setup in production" [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [09:38:02] (03CR) 10Ladsgroup: [C:04-1] "you need to add it to DB_LISTS (and in a follow up patch, start referring to it in config). Also if you want to be sure, these are wikis t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [09:40:51] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [09:42:36] (03CR) 10Jelto: [C:03+2] gerrit: better match for old macOS [puppet] - 10https://gerrit.wikimedia.org/r/1143758 (owner: 10Slyngshede) [09:50:42] !log imported debmonitor-client 0.4.0-3+deb13u1 for trixie-wikimedia T391083 [09:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:45] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [09:54:54] (03PS1) 10Muehlenhoff: Record LDAP access for bwojtowicz [puppet] - 10https://gerrit.wikimedia.org/r/1143765 [09:55:18] PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:56:14] (03CR) 10CI reject: [V:04-1] Record LDAP access for bwojtowicz [puppet] - 10https://gerrit.wikimedia.org/r/1143765 (owner: 10Muehlenhoff) [09:56:44] PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:58:37] (03PS2) 10Muehlenhoff: Record LDAP access for bwojtowicz [puppet] - 10https://gerrit.wikimedia.org/r/1143765 [10:02:02] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for bwojtowicz [puppet] - 10https://gerrit.wikimedia.org/r/1143765 (owner: 10Muehlenhoff) [10:02:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10806848 (10Stevemunene) an-worker1177 umount succeeded, proceeded to Remove the RAID0 drive configurations for all 12 d... [10:03:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10806849 (10Stevemunene) [10:04:09] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: DiskSpace (instance analytics1071:9100) - https://phabricator.wikimedia.org/T392555#10806851 (10Stevemunene) 05Open→03Resolved [10:04:10] (03PS1) 10Dr0ptp4kt: Stream config for edge uniques on prod cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143772 [10:04:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10806852 (10MoritzMuehlenhoff) [10:05:18] (03PS2) 10Dr0ptp4kt: Stream config for edge uniques on prod cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143772 (https://phabricator.wikimedia.org/T391959) [10:06:30] (03PS1) 10Filippo Giunchedi: pontoon: switch to vxlan/ipv6-dualstack [puppet] - 10https://gerrit.wikimedia.org/r/1143774 [10:06:30] (03PS1) 10Filippo Giunchedi: benthos: remove execute permissions from yaml/env files [puppet] - 10https://gerrit.wikimedia.org/r/1143775 [10:06:30] (03PS1) 10Filippo Giunchedi: pontoon: fix pipx venv path [puppet] - 10https://gerrit.wikimedia.org/r/1143776 [10:06:31] (03PS1) 10Filippo Giunchedi: logstash: support reload via SIGHUP [puppet] - 10https://gerrit.wikimedia.org/r/1143777 [10:06:32] (03PS1) 10Filippo Giunchedi: pontoon: fix heap memory for logstash [puppet] - 10https://gerrit.wikimedia.org/r/1143778 [10:06:36] (03PS1) 10Filippo Giunchedi: pontoon: fix curator settings [puppet] - 10https://gerrit.wikimedia.org/r/1143779 [10:08:44] RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:10:24] (03PS2) 10Filippo Giunchedi: pontoon: fix pipx venv path [puppet] - 10https://gerrit.wikimedia.org/r/1143776 [10:10:29] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: fix pipx venv path [puppet] - 10https://gerrit.wikimedia.org/r/1143776 (owner: 10Filippo Giunchedi) [10:10:58] (03PS2) 10Filippo Giunchedi: pontoon: fix heap memory for logstash [puppet] - 10https://gerrit.wikimedia.org/r/1143778 [10:11:02] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: fix heap memory for logstash [puppet] - 10https://gerrit.wikimedia.org/r/1143778 (owner: 10Filippo Giunchedi) [10:11:19] (03PS2) 10Filippo Giunchedi: pontoon: fix curator settings [puppet] - 10https://gerrit.wikimedia.org/r/1143779 [10:11:25] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: fix curator settings [puppet] - 10https://gerrit.wikimedia.org/r/1143779 (owner: 10Filippo Giunchedi) [10:12:37] (03CR) 10Elukey: [C:03+1] benthos: remove execute permissions from yaml/env files [puppet] - 10https://gerrit.wikimedia.org/r/1143775 (owner: 10Filippo Giunchedi) [10:13:24] (03PS2) 10Filippo Giunchedi: benthos: remove execute permissions from yaml/env files [puppet] - 10https://gerrit.wikimedia.org/r/1143775 [10:13:31] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] benthos: remove execute permissions from yaml/env files [puppet] - 10https://gerrit.wikimedia.org/r/1143775 (owner: 10Filippo Giunchedi) [10:14:22] (03CR) 10Majavah: [C:03+1] pontoon: switch to vxlan/ipv6-dualstack [puppet] - 10https://gerrit.wikimedia.org/r/1143774 (owner: 10Filippo Giunchedi) [10:17:08] (03PS2) 10Filippo Giunchedi: pontoon: switch to vxlan/ipv6-dualstack [puppet] - 10https://gerrit.wikimedia.org/r/1143774 [10:17:12] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: switch to vxlan/ipv6-dualstack [puppet] - 10https://gerrit.wikimedia.org/r/1143774 (owner: 10Filippo Giunchedi) [10:17:18] (03CR) 10Fabfur: "Looks like the setup is the same as production" [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [10:20:49] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1005.eqiad.wmnet with OS bullseye [10:21:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host thanos-fe1005.eqiad.wmnet with OS bullseye [10:23:18] RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:28:22] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10806907 (10MatthewVernon) [10:29:20] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10806911 (10MatthewVernon) [10:29:54] (03CR) 10Vgutierrez: [C:03+1] cache: remove unused allowed_methods check from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1143755 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [10:32:08] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10806924 (10MatthewVernon) [10:32:54] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10806929 (10MatthewVernon) [10:35:42] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:37:01] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1005.eqiad.wmnet with reason: host reimage [10:38:42] jelto@cumin1002 upgrade (PID 410518) is awaiting input [10:39:31] I know, I'll wait until the maintenance window starts [10:39:43] is something up with gerrit? I had a gerrit tab open in Firefox (granted, yes, an outdated version of it, but let's not get into that), refreshed the page, getting HTTP 403 consistently; the same happens in a different browser (Chromium) as well, which hopefully means it's not a browser-level issue; but I'm not seeing any discussion about this anywhere? [10:40:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1005.eqiad.wmnet with reason: host reimage [10:41:04] ashley: yes we had to block some old browsers for Gerrit. Is it possible to update your browsers and use a more recent one? [10:41:14] (03Abandoned) 10Hnowlan: mw-api-ext: bump replicas temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143629 (owner: 10Hnowlan) [10:41:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:46] (03PS3) 10JMeybohm: deployment_server: Remove special handling of ci user [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) [10:41:53] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [10:42:41] (03PS9) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [10:43:29] jelto: weird. were misbehaving AIs using such UAs or what? but guess I'll need to use a separate browser for gerrit; recent versions (by which I mean "version number ~100+") of Firefox have been an absolute trainwreck hence why I prefer using an older, more stable version (yet for gerrit I've had to use Chromium because whatever JS is used in gerrit require a lot of fancy new stuff :D) [10:43:55] (03CR) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [10:46:30] ashley: Unfortunately I can't share more information publicly (this is also the reason it was not discussed in a public task). I'd recommend to use a more recent browser for Gerrit (and in general for security reasons). If you still see the issue with a new browser let me know [10:46:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:46:40] (03PS9) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [10:47:42] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [10:48:20] (03PS10) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [10:51:03] (03PS6) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [10:52:15] (03CR) 10Ladsgroup: "I think the biggest risk here is that setting up a new container is taking long enough time for the data to be stale already. I don't thin" [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [10:55:07] well luckily spoofing the UA took me about as long as it took to google the latest Firefox UA and add a custom profile to dev tools ;-) a real shame that these kind of breaking, unannounced, intentional changes happen without sufficient discussion, hopefully that could be fixed sooner rather than later (but I guess the old adage about "temporary" changes applies here as well) [10:58:46] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250509T0700) [11:00:05] jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250509T1100). [11:01:51] mvernon@cumin1002 reimage (PID 431431) is awaiting input [11:02:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" [11:02:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1005.eqiad.wmnet with OS bullseye [11:02:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host thanos-fe1005.eqiad.wmnet with OS bullseye... [11:03:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10806973 (10MatthewVernon) @VRiley-WMF I hope you don't mind, but I saw the failed reimage in my mail this morning and thought I'd take a look to see... [11:06:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [11:06:56] FIRING: ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:07:05] ^ expected and should resolve soon [11:07:24] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, we'll need another patch to enable the ethtool stats in node-exporter but that should be easy enough to get in place in time. Only " [puppet] - 10https://gerrit.wikimedia.org/r/1143703 (https://phabricator.wikimedia.org/T371375) (owner: 10Muehlenhoff) [11:11:56] RESOLVED: ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:47] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10807046 (10cmooney) @fgiunchedi hey just wondering what the best way forward might be if we want to get node exporter exposing ethtool stats for our hosts.... [11:16:37] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:17:09] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:38:33] (03CR) 10Jelto: [C:03+2] Use buildkit wmf-v0.21.1 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1143671 (https://phabricator.wikimedia.org/T393731) (owner: 10Ahmon Dancy) [11:38:35] (03CR) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [11:45:16] !log update toolforge arc-enabled exim4 packages (component/exim4-arc) to latest in debian 12 T356171 [11:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:19] T356171: Enable ARC support in Toolforge - https://phabricator.wikimedia.org/T356171 [11:49:32] (03CR) 10JMeybohm: [C:03+2] deployment_server: Remove special handling of ci user [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [11:50:57] (03CR) 10Alexandros Kosiaris: "Oh, I didn't expect it in the previous patch tbh. My bad. Fixture should go in a files under _scaffold/service/_skel/.fixtures, named like" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [11:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:56:57] 06SRE, 13Patch-Needs-Improvement, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360#10807127 (10JMeybohm) 05Stalled→03Invalid Given the changes done in the parent task have been reverted and CI systems n... [11:58:13] (03PS1) 10JMeybohm: Remove ci namespace from wikikube staging clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143802 (https://phabricator.wikimedia.org/T288629) [12:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:07] (03PS1) 10Kamila Součková: mw-cron/updatequerypages: Migrate Mostcategories,Mostlinkedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1143803 (https://phabricator.wikimedia.org/T388534) [12:06:09] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fea4487ced0: Failed to establish a new connection: [Errno 113 [12:06:09] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [12:07:09] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_task [12:07:09] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:12:34] (03PS1) 10Alexandros Kosiaris: admin: Update akosiaris dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1143808 [12:19:36] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143803 (https://phabricator.wikimedia.org/T388534) (owner: 10Kamila Součková) [12:21:55] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:51] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:30:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [12:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:51:33] !log upload prometheus-blackbox-exporter 0.26.0-0~bpo12+1 to bookworm-wikimedia - T385022 [12:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:36] T385022: Upgrade blackbox-exporter and reduce logging - https://phabricator.wikimedia.org/T385022 [12:53:11] (03CR) 10Alexandros Kosiaris: [C:03+2] admin: Update akosiaris dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1143808 (owner: 10Alexandros Kosiaris) [12:56:15] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10807265 (10fgiunchedi) >>! In T371375#10807046, @cmooney wrote: > @fgiunchedi hey just wondering what the best way forward might be if we want to get node... [13:00:03] (03PS1) 10Filippo Giunchedi: prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) [13:27:38] thcipriani, jeena: I would like an emergency deploy (maybe two) for T393641 (not critical but community is unhappy and as a CSS fix it should be low risk), are SRE okay with a deployment? (I already have a spiderpig^W deployer) [13:27:39] T393641: [Bug] color changes in statement UI for old Vector skin (and potentially others?) - https://phabricator.wikimedia.org/T393641 [13:28:12] right now what I want to backport is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1143793, there might be a second fix (almost certainly also CSS-only) later today [13:29:26] (03PS1) 10Slyngshede: Account block: update templates [software/bitu] - 10https://gerrit.wikimedia.org/r/1143820 (https://phabricator.wikimedia.org/T393779) [13:30:26] (03PS1) 10MVernon: apus: bring new frontend apus-fe1003 into service [puppet] - 10https://gerrit.wikimedia.org/r/1143821 (https://phabricator.wikimedia.org/T389632) [13:32:46] !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided) [13:32:57] (03PS11) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [13:33:45] (03PS1) 10MVernon: thanos: remove thanos-fe200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1143824 (https://phabricator.wikimedia.org/T391352) [13:34:25] (03CR) 10CI reject: [V:04-1] modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [13:36:55] !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 04m 10s) [13:40:05] (03CR) 10Jcrespo: [C:03+1] apus: bring new frontend apus-fe1003 into service [puppet] - 10https://gerrit.wikimedia.org/r/1143821 (https://phabricator.wikimedia.org/T389632) (owner: 10MVernon) [13:40:11] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fa3be3d0fd0: Failed to establish a new connection: [Errno 113 [13:40:11] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [13:42:29] ^^ checking on the Elastic alert. EQIAD is depooled so no user impact AFAIK [13:43:01] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10807449 (10MatthewVernon) [13:43:03] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10807450 (10MatthewVernon) [13:43:11] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_task [13:43:11] mber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:47:20] (03PS14) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [13:47:20] (03CR) 10Vgutierrez: "tests are (now) happy for both text and upload." [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:48:42] (03PS1) 10Lucas Werkmeister (WMDE): Bump wikibase-data-values-value-view to HEAD [extensions/Wikibase] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143835 (https://phabricator.wikimedia.org/T389633) [13:49:23] ^ going ahead with that emergency deploy now FTR (I got a +1 from cdanis elsewhere and no objections) [13:50:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143835 (https://phabricator.wikimedia.org/T389633) (owner: 10Lucas Werkmeister (WMDE)) [13:57:20] also I just realized, if anyone does have objections, you don’t even have to yell at me, you can do stuff in spiderpig yourself :P [13:57:53] (03Abandoned) 10Andrew Bogott: cinder: use 'cinder' service user rather than 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/1143612 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:58:23] (03PS12) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [14:02:15] incidentally, what does Interrupt Job do in SpiderPig? [14:02:34] I assume Kill Job (not recommended) is SIGTERM (SIGKILL?)… is Interrupt Job SIGSTOP? [14:02:49] (03CR) 10Ottomata: varnish: Allow /beacon/v2/event to hit origin servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:03:05] (or SIGTSTP idk) [14:04:33] PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:07:05] (03Merged) 10jenkins-bot: Bump wikibase-data-values-value-view to HEAD [extensions/Wikibase] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143835 (https://phabricator.wikimedia.org/T389633) (owner: 10Lucas Werkmeister (WMDE)) [14:07:29] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1143835|Bump wikibase-data-values-value-view to HEAD (T389633 T393641)]] [14:07:33] T389633: [Darkmode] Update colours on statement editing state on Items - https://phabricator.wikimedia.org/T389633 [14:07:33] T393641: [Bug] color changes in statement UI for old Vector skin (and potentially others?) - https://phabricator.wikimedia.org/T393641 [14:12:36] (03CR) 10Scott French: [C:03+1] mw-cron/updatequerypages: Migrate Mostcategories,Mostlinkedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1143803 (https://phabricator.wikimedia.org/T388534) (owner: 10Kamila Součková) [14:12:52] (03CR) 10Federico Ceratto: [C:03+1] "The hosts are pinging but are depooled and the hostnames match the names in the related task" [puppet] - 10https://gerrit.wikimedia.org/r/1143824 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [14:14:10] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1143835|Bump wikibase-data-values-value-view to HEAD (T389633 T393641)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:14:14] T389633: [Darkmode] Update colours on statement editing state on Items - https://phabricator.wikimedia.org/T389633 [14:14:15] T393641: [Bug] color changes in statement UI for old Vector skin (and potentially others?) - https://phabricator.wikimedia.org/T393641 [14:14:33] 10ops-codfw, 06DC-Ops: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393784 (10phaultfinder) 03NEW [14:14:34] 10ops-codfw, 06DC-Ops: Alert for device lsw1-f3-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T393785 (10phaultfinder) 03NEW [14:15:09] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [14:21:41] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143835|Bump wikibase-data-values-value-view to HEAD (T389633 T393641)]] (duration: 14m 12s) [14:21:45] T389633: [Darkmode] Update colours on statement editing state on Items - https://phabricator.wikimedia.org/T389633 [14:21:46] T393641: [Bug] color changes in statement UI for old Vector skin (and potentially others?) - https://phabricator.wikimedia.org/T393641 [14:22:08] * Lucas_WMDE done deploying for now [14:24:33] !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided) [14:25:03] !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 00m 31s) [14:29:28] !log fab@deploy1003 Started deploy [airflow-dags/research@e3ccac9]: (no justification provided) [14:30:06] !log fab@deploy1003 Finished deploy [airflow-dags/research@e3ccac9]: (no justification provided) (duration: 00m 38s) [14:30:33] RECOVERY - Hadoop NodeManager on an-worker1113 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:35:17] (03CR) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:36:12] (03CR) 10JHathaway: [C:03+1] puppet: On Trixie install Puppet 7 from component/puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1143746 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [14:43:19] PROBLEM - Hadoop DataNode on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [14:43:33] Lucas_WMDE: thanks for the heads up and fixing things as usual [14:43:41] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:43:46] np [14:43:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:44:41] PROBLEM - Hadoop DataNode on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [14:44:59] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:47:47] 06SRE, 10Wikimedia-Mailing-lists: Remove brion@wikimedia.org from admins for wikita-l mailing list - https://phabricator.wikimedia.org/T393787#10807703 (10Ladsgroup) You were the only owner of that mailing list, I removed you and added myself as interim (FWIW, in mm3 you don't need to have a password for each... [14:49:09] 06SRE, 10Wikimedia-Mailing-lists: Remove brion@wikimedia.org from admins for wikita-l mailing list - https://phabricator.wikimedia.org/T393787#10807719 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [14:53:58] (03PS5) 10Cathal Mooney: WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) [14:54:27] (03CR) 10Ahmon Dancy: [C:03+1] Remove ci namespace from wikikube staging clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143802 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [14:56:05] 06SRE, 10Wikimedia-Mailing-lists: Remove brion@wikimedia.org from admins for wikita-l mailing list - https://phabricator.wikimedia.org/T393787#10807747 (10bvibber) Thanks! <3 [14:56:56] (03CR) 10Cathal Mooney: "No diff with the latest patch set. We need to add some code to check if the "link_peer" is a patch panel front port, and if so see what's" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [14:58:17] FIRING: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:01] (03PS6) 10Cathal Mooney: WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) [15:03:09] (03PS1) 10Lucas Werkmeister (WMDE): Bump wikibase-data-values-value-view to HEAD [extensions/Wikibase] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143851 (https://phabricator.wikimedia.org/T389633) [15:03:17] RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:47] ^ gonna deploy this one soon (another CSS-only fix), fyi thcipriani jeena [15:05:23] (03CR) 10Cwhite: [C:03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1143650 (owner: 10Herron) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:47] Lucas_WMDE: ack. Iaside: it's weird, if gerrit knows about the submodule, why not link me to the submodule?) [15:07:57] *(aside [15:08:12] idk, would be useful yeah [15:09:07] (03PS13) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [15:09:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143851 (https://phabricator.wikimedia.org/T389633) (owner: 10Lucas Werkmeister (WMDE)) [15:11:45] (03CR) 10Ottomata: varnish: Allow /beacon/v2/event to hit origin servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:12:22] (03CR) 10Ottomata: varnish: Allow /beacon/v2/event to hit origin servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:13:13] (03CR) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:13:32] (03PS14) 10Elukey: modules: allow to config envoy's stats_config in mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) [15:13:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:17:46] does anyone happen to know why SpiderPig wraps `f`s in the console in f? [15:17:50] so far I can’t find it in codesearch [15:17:56] (possibly it’s something in my browser o_O) [15:18:07] (03CR) 10Elukey: "@Aj" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [15:19:45] meh, looks like it’s in xterm.js https://github.com/xtermjs/xterm.js/blob/e9c547c1c6b67e9f09c24ccc007e19305f536e60/src/browser/renderer/dom/DomRendererRowFactory.ts#L464 [15:19:55] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you !" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143587 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [15:21:46] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10807881 (10cmooney) >>! In T371375#10807265, @fgiunchedi wrote: > If I read the task correctly we can enable it only on >= bookworm hosts (?) in which case... [15:21:54] huh, I am now a layer deeper in wondering why change letter spacing [15:22:09] rpesumably it *wants* to make sure it aligns correctly [15:22:15] (03CR) 10Ottomata: varnish: Allow /beacon/v2/event to hit origin servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:22:15] even if you have a non-monospace font, maybe? [15:22:29] or at least one of those ill-advised monospaced fonts that still ligaturize fi → fi etc. [15:22:41] what it *does* in practice on my system is make the letters misaligned :D [15:23:02] heh, of course [15:23:10] and if I kick out the letter-spacing with !important then the table (“the following changes are scheduled for backport”) aligns better 🙃 [15:23:24] but whatever, if it’s in a library then it doesn’t need a phab task, I can live with a slightly misaligned table [15:23:49] would’ve been different if it was in scap itself ^^ [15:24:19] Lucas_WMDE: I hope to eventually eliminate ASCII art from SpiderPig and replace instances with proper web elements. [15:24:48] :o [15:24:56] though sounds like this one must be in the job log? [15:25:02] since xterm.js [15:25:18] it is, but dancy controls that too I assume ^^ [15:25:19] ah, the job log.. [15:25:27] (03Merged) 10jenkins-bot: Bump wikibase-data-values-value-view to HEAD [extensions/Wikibase] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143851 (https://phabricator.wikimedia.org/T389633) (owner: 10Lucas Werkmeister (WMDE)) [15:25:33] oh, right, now I know which ASCII art you mean [15:25:37] but yeah I was talking about the copy in the job log [15:25:42] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1143851|Bump wikibase-data-values-value-view to HEAD (T389633 T393641)]] [15:25:43] I clicked the other one away, idk if that one was misaligned [15:25:46] T389633: [Darkmode] Update colours on statement editing state on Items - https://phabricator.wikimedia.org/T389633 [15:25:46] T393641: [Bug] color changes in statement UI for old Vector skin (and potentially others?) - https://phabricator.wikimedia.org/T393641 [15:26:13] dancy controls all :) [15:26:45] I'm looking at https://spiderpig.wikimedia.org/jobs/36. Can you point me to the part that looks weird? [15:27:34] I should really just enable file uploads in my bouncer :D [15:27:36] one sec [15:28:42] dancy: https://tmp.lucaswerkmeister.de/spiderpig-table.png [15:29:18] (this is in Firefox and the font is apparently Nimbus Mono PS) [15:32:20] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1143851|Bump wikibase-data-values-value-view to HEAD (T389633 T393641)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:32:23] T389633: [Darkmode] Update colours on statement editing state on Items - https://phabricator.wikimedia.org/T389633 [15:32:24] T393641: [Bug] color changes in statement UI for old Vector skin (and potentially others?) - https://phabricator.wikimedia.org/T393641 [15:33:21] testing [15:33:58] Thx. Looks okay in my browsers (Chrome and Firefox). My Chrome is using Liberation Mono which looks great. My Firefox uses something different that doesn't look as good (the line drawing characters don't touch) but is otherwise aligned correctly. Not sure how to find out what find it is using. [15:34:26] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [15:35:32] dancy: if you click the tiny arrow in the right spot you can switch to the fonts panel https://tmp.lucaswerkmeister.de/devtools.png [15:35:37] it’s quite hidden [15:36:23] Indeed! [15:36:27] Ok. Firefox is using Nimbus Mono PS [15:36:42] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:36:49] 06SRE, 10Observability-Metrics: Set a predefined time window in Pyrra's configuration to measure SLOs with - https://phabricator.wikimedia.org/T393796 (10elukey) 03NEW [15:37:38] if I add font-family: Liberation Mono then it still looks misaligned (probably because xterm.js measured the character widths already and now has no reason to reevaluate its decision to add that letter-spacing) [15:37:44] looks generally nicer though ^^ [15:37:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:38:03] but strange that you get different results in the same browser and font o_O [15:41:04] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143851|Bump wikibase-data-values-value-view to HEAD (T389633 T393641)]] (duration: 15m 22s) [15:41:08] T389633: [Darkmode] Update colours on statement editing state on Items - https://phabricator.wikimedia.org/T389633 [15:41:08] T393641: [Bug] color changes in statement UI for old Vector skin (and potentially others?) - https://phabricator.wikimedia.org/T393641 [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:19] 06SRE, 10Observability-Metrics: Every Grafana dashboard generated by Pyrra contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797 (10elukey) 03NEW [15:42:39] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10807951 (10elukey) I created two subtasks for the major problems, let's discuss in there separately :) [15:42:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:46:18] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for thiemowmde - https://phabricator.wikimedia.org/T393798 (10thcipriani) 03NEW [15:46:19] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for thiemowmde - https://phabricator.wikimedia.org/T393798#10807965 (10thcipriani) For clarity: I filed this task as a followup to a request for [[https://wikitech.wikimedia.org/wiki/Scap/SpiderPig|spiderpig access]]. `deployment` membership is curr... [15:49:34] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1053 to cirrussearch1053 [15:49:45] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.rename (exit_code=93) from elastic1053 to cirrussearch1053 [15:55:32] * Lucas_WMDE done deploying btw [15:57:31] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1053 to cirrussearch1053 [15:57:45] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:57:51] !log bking@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:57:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from elastic1053 to cirrussearch1053 [15:59:50] (03PS1) 10Andrew Bogott: cinder/epoxy: remove resource_filters.json from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1143861 (https://phabricator.wikimedia.org/T393791) [15:59:52] (03PS1) 10Andrew Bogott: cinder/dalmatian: upgrade resource_filters.json to match upstream latest [puppet] - 10https://gerrit.wikimedia.org/r/1143862 (https://phabricator.wikimedia.org/T393791) [16:01:14] (03CR) 10Andrew Bogott: [C:03+2] cinder/epoxy: remove resource_filters.json from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1143861 (https://phabricator.wikimedia.org/T393791) (owner: 10Andrew Bogott) [16:01:17] (03CR) 10Andrew Bogott: [C:03+2] cinder/dalmatian: upgrade resource_filters.json to match upstream latest [puppet] - 10https://gerrit.wikimedia.org/r/1143862 (https://phabricator.wikimedia.org/T393791) (owner: 10Andrew Bogott) [16:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:12] (03PS1) 10Bking: cirrussearch: Add new hostnames to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1143865 (https://phabricator.wikimedia.org/T388610) [16:06:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141516 (owner: 10Krinkle) [16:07:06] (03Merged) 10jenkins-bot: noc: Fix "Class MWMultiVersion not found" in wiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141516 (owner: 10Krinkle) [16:07:18] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1141516|noc: Fix "Class MWMultiVersion not found" in wiki.php]] [16:07:26] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10808026 (10cmooney) >>! In T371375#10807881, @cmooney wrote: > Let me double check and report back. So it seems an unprivaledged user can get these stats... [16:08:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143865 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:08:24] (03PS2) 10Bking: cirrussearch: Add new hostnames to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1143865 (https://phabricator.wikimedia.org/T388610) [16:08:33] (03PS1) 10Andrew Bogott: cinder/epoxy: don't install resource_filters.json [puppet] - 10https://gerrit.wikimedia.org/r/1143867 (https://phabricator.wikimedia.org/T393791) [16:10:42] (03CR) 10Andrew Bogott: [C:03+2] cinder/epoxy: don't install resource_filters.json [puppet] - 10https://gerrit.wikimedia.org/r/1143867 (https://phabricator.wikimedia.org/T393791) (owner: 10Andrew Bogott) [16:14:00] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1141516|noc: Fix "Class MWMultiVersion not found" in wiki.php]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:14:24] !log krinkle@deploy1003 krinkle: Continuing with sync [16:14:33] https://noc.wikimedia.org/wiki.php works again on mw-debug [16:17:10] (03PS1) 10CDanis: Switch status page to haproxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1143868 [16:19:06] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@bfb9c63]: bump image suggestions to 1.6.0 [16:20:41] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@bfb9c63]: bump image suggestions to 1.6.0 (duration: 01m 49s) [16:21:01] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1141516|noc: Fix "Class MWMultiVersion not found" in wiki.php]] (duration: 13m 42s) [16:21:55] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143865 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:39:37] (03PS1) 10Andrew Bogott: deisgnate policy.yaml: forward changes to version epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1143870 [16:41:09] (03CR) 10Andrew Bogott: [C:03+2] deisgnate policy.yaml: forward changes to version epoxy [puppet] - 10https://gerrit.wikimedia.org/r/1143870 (owner: 10Andrew Bogott) [16:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:47:31] (03CR) 10Bking: "You can ignore the experimental build failure, this CR doesn't have a Hosts: line and running PCC would not help validate this change." [puppet] - 10https://gerrit.wikimedia.org/r/1143865 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:48:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) (owner: 10Bernard Wang) [16:49:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143625 (https://phabricator.wikimedia.org/T386247) (owner: 10Bernard Wang) [16:50:52] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: Add new hostnames to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1143865 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:55:11] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fb2ec438fd0: Failed to establish a new connection: [Errno 113 [16:55:11] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:58:11] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: [16:58:11] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:15] (03PS1) 10FNegri: site.pp: Add clouddb102[1-4] as insetup::wmcs_ferm [puppet] - 10https://gerrit.wikimedia.org/r/1143871 (https://phabricator.wikimedia.org/T393733) [17:04:50] (03CR) 10Hnowlan: "Yeah, this is a good point and worth considering. However, because the container images will be reused and cached currently there shouldn'" [puppet] - 10https://gerrit.wikimedia.org/r/1143533 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [17:05:22] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:05:55] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:14:31] (03PS3) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) [17:14:58] (03CR) 10Andrew Bogott: [C:03+1] site.pp: Add clouddb102[1-4] as insetup::wmcs_ferm [puppet] - 10https://gerrit.wikimedia.org/r/1143871 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [17:17:27] (03CR) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:17:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:18:38] (03PS4) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) [17:18:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:19:32] (03PS2) 10FNegri: site.pp: Add clouddb102[1-4] as insetup::wmcs_ferm [puppet] - 10https://gerrit.wikimedia.org/r/1143871 (https://phabricator.wikimedia.org/T393733) [17:19:55] (03CR) 10Andrew Bogott: [C:03+1] site.pp: Add clouddb102[1-4] as insetup::wmcs_ferm [puppet] - 10https://gerrit.wikimedia.org/r/1143871 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [17:21:28] (03CR) 10FNegri: [C:03+2] site.pp: Add clouddb102[1-4] as insetup::wmcs_ferm [puppet] - 10https://gerrit.wikimedia.org/r/1143871 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [17:27:03] (03PS5) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) [17:29:04] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:29:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[1-4] - https://phabricator.wikimedia.org/T393733#10808192 (10fnegri) a:05fnegri→03None > update the site.pp file with the insetup role for your team Done in https://gerrit.wikimedia.o... [17:29:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#10808200 (10fnegri) [17:30:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#10808204 (10fnegri) Please note that we should skip clouddb1021 as it existed in the past and was decom'd in {T368518}. I updated the task to rename the 4 new h... [17:31:13] (03PS1) 10FNegri: site.pp: skip clouddb1021 as it existed in the past [puppet] - 10https://gerrit.wikimedia.org/r/1143876 (https://phabricator.wikimedia.org/T393733) [17:35:27] (03CR) 10Andrew Bogott: [C:03+1] site.pp: skip clouddb1021 as it existed in the past [puppet] - 10https://gerrit.wikimedia.org/r/1143876 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [17:35:56] (03CR) 10FNegri: [C:03+2] site.pp: skip clouddb1021 as it existed in the past [puppet] - 10https://gerrit.wikimedia.org/r/1143876 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [17:39:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808231 (10VRiley-WMF) @MatthewVernon I don't mind at all, thank you! So, yesterday I did make a change in the bios while I was waiting for the scrip... [17:39:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808232 (10VRiley-WMF) [17:39:54] (03PS6) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) [17:40:26] 06SRE, 10DNS, 06Traffic: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10808234 (10Amire80) I support this. For just a bit more context, @zolfeqar brought it up in a conversation with me at the [[ https://meta.wikimedia.org/wiki/Central_Asian_WikiCon_2025 | Central Asi... [17:40:32] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10808236 (10NBaca-WMF) As @Jdlrobson-WMF 's manager, provided he has filled out the above form I approve this request [17:42:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:43:11] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f228d57ced0: Failed to establish a new connection: [Errno 113 [17:43:11] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:44:11] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: green, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1708, active_shards: 5123, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: [17:44:11] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:23] ^^ Not sure why that alert is flapping, I'm not seeing any problems with production-search-omega-eqiad [17:45:37] Will start a task to investigate [17:47:09] (03CR) 10Bking: cirrussearch: Add cluster-specific domain name as a SAN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143633 (https://phabricator.wikimedia.org/T143553) (owner: 10Bking) [17:51:46] (03CR) 10Bking: [C:03+2] cirrussearch: Add new hostnames to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1143865 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:51:51] (03CR) 10Cwhite: [C:03+2] logstash: support reload via SIGHUP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143777 (owner: 10Filippo Giunchedi) [17:52:59] (03PS1) 10Bking: Revert "cirrussearch: Add new hostnames to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1143877 [17:53:19] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cirrussearch: Add new hostnames to site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1143877 (owner: 10Bking) [17:53:33] * cwhite sees the revert, holds [18:00:33] cwhite feel free to merge, revert's done [18:03:32] ty! [18:11:01] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:13:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:13:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10808313 (10cmooney) p:05Triage→03Medium [18:14:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10808315 (10cmooney) [18:14:28] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt thanos-fe1006 - vriley@cumin1002" [18:14:34] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt thanos-fe1006 - vriley@cumin1002" [18:14:34] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:15:08] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe1006 [18:16:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:26] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe1006 [18:17:48] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host thanos-fe1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:18:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:19:46] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:21:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:22:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808316 (10VRiley-WMF) [18:22:58] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt thanos-fe1007 - vriley@cumin1002" [18:23:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt thanos-fe1007 - vriley@cumin1002" [18:23:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:25] (03PS1) 10Ryan Kemper: wdqs-main: pull new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143886 (https://phabricator.wikimedia.org/T388134) [18:23:29] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe1007 [18:24:00] FIRING: ProbeDown: Service install1004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:14] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:24:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe1007 [18:24:46] (03PS2) 10Ryan Kemper: wdqs-main: pool new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143886 (https://phabricator.wikimedia.org/T388134) [18:25:18] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:26:55] RESOLVED: ProbeDown: Service install1004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:28:28] (03CR) 10Ryan Kemper: [C:03+2] wdqs-main: pool new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1143886 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [18:28:43] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1006.eqiad.wmnet with OS bullseye [18:28:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1006.eqiad.wmnet with OS bullseye [18:39:03] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808376 (10VRiley-WMF) [18:50:47] (03PS1) 10Bking: cirrussearch: Add new hostnames to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1143888 (https://phabricator.wikimedia.org/T388610) [18:51:43] vriley@cumin1002 provision (PID 493311) is awaiting input [18:56:20] !log ryankemper@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs1012.eqiad.wmnet|wdqs1013.eqiad.wmnet|wdqs1014.eqiad.wmnet|wdqs1015.eqiad.wmnet|wdqs2007.codfw.wmnet|wdqs2010.codfw.wmnet|wdqs2011.codfw.wmnet|wdqs2012.codfw.wmnet|wdqs2013.codfw.wmnet [19:01:55] vriley@cumin1002 provision (PID 493311) is awaiting input [19:06:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:13:08] (03CR) 10JHathaway: "would love a review of this @mmuhlenhoff@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [19:13:14] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: Add new hostnames to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1143888 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:18:39] (03Abandoned) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [19:20:08] (03CR) 10Bking: [C:03+2] cirrussearch: Add new hostnames to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1143888 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:21:47] (03Abandoned) 10JHathaway: puppetserver: revert private repo settings [puppet] - 10https://gerrit.wikimedia.org/r/1133564 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [19:21:56] (03Abandoned) 10JHathaway: puppet: add an ACL puppet module [puppet] - 10https://gerrit.wikimedia.org/r/1125245 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [19:22:11] (03Abandoned) 10JHathaway: puppetserver: fix gitpuppet group on puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1125246 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [19:23:31] (03PS1) 10Jgreen: Remove fran1001.frack.eqiad.wmnet from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/1143889 (https://phabricator.wikimedia.org/T392818) [19:24:51] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1007.eqiad.wmnet with OS bullseye [19:25:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808450 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye [19:25:28] (03Abandoned) 10JHathaway: hiera: acme_chief: move community-crm to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/1137032 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:27:27] (03CR) 10Dwisehaupt: [C:03+1] "shipit" [puppet] - 10https://gerrit.wikimedia.org/r/1143889 (https://phabricator.wikimedia.org/T392818) (owner: 10Jgreen) [19:35:56] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [19:37:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808472 (10VRiley-WMF) [19:38:24] (03CR) 10Cwhite: [C:03+1] "PCC was a trackpad mis-click. LGTM when ready!" [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [19:45:24] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [19:45:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [19:48:58] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1006.eqiad.wmnet with OS bullseye [19:49:03] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host thanos-fe1006.eqiad.wmnet with OS bullseye e... [19:50:46] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [19:50:57] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1054.eqiad.wmnet'] [19:54:07] (03CR) 10TChin: [C:03+1] "Ok, this time this definitely needs a backport :) After this, a version bump to the eventgate-analytics-external deployment chart is neede" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143772 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [19:55:11] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1006.eqiad.wmnet with OS bullseye [19:55:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808501 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1006.eqiad.wmnet with OS bullseye [19:55:35] (03PS8) 10Ebernhardson: search: add discovery records for secondary clusters [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) [19:55:35] (03PS1) 10Ebernhardson: search: cname specific search clusters to the lvs pool [dns] - 10https://gerrit.wikimedia.org/r/1143891 (https://phabricator.wikimedia.org/T143553) [20:00:07] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1053 to cirrussearch1053 [20:00:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:51] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1053 to cirrussearch1053 - bking@cumin2002" [20:04:22] !log bking@cumin2002 removed unrelated `fran1001` DNS record during a rename [20:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:57] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1053 to cirrussearch1053 - bking@cumin2002" [20:05:57] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:05:57] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1053 on all recursors [20:06:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1053 on all recursors [20:06:02] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1053 [20:06:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143772 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [20:07:49] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1053 [20:08:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1053 to cirrussearch1053 [20:09:21] !log jgreen@cumin1002 START - Cookbook sre.dns.netbox [20:11:46] !log jgreen@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:12:47] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission fran1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T393813#10808563 (10Jgreen) [20:13:59] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission fran1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T393813#10808566 (10Jgreen) I ran "sudo secure-cookbook sre.dns.netbox -t T393813 "Remove host fran1001.frack.eqiad.wmnet from DNS for decommissioning" on cumin1002 and it ended with PA... [20:14:40] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1053.eqiad.wmnet with OS bullseye [20:14:44] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1053 [20:14:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1053 [20:15:26] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1006.eqiad.wmnet with reason: host reimage [20:18:37] (03CR) 10Dr0ptp4kt: "Scheduled the backport deploy for 4 PM ET / 3 PM CT - https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2300130&oldid=2300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143772 (https://phabricator.wikimedia.org/T391959) (owner: 10Dr0ptp4kt) [20:18:58] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1006.eqiad.wmnet with reason: host reimage [20:20:34] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1053.eqiad.wmnet with OS bullseye [20:21:55] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:09] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:23:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1053.eqiad.wmnet with OS bullseye [20:23:23] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1053 [20:23:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1053 [20:24:17] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:30:33] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1053.eqiad.wmnet with OS bullseye [20:31:51] ^^ I hope I'm not to blame for that pybal mismatch, checking now [20:32:16] inflatador: so, I suspect this might be because elastic1053 was pooled during its rename [20:32:52] inflatador: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1020&service=PyBal+IPVS+diff+check yup 1053 source of error [20:32:57] damn! I thought I moved that one out of conftool already. Sorry swfrench-wmf and anyone else. Will get a patch up shortly [20:33:09] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:34:15] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:34:23] * swfrench-wmf scratches head [20:35:56] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:36:03] I renamed the host from `elastic1053` to `cirrussearch1053` ... the host wouldn't PXE boot, and came back up again with hostname `elastic1053`. Maybe that has something to do with it? [20:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:32] (03PS1) 10Bking: cirrussearch: remove elastic1053 from conftool config [puppet] - 10https://gerrit.wikimedia.org/r/1143894 (https://phabricator.wikimedia.org/T388610) [20:39:06] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:39:07] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1006.eqiad.wmnet with OS bullseye [20:39:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808623 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host thanos-fe1006.eqiad.wmnet with OS bullseye c... [20:40:34] (03CR) 10Bking: [C:03+2] cirrussearch: remove elastic1053 from conftool config [puppet] - 10https://gerrit.wikimedia.org/r/1143894 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:40:51] (03CR) 10Bking: [C:03+2] "self-merging to prevent further pybal alerts." [puppet] - 10https://gerrit.wikimedia.org/r/1143894 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:41:39] inflatador: interesting! yeah, I'm not sure ... I wonder if there was a period where the IP (reverse) resolved to the new name, then that was reverted, but then it took 5m for that to TTL out? [20:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:42:33] in any case, https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging starts with depooling, then IIRC step #2 can include removing the old name from `conftool-data` (whilst also adding the new name) [20:43:53] yeah, it was just a mistake on my part....should've been removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143589 [20:44:57] ah, got it :) [20:45:13] thanks for hopping on this, and noticing the pybal alert! [20:46:01] (03CR) 10Ryan Kemper: [C:03+2] DataPlatf. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 (owner: 10Volans) [20:46:17] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on elastic1054.eqiad.wmnet with reason: downtime prior to decom [20:46:28] (03Abandoned) 10Ryan Kemper: [wip] wdqs: point query.wikidata.org to main graph [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138935 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [20:49:48] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1068 to cirrussearch1068 [20:50:12] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:52:53] !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:53:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from elastic1068 to cirrussearch1068 [21:04:08] (03PS1) 10Bking: WIP: Add cirrussearch1122 as chi master-eligible [puppet] - 10https://gerrit.wikimedia.org/r/1143897 (https://phabricator.wikimedia.org/T388610) [21:05:07] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1007.eqiad.wmnet with OS bullseye [21:05:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808663 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye [21:07:33] PROBLEM - Disk space on arclamp2001 is CRITICAL: DISK CRITICAL - free space: /srv 10179 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [21:09:21] (03CR) 10BCornwall: [C:03+1] varnish: Issue and handle WMF-Uniq cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [21:09:43] PROBLEM - Disk space on arclamp1001 is CRITICAL: DISK CRITICAL - free space: /srv 10205 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [21:14:38] (03CR) 10BCornwall: search: add discovery records for secondary clusters (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1143617 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [21:16:46] (03CR) 10BCornwall: [C:03+1] search: cname specific search clusters to the lvs pool [dns] - 10https://gerrit.wikimedia.org/r/1143891 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [21:17:57] (03CR) 10BCornwall: [C:03+1] trafficserver: Allow splitting the cache by HTTP header content [puppet] - 10https://gerrit.wikimedia.org/r/1143599 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [21:33:37] (03PS2) 10BCornwall: site.pp: Add new insetup::traffic codfw cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139559 (https://phabricator.wikimedia.org/T392851) [21:34:55] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10808737 (10BCornwall) 05Open→03In progress [21:52:42] (03PS1) 10Dwisehaupt: community-civicrm: specify cfssl certs for postfix [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) [21:53:31] (03CR) 10Dwisehaupt: "@jhathaway@wikimedia.org I believe this is the change we need after your updates to the main postfix module to allow using cfssl certs." [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:54:12] (03CR) 10Dwisehaupt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:55:32] (03CR) 10CI reject: [V:04-1] community-civicrm: specify cfssl certs for postfix [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:57:37] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1007.eqiad.wmnet with OS bullseye [21:57:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808764 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye e... [22:02:29] (03PS2) 10Dwisehaupt: community-civicrm: specify cfssl certs for postfix [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) [22:03:37] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:07:32] (03PS3) 10Dwisehaupt: community-civicrm: specify cfssl certs for postfix [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) [22:09:20] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-fe1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:10:05] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-fe1007.eqiad.wmnet with OS bullseye [22:10:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808795 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye [22:16:15] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 85003MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [22:20:51] (03CR) 10Dwisehaupt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:26:55] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:14] (03CR) 10JHathaway: [C:03+2] community-civicrm: specify cfssl certs for postfix [puppet] - 10https://gerrit.wikimedia.org/r/1143910 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [23:02:34] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1007.eqiad.wmnet with OS bullseye [23:02:39] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10808859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host thanos-fe1007.eqiad.wmnet with OS bullseye e... [23:12:12] (03PS1) 10DDesouza: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143911 (https://phabricator.wikimedia.org/T344471) [23:27:36] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143911 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [23:29:24] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143911 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [23:38:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143913 [23:38:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143913 (owner: 10TrainBranchBot) [23:50:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143913 (owner: 10TrainBranchBot)