[00:00:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1112-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P75844 and previous config saved to /var/cache/conftool/dbconfig/20250507-000354-ladsgroup.json [00:08:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142728 [00:08:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142728 (owner: 10TrainBranchBot) [00:10:02] (03PS2) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) [00:10:11] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:02] (03CR) 10Andrew Bogott: [C:03+2] wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [00:12:24] (03PS3) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) [00:15:44] (03CR) 10Andrew Bogott: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [00:16:53] !log andrew@dns1004 START - running authdns-update [00:19:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T382778)', diff saved to https://phabricator.wikimedia.org/P75845 and previous config saved to /var/cache/conftool/dbconfig/20250507-001901-ladsgroup.json [00:19:04] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [00:19:17] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:19:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T382778)', diff saved to https://phabricator.wikimedia.org/P75846 and previous config saved to /var/cache/conftool/dbconfig/20250507-001924-ladsgroup.json [00:19:33] !log andrew@dns1004 END - running authdns-update [00:21:08] !log hmonroy@deploy1003 hmonroy, musikanimal: Backport for [[gerrit:1142714|Revert "JavaScript: ESLint 8.57.0" (T381577)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:21:11] T381577: Highlighting of syntax errors, warnings, infos for Wikitext editor - https://phabricator.wikimedia.org/T381577 [00:21:54] musikanimal can you take a look at testservers? [00:22:09] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10798579 (10Papaul) @VRiley-WMF After the move, the server is not booting into the OS , it is stuck at "loading initial ramdisk" when you get back on site can you please power down the server, make sur... [00:22:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T382778)', diff saved to https://phabricator.wikimedia.org/P75847 and previous config saved to /var/cache/conftool/dbconfig/20250507-002226-ladsgroup.json [00:26:25] !log hmonroy@deploy1003 hmonroy, musikanimal: Continuing with sync [00:28:51] (03PS1) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539) [00:29:28] (03PS2) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames again [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539) [00:29:35] (03PS3) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames again [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539) [00:30:32] (03CR) 10Andrew Bogott: [C:03+2] wikimediacloud.org: move codfw1dev rabbitmq cnames again [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [00:30:37] !log andrew@dns1004 START - running authdns-update [00:32:55] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:33:17] !log andrew@dns1004 END - running authdns-update [00:37:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P75848 and previous config saved to /var/cache/conftool/dbconfig/20250507-003733-ladsgroup.json [00:39:48] !log hmonroy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142714|Revert "JavaScript: ESLint 8.57.0" (T381577)]] (duration: 47m 14s) [00:39:51] T381577: Highlighting of syntax errors, warnings, infos for Wikitext editor - https://phabricator.wikimedia.org/T381577 [00:43:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142728 (owner: 10TrainBranchBot) [00:44:50] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev rabbit config: remove a comment that is no longer true [puppet] - 10https://gerrit.wikimedia.org/r/1142687 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [00:44:52] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: remove rabbitmq from cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142688 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [00:52:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P75849 and previous config saved to /var/cache/conftool/dbconfig/20250507-005240-ladsgroup.json [00:57:04] (03PS1) 10Andrew Bogott: codfw1dev rabbitmq: remove contactgroups: wmcs-team-email from role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1142740 (https://phabricator.wikimedia.org/T392539) [00:59:07] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035#10798610 (10Eevans) >>! In T307035#10797869, @Jclark-ctr wrote: > @Eevans is this still needed? or can it be resolved? It's still needed... but, I wonder when they're do for a refresh? They... [01:00:45] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev rabbitmq: remove contactgroups: wmcs-team-email from role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1142740 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [01:06:55] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:07:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T382778)', diff saved to https://phabricator.wikimedia.org/P75850 and previous config saved to /var/cache/conftool/dbconfig/20250507-010748-ladsgroup.json [01:07:52] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [01:08:05] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:08:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T382778)', diff saved to https://phabricator.wikimedia.org/P75851 and previous config saved to /var/cache/conftool/dbconfig/20250507-010811-ladsgroup.json [01:11:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T382778)', diff saved to https://phabricator.wikimedia.org/P75852 and previous config saved to /var/cache/conftool/dbconfig/20250507-011114-ladsgroup.json [01:13:59] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P75853 and previous config saved to /var/cache/conftool/dbconfig/20250507-012621-ladsgroup.json [01:26:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:36:37] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:39:46] (03PS1) 10MusikAnimal: Hooks: disable if content model is unset AND CodeMirror beta is set [extensions/CodeEditor] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142754 (https://phabricator.wikimedia.org/T373711) [01:41:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P75854 and previous config saved to /var/cache/conftool/dbconfig/20250507-014128-ladsgroup.json [01:56:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T382778)', diff saved to https://phabricator.wikimedia.org/P75855 and previous config saved to /var/cache/conftool/dbconfig/20250507-015636-ladsgroup.json [01:56:39] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [01:56:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2188.codfw.wmnet with reason: Maintenance [01:56:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T382778)', diff saved to https://phabricator.wikimedia.org/P75856 and previous config saved to /var/cache/conftool/dbconfig/20250507-015658-ladsgroup.json [01:59:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T382778)', diff saved to https://phabricator.wikimedia.org/P75857 and previous config saved to /var/cache/conftool/dbconfig/20250507-015955-ladsgroup.json [02:07:13] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:15:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P75858 and previous config saved to /var/cache/conftool/dbconfig/20250507-021502-ladsgroup.json [02:30:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P75859 and previous config saved to /var/cache/conftool/dbconfig/20250507-023009-ladsgroup.json [02:32:55] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [extensions/CodeEditor] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142754 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [02:34:23] (03Merged) 10jenkins-bot: Hooks: disable if content model is unset AND CodeMirror beta is set [extensions/CodeEditor] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142754 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [02:34:52] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1142754|Hooks: disable if content model is unset AND CodeMirror beta is set (T373711)]] [02:34:55] T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711 [02:41:36] !log tstarling@deploy1003 tstarling, musikanimal: Backport for [[gerrit:1142754|Hooks: disable if content model is unset AND CodeMirror beta is set (T373711)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [02:41:39] T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711 [02:45:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T382778)', diff saved to https://phabricator.wikimedia.org/P75860 and previous config saved to /var/cache/conftool/dbconfig/20250507-024518-ladsgroup.json [02:45:21] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [02:45:34] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2202.codfw.wmnet with reason: Maintenance [02:46:32] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2212.codfw.wmnet with reason: Maintenance [02:46:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T382778)', diff saved to https://phabricator.wikimedia.org/P75861 and previous config saved to /var/cache/conftool/dbconfig/20250507-024638-ladsgroup.json [02:49:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T382778)', diff saved to https://phabricator.wikimedia.org/P75862 and previous config saved to /var/cache/conftool/dbconfig/20250507-024933-ladsgroup.json [02:50:42] (03PS4) 10Andrea Denisse: graphite: Allow x-grafana-device-id header in CORS config [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) [02:50:42] (03CR) 10Andrea Denisse: "Hi team, I tested this patch by editing the respective configuration file, making a request with Curl to see if the server replied with th" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse) [02:51:35] (03CR) 10Andrea Denisse: graphite: Allow x-grafana-device-id header in CORS config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse) [02:58:33] !log tstarling@deploy1003 tstarling, musikanimal: Continuing with sync [03:04:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P75863 and previous config saved to /var/cache/conftool/dbconfig/20250507-030440-ladsgroup.json [03:06:59] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142754|Hooks: disable if content model is unset AND CodeMirror beta is set (T373711)]] (duration: 32m 06s) [03:07:02] T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711 [03:19:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P75864 and previous config saved to /var/cache/conftool/dbconfig/20250507-031947-ladsgroup.json [03:34:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T382778)', diff saved to https://phabricator.wikimedia.org/P75865 and previous config saved to /var/cache/conftool/dbconfig/20250507-033455-ladsgroup.json [03:34:59] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [03:35:11] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2216.codfw.wmnet with reason: Maintenance [03:35:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T382778)', diff saved to https://phabricator.wikimedia.org/P75866 and previous config saved to /var/cache/conftool/dbconfig/20250507-033518-ladsgroup.json [03:38:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T382778)', diff saved to https://phabricator.wikimedia.org/P75867 and previous config saved to /var/cache/conftool/dbconfig/20250507-033812-ladsgroup.json [03:53:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P75868 and previous config saved to /var/cache/conftool/dbconfig/20250507-035319-ladsgroup.json [04:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P75869 and previous config saved to /var/cache/conftool/dbconfig/20250507-040826-ladsgroup.json [04:23:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T382778)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250507-042334-ladsgroup.json [04:23:45] T382778: Optimize text table - https://phabricator.wikimedia.org/T382778 [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:06:55] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:16:55] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:26:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:03] (03CR) 10Muehlenhoff: [C:03+2] Extend package list to be installed from component/puppet7 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140716 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff) [05:29:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [05:34:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [05:40:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [05:41:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [05:41:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans) [05:41:55] FIRING: [19x] ProbeDown: Service ganeti1032:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:44:31] (03CR) 10Ayounsi: [C:03+2] esams: remove Tele2 transit [homer/public] - 10https://gerrit.wikimedia.org/r/1142539 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi) [05:45:03] (03Merged) 10jenkins-bot: esams: remove Tele2 transit [homer/public] - 10https://gerrit.wikimedia.org/r/1142539 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi) [05:46:07] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10798820 (10MoritzMuehlenhoff) The cause of the regression is now identified; the backport to 6.1. missed an depending patch: https... [05:48:52] !log decom Tele2 transit in esams - T393401 [05:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [05:57:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [05:58:46] (03PS1) 10Ayounsi: Remove Tele2 and Novacore from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T0600) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:01:55] FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:03:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [06:04:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [06:06:16] (03PS2) 10Ayounsi: Remove Tele2, Fiberring and Novacore from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) [06:06:55] FIRING: [19x] ProbeDown: Service ganeti1033:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:06:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Nice idea to rely on the manufacturer fact to select storcli." [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [06:08:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [06:13:47] jmm@cumin2002 drain-node (PID 1021457) is awaiting input [06:18:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [06:19:11] (03PS1) 10RLazarus: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) [06:19:19] (03PS1) 10RLazarus: deployment_server: Add --env to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) [06:20:46] (03PS2) 10RLazarus: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) [06:20:53] (03CR) 10CI reject: [V:04-1] deployment_server: Add --env to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [06:24:04] (03CR) 10RLazarus: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [06:24:07] (03PS1) 10Muehlenhoff: Fix wdqs-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1142796 [06:24:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [06:25:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [06:26:55] FIRING: [19x] ProbeDown: Service ganeti1034:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:33:32] (03PS1) 10Kosta Harlan: temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) [06:51:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 263569 [06:52:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263569 [06:52:15] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268517 [06:52:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268517 [06:52:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264595 [06:52:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264595 [06:52:51] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 35847 [06:53:10] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 35847 [06:53:59] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268097 [06:54:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268097 [06:54:45] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 24441 [06:55:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 24441 [06:55:30] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61588 [06:55:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61588 [06:59:37] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10798932 (10ayounsi) 05Resolved→03Open Unfortunately we're not out of the wood yet... `cr3-ulsfo> show interfaces et-0/0/0 media` still shows lo... [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:56] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10798948 (10JMeybohm) There where a bunch of IO errors on May 4th and 5th, so I would believe the disk needs replacement [07:06:06] (03CR) 10JMeybohm: [C:03+1] python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar) [07:07:49] (03CR) 10JMeybohm: [C:03+2] python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar) [07:08:20] (03CR) 10JMeybohm: [V:03+2 C:03+2] python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar) [07:11:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:13:48] (03PS1) 10Slyngshede: IDP-Test: Test installation of CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1142929 [07:13:55] (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slo: enable alerts for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1142596 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey) [07:14:53] (03CR) 10Slyngshede: [C:03+2] IDP-Test: Test installation of CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1142929 (owner: 10Slyngshede) [07:14:58] !log slyngshede@dns1004 START - running authdns-update [07:16:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:17:34] !log slyngshede@dns1004 END - running authdns-update [07:18:08] (03CR) 10Elukey: raid: update facter and get-raid-status to allow storcli (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [07:18:12] (03PS3) 10Elukey: raid: update facter and get-raid-status to allow storcli [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) [07:20:07] (03CR) 10Elukey: raid: update facter and get-raid-status to allow storcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [07:24:24] (03PS1) 10Arnaudb: gerrit: add abusers [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544) [07:25:17] (03CR) 10Elukey: [C:03+2] raid: update facter and get-raid-status to allow storcli [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [07:26:43] (03PS2) 10Arnaudb: gerrit: add abusers [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544) [07:29:19] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544) (owner: 10Arnaudb) [07:32:35] (03CR) 10Arnaudb: [C:03+2] gerrit: add abusers [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544) (owner: 10Arnaudb) [07:34:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:37:40] (03CR) 10Zabe: [C:03+2] SkinTemplate: Restore a string 'class' in tabAction() [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142715 (https://phabricator.wikimedia.org/T393504) (owner: 10Zabe) [07:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:39:45] FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:43:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:49:25] (03Merged) 10jenkins-bot: SkinTemplate: Restore a string 'class' in tabAction() [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142715 (https://phabricator.wikimedia.org/T393504) (owner: 10Zabe) [07:50:11] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1142715|SkinTemplate: Restore a string 'class' in tabAction() (T393504)]] [07:50:14] T393504: PHP Warning: Array to string conversion - https://phabricator.wikimedia.org/T393504 [07:51:53] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10799037 (10ayounsi) 05Open→03Resolved After chatting with Cathal, we decided to leave it as it as moving ports requires intrusive changes (P... [07:56:51] !log zabe@deploy1003 zabe: Backport for [[gerrit:1142715|SkinTemplate: Restore a string 'class' in tabAction() (T393504)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:56:54] T393504: PHP Warning: Array to string conversion - https://phabricator.wikimedia.org/T393504 [08:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T0800) [08:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:13] (03CR) 10Hashar: "Thank you and that worked!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar) [08:02:44] !log zabe@deploy1003 zabe: Continuing with sync [08:03:39] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552 (10ayounsi) 03NEW [08:03:49] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10799104 (10ayounsi) [08:04:34] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10799106 (10ayounsi) [08:05:19] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10799110 (10ayounsi) [08:09:13] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142715|SkinTemplate: Restore a string 'class' in tabAction() (T393504)]] (duration: 19m 01s) [08:09:16] T393504: PHP Warning: Array to string conversion - https://phabricator.wikimedia.org/T393504 [08:15:14] (03PS1) 10Elukey: raid::broadcom: fix perccli package name [puppet] - 10https://gerrit.wikimedia.org/r/1142978 (https://phabricator.wikimedia.org/T393146) [08:15:30] (03CR) 10Elukey: [V:03+2 C:03+2] raid::broadcom: fix perccli package name [puppet] - 10https://gerrit.wikimedia.org/r/1142978 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [08:16:57] (03PS2) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) [08:17:21] (03CR) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) (owner: 10Majavah) [08:24:17] (03PS3) 10AOkoth: wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) [08:26:41] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [08:27:20] (03CR) 10Harroyo-wmf: [C:03+1] temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan) [08:28:36] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5473/console" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse) [08:31:04] (03CR) 10Volans: [C:03+1] "LGTM, but being a large "syntax" change the best way to ensure it works on all cases is testing it :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:31:49] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. I guess we can start thinking of moving this to alertmanager and possibly using some of the additional metadata - like groups - to " [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi) [08:33:17] (03CR) 10Volans: [C:03+2] elasticsearch: temporarily remove it from bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans) [08:34:09] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5474/console" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse) [08:35:07] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, I'm puzzled as to why PCC is not detecting the change though" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse) [08:37:06] (03CR) 10Muehlenhoff: raid: update facter and get-raid-status to allow storcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [08:38:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [08:41:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [08:43:13] (03Merged) 10jenkins-bot: elasticsearch: temporarily remove it from bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans) [08:44:45] FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:45:08] (03CR) 10Muehlenhoff: [C:03+2] Switch krb2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1140143 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [08:46:32] (03CR) 10Volans: "question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [08:49:45] RESOLVED: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:54:04] !log update `host-inbound-traffic system-services` on pfw1-eqiad - T390052 [08:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:08] T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052 [08:55:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1035.eqiad.wmnet [08:55:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1035.eqiad.wmnet [08:56:55] FIRING: [19x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:57:21] (03CR) 10FNegri: [C:03+1] "Thanks, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) (owner: 10Majavah) [08:59:01] (03CR) 10Brouberol: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [08:59:28] (03PS3) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) [09:01:13] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#10799352 (10Volans) Sure, but we need first a decision on what's the standardize and correct way to commit automatic dbctl changes from cookbo... [09:02:25] (03CR) 10Majavah: [C:03+2] P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) (owner: 10Majavah) [09:06:55] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:07:57] (03CR) 10Ayounsi: [C:03+2] Remove Tele2, Fiberring and Novacore from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi) [09:09:59] (03PS1) 10AOkoth: gerrit: apache ratelimit test [puppet] - 10https://gerrit.wikimedia.org/r/1143019 [09:10:33] (03CR) 10Ayounsi: [C:03+2] "We already have the alert up and running : https://github.com/wikimedia/operations-alerts/blob/master/team-netops/bgp.yaml#L3" [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi) [09:13:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:31:56] 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565 (10MoritzMuehlenhoff) 03NEW [09:32:00] 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565#10799422 (10MoritzMuehlenhoff) p:05Triage→03High [09:34:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [09:36:04] 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565#10799428 (10hnowlan) a:03hnowlan I believe these are both safe to clean up, I'll handle it. [09:36:43] (03PS1) 10Elukey: raid: allow OK in general state for get-raid-status-broadcom.py [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) [09:38:46] jmm@cumin2002 drain-node (PID 1229267) is awaiting input [09:40:43] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1143025 [09:41:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [09:41:31] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1143025 (owner: 10Muehlenhoff) [09:41:58] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [09:42:26] (03CR) 10Elukey: [C:03+2] raid: allow OK in general state for get-raid-status-broadcom.py [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [09:43:59] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:46:55] FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:47:25] 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565#10799457 (10akosiaris) 05Open→03Resolved {{done}} [09:48:59] RESOLVED: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:51:33] (03PS4) 10Hnowlan: trafficserver: route all but zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [09:52:52] (03PS5) 10Hnowlan: trafficserver: route all but enwiki/zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) [09:54:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1036.eqiad.wmnet [09:54:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1036.eqiad.wmnet [09:55:43] (03PS3) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) [09:55:48] (03CR) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [09:55:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [09:56:31] (03CR) 10David Caro: raid: allow OK in general state for get-raid-status-broadcom.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:55] FIRING: [19x] ProbeDown: Service ganeti1036:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:58] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1143025 (owner: 10Muehlenhoff) [09:57:04] (03CR) 10Jgiannelos: [C:03+1] trafficserver: route all but enwiki/zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) (owner: 10Hnowlan) [09:59:33] (03PS1) 10Elukey: raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) [09:59:46] (03CR) 10Elukey: [C:03+2] raid: allow OK in general state for get-raid-status-broadcom.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1000) [10:00:38] (03CR) 10CI reject: [V:04-1] raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [10:01:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [10:01:55] FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:51] (03PS2) 10Elukey: raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) [10:08:38] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [10:08:46] (03CR) 10Elukey: [C:03+2] raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [10:10:15] (03CR) 10Hnowlan: [C:03+2] trafficserver: route all but enwiki/zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) (owner: 10Hnowlan) [10:11:00] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10799493 (10MoritzMuehlenhoff) [10:11:03] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10799494 (10MoritzMuehlenhoff) [10:11:55] FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1037.eqiad.wmnet [10:14:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1037.eqiad.wmnet [10:16:55] FIRING: [19x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [10:18:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [10:19:10] FYI, kubestagemaster1003 will briefly go down for a Ganeti reboot [10:19:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet [10:21:09] PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:21] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on krb2002.codfw.wmnet with reason: update to Bookworm [10:22:28] 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10799527 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3de6b492-82de-43f4-8903-cb18d7303b18) set by jmm@cumin2002 for 3:00:00 on 1 host(s) and their services with reason: update t... [10:25:43] RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [10:25:57] FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:26:55] FIRING: [20x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:27:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet [10:27:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [10:27:27] !log upgrading krb2002 to Bookworm T390863 [10:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:29] T390863: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863 [10:27:49] (03CR) 10Tchanders: [C:03+1] temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan) [10:28:59] FIRING: [21x] ProbeDown: Service ganeti1038:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:57] RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:31:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [10:33:13] (03PS2) 10Tchanders: Assign IP auto-reveal rights to certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492) [10:34:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [10:35:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [10:40:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [10:42:23] (03CR) 10Kamila Součková: [C:03+2] benthos/mw_accesslog_metrics: increase buffering [puppet] - 10https://gerrit.wikimedia.org/r/1142625 (owner: 10Kamila Součková) [10:42:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:43:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet [10:46:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet [10:46:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393574 (10ops-monitoring-bot) 03NEW [10:46:54] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393575 (10ops-monitoring-bot) 03NEW [10:47:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393576 (10ops-monitoring-bot) 03NEW [10:47:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393577 (10ops-monitoring-bot) 03NEW [10:47:52] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10799666 (10MoritzMuehlenhoff) [10:49:39] (03PS1) 10Elukey: icinga: update raid_handler.py with 'broadcom' instead of 'perccli' [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) [10:51:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet [10:51:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet [10:56:55] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:57:00] (03CR) 10Elukey: "The RAID_TYPES variable seems not used in the raid_handler.py script, but I'd proceed anyway for consistency." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [10:57:01] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:57:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [11:00:04] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1100). [11:00:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [11:01:04] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T393578 (10Seddon) 03NEW [11:01:21] FYI, kubestagemaster1004 and dse-k8s-etcd1002 will briefly go down for a Ganeti reboot [11:01:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet [11:01:28] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579 (10Seddon) 03NEW [11:01:50] 06SRE, 10SRE-Access-Requests: Requesting access to for  - https://phabricator.wikimedia.org/T393578#10799730 (10Seddon) 05Open→03Invalid [11:03:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:39] PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:57] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:45] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [11:06:07] RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [11:06:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [11:06:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet [11:07:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet [11:08:16] (03PS10) 10Ayounsi: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 [11:08:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:14] (03PS13) 10Ayounsi: wmf-netbox use core Homer GraphQL based fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) [11:10:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:57] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:12:02] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1163.eqiad.wmnet with reason: Maintenance [11:12:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2203.codfw.wmnet with reason: Maintenance [11:15:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:17:46] (03PS1) 10Kamila Součková: mw-cron/CampaignEvents: Migrate aggregateanswers-{meta,office}wiki [puppet] - 10https://gerrit.wikimedia.org/r/1143058 (https://phabricator.wikimedia.org/T385867) [11:19:01] (03CR) 10Jelto: [C:03+2] gerrit: split Gerrit and Gitiles proxy pools [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [11:41:35] RECOVERY - Hadoop NodeManager on an-worker1193 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet [11:44:11] (03PS5) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) [11:46:46] jmm@cumin2002 drain-node (PID 1359506) is awaiting input [11:47:12] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10799897 (10cmassaro) @tappof Thank you! I am not actually sure. I'm looking at https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/data/data.yaml,... [11:49:50] FYI, ml-etcd1001 will briefly go down for a Ganeti reboot [11:50:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:50:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:37] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: update raid_handler.py with 'broadcom' instead of 'perccli' [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [11:53:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:48] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10799978 (10MoritzMuehlenhoff) >>! In T393140#10799897, @cmassaro wrote: > @tappof Thank you! I am not actually sure. I'm looking at https://phabricator.wikimedia.org/source/operations-pu... [11:55:12] jmm@cumin2002 drain-node (PID 1359506) is awaiting input [11:55:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet [11:56:41] (03PS3) 10Ayounsi: WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [11:56:54] (03CR) 10Muehlenhoff: [C:03+1] "It's used in parse_args() it seems." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [11:57:47] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:52] (03CR) 10Brouberol: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [11:58:57] (03CR) 10Ayounsi: WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [12:00:27] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [12:00:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet [12:00:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet [12:01:48] (03CR) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:02] (03PS6) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) [12:05:32] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069 [12:05:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet [12:07:36] (03CR) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:11:44] jmm@cumin2002 drain-node (PID 1382334) is awaiting input [12:12:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet [12:13:33] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587 (10Justman10000) 03NEW [12:14:24] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800019 (10Justman10000) And how to submit the SSH key? As file? Via text? [12:14:48] (03CR) 10Cathal Mooney: WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [12:15:40] (03CR) 10Hnowlan: [C:03+1] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069 (owner: 10Muehlenhoff) [12:17:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet [12:18:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet [12:18:13] (03CR) 10Cathal Mooney: [C:03+1] "LGTM if you've tested it against all devices." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [12:18:19] (03CR) 10Cathal Mooney: [C:03+1] netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [12:18:30] (03CR) 10Cathal Mooney: [C:03+1] wmf-netbox use core Homer GraphQL based fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [12:20:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:45] (03PS1) 10Kamila Součková: mw-cron/WikimediaEvents: Migrate UpdatePeriodicMetrics jobs [puppet] - 10https://gerrit.wikimedia.org/r/1143073 (https://phabricator.wikimedia.org/T388542) [12:22:00] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [12:25:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:03] (03CR) 10Brouberol: [C:03+1] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:32:18] (03CR) 10Brouberol: "Small edit suggestion" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:35:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [12:38:01] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:38:28] !log installing imagemagick security updates [12:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:36] (03CR) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:41:46] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800154 (10Aklapper) 05Open→03Declined Per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups, `ops` is for SRE staff only. [12:41:52] jmm@cumin2002 drain-node (PID 1412672) is awaiting input [12:41:55] FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:01] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:11] (03PS7) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) [12:43:49] (03CR) 10Jelto: [C:03+2] gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [12:43:51] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800161 (10Aklapper) Also per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access this is the wrong form. Which docs are you following and why? Please also see T393499#1079... [12:44:04] (03PS2) 10Hashar: gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) [12:44:14] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-04-16-213143 to 2025-05-07-003410 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143074 (https://phabricator.wikimedia.org/T386239) [12:44:25] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143075 (https://phabricator.wikimedia.org/T386239) [12:44:44] (03CR) 10Brouberol: [C:03+1] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [12:45:46] (03PS8) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) [12:46:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet [12:47:04] (03CR) 10Jelto: [C:03+2] gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [12:48:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:01] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet [12:51:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [12:57:00] 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10800203 (10NBaca-WMF) As Seddon’s manager I approve this request [12:58:04] !log [wikishared]> CREATE INDEX translation_last_updated_timestamp ON cx_translations (translation_last_updated_timestamp); (T392839) [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:08] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet [13:00:12] I can’t deploy today [13:01:18] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595 (10isarantopoulos) 03NEW [13:02:00] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800232 (10isarantopoulos) [13:04:50] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800237 (10isarantopoulos) [13:05:14] jmm@cumin2002 drain-node (PID 1437955) is awaiting input [13:05:36] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800239 (10Justman10000) >>! In T393587#10800154, @Aklapper hat geschrieben: > Per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups, `ops` is for SRE staff only. But I need `ops` for o... [13:05:47] (03CR) 10Btullis: [C:03+2] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:05:49] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069 (owner: 10Muehlenhoff) [13:06:05] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800240 (10Justman10000) >>! In T393587#10800161, @Aklapper hat geschrieben: > Also per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access this is the wrong form. Which doc... [13:06:06] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069 (owner: 10Muehlenhoff) [13:06:43] Amir1: do you think this is compatible with the CX queries? T393513 [13:06:43] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [13:06:53] (03CR) 10CI reject: [V:04-1] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:07:02] Daimona: I'm debugging [13:07:06] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [13:07:08] !log Restarted Apache httpd server on Gerrit server [13:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:11] was there a newer one than yesterday? [13:07:14] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [13:07:15] Okay great! Let me know if there's anything I can help with [13:07:22] No, this is the one from yesterday evening [13:07:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet [13:07:31] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800246 (10isarantopoulos) I approve adding Bartos... [13:07:41] It's similar to a pattern I saw a few days ago with a spike in open connections [13:07:43] yeah, that I'm looking at. There are still pieces in CX that are slow but I want to double check everything [13:07:52] !log installing poppler security updates [13:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:49] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:12:15] (03PS1) 10Jelto: Revert "gerrit: lower connections to Gitiles from 25 to 4" [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467) [13:12:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet [13:12:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet [13:13:05] (03CR) 10Arnaudb: [C:03+1] "thanks for the quick revert" [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [13:13:59] (03CR) 10Elukey: [C:03+1] Pass krb2002 to Kerberos clients again [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [13:15:14] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:15:22] (03CR) 10Elukey: "Oh right, it holds the 'choices', I missed it. Now I am wondering why it keeps working though." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [13:16:11] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800278 (10Aklapper) >>! In T393587#10800239, @Justman10000 wrote: > But I need `ops` for optimal working! Working on what? And //what exactly// makes you think so? So far I have found on... [13:16:54] (03CR) 10Btullis: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:18:19] (03CR) 10MVernon: [C:03+2] swift: remove ms-be1060 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1140130 (https://phabricator.wikimedia.org/T392796) (owner: 10MVernon) [13:18:36] (03Merged) 10jenkins-bot: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:19:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet [13:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393598 (10ops-monitoring-bot) 03NEW [13:20:41] (03CR) 10Ssingh: icinga: skip services in wait_for_optimal if needed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [13:21:25] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:21:43] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:22:57] (03CR) 10Ayounsi: WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [13:23:08] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10800315 (10MatthewVernon) [13:23:27] (03CR) 10MVernon: [C:03+2] swift: add ms-fe101[5,6] as new proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1140752 (https://phabricator.wikimedia.org/T388886) (owner: 10MVernon) [13:24:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet [13:25:06] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800321 (10Aklapper) >>! In T393587#10800240, @Justman10000 wrote: > Which one should I follow? My question was "Which docs are you following and why?". This has remained unanswered. [13:25:21] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:25:34] (03CR) 10Cathal Mooney: [C:03+1] WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [13:46:21] (03CR) 10Herron: logs-api: add write/delete acl via htgroup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [13:46:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [13:47:04] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [13:47:15] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5481/co" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:48:15] (03CR) 10Elukey: [V:03+1] "Lemme know if it works now! :)" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:48:25] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800405 (10Aklapper) >>! In T393587#10800392, @Justman10000 wrote: >> Please provide a link to non-trivial, merged code changes of yours. > > I don't have one! Then I do not think that y... [13:49:57] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800416 (10Aklapper) > I just don't want to look stupid when I want to do something, but I can't because no permission! Looking stupid is much much more acceptable than not following http... [13:50:14] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1117 to cirrussearch1117 - bking@cumin2002" [13:50:19] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1117 to cirrussearch1117 - bking@cumin2002" [13:50:19] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:20] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1117 on all recursors [13:50:25] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1117 on all recursors [13:50:25] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1117 [13:50:31] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800429 (10Justman10000) >>! In T393587#10800405, @Aklapper hat geschrieben: >>>! In T393587#10800392, @Justman10000 wrote: >>> Please provide a link to non-trivial, merged code changes of... [13:50:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [13:51:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:59] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1117 [13:52:10] !log installing nginx security updates [13:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:31] 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10800438 (10Gehel) [13:52:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1117 to cirrussearch1117 [13:52:39] (03CR) 10Herron: [C:03+1] "LGTM! please see comment before submitting" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:53:50] (03PS1) 10Brouberol: mediawiki-dumps-legacy: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143101 (https://phabricator.wikimedia.org/T389784) [13:56:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:32] (03PS1) 10Btullis: Fix typo in mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143103 (https://phabricator.wikimedia.org/T389784) [13:56:51] (03CR) 10Elukey: [V:03+1] profile::pyrra::filesystem::slos: add test for revertrisk LA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:56:53] (03PS4) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) [13:57:02] !log sukhe@dns1004 START - running authdns-update [13:57:02] (03PS1) 10Jelto: gerrit: add more ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143104 (https://phabricator.wikimedia.org/T393498) [13:57:49] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1116.eqiad.wmnet with OS bullseye [13:58:27] (03Abandoned) 10Btullis: Fix typo in mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143103 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:58:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1117.eqiad.wmnet with OS bullseye [13:58:33] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143101 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol) [13:58:54] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1118 to cirrussearch1118 [13:59:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393574#10800460 (10elukey) 05Open→03Invalid My fault, related to T393146. [13:59:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393575#10800465 (10elukey) 05Open→03Invalid My fault, related to T393146. [13:59:18] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:59:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393576#10800470 (10elukey) 05Open→03Invalid My fault, related to T393146. [13:59:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393577#10800475 (10elukey) 05Open→03Invalid My fault, related to T393146. [13:59:41] !log sukhe@dns1004 END - running authdns-update [13:59:42] (03CR) 10Arnaudb: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1143104 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto) [14:00:02] (03CR) 10Jelto: [C:03+2] gerrit: add more ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143104 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1400) [14:00:22] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10800482 (10BTullis) 05Open→03Resolved This should be all working now @JVanderhoop-WMF - I'... [14:00:30] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143101 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol) [14:00:36] (03CR) 10Btullis: [C:03+2] Add scampos to the analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/1142679 (https://phabricator.wikimedia.org/T393066) (owner: 10Btullis) [14:01:12] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update evaluators from 2025-04-16-213143 to 2025-05-07-003410 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143074 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester) [14:01:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:35] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393598#10800488 (10elukey) 05Open→03Invalid My fault, related to T393146. [14:02:25] (03CR) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [14:02:55] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-04-16-213143 to 2025-05-07-003410 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143074 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester) [14:03:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:03:29] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1118 to cirrussearch1118 - bking@cumin2002" [14:03:48] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1118 to cirrussearch1118 - bking@cumin2002" [14:03:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:03:49] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1118 on all recursors [14:03:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1118 on all recursors [14:03:53] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1118 [14:04:32] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:05:26] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1118 [14:05:34] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:05:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:05:49] (03CR) 10Hnowlan: "We should remove the `rerendered_pcs_wikis` entry in helmfile.d/services/changeprop/values.yaml also." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos) [14:06:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1118 to cirrussearch1118 [14:06:57] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:07:17] !log sukhe@dns1004 START - running authdns-update [14:07:17] (03PS1) 10Brouberol: Fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143106 (https://phabricator.wikimedia.org/T389784) [14:07:39] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:07:41] (03CR) 10Hashar: [C:03+1] "From the doc at https://httpd.apache.org/docs/2.4/mod/mod_proxy.html `max` applies on a per child process. So with 5 child processes that " [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [14:07:50] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:08:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting access to for  - https://phabricator.wikimedia.org/T393066#10800522 (10BTullis) 05In progress→03Resolved This should be working now @SCampos-WMF - Please feel free to let me... [14:08:52] (03CR) 10Brouberol: [C:03+2] Fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143106 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol) [14:08:52] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:08:59] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800527 (10Aklapper) Sure, please feel free to point to other meaningful technical contributions if there are no code contributions. > Is that why one don't give someone a chance? A chance... [14:09:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1116.eqiad.wmnet with reason: host reimage [14:09:21] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1118.eqiad.wmnet with OS bullseye [14:09:47] !log sukhe@dns1004 END - running authdns-update [14:09:58] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1117.eqiad.wmnet with reason: host reimage [14:10:00] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143075 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester) [14:10:36] (03PS2) 10Jgiannelos: pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 [14:10:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:11:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:11:34] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143075 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester) [14:12:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1116.eqiad.wmnet with reason: host reimage [14:12:03] jouncebot: nowandnext [14:12:03] For the next 0 hour(s) and 47 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1400) [14:12:03] In 2 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700) [14:12:34] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:13:03] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:13:58] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:14:48] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:15:05] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:15:20] (03CR) 10Elukey: [C:03+2] icinga: update raid_handler.py with 'broadcom' instead of 'perccli' [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [14:15:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1117.eqiad.wmnet with reason: host reimage [14:15:52] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:16:23] (03CR) 10Elukey: [C:03+2] "Self answering - it breaks before reaching any phabricator code, so it didn't create wrong/invalid tasks.." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey) [14:16:30] (03PS6) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) [14:18:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [14:18:41] (03CR) 10Scott French: [C:03+1] mw-cron/WikimediaEvents: Migrate UpdatePeriodicMetrics jobs [puppet] - 10https://gerrit.wikimedia.org/r/1143073 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [14:19:21] (03PS1) 10Jforrester: wikifunctions: Provide a quick shell script for testing prod status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143109 (https://phabricator.wikimedia.org/T369741) [14:19:41] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10800566 (10elukey) 05Open→03Resolved a:03elukey Summary: - Renamed the perccli nagios check to a more generic broadcom, tha... [14:20:38] (03CR) 10Scott French: [C:03+1] mw-cron/CampaignEvents: Migrate aggregateanswers-{meta,office}wiki [puppet] - 10https://gerrit.wikimedia.org/r/1143058 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [14:21:00] (03CR) 10Jforrester: [C:03+2] wikifunctions: Provide a quick shell script for testing prod status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143109 (https://phabricator.wikimedia.org/T369741) (owner: 10Jforrester) [14:22:05] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10800572 (10JVanderhoop-WMF) Thank you! Can confirm it works. [14:22:36] (03Merged) 10jenkins-bot: wikifunctions: Provide a quick shell script for testing prod status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143109 (https://phabricator.wikimedia.org/T369741) (owner: 10Jforrester) [14:22:53] (03CR) 10Scott French: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:23:02] (03PS1) 10Hnowlan: mw::maintenance: migrate all parsercache jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143110 (https://phabricator.wikimedia.org/T385800) [14:24:38] 06SRE, 06Traffic, 13Patch-For-Review: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10800593 (10CDobbins) a:05CDobbins→03ssingh [14:26:35] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1119 to cirrussearch1119 [14:26:48] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:27:39] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1081-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [14:27:51] (03PS3) 10Neslihan Turan: Create feature flags for resolving Wikibase item labels on Watchlist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685) [14:27:55] (03PS1) 10Jelto: gerrit: add more abuse IPs [puppet] - 10https://gerrit.wikimedia.org/r/1143111 (https://phabricator.wikimedia.org/T393498) [14:29:28] FIRING: SystemdUnitCrashLoop: logstash.service crashloop on elastic1087:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:29:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1116.eqiad.wmnet with OS bullseye [14:29:41] (03CR) 10Jelto: [C:03+2] gerrit: add more abuse IPs [puppet] - 10https://gerrit.wikimedia.org/r/1143111 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto) [14:31:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:23] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1119 to cirrussearch1119 - bking@cumin2002" [14:34:28] FIRING: [9x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:35:51] (03CR) 10Andrea Denisse: [C:03+2] graphite: Allow x-grafana-device-id header in CORS config [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse) [14:36:28] bking@cumin2002 rename (PID 1527059) is awaiting input [14:37:03] (03CR) 10Scott French: "This is quite similar to the what I'm having to deal with for the refreshlinks jobs, which never adopted `sharded_periodic_job`." [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:37:39] (03PS4) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) [14:38:02] (03CR) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:39:06] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1118.eqiad.wmnet with reason: host reimage [14:39:10] !log installing openjdk-17 security updates [14:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:28] FIRING: [10x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:39:37] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:40:35] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1119 to cirrussearch1119 - bking@cumin2002" [14:40:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:40:36] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1119 on all recursors [14:40:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1119 on all recursors [14:40:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1119 [14:41:39] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035#10800689 (10Eevans) 05Open→03Declined >>! In T307035#10800347, @MatthewVernon wrote: > @Eevans refresh due Q2 next year per the procurement spreadsheet. Oh, thank you! I'm never quite... [14:41:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1117.eqiad.wmnet with OS bullseye [14:42:03] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all parsercache jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143110 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan) [14:42:38] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1081-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [14:43:47] bking@cumin2002 rename (PID 1527059) is awaiting input [14:43:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1118.eqiad.wmnet with reason: host reimage [14:44:28] RESOLVED: [10x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:44:32] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [14:44:59] (03CR) 10Scott French: [C:03+1] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [14:47:07] (03PS1) 10Volans: CHANGELOG: add changelogs for release v10.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1143112 [14:47:50] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1119 [14:48:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1119 to cirrussearch1119 [14:50:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1143098 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans) [14:51:08] (03Abandoned) 10Bking: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:54:35] (03CR) 10Bking: [C:03+2] cirrus: disable completion indices in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1141903 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:55:01] (03CR) 10Scott French: deployment_server: Add --env to mwscript-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [14:56:29] (03PS3) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 [14:57:04] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [14:57:33] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1119.eqiad.wmnet with OS bullseye [14:58:06] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1081-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [14:58:16] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1120 to cirrussearch1120 [14:59:15] (03CR) 10CI reject: [V:04-1] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [14:59:27] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1081* for thread pool rejections - bking@cumin2002 [14:59:31] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1081* for thread pool rejections - bking@cumin2002 [14:59:58] FIRING: [3x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1058:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:00:33] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:02:33] !log sukhe@dns1004 START - running authdns-update [15:02:39] FIRING: [3x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1060-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [15:03:25] PROBLEM - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1118 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [15:04:01] !log pool ms-fe1015 ms-fe1016 new frontends T388886 T391354 [15:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:04] T388886: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886 [15:04:05] T391354: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354 [15:04:16] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1060*,elastic1081* for thread pool rejections - bking@cumin2002 [15:04:19] !log sukhe@dns1004 END - running authdns-update [15:04:19] RECOVERY - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1118 is OK: SSL OK - Certificate cirrussearch1118.eqiad.wmnet valid until 2025-06-04 14:58:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [15:04:21] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1060*,elastic1081* for thread pool rejections - bking@cumin2002 [15:04:30] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1015.eqiad.wmnet [15:04:30] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1015.eqiad.wmnet [15:04:31] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1015.eqiad.wmnet [15:04:31] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1015.eqiad.wmnet [15:04:39] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1016.eqiad.wmnet [15:04:47] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1016.eqiad.wmnet [15:04:55] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1016.eqiad.wmnet [15:04:58] FIRING: [27x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:05:03] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1016.eqiad.wmnet [15:05:49] (03CR) 10Hnowlan: [C:03+1] pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos) [15:06:04] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10800814 (10MatthewVernon) [15:06:11] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v10.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1143112 (owner: 10Volans) [15:06:14] !log sudo cumin -b1 -s10 'A:dnsbox' 'sudo -u authdns git -C /srv/authdns/git maintenance run' T393602 [15:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:17] T393602: Improving the time it takes to run authdns-update - https://phabricator.wikimedia.org/T393602 [15:06:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10800816 (10Stevemunene) Hosts are in a decommissioned state with no under replocated blocks {F59748220} {F59748234} Pro... [15:06:24] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1120 to cirrussearch1120 - bking@cumin2002" [15:06:40] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1120 to cirrussearch1120 - bking@cumin2002" [15:06:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:41] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1120 on all recursors [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:44] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1120 on all recursors [15:06:45] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1120 [15:07:25] (03CR) 10Volans: [C:03+2] elasticsearch: do not fail on Python 3.10+ [cookbooks] - 10https://gerrit.wikimedia.org/r/1143098 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans) [15:08:09] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1060*,elastic1081*,elastic1083* for thread pool rejections - bking@cumin2002 [15:08:13] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1060*,elastic1081*,elastic1083* for thread pool rejections - bking@cumin2002 [15:08:33] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1120 [15:09:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1120 to cirrussearch1120 [15:09:28] (03PS1) 10MVernon: hiera: remove ms-be1060 from swift storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1143118 (https://phabricator.wikimedia.org/T391354) [15:09:45] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1119.eqiad.wmnet with reason: host reimage [15:09:53] !log sukhe@dns1004 START - running authdns-update [15:09:58] FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:10:04] !log timing authdns-update for T393602 [15:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:13] FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:10:55] !log sukhe@dns1004 END - running authdns-update [15:12:31] bking@cumin2002 reimage (PID 1574550) is awaiting input [15:13:50] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1120.eqiad.wmnet with OS bullseye [15:14:01] (03CR) 10Jgiannelos: [C:03+2] pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos) [15:14:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1119.eqiad.wmnet with reason: host reimage [15:14:58] RESOLVED: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:15:37] (03Merged) 10jenkins-bot: pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos) [15:17:10] bking@cumin2002 rename (PID 1578957) is awaiting input [15:17:32] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v10.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1143112 (owner: 10Volans) [15:17:32] (03Merged) 10jenkins-bot: elasticsearch: do not fail on Python 3.10+ [cookbooks] - 10https://gerrit.wikimedia.org/r/1143098 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans) [15:17:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10800851 (10Stevemunene) ` stevemunene@an-worker1156:~$ sudo disable-puppet "T390170 - hard drive replacement in progres... [15:18:56] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800853 (10Pppery) > Per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups, ops is for SRE staff only. FYI this isn't quite true - there have, at various times, been volunteers with `op... [15:20:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1118.eqiad.wmnet with OS bullseye [15:20:32] (03CR) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [15:21:14] (03CR) 10Eevans: [C:03+1] hiera: remove ms-be1060 from swift storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1143118 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [15:22:10] (03CR) 10Herron: [C:03+1] profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [15:24:09] RESOLVED: [3x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1060-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [15:24:34] (03CR) 10MVernon: [C:03+2] hiera: remove ms-be1060 from swift storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1143118 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [15:26:12] !log mvernon@cumin1002 START - Cookbook sre.hosts.decommission for hosts ms-be1060.eqiad.wmnet [15:27:04] (03CR) 10Filippo Giunchedi: [C:03+2] sre: alert on Prometheus codfw/eqiad down [alerts] - 10https://gerrit.wikimedia.org/r/1142543 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [15:27:09] (03CR) 10Filippo Giunchedi: [C:03+2] sre: alert on webrequest-sampled not processed [alerts] - 10https://gerrit.wikimedia.org/r/1142621 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [15:27:22] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] sre: alert on webrequest-sampled not processed [alerts] - 10https://gerrit.wikimedia.org/r/1142621 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi) [15:28:43] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1121 to cirrussearch1121 [15:28:51] jouncebot: nowandnext [15:28:51] No deployments scheduled for the next 1 hour(s) and 31 minute(s) [15:28:51] In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700) [15:29:00] 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10800873 (10cmassaro) deploy1003.eqiad.wmnet is the one! I was able to log in there, but I've switched computers and now need access with my new SSH key. [15:29:13] (03PS1) 10CDanis: move geoip to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/1143123 [15:29:19] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143123 (owner: 10CDanis) [15:29:26] (03PS4) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 [15:29:30] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:29:37] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [15:29:42] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:29:45] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:29:55] PROBLEM - Host db1247 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:30:00] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:30:01] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: remove minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142553 (owner: 10Filippo Giunchedi) [15:30:04] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:30:08] !incidents [15:30:08] 6096 (UNACKED) Host db1247 (paged) - PING - Packet loss = 100% [15:30:08] 6095 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [15:30:12] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [15:30:13] !ack 6096 [15:30:14] 6096 (ACKED) Host db1247 (paged) - PING - Packet loss = 100% [15:30:14] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [15:30:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609 (10MatthewVernon) 03NEW [15:30:19] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:30:23] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:30:30] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:30:42] 1247 really wants to be like 1246? [15:30:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10800891 (10MatthewVernon) @RobH Decom task is T393609. [15:31:01] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:31:13] (03CR) 10Herron: "thanks for the help!" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron) [15:31:15] cdanis: O [15:31:19] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:31:21] I'm around with hands if you need [15:31:36] swfrench-wmf: it's a s4 replica, so I think we just need to depool [15:31:43] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:31:58] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1120.eqiad.wmnet with reason: host reimage [15:32:00] cdanis: SGTM [15:32:13] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:32:13] RECOVERY - Host db1247 #page is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [15:32:28] !log cdanis@cumin1002 dbctl commit (dc=all): 'depool db1247', diff saved to https://phabricator.wikimedia.org/P75876 and previous config saved to /var/cache/conftool/dbconfig/20250507-153228-cdanis.json [15:32:32] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10800909 (10MatthewVernon) [15:33:41] PROBLEM - mysqld processes #page on db1247 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:33:42] PROBLEM - MariaDB Replica IO: s4 #page on db1247 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:33:48] PROBLEM - MariaDB read only s4 on db1247 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:33:58] !incidents [15:33:58] 6096 (ACKED) Host db1247 (paged) - PING - Packet loss = 100% [15:33:58] 6097 (UNACKED) db1247 (paged)/mysqld processes (paged) [15:33:59] 6098 (UNACKED) db1247 (paged)/MariaDB Replica IO: s4 (paged) [15:33:59] 6095 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [15:34:04] (03PS1) 10Ladsgroup: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) [15:34:06] !ack 6097 [15:34:07] 6097 (ACKED) db1247 (paged)/mysqld processes (paged) [15:34:07] PROBLEM - MariaDB Replica SQL: s4 #page on db1247 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:34:09] !ack 6098 [15:34:10] 6098 (ACKED) db1247 (paged)/MariaDB Replica IO: s4 (paged) [15:34:14] thanks swfrench-wmf [15:34:15] (03PS1) 10Ladsgroup: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143125 (https://phabricator.wikimedia.org/T393513) [15:34:20] !incidents [15:34:20] 6096 (ACKED) Host db1247 (paged) - PING - Packet loss = 100% [15:34:20] 6097 (ACKED) db1247 (paged)/mysqld processes (paged) [15:34:20] 6098 (ACKED) db1247 (paged)/MariaDB Replica IO: s4 (paged) [15:34:21] 6099 (UNACKED) db1247 (paged)/MariaDB Replica SQL: s4 (paged) [15:34:21] 6095 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [15:34:23] jouncebot: nowandnext [15:34:23] No deployments scheduled for the next 1 hour(s) and 25 minute(s) [15:34:23] In 1 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700) [15:34:28] !ack 6099 [15:34:29] 6099 (ACKED) db1247 (paged)/MariaDB Replica SQL: s4 (paged) [15:34:44] (03CR) 10Ladsgroup: [C:03+2] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [15:34:48] (03CR) 10Ladsgroup: [C:03+2] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143125 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [15:35:04] cdanis: do you have a task at which to point a silence, or shall I open one? [15:35:17] swfrench-wmf: please go ahead, I hadn't created one yet [15:36:16] !log mvernon@cumin1002 START - Cookbook sre.dns.netbox [15:36:20] PROBLEM - Host cirrussearch1119 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:28] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1121 to cirrussearch1121 - bking@cumin2002" [15:36:28] (03CR) 10JHathaway: "I needed to revert the last version of this patch, 1141952, because I failed to test on bullseye and earlier. This patch includes bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1120.eqiad.wmnet with reason: host reimage [15:37:18] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1121 to cirrussearch1121 - bking@cumin2002" [15:37:19] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:37:19] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1121 on all recursors [15:37:22] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1121 on all recursors [15:37:23] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1121 [15:37:28] RECOVERY - Host cirrussearch1119 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:38:49] !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:50] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ms-be1060.eqiad.wmnet [15:38:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10800960 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: `ms-be1060.eqiad.wmnet` -... [15:39:00] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1121 [15:39:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1121 to cirrussearch1121 [15:39:55] cdanis: T393612 for the restart. I'm going to put a downtime in place long enough for the DBAs to check things out and give it a clean bill of health. [15:39:55] T393612: db1247 crash - 15:29 on 2025-05-07 - https://phabricator.wikimedia.org/T393612 [15:40:02] thanks! [15:40:45] PROBLEM - MariaDB Replica Lag: s4 #page on db1247 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:40:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10800971 (10MatthewVernon) @RobH I think the above cookbook failure is expected given this host is too broken to boot reliably, but... [15:40:50] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1121.eqiad.wmnet with OS bullseye [15:40:56] !incidents [15:40:57] 6096 (ACKED) Host db1247 (paged) - PING - Packet loss = 100% [15:40:57] 6097 (ACKED) db1247 (paged)/mysqld processes (paged) [15:40:57] 6098 (ACKED) db1247 (paged)/MariaDB Replica IO: s4 (paged) [15:40:58] 6099 (ACKED) db1247 (paged)/MariaDB Replica SQL: s4 (paged) [15:40:58] 6100 (UNACKED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [15:40:58] PROBLEM - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1120 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [15:41:03] !ack 6100 [15:41:04] 6100 (ACKED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [15:42:09] ... waiting on the downtime ... [15:42:59] !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1247.eqiad.wmnet with reason: Host has crashed - T393612 [15:43:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1119.eqiad.wmnet with OS bullseye [15:43:30] PROBLEM - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1120 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [15:44:30] RECOVERY - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1120 is OK: SSL OK - Certificate cirrussearch1120.eqiad.wmnet valid until 2025-06-04 15:38:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [15:44:55] (03CR) 10CI reject: [V:04-1] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [15:46:27] (03Merged) 10jenkins-bot: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143125 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [15:47:53] (03CR) 10Jdlrobson: [C:04-1] "Per Nova Linguae" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [15:49:21] !log zabe@mwmaint1002:~$ mwscript findBadBlobs.php enwiki --revisions 276146284,819689534,1289169661 --mark "T393237" [15:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:24] T393237: Consistent error loading a specific enwiki page: Fatal exception of type "MediaWiki\Revision\RevisionAccessException" - https://phabricator.wikimedia.org/T393237 [15:53:00] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1121.eqiad.wmnet with reason: host reimage [15:53:09] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801047 (10Justman10000) >>! In T393587#10800527, @Aklapper hat geschrieben: >> Is that why one don't give someone a chance? > A chance to do what exactly? Reviewing code changes? You can... [15:53:40] !log uploaded a python-pynetbox 7.4.1-1~wmf12u1 to bookworm-wikimedia (needed for Cumin update) T389380 [15:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:43] T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380 [15:54:02] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801060 (10Justman10000) >>! In T393587#10800527, @Aklapper hat geschrieben: > Besides that, I do not know what makes you think that you need `ops` as you have not yet answered that questi... [15:54:17] (03PS1) 10Hnowlan: rest-gateway: route reading lists API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) [15:55:19] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801077 (10Justman10000) >>! In T393587#10800853, @Pppery hat geschrieben: > But agreed with Aklapper that Justman10000 is nowhere near qualified for it (or even the lesser `deployment` g... [15:55:30] (03CR) 10Ladsgroup: [C:03+2] "again" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [15:55:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [15:58:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1121.eqiad.wmnet with reason: host reimage [15:59:23] (03CR) 10Kamila Součková: [C:03+2] mw-cron/CampaignEvents: Migrate aggregateanswers-{meta,office}wiki [puppet] - 10https://gerrit.wikimedia.org/r/1143058 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [15:59:27] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10801093 (10cmooney) [15:59:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1120.eqiad.wmnet with OS bullseye [16:00:11] (03PS2) 10RLazarus: deployment_server: Add --env to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) [16:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10801099 (10RobH) >>! In T392796#10800971, @MatthewVernon wrote: > @RobH I think the above cookbook failure is expected given this h... [16:23:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1121.eqiad.wmnet with OS bullseye [16:24:55] (03PS4) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [16:25:03] (03CR) 10Bvibber: Charts phase 1 deployment (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [16:25:50] (03CR) 10CDanis: "pcc lgtm https://puppet-compiler.wmflabs.org/output/1143123/6243/" [puppet] - 10https://gerrit.wikimedia.org/r/1143123 (owner: 10CDanis) [16:26:13] (03CR) 10Hnowlan: [C:03+1] Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [16:28:38] (03CR) 10Scott French: [C:03+1] deployment_server: Add --env to mwscript-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [16:29:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10801288 (10Jclark-ctr) opened server Verified was connected. i reseated all the drives while it was turned off. and had a bunch of drives show up failed enitre top row of Backplane. Reseated drive... [16:30:05] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:30:27] (03CR) 10Fabfur: [C:03+1] "Absolutely +1 and thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143123 (owner: 10CDanis) [16:31:01] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:31:09] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [16:31:23] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [16:36:22] !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] (duration: 29m 10s) [16:36:25] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [16:36:54] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] [16:36:56] (03CR) 10Aleksandar Mastilovic: "File structure changed in the mean time - I did my best to track what went where and delete accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [16:38:09] (03PS1) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) [16:38:26] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801332 (10Aklapper) Welcome to the concept of code review. [16:40:35] (03CR) 10CI reject: [V:04-1] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans) [16:40:57] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801335 (10Justman10000) >>! In T393587#10801332, @Aklapper hat geschrieben: > Welcome to the concept of code review. Exactly! And for me, it's about being able to commit directly... [16:41:25] (03PS1) 10Aleksandar Mastilovic: Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) [16:42:44] (03CR) 10Aleksandar Mastilovic: "MR to set the WM Enterprise downloader to "absent": https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143134" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [16:43:01] !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] (duration: 06m 07s) [16:43:04] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [16:45:20] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1121 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 66, number_of_data_nodes: 66, discovered_master: True, active_primary_shards: 1481, active_shards: 4444, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 31, delayed_unassigned_shards: 0, number_of_pendin [16:45:20] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.3072625698324 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:45:38] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch1121 is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 33, number_of_data_nodes: 33, discovered_master: True, active_primary_shards: 1680, active_shards: 5035, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of [16:45:38] _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92061917047033 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:45:50] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: DiskSpace (instance analytics1071:9100) - https://phabricator.wikimedia.org/T392555#10801366 (10Stevemunene) This seems to have been resolved on 2nd May 2025, apologies for the delay {F59749689} ` stevemunene@analytic... [16:49:32] (03PS2) 10JHathaway: postfix: add support for cfssl certs [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) [16:49:58] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [16:50:17] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#10801387 (10Dwisehaupt) It appears that this just happened again today starting ~1444 UTC. The check logs on our hosts show checks being run succe... [16:50:34] (03CR) 10Bking: "I'm doing my due diligence with Puppet catalog lookups, but also adding Reuven who has more experience with Envoy." [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson) [16:52:11] could we (fr-tech) bother someone to do a `sysctl restart nsca` on the active alert host? we are seeing all of our service alerts as coming in AWOL when they are online. Some history in this phab I just updated: https://phabricator.wikimedia.org/T196336#10801387 [16:53:03] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801408 (10Aklapper) I guess we don't let random folks push random commits without review to potentially bring down Wikimedia websites. I hope that does not come as a surprise. [16:53:11] we may also need to clear out the mail queues for the backlog of spurrious mails from this destined for fr-tech@ and fr-tech-ops@ since we are at least 90 mins behind on the queue for these and we don't need to keep the mail bomb around. [16:54:39] (03CR) 10JHathaway: [C:03+2] postfix: add support for cfssl certs [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway) [16:55:07] (03PS2) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) [16:56:05] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1122 to cirrussearch1122 [16:56:12] (03CR) 10CI reject: [V:04-1] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans) [16:56:22] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1123 to cirrussearch1123 [16:56:26] swfrench-wmf: cdanis: not sure if i should ping you all as SRE on call for this ^^ but doing so. let me know if it's incorrect and i should do something else. [16:57:43] (03PS3) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) [16:58:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:58:41] !log per dwisehaupt T196336 💙cdanis@alert1002.wikimedia.org ~ 🕐☕ sudo systemctl restart nsca.service [16:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:47] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [16:59:15] dwisehaupt: cdanis: wonder if this might be related to the `FIRING: IcingaOverload: Checks are taking long to execute on alert1002:9245` in -observability? [16:59:21] cdanis: thanks for doing that! [16:59:30] swfrench-wmf: my suspicions are the same [16:59:30] thanks. hopefully that will help like in the past. [16:59:52] we have a plan to migrate from icinga, just got sidelined on other major projects for the last 6+ months. [17:00:03] (03PS4) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) [17:00:05] swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700). [17:00:10] I do think they're related. [17:00:31] denisse: https://grafana.wikimedia.org/goto/cLMd-UbNR?orgId=1 something has been adding a *lot* of new icinga checks [17:01:47] (that's a zoomed view of one of the mini timeseries on https://grafana.wikimedia.org/d/rsCfQfuZz/icinga?orgId=1 ) [17:02:17] (03CR) 10CI reject: [V:04-1] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans) [17:03:56] bking@cumin2002 rename (PID 1687172) is awaiting input [17:03:57] is that just wonky histogram buckets in the 'Check Latency' panel, or did something odd happen around 15:12? [17:04:09] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:04:45] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1122 to cirrussearch1122 - bking@cumin2002" [17:04:57] swfrench-wmf: today luca did make some changes to the perc/broadcom raid checks and there was some issue so it's possible that some checks were added before others were removed, but the net result in the end should be zero [17:05:53] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10801492 (10Papaul) @Jclark-ctr thank you for looking at this. I will rebuilt it and re-image. [17:05:54] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10801494 (10Stevemunene) Theres an issue with `/var/lib/hadoop/data/k/hdfs` which seems to be inaccessible and probably related to... [17:06:01] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1122 to cirrussearch1122 - bking@cumin2002" [17:06:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:06:02] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1122 on all recursors [17:06:04] (03PS1) 10Volans: Upstream release v10.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1143136 [17:06:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1122 on all recursors [17:06:06] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1122 [17:07:20] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1122 [17:07:23] (03CR) 10Scott French: [C:03+2] hieradata: remove icu67 override on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139910 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French) [17:08:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1122 to cirrussearch1122 [17:08:31] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1123 to cirrussearch1123 - bking@cumin2002" [17:08:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1123 to cirrussearch1123 - bking@cumin2002" [17:08:37] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:08:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1123 on all recursors [17:08:41] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1123 on all recursors [17:08:41] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1123 [17:08:45] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1122.eqiad.wmnet with OS bullseye [17:08:54] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1123 [17:09:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1123 to cirrussearch1123 [17:09:36] thanks, volans [17:09:40] 06SRE, 06serviceops-radar, 06SRE Observability, 10wikitech.wikimedia.org: Move meta monitoring off of wikitech-static - https://phabricator.wikimedia.org/T393625 (10andrea.denisse) 03NEW [17:11:21] (03CR) 10Volans: [C:03+2] Upstream release v10.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1143136 (owner: 10Volans) [17:11:21] just confirmed that 15:12 does not appear to correlate with any puppet run on alert1002. last run prior was just before 15:00 (cleaned up elastic1119) [17:11:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:12:21] (03CR) 10Scott French: [C:03+2] hieradata: switch deployment hosts to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1139914 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French) [17:12:21] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10801541 (10andrea.denisse) Hi @RobH @Andrew , we have Meta Monitoring enabled in the Wikitech static Rackspace host. Could you please provide the o... [17:13:15] dwisehaupt: delete all mail from /fr-tech.bnc.*@wikimedia.org/? [17:13:35] (03PS1) 10Sbisson: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) [17:13:49] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10801548 (10RobH) So I actually have no login rights (and don't need them) for the new AWS hosted wikitech static deployment. I just pay the AWS bi... [17:13:50] !log disable-puppet "In-place update to PHP 8.1 - T392938" on deploy1003 and deploy2002 [17:13:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_magru [17:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:54] T392938: Remove PHP 7.4 from deployment hosts - https://phabricator.wikimedia.org/T392938 [17:14:00] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:14:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [17:15:32] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host apus-fe1003 [17:15:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1123.eqiad.wmnet with OS bullseye [17:16:21] jhathaway: the mail is coming from nagios@alert1002.wikimedia.org to fr-tech@wikimedia.org and to fr-tech-ops@wikimedia.org [17:16:27] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-fe1003 [17:17:11] swfrench-wmf: I don't know if it's related or not but my scap basically gets stuck on building images (twice so far) [17:17:27] i'm starting to see recoveries and OK status in the icinga UI for the services. [17:17:32] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1124 to cirrussearch1124 [17:17:36] going for the third time [17:17:47] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] [17:17:50] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [17:17:57] nod thanks [17:18:33] Amir1: oh, sorry - are you running a backport during the infra window? [17:18:47] swfrench-wmf: it was broken before that [17:18:50] also, no - it should not be related to the PHP update I'm doing [17:19:09] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:19:13] or rather, which thing are you asking if it's related to :) [17:19:25] and where is your scap run getting stuck? [17:19:49] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:20:16] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1125 to cirrussearch1125 [17:20:46] swfrench-wmf: right now it's stuck on this: [17:20:49] https://www.irccloud.com/pastebin/Uh1rYKQd/ [17:21:00] but before that, the same thing [17:21:30] 06SRE, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626 (10SKivlehan-WMF) 03NEW [17:21:51] (03Merged) 10jenkins-bot: Upstream release v10.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1143136 (owner: 10Volans) [17:23:07] Amir1: looking at your `scap-image-build-and-push-log`, these are both incremental builds ... that's puzzling [17:23:23] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1124 to cirrussearch1124 - bking@cumin2002" [17:23:29] i.e., I would not expect the push (what's currently pending) to take very long [17:23:40] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1124 to cirrussearch1124 - bking@cumin2002" [17:23:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:41] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1124 on all recursors [17:23:43] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:23:44] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1124 on all recursors [17:23:45] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1124 [17:23:46] the first one took twenty minutes until I gave up [17:24:09] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1124 [17:24:32] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_magru [17:24:37] (03CR) 10CI reject: [V:04-1] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [17:24:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1124 to cirrussearch1124 [17:26:42] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1122.eqiad.wmnet with reason: host reimage [17:27:53] swfrench-wmf: it's moving forward now, after 9 minutes. I take it now but this seems really broken [17:28:33] Amir1: good to hear it's moving. I'm working my way through logs to try to sort out what was happening. [17:28:59] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [17:29:10] (03CR) 10Sbisson: "recheck" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [17:29:41] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1125 to cirrussearch1125 - bking@cumin2002" [17:29:46] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1125 to cirrussearch1125 - bking@cumin2002" [17:29:47] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:29:47] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1125 on all recursors [17:29:50] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1125 on all recursors [17:29:51] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1125 [17:30:02] (03CR) 10Brouberol: [C:03+1] Pass krb2002 to Kerberos clients again [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [17:30:05] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1123.eqiad.wmnet with reason: host reimage [17:31:15] (03CR) 10Brouberol: Set the remaining Enterprise WM Downloader job to absent (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [17:31:37] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:44] (03PS1) 10Sbisson: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) [17:32:54] bking@cumin2002 rename (PID 1712372) is awaiting input [17:33:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [17:34:09] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:34:14] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:34:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1122.eqiad.wmnet with reason: host reimage [17:35:25] !log deploy1003 and deploy2002 updated to PHP 8.1 - T392938 [17:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:28] T392938: Remove PHP 7.4 from deployment hosts - https://phabricator.wikimedia.org/T392938 [17:37:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1123.eqiad.wmnet with reason: host reimage [17:37:42] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:37:45] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [17:38:24] PROBLEM - Elasticsearch HTTPS for production-search-psi-eqiad on cirrussearch1123 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [17:39:09] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:40:17] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:43:32] (03CR) 10CI reject: [V:04-1] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [17:43:35] (03PS1) 10Andrew Bogott: Initial checkin of files for openstack version 'Epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143144 (https://phabricator.wikimedia.org/T390914) [17:43:36] (03PS1) 10Andrew Bogott: Openstack Magnum: remove local hacks for log size limits [puppet] - 10https://gerrit.wikimedia.org/r/1143145 (https://phabricator.wikimedia.org/T390914) [17:43:44] RECOVERY - Disk space on analytics1073 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops [17:44:24] cdanis: looks like some alerts will clear and then flap back to awol. not sure what's needed from here. [17:45:24] RECOVERY - Disk space on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops [17:45:25] (03CR) 10Aleksandar Mastilovic: Set the remaining Enterprise WM Downloader job to absent (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [17:45:34] (03CR) 10Ryan Kemper: [C:03+2] Fix wdqs-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1142796 (owner: 10Muehlenhoff) [17:46:04] (03PS2) 10Aleksandar Mastilovic: Adding suggested edits [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) [17:46:32] PROBLEM - Elasticsearch HTTPS for production-search-psi-eqiad on cirrussearch1123 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [17:47:04] (03CR) 10Sbisson: "recheck" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [17:47:34] RECOVERY - Elasticsearch HTTPS for production-search-psi-eqiad on cirrussearch1123 is OK: SSL OK - Certificate cirrussearch1123.eqiad.wmnet valid until 2025-06-04 17:41:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [17:51:37] (03PS2) 10Volans: DataPlatf. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 [17:52:37] vriley@cumin1002 provision (PID 3777157) is awaiting input [17:52:56] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:53:48] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] (duration: 36m 00s) [17:53:51] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [17:55:19] 10SRE-tools, 10Spicerack: Cookbook downtiming does not work, continues anyway - https://phabricator.wikimedia.org/T393630 (10BCornwall) 03NEW [17:55:40] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:55:48] RECOVERY - Disk space on analytics1072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [17:57:13] (03CR) 10Xcollazo: "Looks like the commit message needs fixing. Otherwise patchset 2 LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [17:57:59] (03CR) 10Sbisson: "recheck" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [17:58:06] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:24] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:45] Amir1: so, to follow up, all I can really tell from this point is that the 4GiB tmpfs staging area was being exhausted, repeatedly [17:58:45] that runs counter to what I said about these being incremental builds, *but* I now realize those were incremental relative to an image built earlier in your attempts ... [17:58:45] meaning, those might have in fact been "full" mediawiki-layer pushes - alas, since the run was interrupted, and the logs in your home directory overwritten, I can't say for certain [17:59:09] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:59:27] (03PS5) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [17:59:30] thanks. Even this one took 36 minutes [17:59:58] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [18:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1800) [18:00:14] (03CR) 10CI reject: [V:04-1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [18:01:33] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1125 [18:02:03] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1122.eqiad.wmnet with OS bullseye [18:02:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1125 to cirrussearch1125 [18:02:16] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch1123 is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1680, active_shards: 5035, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of [18:02:16] _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92061917047033 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:02:50] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch1122 is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1680, active_shards: 5035, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of [18:02:50] _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92061917047033 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:03:03] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801784 (10Justman10000) >>! In T393587#10801408, @Aklapper hat geschrieben: > I guess we don't let random folks push random commits without review to potentially bring down Wikimedia webs... [18:03:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1123.eqiad.wmnet with OS bullseye [18:05:33] jmm@cumin2002 drain-node (PID 1484195) is awaiting input [18:06:04] swfrench-wmf: https://logstash.wikimedia.org/goto/e66c16bb6b4a4511d9890acabeebb1ee for old build logs [18:06:55] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1124.eqiad.wmnet with OS bullseye [18:06:56] I started with the official scap logstash dashboard https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 and added the `labels.channel: scap.k8s.build` filter. [18:06:59] dwisehaupt: I think from here you'll need to talk to the o11y sre team, sorry :( I would suspect that icinga is backlogged enough it can't process the nsca notifications [18:07:32] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 68, number_of_data_nodes: 68, discovered_master: True, active_primary_shards: 1481, active_shards: 4444, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 31, delayed_unassigned_shards: 0, number_of_pendin [18:07:32] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.3072625698324 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:07:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1125.eqiad.wmnet with OS bullseye [18:08:00] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 68, number_of_data_nodes: 68, discovered_master: True, active_primary_shards: 1481, active_shards: 4444, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 31, delayed_unassigned_shards: 0, number_of_pendin [18:08:00] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.3072625698324 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:08:17] dancy: indeed, thanks! so, the problem is that the ones I'm interested were aborted, so the logs that report the full (paginated) output from the build are missing [18:08:40] like the one at 16:08 [18:08:48] i.e., you only get the `Running sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py` logs [18:09:06] cdanis: on cool. thanks! i'll find their channel since i'm not in it now. [18:09:48] dwisehaupt: #wikimedia-observability [18:09:52] thanks! [18:10:17] (03PS1) 10Ssingh: hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) [18:10:38] swfrench-wmf: Aww, bummer. [18:10:51] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10801829 (10Andrew) 05Open→03Resolved a:03Andrew Yes! the other three were repurposed in https://phabricator.wikimedia.org/T392539 [18:10:58] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5485/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh) [18:11:16] Anyway, it's train window time and I'm taking over for Jeena today. [18:12:42] I'm going to run `scap build-images`to see what state things are in first. [18:12:49] (03PS6) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [18:12:52] (03PS5) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) [18:12:56] !log dancy@deploy1003 Started scap build-images: (no justification provided) [18:13:27] !log dancy@deploy1003 Finished scap build-images: (no justification provided) (duration: 00m 30s) [18:13:37] dancy: sounds good - I _think_ you should be in a good state, as A.mir1's run should have cleared the "apparently latent" large layer pushes [18:13:46] Yep. Fast run. [18:13:57] (03PS2) 10Andrew Bogott: Initial checkin of files for openstack version 'Epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143144 (https://phabricator.wikimedia.org/T390914) [18:13:57] (03PS2) 10Andrew Bogott: Openstack Magnum: remove local hacks for log size limits [puppet] - 10https://gerrit.wikimedia.org/r/1143145 (https://phabricator.wikimedia.org/T390914) [18:13:58] (03PS1) 10Andrew Bogott: Openstack codw1dev: upgrade to release 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143154 (https://phabricator.wikimedia.org/T390914) [18:14:56] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm [18:15:01] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143155 (https://phabricator.wikimedia.org/T386223) [18:15:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10801844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm [18:15:03] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143155 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [18:15:48] (03CR) 10Andrew Bogott: [C:03+2] Initial checkin of files for openstack version 'Epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143144 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [18:15:52] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143155 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot) [18:15:55] (03CR) 10Andrew Bogott: [C:03+2] Openstack Magnum: remove local hacks for log size limits [puppet] - 10https://gerrit.wikimedia.org/r/1143145 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [18:18:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:18:19] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1125.eqiad.wmnet with reason: host reimage [18:18:21] (03CR) 10BCornwall: [C:03+1] hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh) [18:18:45] (03PS1) 10Ebernhardson: Bump ltr plugin to 1.5.4-wmf1-os1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1143156 (https://phabricator.wikimedia.org/T317599) [18:19:01] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801870 (10Aklapper) You have been told before first to write patches to "modify group permissions". You have not yet. Feel free to start contributing instead of asking for more permission... [18:19:06] (03CR) 10AOkoth: [C:03+2] wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [18:19:39] !log aokoth@dns1004 START - running authdns-update [18:20:00] (03CR) 10AOkoth: [C:03+2] aphlict: revert eqiad host to active [puppet] - 10https://gerrit.wikimedia.org/r/1140217 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [18:20:22] RECOVERY - Disk space on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops [18:20:31] (03CR) 10Andrew Bogott: [C:03+2] Openstack codw1dev: upgrade to release 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143154 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [18:20:49] !log aokoth@dns1004 END - running authdns-update [18:21:21] !log uploaded spicerack_10.2.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [18:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1125.eqiad.wmnet with reason: host reimage [18:21:45] arnaudb: want me to merge 'Arnold Okoth: aphlict: revert eqiad host to active' ? [18:21:54] oops, I mean arnoldokoth ^ [18:22:00] arnaudb, disregard [18:22:16] Yes please. [18:22:20] ok! doing [18:22:42] Thanks! [18:24:02] (03PS2) 10Volans: DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 [18:28:32] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [18:28:50] (03PS2) 10HMonroy: Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) [18:29:08] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.28 refs T386223 [18:29:10] T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223 [18:29:36] hmmmm ... `CalicoKubeControllersDown` is both worrisome and has a minimally helpful summary [18:29:53] TODO.. hehe [18:30:25] ah, alright - `site:eqiad prometheus:k8s-dse` [18:30:32] (03CR) 10Jdlrobson: [C:04-1] "This really should use a dblist to avoid unreadable configuration code. I can help set that up if that's useful?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [18:30:35] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [18:31:28] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:48] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:32:12] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:33:59] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:40] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:34:50] (03CR) 10Bvibber: "Yeah that's best :D I think I can figure it out i'll poke you if I get lost :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [18:35:12] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:35:28] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:35:30] (03PS3) 10HMonroy: Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) [18:35:48] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801942 (10Justman10000) >>! In T393587#10801870, @Aklapper hat geschrieben: > You have been told before first to write patches to "modify group permissions". You have not yet. Feel free t... [18:36:14] (03CR) 10MusikAnimal: [C:03+1] Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:36:52] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:37:06] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801946 (10Justman10000) What I mean to say is, I'd rather be able to do it directly than have to hope to be faster than those who can comit directly! [18:37:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:37:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:38:04] (03Merged) 10jenkins-bot: Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:38:13] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10801948 (10Eevans) >>! In T390630#10793749, @Scott_French wrote: > After a bit of thought and some back-... [18:38:27] !log hmonroy@deploy1003 Started scap sync-world: Backport for [[gerrit:1142642|Enable Codex and Multiblocks in Hebrew wiki (T377121)]] [18:38:29] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [18:38:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1125.eqiad.wmnet with OS bullseye [18:39:07] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10801953 (10Eevans) [18:39:12] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:39:28] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:41:48] RECOVERY - Disk space on analytics1073 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops [18:41:48] RECOVERY - Disk space on analytics1072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [18:42:12] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:42:28] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:43:06] jclark@cumin1002 provision (PID 3784978) is awaiting input [18:43:59] (03PS3) 10Aleksandar Mastilovic: Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) [18:44:12] alright, following up on the CalicoKubeControllersDown alert, it appears that the calico-kube-controllers pod in dse-k8s-eqiad is OOMing [18:45:04] (03CR) 10Vgutierrez: [C:03+1] hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh) [18:45:06] !log hmonroy@deploy1003 hmonroy: Backport for [[gerrit:1142642|Enable Codex and Multiblocks in Hebrew wiki (T377121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:45:09] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [18:45:25] (03PS7) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [18:45:54] (03CR) 10Bvibber: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [18:46:35] musikanimal ready in testserver [18:47:27] okay great! give me a few minutes [18:48:32] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [18:48:54] hmonroy: looks good! [18:49:06] !log hmonroy@deploy1003 hmonroy: Continuing with sync [18:50:32] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [18:53:25] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10801995 (10Jclark-ctr) @MatthewVernon Can you update the eqiad.yaml file for this one think some things where missed it will not image in eqiad for @VRiley-WMF [18:55:48] !log hmonroy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142642|Enable Codex and Multiblocks in Hebrew wiki (T377121)]] (duration: 17m 21s) [18:55:53] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [18:56:05] musikanimal: done! [18:56:13] \o/ [18:57:16] (03PS1) 10Cory Massaro: Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165 [19:00:07] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802011 (10RobH) a:03RobH I'll open a case with Dell, which will inevitably require the firmware on the NIC, mainboard, and idrac be updated before they... [19:03:36] (03CR) 10David Martin: [C:03+1] Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165 (owner: 10Cory Massaro) [19:06:23] (03CR) 10Xcollazo: [C:03+1] Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [19:13:08] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802052 (10RobH) I can see it seems to have randomly fired a few times: ` Mon Mar 17 2025 13:32:01 A fatal error was detected on a component at bus 4 de... [19:16:14] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83589MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [19:18:49] (03CR) 10Ebernhardson: [C:03+1] "Looks reasonable. When switching everything will be pretty cold in the new datacenter, i added https://wikitech.wikimedia.org/wiki/Search/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [19:19:39] (03CR) 10Ladsgroup: [C:03+2] "try again" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [19:21:18] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802084 (10RobH) Support request confirmed as 'after hours english support' so I had to fill out my contact details a second time and request the upload u... [19:22:27] (03PS1) 10Cwhite: logstash: calculate w3c generated timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886) [19:22:59] (03PS1) 10Ladsgroup: Improve circuit breaking error message [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930) [19:25:44] (03PS2) 10Ebernhardson: Update plugins for extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1143156 (https://phabricator.wikimedia.org/T317599) [19:25:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609#10802095 (10VRiley-WMF) [19:26:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609#10802097 (10VRiley-WMF) This is completed [19:28:46] (03PS1) 10Ladsgroup: Remove hard-coded timestamps in SpecialGlobalContributionsTest [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143172 (https://phabricator.wikimedia.org/T393531) [19:28:58] (03CR) 10Ladsgroup: [C:03+2] Remove hard-coded timestamps in SpecialGlobalContributionsTest [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143172 (https://phabricator.wikimedia.org/T393531) (owner: 10Ladsgroup) [19:29:23] bking@cumin2002 reimage (PID 1765198) is awaiting input [19:30:38] (03CR) 10CI reject: [V:04-1] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [19:34:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10802125 (10VRiley-WMF) 05Open→03Resolved This unit has been decommed. We will ensure these disks are certainly shredded. [19:35:14] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe1003.eqiad.wmnet with OS bookworm [19:35:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10802129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors: - apus-fe... [19:38:25] (03CR) 10Jdlrobson: [C:03+1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [19:38:36] 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#10802136 (10Dwisehaupt) The nsca restart by cdanis helped temporarily but the awol condition quickly returned. It fully cleared up after an icinga... [19:38:59] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:39:09] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [19:42:02] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [19:43:20] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [19:44:52] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [19:45:14] (03PS1) 10Bking: dse-k8s-eqiad: raise calico requests and remove limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) [19:45:35] (03Merged) 10jenkins-bot: Remove hard-coded timestamps in SpecialGlobalContributionsTest [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143172 (https://phabricator.wikimedia.org/T393531) (owner: 10Ladsgroup) [19:50:44] PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100% [19:52:12] RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [19:52:27] (03CR) 10Ladsgroup: [C:03+2] Improve circuit breaking error message [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [19:52:47] (03CR) 10Ladsgroup: [C:03+2] "hit me baby one more time" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [19:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:55:47] (03CR) 10Scott French: dse-k8s-eqiad: raise calico requests and remove limits. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking) [19:58:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [19:58:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T2000). [20:00:05] bvibber and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] o/ [20:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:02:41] (03PS2) 10Bking: dse-k8s-eqiad: raise calico requests and remove limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) [20:03:00] o/ [20:03:48] I can deploy [20:03:57] just need a couple minutes [20:04:02] (03Merged) 10jenkins-bot: Improve circuit breaking error message [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup) [20:04:03] cool :) [20:04:06] (03Merged) 10jenkins-bot: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup) [20:04:33] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143124|Remove whatlinkshere hook (T393513)]], [[gerrit:1143171|Improve circuit breaking error message (T360930)]], [[gerrit:1143172|Remove hard-coded timestamps in SpecialGlobalContributionsTest (T393531)]] [20:04:39] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [20:04:39] T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930 [20:04:40] T393531: SpecialGlobalContributionsTest::testExecuteTarget with data set "Valid IP" failure in other extensions' build - https://phabricator.wikimedia.org/T393531 [20:07:46] bvibber: can yours both go out together? [20:08:13] yes [20:09:11] I can do the deploy, mine will finish soon [20:09:27] (depending on how slow these things are) [20:09:36] oh i didn't realize you were deploying! [20:09:48] sorry [20:09:53] aiee :) [20:10:23] I stopped mine [20:10:32] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:10:47] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:11:27] sorry, my previous deploy broke so many times [20:11:44] and this one is taking way too long bleeding to this window [20:11:53] that's okay, I should have checked the backscroll more closely [20:12:30] scap is really slow today, one deploy I had took 36 minutes :/ [20:12:47] hmm strange [20:13:10] last time I backported it took a while but I thought it was because of the localization changes [20:13:50] (03CR) 10Scott French: [C:03+1] dse-k8s-eqiad: raise calico requests and remove limits. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking) [20:18:26] (03PS3) 10Scott French: P:mw::maintenance::refreshlinks: rename and prepare for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143121 (https://phabricator.wikimedia.org/T388530) [20:18:26] (03CR) 10Scott French: "Alright, despite the *very* long commit message, I think this is the simplest option that gets us out of the business of using the ` (03PS3) 10Scott French: P:mw::maintenance::refreshlinks: migrate s8 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530) [20:21:02] (03CR) 10Bking: dse-k8s-eqiad: raise calico requests and remove limits. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking) [20:21:29] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh) [20:21:55] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: raise calico requests and remove limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking) [20:22:18] > 20:21:58 Finished build-and-push-container-images (duration: 16m 37s) [20:22:21] This is not normal [20:22:54] The multiversion image was a full build [20:23:06] !log depooling lvs3009 for HW maint: T393616 [20:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:08] T393616: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616 [20:23:29] Amir1: Updated 536 CDB files(s) in /srv/mediawiki-staging/php-1.44.0-wmf.28/cache/l10n is the reason for the full image build. [20:24:02] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802334 (10ssingh) >>! In T393616#10802011, @RobH wrote: > I'll open a case with Dell, which will inevitably require the firmware on the NIC, mainboard, a... [20:24:09] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:24:17] dancy: aaaah, that makes sense now. Thanks [20:24:33] (03PS1) 10Andrew Bogott: wmf_sink: remove an arg when asking keystone for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1143183 (https://phabricator.wikimedia.org/T390914) [20:25:25] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [20:26:00] !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009*} and A:liberica (T393616) [20:26:07] !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009*} and A:liberica (T393616) [20:26:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609#10802343 (10VRiley-WMF) 05Open→03Resolved [20:26:30] FIRING: LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=esams&var-instance=lvs3009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [20:26:41] hmm that's not cool [20:26:58] probably a race condition given I just ran it [20:27:02] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:27:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:28:15] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:31:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=esams&var-instance=lvs3009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [20:32:59] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1143124|Remove whatlinkshere hook (T393513)]], [[gerrit:1143171|Improve circuit breaking error message (T360930)]], [[gerrit:1143172|Remove hard-coded timestamps in SpecialGlobalContributionsTest (T393531)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:33:02] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [20:33:04] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [20:33:05] T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930 [20:33:05] T393531: SpecialGlobalContributionsTest::testExecuteTarget with data set "Valid IP" failure in other extensions' build - https://phabricator.wikimedia.org/T393531 [20:34:05] (03CR) 10Ladsgroup: [C:03+2] Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142671 (https://phabricator.wikimedia.org/T393286) (owner: 10Jdlrobson) [20:34:10] (03CR) 10Ladsgroup: [C:03+2] Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1142698 (https://phabricator.wikimedia.org/T393286) (owner: 10Bvibber) [20:34:14] (03CR) 10Ladsgroup: [C:03+2] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:34:54] (03CR) 10Ladsgroup: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:35:17] \o/ [20:35:49] bvibber: :hype: [20:36:54] can we deploy now? [20:37:15] I think it's still syncing [20:37:41] (03CR) 10Ladsgroup: [C:04-1] "I think this patch is broken. You need to add the dblist to the DB_LISTS, otherwise it doesn't work. The tests should have caught this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:38:29] on it [20:38:43] yeah, it's still syncing [20:38:45] (03PS2) 10Andrew Bogott: wmf_sink: remove an arg when asking keystone for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1143183 (https://phabricator.wikimedia.org/T390914) [20:39:09] (03PS8) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [20:39:27] (03CR) 10Bvibber: "Whoops! Should be fixed now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:39:43] and now we wait for the tests again :D [20:39:55] (03CR) 10CI reject: [V:04-1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:39:56] (03CR) 10Ladsgroup: "The diff CI is broken:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:40:01] (03CR) 10Andrew Bogott: [C:03+2] wmf_sink: remove an arg when asking keystone for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1143183 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [20:41:11] ...what? [20:41:49] (03CR) 10Bvibber: "I have no idea what's wrong. Is there documentation I have failed to find and follow?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:42:45] does anybody know what's wrong and how to fix it? [20:43:03] (03Merged) 10jenkins-bot: Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142671 (https://phabricator.wikimedia.org/T393286) (owner: 10Jdlrobson) [20:43:04] (03Merged) 10jenkins-bot: Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1142698 (https://phabricator.wikimedia.org/T393286) (owner: 10Bvibber) [20:43:50] bvibber: run "composer manage-dblist update" [20:44:19] thx [20:44:24] is this documented somewhere? [20:44:44] (03PS9) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) [20:45:53] (03CR) 10Bvibber: "was told to run composer manage-dblist update, hopefully that does it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:46:15] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143124|Remove whatlinkshere hook (T393513)]], [[gerrit:1143171|Improve circuit breaking error message (T360930)]], [[gerrit:1143172|Remove hard-coded timestamps in SpecialGlobalContributionsTest (T393531)]] (duration: 41m 41s) [20:46:20] T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513 [20:46:20] T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930 [20:46:21] T393531: SpecialGlobalContributionsTest::testExecuteTarget with data set "Valid IP" failure in other extensions' build - https://phabricator.wikimedia.org/T393531 [20:46:45] (03CR) 10Ladsgroup: [C:03+2] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:46:58] (03CR) 10RLazarus: [C:03+2] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [20:47:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:47:22] \o/ [20:47:34] (03Merged) 10jenkins-bot: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber) [20:48:01] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1142701|Charts phase 1 deployment (T393517)]], [[gerrit:1142671|Clear floats to avoid tall charts (T393286)]], [[gerrit:1142698|Clear floats to avoid tall charts (T393286)]] [20:48:06] T393517: Enable Charts for Phase 1 wikis - https://phabricator.wikimedia.org/T393517 [20:48:06] T393286: Chart without any explicit height is very high on itwiki - https://phabricator.wikimedia.org/T393286 [20:49:10] (03CR) 10CI reject: [V:04-1] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus) [20:49:42] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1124.eqiad.wmnet with OS bullseye [20:49:56] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802443 (10ssingh) The host has been depooled so you can reboot or shut it down without checking with us. Thanks for the quick response Rob! [20:50:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1124.eqiad.wmnet with OS bullseye [20:52:34] (03PS1) 10Bvibber: Stub README.md for dblists/ dir to remind people to use the tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143188 [20:54:53] bvibber: it's on the mw-debug [20:55:05] excellent [20:55:17] !log ladsgroup@deploy1003 jdlrobson, bvibber, ladsgroup: Backport for [[gerrit:1142701|Charts phase 1 deployment (T393517)]], [[gerrit:1142671|Clear floats to avoid tall charts (T393286)]], [[gerrit:1142698|Clear floats to avoid tall charts (T393286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:55:22] testing [20:55:28] T393517: Enable Charts for Phase 1 wikis - https://phabricator.wikimedia.org/T393517 [20:55:29] T393286: Chart without any explicit height is very high on itwiki - https://phabricator.wikimedia.org/T393286 [20:56:30] Amir1: we're good to go <3 [20:56:43] !log ladsgroup@deploy1003 jdlrobson, bvibber, ladsgroup: Continuing with sync [20:56:48] let's go then [20:57:11] i guess it's time to *flips down sunglasses* deploy the patches [20:59:47] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_magru [21:00:06] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T2100) [21:01:25] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1124.eqiad.wmnet with reason: host reimage [21:01:25] (03PS2) 10Bvibber: Stub README.md for dblists/ dir to remind people to use the tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143188 (https://phabricator.wikimedia.org/T393648) [21:04:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1124.eqiad.wmnet with reason: host reimage [21:05:18] (03CR) 10Ladsgroup: [C:03+2] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [21:05:23] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142701|Charts phase 1 deployment (T393517)]], [[gerrit:1142671|Clear floats to avoid tall charts (T393286)]], [[gerrit:1142698|Clear floats to avoid tall charts (T393286)]] (duration: 17m 21s) [21:05:26] T393517: Enable Charts for Phase 1 wikis - https://phabricator.wikimedia.org/T393517 [21:05:26] T393286: Chart without any explicit height is very high on itwiki - https://phabricator.wikimedia.org/T393286 [21:05:27] (03CR) 10Ladsgroup: [C:03+2] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [21:05:38] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_magru [21:05:42] bvibber: deployed [21:05:47] Amir1 we can do my patches tomorrow as we're running out of time. And we can only do wmf.28 at that point. [21:05:55] Amir1: woohoo! thanks [21:06:03] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_esams [21:06:06] stephanebisson: nah, It's straightforward IMO [21:06:21] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_esams [21:06:33] it can go over a bit. I don't think anything is happening with the next window [21:06:56] deployment windows are a social construct anyway made to sell more clocks or something like that [21:08:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [21:08:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [21:09:01] Alright, go for it. There is nothing to test directly. It disables the possibility of calling `cxpublishedtransaltion` without to/from parameters. We don't believe any API caller is doing that but it they do, it's going to trigger and API validation error instead of a slow query, and we are fine with that. [21:09:43] yeah [21:11:33] I have to run but I'll search the logs in a few hours to see if there is anything related. [21:11:59] Amir1 thanks for deploying and sorry for the situation. [21:13:17] thanks. No worries [21:16:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10802579 (10VRiley-WMF) thanos-fe1005 A7 U26 CableID 4888 Port 24 thanos-fe1006 B4 U8 CableID 4778 Port35 thanos-fe1007 D4 U26 CableID 20220118 [21:16:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10802581 (10VRiley-WMF) [21:18:52] (03Merged) 10jenkins-bot: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [21:18:53] (03Merged) 10jenkins-bot: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson) [21:19:20] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143138|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]], [[gerrit:1143142|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]] [21:19:24] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [21:21:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1124.eqiad.wmnet with OS bullseye [21:26:54] !log ladsgroup@deploy1003 ladsgroup, sbisson: Backport for [[gerrit:1143138|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]], [[gerrit:1143142|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:57] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [21:27:02] !log ladsgroup@deploy1003 ladsgroup, sbisson: Continuing with sync [21:28:08] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:00] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 759, active_shards: 1784, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [21:29:00] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:29:52] (03PS2) 10Muehlenhoff: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) [21:33:33] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143138|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]], [[gerrit:1143142|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]] (duration: 14m 12s) [21:33:36] T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839 [21:44:02] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [21:47:00] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [21:49:54] (03PS1) 10Ryan Kemper: wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) [21:51:10] (03PS2) 10Ryan Kemper: wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) [21:51:45] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [21:56:27] (03CR) 10Bking: [C:03+1] wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T2200) [22:10:22] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802714 (10Aklapper) > Why should I submit patches when others can commit directly? Provide one specific example where someone "committed directly" instead of going via a patch. One. Thanks. [22:11:43] (03CR) 10Btullis: [C:03+1] wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper) [22:18:07] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [22:36:55] FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:44:04] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [22:46:00] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [22:58:40] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802784 (10Justman10000) Everyone? Why do a patch when one can comit directly? And even then, the same question still remains... Only that I would then have to create a patch faster than a... [23:09:09] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:13:46] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on elastic1062:9290 - https://phabricator.wikimedia.org/T393657 (10phaultfinder) 03NEW [23:30:37] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802849 (10Aklapper) Could you simply answer my question and link to one specific example? [23:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143204 [23:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143204 (owner: 10TrainBranchBot) [23:39:00] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:51:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143204 (owner: 10TrainBranchBot) [23:55:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [23:57:56] 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802874 (10Justman10000) >>! In T393587#10802849, @Aklapper hat geschrieben: > Could you simply answer my question and link to one specific example? From my answer, I think that one does...