[00:00:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1112-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[00:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:03:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P75844 and previous config saved to /var/cache/conftool/dbconfig/20250507-000354-ladsgroup.json
[00:08:36] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142728
[00:08:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142728 (owner: 10TrainBranchBot)
[00:10:02] <wikibugs>	 (03PS2) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539)
[00:10:11] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:12:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[00:12:24] <wikibugs>	 (03PS3) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539)
[00:15:44] <wikibugs>	 (03CR) 10Andrew Bogott: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1142684 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[00:16:53] <logmsgbot>	 !log andrew@dns1004 START - running authdns-update
[00:19:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T382778)', diff saved to https://phabricator.wikimedia.org/P75845 and previous config saved to /var/cache/conftool/dbconfig/20250507-001901-ladsgroup.json
[00:19:04] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[00:19:17] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2174.codfw.wmnet with reason: Maintenance
[00:19:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T382778)', diff saved to https://phabricator.wikimedia.org/P75846 and previous config saved to /var/cache/conftool/dbconfig/20250507-001924-ladsgroup.json
[00:19:33] <logmsgbot>	 !log andrew@dns1004 END - running authdns-update
[00:21:08] <logmsgbot>	 !log hmonroy@deploy1003 hmonroy, musikanimal: Backport for [[gerrit:1142714|Revert "JavaScript: ESLint 8.57.0" (T381577)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[00:21:11] <stashbot>	 T381577: Highlighting of syntax errors, warnings, infos for Wikitext editor - https://phabricator.wikimedia.org/T381577
[00:21:54] <hmonroy>	 musikanimal can you take a look at testservers?
[00:22:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10798579 (10Papaul) @VRiley-WMF After the move, the server is not booting into the OS , it is stuck at "loading initial ramdisk" when you get back on site can you please power down the server, make sur...
[00:22:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T382778)', diff saved to https://phabricator.wikimedia.org/P75847 and previous config saved to /var/cache/conftool/dbconfig/20250507-002226-ladsgroup.json
[00:26:25] <logmsgbot>	 !log hmonroy@deploy1003 hmonroy, musikanimal: Continuing with sync
[00:28:51] <wikibugs>	 (03PS1) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539)
[00:29:28] <wikibugs>	 (03PS2) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames again [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539)
[00:29:35] <wikibugs>	 (03PS3) 10Andrew Bogott: wikimediacloud.org: move codfw1dev rabbitmq cnames again [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539)
[00:30:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wikimediacloud.org: move codfw1dev rabbitmq cnames again [dns] - 10https://gerrit.wikimedia.org/r/1142731 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[00:30:37] <logmsgbot>	 !log andrew@dns1004 START - running authdns-update
[00:32:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:33:17] <logmsgbot>	 !log andrew@dns1004 END - running authdns-update
[00:37:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P75848 and previous config saved to /var/cache/conftool/dbconfig/20250507-003733-ladsgroup.json
[00:39:48] <logmsgbot>	 !log hmonroy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142714|Revert "JavaScript: ESLint 8.57.0" (T381577)]] (duration: 47m 14s)
[00:39:51] <stashbot>	 T381577: Highlighting of syntax errors, warnings, infos for Wikitext editor - https://phabricator.wikimedia.org/T381577
[00:43:26] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1142728 (owner: 10TrainBranchBot)
[00:44:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] codfw1dev rabbit config: remove a comment that is no longer true [puppet] - 10https://gerrit.wikimedia.org/r/1142687 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[00:44:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] codfw1dev: remove rabbitmq from cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142688 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[00:52:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P75849 and previous config saved to /var/cache/conftool/dbconfig/20250507-005240-ladsgroup.json
[00:57:04] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev rabbitmq: remove contactgroups: wmcs-team-email from role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1142740 (https://phabricator.wikimedia.org/T392539)
[00:59:07] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035#10798610 (10Eevans) >>! In T307035#10797869, @Jclark-ctr wrote: > @Eevans  is this still needed? or can it be resolved?  It's still needed... but, I wonder when they're do for a refresh?  They...
[01:00:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] codfw1dev rabbitmq: remove contactgroups: wmcs-team-email from role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1142740 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott)
[01:06:55] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:07:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T382778)', diff saved to https://phabricator.wikimedia.org/P75850 and previous config saved to /var/cache/conftool/dbconfig/20250507-010748-ladsgroup.json
[01:07:52] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[01:08:05] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2176.codfw.wmnet with reason: Maintenance
[01:08:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T382778)', diff saved to https://phabricator.wikimedia.org/P75851 and previous config saved to /var/cache/conftool/dbconfig/20250507-010811-ladsgroup.json
[01:11:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T382778)', diff saved to https://phabricator.wikimedia.org/P75852 and previous config saved to /var/cache/conftool/dbconfig/20250507-011114-ladsgroup.json
[01:13:59] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:26:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P75853 and previous config saved to /var/cache/conftool/dbconfig/20250507-012621-ladsgroup.json
[01:26:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:36:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:39:46] <wikibugs>	 (03PS1) 10MusikAnimal: Hooks: disable if content model is unset AND CodeMirror beta is set [extensions/CodeEditor] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142754 (https://phabricator.wikimedia.org/T373711)
[01:41:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P75854 and previous config saved to /var/cache/conftool/dbconfig/20250507-014128-ladsgroup.json
[01:56:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T382778)', diff saved to https://phabricator.wikimedia.org/P75855 and previous config saved to /var/cache/conftool/dbconfig/20250507-015636-ladsgroup.json
[01:56:39] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[01:56:52] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2188.codfw.wmnet with reason: Maintenance
[01:56:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T382778)', diff saved to https://phabricator.wikimedia.org/P75856 and previous config saved to /var/cache/conftool/dbconfig/20250507-015658-ladsgroup.json
[01:59:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T382778)', diff saved to https://phabricator.wikimedia.org/P75857 and previous config saved to /var/cache/conftool/dbconfig/20250507-015955-ladsgroup.json
[02:07:13] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:15:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P75858 and previous config saved to /var/cache/conftool/dbconfig/20250507-021502-ladsgroup.json
[02:30:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P75859 and previous config saved to /var/cache/conftool/dbconfig/20250507-023009-ladsgroup.json
[02:32:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:33:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [extensions/CodeEditor] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142754 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal)
[02:34:23] <wikibugs>	 (03Merged) 10jenkins-bot: Hooks: disable if content model is unset AND CodeMirror beta is set [extensions/CodeEditor] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142754 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal)
[02:34:52] <logmsgbot>	 !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1142754|Hooks: disable if content model is unset AND CodeMirror beta is set (T373711)]]
[02:34:55] <stashbot>	 T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711
[02:41:36] <logmsgbot>	 !log tstarling@deploy1003 tstarling, musikanimal: Backport for [[gerrit:1142754|Hooks: disable if content model is unset AND CodeMirror beta is set (T373711)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[02:41:39] <stashbot>	 T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711
[02:45:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T382778)', diff saved to https://phabricator.wikimedia.org/P75860 and previous config saved to /var/cache/conftool/dbconfig/20250507-024518-ladsgroup.json
[02:45:21] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[02:45:34] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2202.codfw.wmnet with reason: Maintenance
[02:46:32] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2212.codfw.wmnet with reason: Maintenance
[02:46:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T382778)', diff saved to https://phabricator.wikimedia.org/P75861 and previous config saved to /var/cache/conftool/dbconfig/20250507-024638-ladsgroup.json
[02:49:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T382778)', diff saved to https://phabricator.wikimedia.org/P75862 and previous config saved to /var/cache/conftool/dbconfig/20250507-024933-ladsgroup.json
[02:50:42] <wikibugs>	 (03PS4) 10Andrea Denisse: graphite: Allow x-grafana-device-id header in CORS config [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439)
[02:50:42] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi team, I tested this patch by editing the respective configuration file, making a request with Curl to see if the server replied with th" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse)
[02:51:35] <wikibugs>	 (03CR) 10Andrea Denisse: graphite: Allow x-grafana-device-id header in CORS config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse)
[02:58:33] <logmsgbot>	 !log tstarling@deploy1003 tstarling, musikanimal: Continuing with sync
[03:04:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P75863 and previous config saved to /var/cache/conftool/dbconfig/20250507-030440-ladsgroup.json
[03:06:59] <logmsgbot>	 !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142754|Hooks: disable if content model is unset AND CodeMirror beta is set (T373711)]] (duration: 32m 06s)
[03:07:02] <stashbot>	 T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6 - https://phabricator.wikimedia.org/T373711
[03:19:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P75864 and previous config saved to /var/cache/conftool/dbconfig/20250507-031947-ladsgroup.json
[03:34:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T382778)', diff saved to https://phabricator.wikimedia.org/P75865 and previous config saved to /var/cache/conftool/dbconfig/20250507-033455-ladsgroup.json
[03:34:59] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[03:35:11] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2216.codfw.wmnet with reason: Maintenance
[03:35:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T382778)', diff saved to https://phabricator.wikimedia.org/P75866 and previous config saved to /var/cache/conftool/dbconfig/20250507-033518-ladsgroup.json
[03:38:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T382778)', diff saved to https://phabricator.wikimedia.org/P75867 and previous config saved to /var/cache/conftool/dbconfig/20250507-033812-ladsgroup.json
[03:53:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P75868 and previous config saved to /var/cache/conftool/dbconfig/20250507-035319-ladsgroup.json
[04:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:08:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P75869 and previous config saved to /var/cache/conftool/dbconfig/20250507-040826-ladsgroup.json
[04:23:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T382778)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250507-042334-ladsgroup.json
[04:23:45] <stashbot>	 T382778: Optimize text table - https://phabricator.wikimedia.org/T382778
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:06:55] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:16:55] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:26:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:28:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Extend package list to be installed from component/puppet7 on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1140716 (https://phabricator.wikimedia.org/T392790) (owner: 10Muehlenhoff)
[05:29:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[05:34:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet
[05:40:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet
[05:41:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[05:41:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans)
[05:41:55] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1032:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:44:31] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] esams: remove Tele2 transit [homer/public] - 10https://gerrit.wikimedia.org/r/1142539 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi)
[05:45:03] <wikibugs>	 (03Merged) 10jenkins-bot: esams: remove Tele2 transit [homer/public] - 10https://gerrit.wikimedia.org/r/1142539 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi)
[05:46:07] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10observability: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366#10798820 (10MoritzMuehlenhoff) The cause of the regression is now identified; the backport to 6.1. missed an depending patch: https...
[05:48:52] <XioNoX>	 !log decom Tele2 transit in esams - T393401
[05:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet
[05:57:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet
[05:58:46] <wikibugs>	 (03PS1) 10Ayounsi: Remove Tele2 and Novacore from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T0600)
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:01:55] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:03:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet
[06:04:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet
[06:06:16] <wikibugs>	 (03PS2) 10Ayounsi: Remove Tele2, Fiberring and Novacore from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401)
[06:06:55] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1033:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:06:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Nice idea to rely on the manufacturer fact to select storcli." [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[06:08:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet
[06:13:47] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1021457) is awaiting input
[06:18:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet
[06:19:11] <wikibugs>	 (03PS1) 10RLazarus: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925)
[06:19:19] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Add --env to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925)
[06:20:46] <wikibugs>	 (03PS2) 10RLazarus: mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925)
[06:20:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] deployment_server: Add --env to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[06:24:04] <wikibugs>	 (03CR) 10RLazarus: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[06:24:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix wdqs-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1142796
[06:24:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet
[06:25:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet
[06:26:55] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1034:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:33:32] <wikibugs>	 (03PS1) 10Kosta Harlan: temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358)
[06:51:50] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 263569
[06:52:10] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263569
[06:52:15] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268517
[06:52:26] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268517
[06:52:30] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264595
[06:52:48] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264595
[06:52:51] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 35847
[06:53:10] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 35847
[06:53:59] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268097
[06:54:18] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268097
[06:54:45] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 24441
[06:55:16] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 24441
[06:55:30] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61588
[06:55:54] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61588
[06:59:37] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10798932 (10ayounsi) 05Resolved→03Open Unfortunately we're not out of the wood yet...  `cr3-ulsfo> show interfaces et-0/0/0 media` still shows lo...
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T393205#10798948 (10JMeybohm) There where a bunch of IO errors on May 4th and 5th, so I would believe the disk needs replacement
[07:06:06] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar)
[07:07:49] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar)
[07:08:20] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar)
[07:11:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:13:48] <wikibugs>	 (03PS1) 10Slyngshede: IDP-Test: Test installation of CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1142929
[07:13:55] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slo: enable alerts for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/1142596 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[07:14:53] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP-Test: Test installation of CAS 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1142929 (owner: 10Slyngshede)
[07:14:58] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[07:16:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:17:34] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[07:18:08] <wikibugs>	 (03CR) 10Elukey: raid: update facter and get-raid-status to allow storcli (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[07:18:12] <wikibugs>	 (03PS3) 10Elukey: raid: update facter and get-raid-status to allow storcli [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146)
[07:20:07] <wikibugs>	 (03CR) 10Elukey: raid: update facter and get-raid-status to allow storcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[07:24:24] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: add abusers [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544)
[07:25:17] <wikibugs>	 (03CR) 10Elukey: [C:03+2] raid: update facter and get-raid-status to allow storcli [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[07:26:43] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: add abusers [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544)
[07:29:19] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544) (owner: 10Arnaudb)
[07:32:35] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: add abusers [puppet] - 10https://gerrit.wikimedia.org/r/1142930 (https://phabricator.wikimedia.org/T393544) (owner: 10Arnaudb)
[07:34:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[07:37:40] <wikibugs>	 (03CR) 10Zabe: [C:03+2] SkinTemplate: Restore a string 'class' in tabAction() [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142715 (https://phabricator.wikimedia.org/T393504) (owner: 10Zabe)
[07:38:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:39:45] <jinxer-wm>	 FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[07:43:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:49:25] <wikibugs>	 (03Merged) 10jenkins-bot: SkinTemplate: Restore a string 'class' in tabAction() [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142715 (https://phabricator.wikimedia.org/T393504) (owner: 10Zabe)
[07:50:11] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1142715|SkinTemplate: Restore a string 'class' in tabAction() (T393504)]]
[07:50:14] <stashbot>	 T393504: PHP Warning: Array to string conversion - https://phabricator.wikimedia.org/T393504
[07:51:53] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10799037 (10ayounsi) 05Open→03Resolved After chatting with Cathal, we decided to leave it as it as moving ports requires intrusive changes (P...
[07:56:51] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1142715|SkinTemplate: Restore a string 'class' in tabAction() (T393504)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:56:54] <stashbot>	 T393504: PHP Warning: Array to string conversion - https://phabricator.wikimedia.org/T393504
[08:00:04] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T0800)
[08:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:02:13] <wikibugs>	 (03CR) 10Hashar: "Thank you and that worked!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 (owner: 10Hashar)
[08:02:44] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[08:03:39] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552 (10ayounsi) 03NEW
[08:03:49] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10799104 (10ayounsi)
[08:04:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10799106 (10ayounsi)
[08:05:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10799110 (10ayounsi)
[08:09:13] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142715|SkinTemplate: Restore a string 'class' in tabAction() (T393504)]] (duration: 19m 01s)
[08:09:16] <stashbot>	 T393504: PHP Warning: Array to string conversion - https://phabricator.wikimedia.org/T393504
[08:15:14] <wikibugs>	 (03PS1) 10Elukey: raid::broadcom: fix perccli package name [puppet] - 10https://gerrit.wikimedia.org/r/1142978 (https://phabricator.wikimedia.org/T393146)
[08:15:30] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] raid::broadcom: fix perccli package name [puppet] - 10https://gerrit.wikimedia.org/r/1142978 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[08:16:57] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797)
[08:17:21] <wikibugs>	 (03CR) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) (owner: 10Majavah)
[08:24:17] <wikibugs>	 (03PS3) 10AOkoth: wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128)
[08:26:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron)
[08:27:20] <wikibugs>	 (03CR) 10Harroyo-wmf: [C:03+1] temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan)
[08:28:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5473/console" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse)
[08:31:04] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, but being a large "syntax" change the best way to ensure it works on all cases is testing it :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[08:31:49] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM.  I guess we can start thinking of moving this to alertmanager and possibly using some of the additional metadata - like groups - to " [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi)
[08:33:17] <wikibugs>	 (03CR) 10Volans: [C:03+2] elasticsearch: temporarily remove it from bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans)
[08:34:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5474/console" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse)
[08:35:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, I'm puzzled as to why PCC is not detecting the change though" [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse)
[08:37:06] <wikibugs>	 (03CR) 10Muehlenhoff: raid: update facter and get-raid-status to allow storcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142518 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[08:38:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet
[08:41:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet
[08:43:13] <wikibugs>	 (03Merged) 10jenkins-bot: elasticsearch: temporarily remove it from bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1142557 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans)
[08:44:45] <jinxer-wm>	 FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:45:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch krb2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1140143 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[08:46:32] <wikibugs>	 (03CR) 10Volans: "question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey)
[08:49:45] <jinxer-wm>	 RESOLVED: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:54:04] <XioNoX>	 !log update `host-inbound-traffic system-services` on pfw1-eqiad - T390052
[08:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:08] <stashbot>	 T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052
[08:55:11] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1035.eqiad.wmnet
[08:55:12] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1035.eqiad.wmnet
[08:56:55] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1035:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:57:21] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Thanks, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) (owner: 10Majavah)
[08:59:01] <wikibugs>	 (03CR) 10Brouberol: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[08:59:28] <wikibugs>	 (03PS3) 10Majavah: P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797)
[09:01:13] <wikibugs>	 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#10799352 (10Volans) Sure, but we need first a decision on what's the standardize and correct way to commit automatic dbctl changes from cookbo...
[09:02:25] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: Apply admin-root sudo policy to all instances [puppet] - 10https://gerrit.wikimedia.org/r/1139416 (https://phabricator.wikimedia.org/T392797) (owner: 10Majavah)
[09:06:55] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:07:57] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Remove Tele2, Fiberring and Novacore from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi)
[09:09:59] <wikibugs>	 (03PS1) 10AOkoth: gerrit: apache ratelimit test [puppet] - 10https://gerrit.wikimedia.org/r/1143019
[09:10:33] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "We already have the alert up and running : https://github.com/wikimedia/operations-alerts/blob/master/team-netops/bgp.yaml#L3" [puppet] - 10https://gerrit.wikimedia.org/r/1142786 (https://phabricator.wikimedia.org/T393401) (owner: 10Ayounsi)
[09:13:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:18:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:31:56] <wikibugs>	 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565 (10MoritzMuehlenhoff) 03NEW
[09:32:00] <wikibugs>	 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565#10799422 (10MoritzMuehlenhoff) p:05Triage→03High
[09:34:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet
[09:36:04] <wikibugs>	 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565#10799428 (10hnowlan) a:03hnowlan I believe these are both safe to clean up, I'll handle it.
[09:36:43] <wikibugs>	 (03PS1) 10Elukey: raid: allow OK in general state for get-raid-status-broadcom.py [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146)
[09:38:46] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1229267) is awaiting input
[09:40:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1143025
[09:41:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet
[09:41:31] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1143025 (owner: 10Muehlenhoff)
[09:41:58] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[09:42:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] raid: allow OK in general state for get-raid-status-broadcom.py [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[09:43:59] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:46:55] <jinxer-wm>	 FIRING: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:47:25] <wikibugs>	 06SRE, 06serviceops: Cert expiry warning for zotero.discovery.wmnet and wikifeeds - https://phabricator.wikimedia.org/T393565#10799457 (10akosiaris) 05Open→03Resolved {{done}}
[09:48:59] <jinxer-wm>	 RESOLVED: [3x] PuppetCertificateAboutToExpire: Puppet CA certificate purged is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:51:33] <wikibugs>	 (03PS4) 10Hnowlan: trafficserver: route all but zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724)
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:52:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[09:52:52] <wikibugs>	 (03PS5) 10Hnowlan: trafficserver: route all but enwiki/zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724)
[09:54:25] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1036.eqiad.wmnet
[09:54:25] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1036.eqiad.wmnet
[09:55:43] <wikibugs>	 (03PS3) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782)
[09:55:48] <wikibugs>	 (03CR) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[09:55:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet
[09:56:31] <wikibugs>	 (03CR) 10David Caro: raid: allow OK in general state for get-raid-status-broadcom.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:55] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1036:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:56:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1143025 (owner: 10Muehlenhoff)
[09:57:04] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] trafficserver: route all but enwiki/zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) (owner: 10Hnowlan)
[09:59:33] <wikibugs>	 (03PS1) 10Elukey: raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146)
[09:59:46] <wikibugs>	 (03CR) 10Elukey: [C:03+2] raid: allow OK in general state for get-raid-status-broadcom.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143023 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1000)
[10:00:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[10:01:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet
[10:01:55] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:05:51] <wikibugs>	 (03PS2) 10Elukey: raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146)
[10:08:38] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[10:08:46] <wikibugs>	 (03CR) 10Elukey: [C:03+2] raid: fix get-raid-status-broadcom.py script [puppet] - 10https://gerrit.wikimedia.org/r/1143026 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[10:10:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: route all but enwiki/zhwiki PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) (owner: 10Hnowlan)
[10:11:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10799493 (10MoritzMuehlenhoff)
[10:11:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10799494 (10MoritzMuehlenhoff)
[10:11:55] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:14:13] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1037.eqiad.wmnet
[10:14:14] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1037.eqiad.wmnet
[10:16:55] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:17:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet
[10:18:07] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[10:19:10] <moritzm>	 FYI, kubestagemaster1003 will briefly go down for a Ganeti reboot
[10:19:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet
[10:21:09] <icinga-wm>	 PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100%
[10:22:21] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on krb2002.codfw.wmnet with reason: update to Bookworm
[10:22:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10799527 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3de6b492-82de-43f4-8903-cb18d7303b18) set by jmm@cumin2002 for 3:00:00 on 1 host(s) and their services with reason: update t...
[10:25:43] <icinga-wm>	 RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms
[10:25:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:26:55] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service kubestagemaster1003:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:27:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet
[10:27:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet
[10:27:27] <moritzm>	 !log upgrading krb2002 to Bookworm T390863
[10:27:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:29] <stashbot>	 T390863: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863
[10:27:49] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] temp accounts: Remove AutopromoteOnce configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142858 (https://phabricator.wikimedia.org/T393358) (owner: 10Kosta Harlan)
[10:28:59] <jinxer-wm>	 FIRING: [21x] ProbeDown: Service ganeti1038:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:30:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:31:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet
[10:33:13] <wikibugs>	 (03PS2) 10Tchanders: Assign IP auto-reveal rights to certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142649 (https://phabricator.wikimedia.org/T386492)
[10:34:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet
[10:35:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:40:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet
[10:40:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:40:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet
[10:42:23] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] benthos/mw_accesslog_metrics: increase buffering [puppet] - 10https://gerrit.wikimedia.org/r/1142625 (owner: 10Kamila Součková)
[10:42:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:43:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet
[10:46:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet
[10:46:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393574 (10ops-monitoring-bot) 03NEW
[10:46:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393575 (10ops-monitoring-bot) 03NEW
[10:47:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393576 (10ops-monitoring-bot) 03NEW
[10:47:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:47:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393577 (10ops-monitoring-bot) 03NEW
[10:47:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10799666 (10MoritzMuehlenhoff)
[10:49:39] <wikibugs>	 (03PS1) 10Elukey: icinga: update raid_handler.py with 'broadcom' instead of 'perccli' [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146)
[10:51:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet
[10:51:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet
[10:56:55] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:57:00] <wikibugs>	 (03CR) 10Elukey: "The RAID_TYPES variable seems not used in the raid_handler.py script, but I'd proceed anyway for consistency." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[10:57:01] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:57:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet
[11:00:04] <jouncebot>	 mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1100).
[11:00:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet
[11:01:04] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T393578 (10Seddon) 03NEW
[11:01:21] <moritzm>	 FYI, kubestagemaster1004 and dse-k8s-etcd1002 will briefly go down for a Ganeti reboot
[11:01:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet
[11:01:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579 (10Seddon) 03NEW
[11:01:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T393578#10799730 (10Seddon) 05Open→03Invalid
[11:03:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:03:39] <icinga-wm>	 PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100%
[11:03:57] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:05:45] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms
[11:06:07] <icinga-wm>	 RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms
[11:06:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet
[11:06:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet
[11:07:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet
[11:08:16] <wikibugs>	 (03PS10) 10Ayounsi: netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437
[11:08:31] <jinxer-wm>	 RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:09:14] <wikibugs>	 (03PS13) 10Ayounsi: wmf-netbox use core Homer GraphQL based fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577)
[11:10:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:10:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:12:02] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[11:12:52] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2203.codfw.wmnet with reason: Maintenance
[11:15:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:17:46] <wikibugs>	 (03PS1) 10Kamila Součková: mw-cron/CampaignEvents: Migrate aggregateanswers-{meta,office}wiki [puppet] - 10https://gerrit.wikimedia.org/r/1143058 (https://phabricator.wikimedia.org/T385867)
[11:19:01] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: split Gerrit and Gitiles proxy pools [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[11:41:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1193 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet
[11:44:11] <wikibugs>	 (03PS5) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784)
[11:46:46] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1359506) is awaiting input
[11:47:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10799897 (10cmassaro) @tappof Thank you! I am not actually sure. I'm looking at https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/data/data.yaml,...
[11:49:50] <moritzm>	 FYI, ml-etcd1001 will briefly go down for a Ganeti reboot
[11:50:29] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:50:47] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:51:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] icinga: update raid_handler.py with 'broadcom' instead of 'perccli' [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[11:53:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:54:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10799978 (10MoritzMuehlenhoff) >>! In T393140#10799897, @cmassaro wrote: > @tappof Thank you! I am not actually sure. I'm looking at https://phabricator.wikimedia.org/source/operations-pu...
[11:55:12] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1359506) is awaiting input
[11:55:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet
[11:56:41] <wikibugs>	 (03PS3) 10Ayounsi: WMF-Plugin: Potential clean-up of b-end circuit finding logic [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney)
[11:56:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "It's used in parse_args() it seems." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[11:57:47] <icinga-wm>	 PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:58:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:58:52] <wikibugs>	 (03CR) 10Brouberol: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[11:58:57] <wikibugs>	 (03CR) 10Ayounsi: WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney)
[12:00:27] <icinga-wm>	 RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms
[12:00:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet
[12:00:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet
[12:01:48] <wikibugs>	 (03CR) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[12:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:02] <wikibugs>	 (03PS6) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784)
[12:05:32] <wikibugs>	 (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069
[12:05:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet
[12:07:36] <wikibugs>	 (03CR) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[12:11:44] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1382334) is awaiting input
[12:12:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet
[12:13:33] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587 (10Justman10000) 03NEW
[12:14:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800019 (10Justman10000) And how to submit the SSH key? As file? Via text?
[12:14:48] <wikibugs>	 (03CR) 10Cathal Mooney: WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney)
[12:15:40] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069 (owner: 10Muehlenhoff)
[12:17:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet
[12:18:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet
[12:18:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM if you've tested it against all devices." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney)
[12:18:19] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] netbox: add fetch_device_interfaces using GraphQL [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[12:18:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] wmf-netbox use core Homer GraphQL based fetch_device_interfaces() [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1094284 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[12:20:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:21:45] <wikibugs>	 (03PS1) 10Kamila Součková: mw-cron/WikimediaEvents: Migrate UpdatePeriodicMetrics jobs [puppet] - 10https://gerrit.wikimedia.org/r/1143073 (https://phabricator.wikimedia.org/T388542)
[12:22:00] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[12:25:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:27:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[12:32:18] <wikibugs>	 (03CR) 10Brouberol: "Small edit suggestion" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[12:35:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet
[12:38:01] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:38:28] <moritzm>	 !log installing imagemagick security updates
[12:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:36] <wikibugs>	 (03CR) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[12:41:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800154 (10Aklapper) 05Open→03Declined Per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups, `ops` is for SRE staff only.
[12:41:52] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1412672) is awaiting input
[12:41:55] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:43:01] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:43:11] <wikibugs>	 (03PS7) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784)
[12:43:49] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[12:43:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800161 (10Aklapper) Also per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access this is the wrong form. Which docs are you following and why?  Please also see T393499#1079...
[12:44:04] <wikibugs>	 (03PS2) 10Hashar: gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467)
[12:44:14] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-04-16-213143 to 2025-05-07-003410 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143074 (https://phabricator.wikimedia.org/T386239)
[12:44:25] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143075 (https://phabricator.wikimedia.org/T386239)
[12:44:44] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[12:45:46] <wikibugs>	 (03PS8) 10Btullis: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784)
[12:46:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet
[12:47:04] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[12:48:53] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:49:11] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:49:43] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:50:01] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet
[12:51:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet
[12:57:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting production SSH key update for Joseph Seddon - https://phabricator.wikimedia.org/T393579#10800203 (10NBaca-WMF) As Seddon’s manager I approve this request
[12:58:04] <Amir1>	 !log [wikishared]> CREATE INDEX translation_last_updated_timestamp ON cx_translations (translation_last_updated_timestamp); (T392839)
[12:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:08] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet
[13:00:12] <Lucas_WMDE>	 I can’t deploy today
[13:01:18] <wikibugs>	 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595 (10isarantopoulos) 03NEW
[13:02:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800232 (10isarantopoulos)
[13:04:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800237 (10isarantopoulos)
[13:05:14] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1437955) is awaiting input
[13:05:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800239 (10Justman10000) >>! In T393587#10800154, @Aklapper hat geschrieben: > Per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups, `ops` is for SRE staff only.  But I need `ops` for o...
[13:05:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[13:05:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069 (owner: 10Muehlenhoff)
[13:06:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800240 (10Justman10000) >>! In T393587#10800161, @Aklapper hat geschrieben: > Also per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access this is the wrong form. Which doc...
[13:06:06] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143069 (owner: 10Muehlenhoff)
[13:06:43] <Daimona>	 Amir1: do you think this is compatible with the CX queries? T393513
[13:06:43] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[13:06:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[13:07:02] <Amir1>	 Daimona: I'm debugging
[13:07:06] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[13:07:08] <hashar>	 !log Restarted Apache httpd server on Gerrit server
[13:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:11] <Amir1>	 was there a newer one than yesterday?
[13:07:14] <logmsgbot>	 !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[13:07:15] <Daimona>	 Okay great! Let me know if there's anything I can help with
[13:07:22] <Daimona>	 No, this is the one from yesterday evening
[13:07:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet
[13:07:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz - https://phabricator.wikimedia.org/T393595#10800246 (10isarantopoulos) I approve adding Bartos...
[13:07:41] <Daimona>	 It's similar to a pattern I saw a few days ago with a spike in open connections
[13:07:43] <Amir1>	 yeah, that I'm looking at. There are still pieces in CX that are slow but I want to double check everything
[13:07:52] <moritzm>	 !log installing poppler security updates
[13:07:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:12:15] <wikibugs>	 (03PS1) 10Jelto: Revert "gerrit: lower connections to Gitiles from 25 to 4" [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467)
[13:12:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet
[13:12:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet
[13:13:05] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "thanks for the quick revert" [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[13:13:59] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Pass krb2002 to Kerberos clients again [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[13:15:14] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[13:15:22] <wikibugs>	 (03CR) 10Elukey: "Oh right, it holds the 'choices', I missed it. Now I am wondering why it keeps working though." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[13:16:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800278 (10Aklapper) >>! In T393587#10800239, @Justman10000 wrote: > But I need `ops` for optimal working!  Working on what? And //what exactly// makes you think so? So far I have found on...
[13:16:54] <wikibugs>	 (03CR) 10Btullis: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[13:18:19] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: remove ms-be1060 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1140130 (https://phabricator.wikimedia.org/T392796) (owner: 10MVernon)
[13:18:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143060 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[13:19:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet
[13:19:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393598 (10ops-monitoring-bot) 03NEW
[13:20:41] <wikibugs>	 (03CR) 10Ssingh: icinga: skip services in wait_for_optimal if needed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey)
[13:21:25] <logmsgbot>	 !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[13:21:43] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[13:22:57] <wikibugs>	 (03CR) 10Ayounsi: WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney)
[13:23:08] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10800315 (10MatthewVernon)
[13:23:27] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: add ms-fe101[5,6] as new proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1140752 (https://phabricator.wikimedia.org/T388886) (owner: 10MVernon)
[13:24:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet
[13:25:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800321 (10Aklapper) >>! In T393587#10800240, @Justman10000 wrote: > Which one should I follow?  My question was "Which docs are you following and why?". This has remained unanswered.
[13:25:21] <logmsgbot>	 !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[13:25:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] WMF-Plugin: Potential clean-up of b-end circuit finding logic (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1122524 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney)
[13:46:21] <wikibugs>	 (03CR) 10Herron: logs-api: add write/delete acl via htgroup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron)
[13:46:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet
[13:47:04] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[13:47:15] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5481/co" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[13:48:15] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "Lemme know if it works now! :)" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[13:48:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800405 (10Aklapper) >>! In T393587#10800392, @Justman10000 wrote: >> Please provide a link to non-trivial, merged code changes of yours. >  > I don't have one!  Then I do not think that y...
[13:49:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800416 (10Aklapper) > I just don't want to look stupid when I want to do something, but I can't because no permission!  Looking stupid is much much more acceptable than not following http...
[13:50:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1117 to cirrussearch1117 - bking@cumin2002"
[13:50:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1117 to cirrussearch1117 - bking@cumin2002"
[13:50:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:50:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1117 on all recursors
[13:50:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1117 on all recursors
[13:50:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1117
[13:50:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800429 (10Justman10000) >>! In T393587#10800405, @Aklapper hat geschrieben: >>>! In T393587#10800392, @Justman10000 wrote: >>> Please provide a link to non-trivial, merged code changes of...
[13:50:51] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[13:51:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:51:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1117
[13:52:10] <moritzm>	 !log installing nginx security updates
[13:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:31] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10800438 (10Gehel)
[13:52:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1117 to cirrussearch1117
[13:52:39] <wikibugs>	 (03CR) 10Herron: [C:03+1] "LGTM! please see comment before submitting" [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[13:53:50] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143101 (https://phabricator.wikimedia.org/T389784)
[13:56:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:56:32] <wikibugs>	 (03PS1) 10Btullis: Fix typo in mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143103 (https://phabricator.wikimedia.org/T389784)
[13:56:51] <wikibugs>	 (03CR) 10Elukey: [V:03+1] profile::pyrra::filesystem::slos: add test for revertrisk LA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[13:56:53] <wikibugs>	 (03PS4) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350)
[13:57:02] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[13:57:02] <wikibugs>	 (03PS1) 10Jelto: gerrit: add more ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143104 (https://phabricator.wikimedia.org/T393498)
[13:57:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1116.eqiad.wmnet with OS bullseye
[13:58:27] <wikibugs>	 (03Abandoned) 10Btullis: Fix typo in mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143103 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis)
[13:58:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1117.eqiad.wmnet with OS bullseye
[13:58:33] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143101 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol)
[13:58:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1118 to cirrussearch1118
[13:59:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393574#10800460 (10elukey) 05Open→03Invalid My fault, related to T393146.
[13:59:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393575#10800465 (10elukey) 05Open→03Invalid My fault, related to T393146.
[13:59:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:59:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393576#10800470 (10elukey) 05Open→03Invalid My fault, related to T393146.
[13:59:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393577#10800475 (10elukey) 05Open→03Invalid My fault, related to T393146.
[13:59:41] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[13:59:42] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1143104 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto)
[14:00:02] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: add more ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1143104 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto)
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1400)
[14:00:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10800482 (10BTullis) 05Open→03Resolved This should be all working now @JVanderhoop-WMF - I'...
[14:00:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143101 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol)
[14:00:36] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add scampos to the analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/1142679 (https://phabricator.wikimedia.org/T393066) (owner: 10Btullis)
[14:01:12] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update evaluators from 2025-04-16-213143 to 2025-05-07-003410 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143074 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester)
[14:01:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:01:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1090.eqiad.wmnet - https://phabricator.wikimedia.org/T393598#10800488 (10elukey) 05Open→03Invalid My fault, related to T393146.
[14:02:25] <wikibugs>	 (03CR) 10Elukey: profile::pyrra::filesystem::slos: add test for revertrisk LA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[14:02:55] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-04-16-213143 to 2025-05-07-003410 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143074 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester)
[14:03:20] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[14:03:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1118 to cirrussearch1118 - bking@cumin2002"
[14:03:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1118 to cirrussearch1118 - bking@cumin2002"
[14:03:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:03:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1118 on all recursors
[14:03:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1118 on all recursors
[14:03:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1118
[14:04:32] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:05:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1118
[14:05:34] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:05:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[14:05:49] <wikibugs>	 (03CR) 10Hnowlan: "We should remove the `rerendered_pcs_wikis` entry in helmfile.d/services/changeprop/values.yaml also." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos)
[14:06:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1118 to cirrussearch1118
[14:06:57] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:07:17] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:07:17] <wikibugs>	 (03PS1) 10Brouberol: Fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143106 (https://phabricator.wikimedia.org/T389784)
[14:07:39] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:07:41] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "From the doc at https://httpd.apache.org/docs/2.4/mod/mod_proxy.html `max` applies on a per child process. So with 5 child processes that " [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[14:07:50] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:08:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting access to <Superset> for <SCampos-WMF> - https://phabricator.wikimedia.org/T393066#10800522 (10BTullis) 05In progress→03Resolved This should be working now @SCampos-WMF - Please feel free to let me...
[14:08:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143106 (https://phabricator.wikimedia.org/T389784) (owner: 10Brouberol)
[14:08:52] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:08:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800527 (10Aklapper) Sure, please feel free to point to other meaningful technical contributions if there are no code contributions. > Is that why one don't give someone a chance? A chance...
[14:09:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1116.eqiad.wmnet with reason: host reimage
[14:09:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1118.eqiad.wmnet with OS bullseye
[14:09:47] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:09:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1117.eqiad.wmnet with reason: host reimage
[14:10:00] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143075 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester)
[14:10:36] <wikibugs>	 (03PS2) 10Jgiannelos: pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061
[14:10:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[14:11:05] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[14:11:34] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-04-23-134615 to 2025-05-06-142345 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143075 (https://phabricator.wikimedia.org/T386239) (owner: 10Jforrester)
[14:12:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1116.eqiad.wmnet with reason: host reimage
[14:12:03] <dcausse>	 jouncebot: nowandnext
[14:12:03] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1400)
[14:12:03] <jouncebot>	 In 2 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700)
[14:12:34] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:13:03] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:13:58] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:14:48] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:15:05] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:15:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] icinga: update raid_handler.py with 'broadcom' instead of 'perccli' [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[14:15:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1117.eqiad.wmnet with reason: host reimage
[14:15:52] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:16:23] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Self answering - it breaks before reaching any phabricator code, so it didn't create wrong/invalid tasks.." [puppet] - 10https://gerrit.wikimedia.org/r/1143052 (https://phabricator.wikimedia.org/T393146) (owner: 10Elukey)
[14:16:30] <wikibugs>	 (03PS6) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086)
[14:18:07] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[14:18:41] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw-cron/WikimediaEvents: Migrate UpdatePeriodicMetrics jobs [puppet] - 10https://gerrit.wikimedia.org/r/1143073 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková)
[14:19:21] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Provide a quick shell script for testing prod status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143109 (https://phabricator.wikimedia.org/T369741)
[14:19:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: Add support for Broadcom RAID controllers using storcli - https://phabricator.wikimedia.org/T393146#10800566 (10elukey) 05Open→03Resolved a:03elukey Summary:  - Renamed the perccli nagios check to a more generic broadcom, tha...
[14:20:38] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw-cron/CampaignEvents: Migrate aggregateanswers-{meta,office}wiki [puppet] - 10https://gerrit.wikimedia.org/r/1143058 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[14:21:00] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Provide a quick shell script for testing prod status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143109 (https://phabricator.wikimedia.org/T369741) (owner: 10Jforrester)
[14:22:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Grant Access to analytics-privatedata-users for Jvanderhoop-WMF - https://phabricator.wikimedia.org/T393409#10800572 (10JVanderhoop-WMF) Thank you! Can confirm it works.
[14:22:36] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Provide a quick shell script for testing prod status [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143109 (https://phabricator.wikimedia.org/T369741) (owner: 10Jforrester)
[14:22:53] <wikibugs>	 (03CR) 10Scott French: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:23:02] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate all parsercache jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143110 (https://phabricator.wikimedia.org/T385800)
[14:24:38] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10800593 (10CDobbins) a:05CDobbins→03ssingh
[14:26:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1119 to cirrussearch1119
[14:26:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[14:27:39] <jinxer-wm>	 FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1081-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh
[14:27:51] <wikibugs>	 (03PS3) 10Neslihan Turan: Create feature flags for resolving Wikibase item labels on Watchlist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141852 (https://phabricator.wikimedia.org/T388685)
[14:27:55] <wikibugs>	 (03PS1) 10Jelto: gerrit: add more abuse IPs [puppet] - 10https://gerrit.wikimedia.org/r/1143111 (https://phabricator.wikimedia.org/T393498)
[14:29:28] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: logstash.service crashloop on elastic1087:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:29:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1116.eqiad.wmnet with OS bullseye
[14:29:41] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: add more abuse IPs [puppet] - 10https://gerrit.wikimedia.org/r/1143111 (https://phabricator.wikimedia.org/T393498) (owner: 10Jelto)
[14:31:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:33:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1119 to cirrussearch1119 - bking@cumin2002"
[14:34:28] <jinxer-wm>	 FIRING: [9x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:35:51] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] graphite: Allow x-grafana-device-id header in CORS config [puppet] - 10https://gerrit.wikimedia.org/r/1142680 (https://phabricator.wikimedia.org/T393439) (owner: 10Andrea Denisse)
[14:36:28] <logmsgbot>	 bking@cumin2002 rename (PID 1527059) is awaiting input
[14:37:03] <wikibugs>	 (03CR) 10Scott French: "This is quite similar to the what I'm having to deal with for the refreshlinks jobs, which never adopted `sharded_periodic_job`." [puppet] - 10https://gerrit.wikimedia.org/r/1142579 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:37:39] <wikibugs>	 (03PS4) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782)
[14:38:02] <wikibugs>	 (03CR) 10Hnowlan: mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:39:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1118.eqiad.wmnet with reason: host reimage
[14:39:10] <moritzm>	 !log installing openjdk-17 security updates
[14:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:28] <jinxer-wm>	 FIRING: [10x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:39:37] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate listTaskCounts to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142563 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[14:40:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1119 to cirrussearch1119 - bking@cumin2002"
[14:40:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:40:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1119 on all recursors
[14:40:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1119 on all recursors
[14:40:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1119
[14:41:39] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035#10800689 (10Eevans) 05Open→03Declined >>! In T307035#10800347, @MatthewVernon wrote: > @Eevans refresh due Q2 next year per the procurement spreadsheet.  Oh, thank you!  I'm never quite...
[14:41:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1117.eqiad.wmnet with OS bullseye
[14:42:03] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all parsercache jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143110 (https://phabricator.wikimedia.org/T385800) (owner: 10Hnowlan)
[14:42:38] <jinxer-wm>	 RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1081-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh
[14:43:47] <logmsgbot>	 bking@cumin2002 rename (PID 1527059) is awaiting input
[14:43:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1118.eqiad.wmnet with reason: host reimage
[14:44:28] <jinxer-wm>	 RESOLVED: [10x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[14:44:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron)
[14:44:59] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[14:47:07] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v10.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1143112
[14:47:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1119
[14:48:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1119 to cirrussearch1119
[14:50:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1143098 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans)
[14:51:08] <wikibugs>	 (03Abandoned) 10Bking: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:54:35] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrus: disable completion indices in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1141903 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:55:01] <wikibugs>	 (03CR) 10Scott French: deployment_server: Add --env to mwscript-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[14:56:29] <wikibugs>	 (03PS3) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963
[14:57:04] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[14:57:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1119.eqiad.wmnet with OS bullseye
[14:58:06] <jinxer-wm>	 FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1081-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh
[14:58:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1120 to cirrussearch1120
[14:59:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[14:59:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1081* for thread pool rejections - bking@cumin2002
[14:59:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1081* for thread pool rejections - bking@cumin2002
[14:59:58] <jinxer-wm>	 FIRING: [3x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1058:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:00:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:02:33] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[15:02:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1060-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh
[15:03:25] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1118 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[15:04:01] <Emperor>	 !log pool ms-fe1015 ms-fe1016 new frontends T388886 T391354
[15:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:04] <stashbot>	 T388886: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886
[15:04:05] <stashbot>	 T391354: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354
[15:04:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1060*,elastic1081* for thread pool rejections - bking@cumin2002
[15:04:19] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[15:04:19] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1118 is OK: SSL OK - Certificate cirrussearch1118.eqiad.wmnet valid until 2025-06-04 14:58:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search
[15:04:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1060*,elastic1081* for thread pool rejections - bking@cumin2002
[15:04:30] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1015.eqiad.wmnet
[15:04:30] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1015.eqiad.wmnet
[15:04:31] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1015.eqiad.wmnet
[15:04:31] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1015.eqiad.wmnet
[15:04:39] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1016.eqiad.wmnet
[15:04:47] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1016.eqiad.wmnet
[15:04:55] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1016.eqiad.wmnet
[15:04:58] <jinxer-wm>	 FIRING: [27x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:05:03] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1016.eqiad.wmnet
[15:05:49] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos)
[15:06:04] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10800814 (10MatthewVernon)
[15:06:11] <wikibugs>	 (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v10.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1143112 (owner: 10Volans)
[15:06:14] <sukhe>	 !log sudo cumin -b1 -s10 'A:dnsbox' 'sudo -u authdns git -C /srv/authdns/git maintenance run' T393602
[15:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:17] <stashbot>	 T393602: Improving the time it takes to run authdns-update - https://phabricator.wikimedia.org/T393602
[15:06:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10800816 (10Stevemunene) Hosts are in a decommissioned state with no under replocated blocks {F59748220} {F59748234} Pro...
[15:06:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1120 to cirrussearch1120 - bking@cumin2002"
[15:06:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1120 to cirrussearch1120 - bking@cumin2002"
[15:06:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:06:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1120 on all recursors
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1120 on all recursors
[15:06:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1120
[15:07:25] <wikibugs>	 (03CR) 10Volans: [C:03+2] elasticsearch: do not fail on Python 3.10+ [cookbooks] - 10https://gerrit.wikimedia.org/r/1143098 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans)
[15:08:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1060*,elastic1081*,elastic1083* for thread pool rejections - bking@cumin2002
[15:08:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1060*,elastic1081*,elastic1083* for thread pool rejections - bking@cumin2002
[15:08:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1120
[15:09:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1120 to cirrussearch1120
[15:09:28] <wikibugs>	 (03PS1) 10MVernon: hiera: remove ms-be1060 from swift storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1143118 (https://phabricator.wikimedia.org/T391354)
[15:09:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1119.eqiad.wmnet with reason: host reimage
[15:09:53] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[15:09:58] <jinxer-wm>	 FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:10:04] <sukhe>	 !log timing authdns-update for T393602
[15:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:13] <jinxer-wm>	 FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:10:55] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[15:12:31] <logmsgbot>	 bking@cumin2002 reimage (PID 1574550) is awaiting input
[15:13:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1120.eqiad.wmnet with OS bullseye
[15:14:01] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos)
[15:14:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1119.eqiad.wmnet with reason: host reimage
[15:14:58] <jinxer-wm>	 RESOLVED: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:15:37] <wikibugs>	 (03Merged) 10jenkins-bot: pcs-restbase-sunset: Remove restriction on domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143061 (owner: 10Jgiannelos)
[15:17:10] <logmsgbot>	 bking@cumin2002 rename (PID 1578957) is awaiting input
[15:17:32] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v10.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1143112 (owner: 10Volans)
[15:17:32] <wikibugs>	 (03Merged) 10jenkins-bot: elasticsearch: do not fail on Python 3.10+ [cookbooks] - 10https://gerrit.wikimedia.org/r/1143098 (https://phabricator.wikimedia.org/T390860) (owner: 10Volans)
[15:17:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10800851 (10Stevemunene) ` stevemunene@an-worker1156:~$ sudo disable-puppet "T390170 - hard drive replacement in progres...
[15:18:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10800853 (10Pppery) > Per https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups, ops is for SRE staff only.  FYI this isn't quite true - there have, at various times, been volunteers with `op...
[15:20:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1118.eqiad.wmnet with OS bullseye
[15:20:32] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: Removing WM Enterprise downloader Puppet configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[15:21:14] <wikibugs>	 (03CR) 10Eevans: [C:03+1] hiera: remove ms-be1060 from swift storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1143118 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[15:22:10] <wikibugs>	 (03CR) 10Herron: [C:03+1] profile::pyrra::filesystem::slos: add test for revertrisk LA [puppet] - 10https://gerrit.wikimedia.org/r/1142613 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[15:24:09] <jinxer-wm>	 RESOLVED: [3x] CirrusSearchThreadPoolRejectionsTooHigh: elastic1060-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh
[15:24:34] <wikibugs>	 (03CR) 10MVernon: [C:03+2] hiera: remove ms-be1060 from swift storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1143118 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon)
[15:26:12] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.decommission for hosts ms-be1060.eqiad.wmnet
[15:27:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] sre: alert on Prometheus codfw/eqiad down [alerts] - 10https://gerrit.wikimedia.org/r/1142543 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi)
[15:27:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] sre: alert on webrequest-sampled not processed [alerts] - 10https://gerrit.wikimedia.org/r/1142621 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi)
[15:27:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] sre: alert on webrequest-sampled not processed [alerts] - 10https://gerrit.wikimedia.org/r/1142621 (https://phabricator.wikimedia.org/T393365) (owner: 10Filippo Giunchedi)
[15:28:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1121 to cirrussearch1121
[15:28:51] <hnowlan>	 jouncebot: nowandnext
[15:28:51] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 31 minute(s)
[15:28:51] <jouncebot>	 In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700)
[15:29:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to eqiad, codfw, bast for apine - https://phabricator.wikimedia.org/T393140#10800873 (10cmassaro) deploy1003.eqiad.wmnet is the one! I was able to log in there, but I've switched computers and now need access with my new SSH key.
[15:29:13] <wikibugs>	 (03PS1) 10CDanis: move geoip to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/1143123
[15:29:19] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143123 (owner: 10CDanis)
[15:29:26] <wikibugs>	 (03PS4) 10JHathaway: systemd::sysuser: create the user synchronously in the define [puppet] - 10https://gerrit.wikimedia.org/r/1141963
[15:29:30] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[15:29:37] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[15:29:42] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[15:29:45] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[15:29:55] <icinga-wm>	 PROBLEM - Host db1247 #page is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:00] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[15:30:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: remove minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1142553 (owner: 10Filippo Giunchedi)
[15:30:04] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[15:30:08] <cdanis>	 !incidents
[15:30:08] <sirenbot>	 6096 (UNACKED)  Host db1247 (paged) - PING  - Packet loss = 100%
[15:30:08] <sirenbot>	 6095 (RESOLVED)  Host db1246 (paged) - PING  - Packet loss = 100%
[15:30:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad
[15:30:13] <cdanis>	 !ack 6096
[15:30:14] <sirenbot>	 6096 (ACKED)  Host db1247 (paged) - PING  - Packet loss = 100%
[15:30:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad
[15:30:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609 (10MatthewVernon) 03NEW
[15:30:19] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[15:30:23] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[15:30:30] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[15:30:42] <swfrench-wmf>	 1247 really wants to be like 1246?
[15:30:56] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10800891 (10MatthewVernon) @RobH Decom task is T393609.
[15:31:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:31:13] <wikibugs>	 (03CR) 10Herron: "thanks for the help!" [puppet] - 10https://gerrit.wikimedia.org/r/1140723 (https://phabricator.wikimedia.org/T390194) (owner: 10Herron)
[15:31:15] <swfrench-wmf>	 cdanis: O
[15:31:19] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[15:31:21] <swfrench-wmf>	 I'm around with hands if you need
[15:31:36] <cdanis>	 swfrench-wmf: it's a s4 replica, so I think we just need to depool
[15:31:43] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[15:31:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1120.eqiad.wmnet with reason: host reimage
[15:32:00] <swfrench-wmf>	 cdanis: SGTM
[15:32:13] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[15:32:13] <icinga-wm>	 RECOVERY - Host db1247 #page is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[15:32:28] <logmsgbot>	 !log cdanis@cumin1002 dbctl commit (dc=all): 'depool db1247', diff saved to https://phabricator.wikimedia.org/P75876 and previous config saved to /var/cache/conftool/dbconfig/20250507-153228-cdanis.json
[15:32:32] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10800909 (10MatthewVernon)
[15:33:41] <icinga-wm>	 PROBLEM - mysqld processes #page on db1247 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[15:33:42] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s4 #page on db1247 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:33:48] <icinga-wm>	 PROBLEM - MariaDB read only s4 on db1247 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[15:33:58] <swfrench-wmf>	 !incidents
[15:33:58] <sirenbot>	 6096 (ACKED)  Host db1247 (paged) - PING  - Packet loss = 100%
[15:33:58] <sirenbot>	 6097 (UNACKED)  db1247 (paged)/mysqld processes (paged)
[15:33:59] <sirenbot>	 6098 (UNACKED)  db1247 (paged)/MariaDB Replica IO: s4 (paged)
[15:33:59] <sirenbot>	 6095 (RESOLVED)  Host db1246 (paged) - PING  - Packet loss = 100%
[15:34:04] <wikibugs>	 (03PS1) 10Ladsgroup: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513)
[15:34:06] <swfrench-wmf>	 !ack 6097
[15:34:07] <sirenbot>	 6097 (ACKED)  db1247 (paged)/mysqld processes (paged)
[15:34:07] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s4 #page on db1247 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:34:09] <swfrench-wmf>	 !ack 6098
[15:34:10] <sirenbot>	 6098 (ACKED)  db1247 (paged)/MariaDB Replica IO: s4 (paged)
[15:34:14] <cdanis>	 thanks swfrench-wmf 
[15:34:15] <wikibugs>	 (03PS1) 10Ladsgroup: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143125 (https://phabricator.wikimedia.org/T393513)
[15:34:20] <swfrench-wmf>	 !incidents
[15:34:20] <sirenbot>	 6096 (ACKED)  Host db1247 (paged) - PING  - Packet loss = 100%
[15:34:20] <sirenbot>	 6097 (ACKED)  db1247 (paged)/mysqld processes (paged)
[15:34:20] <sirenbot>	 6098 (ACKED)  db1247 (paged)/MariaDB Replica IO: s4 (paged)
[15:34:21] <sirenbot>	 6099 (UNACKED)  db1247 (paged)/MariaDB Replica SQL: s4 (paged)
[15:34:21] <sirenbot>	 6095 (RESOLVED)  Host db1246 (paged) - PING  - Packet loss = 100%
[15:34:23] <Amir1>	 jouncebot: nowandnext
[15:34:23] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 25 minute(s)
[15:34:23] <jouncebot>	 In 1 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700)
[15:34:28] <swfrench-wmf>	 !ack 6099
[15:34:29] <sirenbot>	 6099 (ACKED)  db1247 (paged)/MariaDB Replica SQL: s4 (paged)
[15:34:44] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[15:34:48] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143125 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[15:35:04] <swfrench-wmf>	 cdanis: do you have a task at which to point a silence, or shall I open one?
[15:35:17] <cdanis>	 swfrench-wmf: please go ahead, I hadn't created one yet
[15:36:16] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.dns.netbox
[15:36:20] <icinga-wm>	 PROBLEM - Host cirrussearch1119 is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1121 to cirrussearch1121 - bking@cumin2002"
[15:36:28] <wikibugs>	 (03CR) 10JHathaway: "I needed to revert the last version of this patch, 1141952, because I failed to test on bullseye and earlier. This patch includes bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/1141963 (owner: 10JHathaway)
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1120.eqiad.wmnet with reason: host reimage
[15:37:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1121 to cirrussearch1121 - bking@cumin2002"
[15:37:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:37:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1121 on all recursors
[15:37:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1121 on all recursors
[15:37:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1121
[15:37:28] <icinga-wm>	 RECOVERY - Host cirrussearch1119 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[15:38:49] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:38:50] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ms-be1060.eqiad.wmnet
[15:38:55] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10800960 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: `ms-be1060.eqiad.wmnet` -...
[15:39:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1121
[15:39:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1121 to cirrussearch1121
[15:39:55] <swfrench-wmf>	 cdanis: T393612 for the restart. I'm going to put a downtime in place long enough for the DBAs to check things out and give it a clean bill of health.
[15:39:55] <stashbot>	 T393612: db1247 crash - 15:29 on 2025-05-07 - https://phabricator.wikimedia.org/T393612
[15:40:02] <cdanis>	 thanks!
[15:40:45] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 #page on db1247 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:40:47] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10800971 (10MatthewVernon) @RobH I think the above cookbook failure is expected given this host is too broken to boot reliably, but...
[15:40:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1121.eqiad.wmnet with OS bullseye
[15:40:56] <swfrench-wmf>	 !incidents
[15:40:57] <sirenbot>	 6096 (ACKED)  Host db1247 (paged) - PING  - Packet loss = 100%
[15:40:57] <sirenbot>	 6097 (ACKED)  db1247 (paged)/mysqld processes (paged)
[15:40:57] <sirenbot>	 6098 (ACKED)  db1247 (paged)/MariaDB Replica IO: s4 (paged)
[15:40:58] <sirenbot>	 6099 (ACKED)  db1247 (paged)/MariaDB Replica SQL: s4 (paged)
[15:40:58] <sirenbot>	 6100 (UNACKED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[15:40:58] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1120 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[15:41:03] <swfrench-wmf>	 !ack 6100
[15:41:04] <sirenbot>	 6100 (ACKED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[15:42:09] <swfrench-wmf>	 ... waiting on the downtime ...
[15:42:59] <logmsgbot>	 !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1247.eqiad.wmnet with reason: Host has crashed - T393612
[15:43:15] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1119.eqiad.wmnet with OS bullseye
[15:43:30] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1120 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[15:44:30] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for production-search-eqiad on cirrussearch1120 is OK: SSL OK - Certificate cirrussearch1120.eqiad.wmnet valid until 2025-06-04 15:38:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search
[15:44:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[15:46:27] <wikibugs>	 (03Merged) 10jenkins-bot: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143125 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[15:47:53] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "Per Nova Linguae" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[15:49:21] <zabe>	 !log zabe@mwmaint1002:~$ mwscript findBadBlobs.php enwiki --revisions 276146284,819689534,1289169661 --mark "T393237"
[15:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:24] <stashbot>	 T393237: Consistent error loading a specific enwiki page: Fatal exception of type "MediaWiki\Revision\RevisionAccessException" - https://phabricator.wikimedia.org/T393237
[15:53:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1121.eqiad.wmnet with reason: host reimage
[15:53:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801047 (10Justman10000) >>! In T393587#10800527, @Aklapper hat geschrieben: >> Is that why one don't give someone a chance? > A chance to do what exactly? Reviewing code changes? You can...
[15:53:40] <moritzm>	 !log uploaded a python-pynetbox 7.4.1-1~wmf12u1 to bookworm-wikimedia (needed for Cumin update) T389380
[15:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:43] <stashbot>	 T389380: Upgrade Cumin hosts to Bookworm - https://phabricator.wikimedia.org/T389380
[15:54:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801060 (10Justman10000) >>! In T393587#10800527, @Aklapper hat geschrieben: > Besides that, I do not know what makes you think that you need `ops` as you have not yet answered that questi...
[15:54:17] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: route reading lists API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891)
[15:55:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801077 (10Justman10000) >>! In T393587#10800853, @Pppery hat geschrieben: > But agreed with Aklapper that Justman10000 is nowhere near qualified for it (or even the lesser  `deployment` g...
[15:55:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] "again" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[15:55:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[15:58:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1121.eqiad.wmnet with reason: host reimage
[15:59:23] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] mw-cron/CampaignEvents: Migrate aggregateanswers-{meta,office}wiki [puppet] - 10https://gerrit.wikimedia.org/r/1143058 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[15:59:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10801093 (10cmooney)
[15:59:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1120.eqiad.wmnet with OS bullseye
[16:00:11] <wikibugs>	 (03PS2) 10RLazarus: deployment_server: Add --env to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925)
[16:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:02:48] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10801099 (10RobH) >>! In T392796#10800971, @MatthewVernon wrote: > @RobH I think the above cookbook failure is expected given this h...
[16:23:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1121.eqiad.wmnet with OS bullseye
[16:24:55] <wikibugs>	 (03PS4) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517)
[16:25:03] <wikibugs>	 (03CR) 10Bvibber: Charts phase 1 deployment (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[16:25:50] <wikibugs>	 (03CR) 10CDanis: "pcc lgtm https://puppet-compiler.wmflabs.org/output/1143123/6243/" [puppet] - 10https://gerrit.wikimedia.org/r/1143123 (owner: 10CDanis)
[16:26:13] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[16:28:38] <wikibugs>	 (03CR) 10Scott French: [C:03+1] deployment_server: Add --env to mwscript-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142795 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[16:29:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10801288 (10Jclark-ctr) opened server Verified was connected.  i reseated all the drives while it was turned off.  and had a bunch of drives show up failed enitre top row of Backplane.   Reseated drive...
[16:30:05] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[16:30:27] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "Absolutely +1 and thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143123 (owner: 10CDanis)
[16:31:01] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[16:31:09] <logmsgbot>	 !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[16:31:23] <logmsgbot>	 !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[16:36:22] <logmsgbot>	 !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] (duration: 29m 10s)
[16:36:25] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[16:36:54] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]]
[16:36:56] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "File structure changed in the mean time - I did my best to track what went where and delete accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[16:38:09] <wikibugs>	 (03PS1) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617)
[16:38:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801332 (10Aklapper) Welcome to the concept of code review.
[16:40:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans)
[16:40:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801335 (10Justman10000) >>! In T393587#10801332, @Aklapper hat geschrieben: > Welcome to the concept of code review.  Exactly! And for me, it's about being able to commit directly...
[16:41:25] <wikibugs>	 (03PS1) 10Aleksandar Mastilovic: Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556)
[16:42:44] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "MR to set the WM Enterprise downloader to "absent": https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143134" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[16:43:01] <logmsgbot>	 !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] (duration: 06m 07s)
[16:43:04] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[16:45:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1121 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 66, number_of_data_nodes: 66, discovered_master: True, active_primary_shards: 1481, active_shards: 4444, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 31, delayed_unassigned_shards: 0, number_of_pendin
[16:45:20] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.3072625698324 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:45:38] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch1121 is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 33, number_of_data_nodes: 33, discovered_master: True, active_primary_shards: 1680, active_shards: 5035, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of
[16:45:38] <icinga-wm>	 _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92061917047033 https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:45:50] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: DiskSpace (instance analytics1071:9100) - https://phabricator.wikimedia.org/T392555#10801366 (10Stevemunene) This seems to have been resolved on 2nd May 2025, apologies for the delay  {F59749689}  ` stevemunene@analytic...
[16:49:32] <wikibugs>	 (03PS2) 10JHathaway: postfix: add support for cfssl certs [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715)
[16:49:58] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway)
[16:50:17] <wikibugs>	 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#10801387 (10Dwisehaupt) It appears that this just happened again today starting ~1444 UTC. The check logs on our hosts show checks being run succe...
[16:50:34] <wikibugs>	 (03CR) 10Bking: "I'm doing my due diligence with Puppet catalog lookups, but also adding Reuven who has more experience with Envoy." [puppet] - 10https://gerrit.wikimedia.org/r/1142693 (owner: 10Ebernhardson)
[16:52:11] <dwisehaupt>	 could we (fr-tech) bother someone to do a `sysctl restart nsca` on the active alert host? we are seeing all of our service alerts as coming in AWOL when they are online. Some history in this phab I just updated: https://phabricator.wikimedia.org/T196336#10801387
[16:53:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801408 (10Aklapper) I guess we don't let random folks push random commits without review to potentially bring down Wikimedia websites. I hope that does not come as a surprise.
[16:53:11] <dwisehaupt>	 we may also need to clear out the mail queues for the backlog of spurrious mails from this destined for fr-tech@ and fr-tech-ops@ since we are at least 90 mins behind on the queue for these and we don't need to keep the mail bomb around.
[16:54:39] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: add support for cfssl certs [puppet] - 10https://gerrit.wikimedia.org/r/1140791 (https://phabricator.wikimedia.org/T383715) (owner: 10JHathaway)
[16:55:07] <wikibugs>	 (03PS2) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617)
[16:56:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1122 to cirrussearch1122
[16:56:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans)
[16:56:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1123 to cirrussearch1123
[16:56:26] <dwisehaupt>	 swfrench-wmf: cdanis: not sure if i should ping you all as SRE on call for this ^^ but doing so. let me know if it's incorrect and i should do something else.
[16:57:43] <wikibugs>	 (03PS3) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617)
[16:58:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:58:41] <cdanis>	 !log per dwisehaupt T196336  💙cdanis@alert1002.wikimedia.org ~ 🕐☕ sudo systemctl restart nsca.service
[16:58:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:47] <stashbot>	 T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336
[16:59:15] <swfrench-wmf>	 dwisehaupt: cdanis: wonder if this might be related to the `FIRING: IcingaOverload: Checks are taking long to execute on alert1002:9245` in -observability?
[16:59:21] <swfrench-wmf>	 cdanis: thanks for doing that!
[16:59:30] <cdanis>	 swfrench-wmf: my suspicions are the same
[16:59:30] <dwisehaupt>	 thanks. hopefully that will help like in the past.
[16:59:52] <dwisehaupt>	 we have a plan to migrate from icinga, just got sidelined on other major projects for the last 6+ months.
[17:00:03] <wikibugs>	 (03PS4) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617)
[17:00:05] <jouncebot>	 swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1700).
[17:00:10] <denisse>	 I do think they're related.
[17:00:31] <cdanis>	 denisse: https://grafana.wikimedia.org/goto/cLMd-UbNR?orgId=1 something has been adding a *lot* of new icinga checks
[17:01:47] <cdanis>	 (that's a zoomed view of one of the mini timeseries on https://grafana.wikimedia.org/d/rsCfQfuZz/icinga?orgId=1 )
[17:02:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617) (owner: 10Eevans)
[17:03:56] <logmsgbot>	 bking@cumin2002 rename (PID 1687172) is awaiting input
[17:03:57] <swfrench-wmf>	 is that just wonky histogram buckets in the 'Check Latency' panel, or did something odd happen around 15:12?
[17:04:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:04:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1122 to cirrussearch1122 - bking@cumin2002"
[17:04:57] <volans>	 swfrench-wmf: today luca did make some changes to the perc/broadcom raid checks and there was some issue so it's possible that some checks were added before others were removed, but the net result in the end should be zero
[17:05:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10801492 (10Papaul) @Jclark-ctr thank you for looking at this. I will rebuilt it and re-image.
[17:05:54] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10801494 (10Stevemunene) Theres an issue with `/var/lib/hadoop/data/k/hdfs` which seems to be inaccessible and probably related to...
[17:06:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1122 to cirrussearch1122 - bking@cumin2002"
[17:06:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:06:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1122 on all recursors
[17:06:04] <wikibugs>	 (03PS1) 10Volans: Upstream release v10.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1143136
[17:06:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1122 on all recursors
[17:06:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1122
[17:07:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1122
[17:07:23] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: remove icu67 override on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139910 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French)
[17:08:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1122 to cirrussearch1122
[17:08:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1123 to cirrussearch1123 - bking@cumin2002"
[17:08:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1123 to cirrussearch1123 - bking@cumin2002"
[17:08:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:08:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1123 on all recursors
[17:08:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1123 on all recursors
[17:08:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1123
[17:08:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1122.eqiad.wmnet with OS bullseye
[17:08:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1123
[17:09:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1123 to cirrussearch1123
[17:09:36] <swfrench-wmf>	 thanks, volans
[17:09:40] <wikibugs>	 06SRE, 06serviceops-radar, 06SRE Observability, 10wikitech.wikimedia.org: Move meta monitoring off of wikitech-static - https://phabricator.wikimedia.org/T393625 (10andrea.denisse) 03NEW
[17:11:21] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v10.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1143136 (owner: 10Volans)
[17:11:21] <swfrench-wmf>	 just confirmed that 15:12 does not appear to correlate with any puppet run on alert1002. last run prior was just before 15:00 (cleaned up elastic1119)
[17:11:31] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:12:21] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: switch deployment hosts to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1139914 (https://phabricator.wikimedia.org/T392938) (owner: 10Scott French)
[17:12:21] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10801541 (10andrea.denisse) Hi @RobH @Andrew , we have Meta Monitoring enabled in the Wikitech static Rackspace host. Could you please provide the o...
[17:13:15] <jhathaway>	 dwisehaupt: delete all mail from /fr-tech.bnc.*@wikimedia.org/?
[17:13:35] <wikibugs>	 (03PS1) 10Sbisson: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839)
[17:13:49] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10801548 (10RobH) So I actually have no login rights (and don't need them) for the new AWS hosted wikitech static deployment.  I just pay the AWS bi...
[17:13:50] <swfrench-wmf>	 !log disable-puppet "In-place update to PHP 8.1 - T392938" on deploy1003 and deploy2002
[17:13:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_magru
[17:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:54] <stashbot>	 T392938: Remove PHP 7.4 from deployment hosts - https://phabricator.wikimedia.org/T392938
[17:14:00] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:14:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[17:15:32] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host apus-fe1003
[17:15:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1123.eqiad.wmnet with OS bullseye
[17:16:21] <dwisehaupt>	 jhathaway: the mail is coming from nagios@alert1002.wikimedia.org to fr-tech@wikimedia.org and to fr-tech-ops@wikimedia.org
[17:16:27] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-fe1003
[17:17:11] <Amir1>	 swfrench-wmf: I don't know if it's related or not but my scap basically gets stuck on building images (twice so far)
[17:17:27] <dwisehaupt>	 i'm starting to see recoveries and OK status in the icinga UI for the services. 
[17:17:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1124 to cirrussearch1124
[17:17:36] <Amir1>	 going for the third time
[17:17:47] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]]
[17:17:50] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[17:17:57] <jhathaway>	 nod thanks
[17:18:33] <swfrench-wmf>	 Amir1: oh, sorry - are you running a backport during the infra window?
[17:18:47] <Amir1>	 swfrench-wmf: it was broken before that
[17:18:50] <swfrench-wmf>	 also, no - it should not be related to the PHP update I'm doing
[17:19:09] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:19:13] <swfrench-wmf>	 or rather, which thing are you asking if it's related to :)
[17:19:25] <swfrench-wmf>	 and where is your scap run getting stuck?
[17:19:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:20:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1125 to cirrussearch1125
[17:20:46] <Amir1>	 swfrench-wmf: right now it's stuck on this:
[17:20:49] <Amir1>	 https://www.irccloud.com/pastebin/Uh1rYKQd/
[17:21:00] <Amir1>	 but before that, the same thing
[17:21:30] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626 (10SKivlehan-WMF) 03NEW
[17:21:51] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v10.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1143136 (owner: 10Volans)
[17:23:07] <swfrench-wmf>	 Amir1: looking at your `scap-image-build-and-push-log`, these are both incremental builds ... that's puzzling
[17:23:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1124 to cirrussearch1124 - bking@cumin2002"
[17:23:29] <swfrench-wmf>	 i.e., I would not expect the push (what's currently pending) to take very long
[17:23:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1124 to cirrussearch1124 - bking@cumin2002"
[17:23:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:23:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1124 on all recursors
[17:23:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[17:23:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1124 on all recursors
[17:23:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1124
[17:23:46] <Amir1>	 the first one took twenty minutes until I gave up
[17:24:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1124
[17:24:32] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_magru
[17:24:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[17:24:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1124 to cirrussearch1124
[17:26:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1122.eqiad.wmnet with reason: host reimage
[17:27:53] <Amir1>	 swfrench-wmf: it's moving forward now, after 9 minutes. I take it now but this seems really broken
[17:28:33] <swfrench-wmf>	 Amir1: good to hear it's moving. I'm working my way through logs to try to sort out what was happening.
[17:28:59] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[17:29:10] <wikibugs>	 (03CR) 10Sbisson: "recheck" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[17:29:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1125 to cirrussearch1125 - bking@cumin2002"
[17:29:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1125 to cirrussearch1125 - bking@cumin2002"
[17:29:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:29:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1125 on all recursors
[17:29:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1125 on all recursors
[17:29:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1125
[17:30:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Pass krb2002 to Kerberos clients again [puppet] - 10https://gerrit.wikimedia.org/r/1143063 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[17:30:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1123.eqiad.wmnet with reason: host reimage
[17:31:15] <wikibugs>	 (03CR) 10Brouberol: Set the remaining Enterprise WM Downloader job to absent (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[17:31:37] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:32:44] <wikibugs>	 (03PS1) 10Sbisson: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839)
[17:32:54] <logmsgbot>	 bking@cumin2002 rename (PID 1712372) is awaiting input
[17:33:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[17:34:09] <jinxer-wm>	 FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:34:14] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:34:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1122.eqiad.wmnet with reason: host reimage
[17:35:25] <swfrench-wmf>	 !log deploy1003 and deploy2002 updated to PHP 8.1 - T392938
[17:35:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:28] <stashbot>	 T392938: Remove PHP 7.4 from deployment hosts - https://phabricator.wikimedia.org/T392938
[17:37:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1123.eqiad.wmnet with reason: host reimage
[17:37:42] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:37:45] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[17:38:24] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-psi-eqiad on cirrussearch1123 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[17:39:09] <jinxer-wm>	 FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:40:17] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[17:43:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[17:43:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Initial checkin of files for openstack version 'Epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143144 (https://phabricator.wikimedia.org/T390914)
[17:43:36] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Magnum: remove local hacks for log size limits [puppet] - 10https://gerrit.wikimedia.org/r/1143145 (https://phabricator.wikimedia.org/T390914)
[17:43:44] <icinga-wm>	 RECOVERY - Disk space on analytics1073 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops
[17:44:24] <dwisehaupt>	 cdanis: looks like some alerts will clear and then flap back to awol. not sure what's needed from here.
[17:45:24] <icinga-wm>	 RECOVERY - Disk space on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops
[17:45:25] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: Set the remaining Enterprise WM Downloader job to absent (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[17:45:34] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] Fix wdqs-all Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1142796 (owner: 10Muehlenhoff)
[17:46:04] <wikibugs>	 (03PS2) 10Aleksandar Mastilovic: Adding suggested edits [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556)
[17:46:32] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for production-search-psi-eqiad on cirrussearch1123 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[17:47:04] <wikibugs>	 (03CR) 10Sbisson: "recheck" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[17:47:34] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for production-search-psi-eqiad on cirrussearch1123 is OK: SSL OK - Certificate cirrussearch1123.eqiad.wmnet valid until 2025-06-04 17:41:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search
[17:51:37] <wikibugs>	 (03PS2) 10Volans: DataPlatf. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842
[17:52:37] <logmsgbot>	 vriley@cumin1002 provision (PID 3777157) is awaiting input
[17:52:56] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:53:48] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143125|Remove whatlinkshere hook (T393513)]] (duration: 36m 00s)
[17:53:51] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[17:55:19] <wikibugs>	 10SRE-tools, 10Spicerack: Cookbook downtiming does not work, continues anyway - https://phabricator.wikimedia.org/T393630 (10BCornwall) 03NEW
[17:55:40] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:55:48] <icinga-wm>	 RECOVERY - Disk space on analytics1072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops
[17:57:13] <wikibugs>	 (03CR) 10Xcollazo: "Looks like the commit message needs fixing. Otherwise patchset 2 LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[17:57:59] <wikibugs>	 (03CR) 10Sbisson: "recheck" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[17:58:06] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:58:24] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:58:45] <swfrench-wmf>	 Amir1: so, to follow up, all I can really tell from this point is that the 4GiB tmpfs staging area was being exhausted, repeatedly
[17:58:45] <swfrench-wmf>	 that runs counter to what I said about these being incremental builds, *but* I now realize those were incremental relative to an image built earlier in your attempts ...
[17:58:45] <swfrench-wmf>	 meaning, those might have in fact been "full" mediawiki-layer pushes - alas, since the run was interrupted, and the logs in your home directory overwritten, I can't say for certain
[17:59:09] <jinxer-wm>	 FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[17:59:27] <wikibugs>	 (03PS5) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517)
[17:59:30] <Amir1>	 thanks. Even this one took 36 minutes
[17:59:58] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T1800)
[18:00:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[18:01:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1125
[18:02:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1122.eqiad.wmnet with OS bullseye
[18:02:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1125 to cirrussearch1125
[18:02:16] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch1123 is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1680, active_shards: 5035, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of
[18:02:16] <icinga-wm>	 _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92061917047033 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:02:50] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch1122 is OK: OK - elasticsearch status production-search-psi-eqiad: cluster_name: production-search-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 35, number_of_data_nodes: 35, discovered_master: True, active_primary_shards: 1680, active_shards: 5035, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 4, delayed_unassigned_shards: 0, number_of
[18:02:50] <icinga-wm>	 _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92061917047033 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:03:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801784 (10Justman10000) >>! In T393587#10801408, @Aklapper hat geschrieben: > I guess we don't let random folks push random commits without review to potentially bring down Wikimedia webs...
[18:03:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1123.eqiad.wmnet with OS bullseye
[18:05:33] <logmsgbot>	 jmm@cumin2002 drain-node (PID 1484195) is awaiting input
[18:06:04] <dancy>	 swfrench-wmf: https://logstash.wikimedia.org/goto/e66c16bb6b4a4511d9890acabeebb1ee for old build logs
[18:06:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1124.eqiad.wmnet with OS bullseye
[18:06:56] <dancy>	 I started with the official scap logstash dashboard https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 and added the `labels.channel: scap.k8s.build` filter.
[18:06:59] <cdanis>	 dwisehaupt: I think from here you'll need to talk to the o11y sre team, sorry :( I would suspect that icinga is backlogged enough it can't process the nsca notifications
[18:07:32] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 68, number_of_data_nodes: 68, discovered_master: True, active_primary_shards: 1481, active_shards: 4444, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 31, delayed_unassigned_shards: 0, number_of_pendin
[18:07:32] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.3072625698324 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:07:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1125.eqiad.wmnet with OS bullseye
[18:08:00] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 68, number_of_data_nodes: 68, discovered_master: True, active_primary_shards: 1481, active_shards: 4444, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 31, delayed_unassigned_shards: 0, number_of_pendin
[18:08:00] <icinga-wm>	  0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.3072625698324 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:08:17] <swfrench-wmf>	 dancy: indeed, thanks! so, the problem is that the ones I'm interested were aborted, so the logs that report the full (paginated) output from the build are missing
[18:08:40] <swfrench-wmf>	 like the one at 16:08
[18:08:48] <swfrench-wmf>	 i.e., you only get the `Running sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py` logs
[18:09:06] <dwisehaupt>	 cdanis: on cool. thanks! i'll find their channel since i'm not in it now.
[18:09:48] <swfrench-wmf>	 dwisehaupt: #wikimedia-observability
[18:09:52] <dwisehaupt>	 thanks!
[18:10:17] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616)
[18:10:38] <dancy>	 swfrench-wmf: Aww, bummer.
[18:10:51] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Repurpose 5 config B servers - https://phabricator.wikimedia.org/T380805#10801829 (10Andrew) 05Open→03Resolved a:03Andrew Yes! the other three were repurposed in https://phabricator.wikimedia.org/T392539
[18:10:58] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5485/co" [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh)
[18:11:16] <dancy>	 Anyway, it's train window time and I'm taking over for Jeena today.
[18:12:42] <dancy>	 I'm going to run `scap build-images`to see what state things are in first.
[18:12:49] <wikibugs>	 (03PS6) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517)
[18:12:52] <wikibugs>	 (03PS5) 10Eevans: restbase: decommission restbase10[28-30].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1143133 (https://phabricator.wikimedia.org/T393617)
[18:12:56] <logmsgbot>	 !log dancy@deploy1003 Started scap build-images: (no justification provided)
[18:13:27] <logmsgbot>	 !log dancy@deploy1003 Finished scap build-images: (no justification provided) (duration: 00m 30s)
[18:13:37] <swfrench-wmf>	 dancy: sounds good - I _think_ you should be in a good state, as A.mir1's run should have cleared the "apparently latent" large layer pushes
[18:13:46] <dancy>	 Yep. Fast run.
[18:13:57] <wikibugs>	 (03PS2) 10Andrew Bogott: Initial checkin of files for openstack version 'Epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143144 (https://phabricator.wikimedia.org/T390914)
[18:13:57] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Magnum: remove local hacks for log size limits [puppet] - 10https://gerrit.wikimedia.org/r/1143145 (https://phabricator.wikimedia.org/T390914)
[18:13:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack codw1dev: upgrade to release 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143154 (https://phabricator.wikimedia.org/T390914)
[18:14:56] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm
[18:15:01] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143155 (https://phabricator.wikimedia.org/T386223)
[18:15:02] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10801844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm
[18:15:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143155 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot)
[18:15:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Initial checkin of files for openstack version 'Epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143144 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott)
[18:15:52] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143155 (https://phabricator.wikimedia.org/T386223) (owner: 10TrainBranchBot)
[18:15:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack Magnum: remove local hacks for log size limits [puppet] - 10https://gerrit.wikimedia.org/r/1143145 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott)
[18:18:07] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[18:18:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1125.eqiad.wmnet with reason: host reimage
[18:18:21] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh)
[18:18:45] <wikibugs>	 (03PS1) 10Ebernhardson: Bump ltr plugin to 1.5.4-wmf1-os1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1143156 (https://phabricator.wikimedia.org/T317599)
[18:19:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801870 (10Aklapper) You have been told before first to write patches to "modify group permissions". You have not yet. Feel free to start contributing instead of asking for more permission...
[18:19:06] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[18:19:39] <logmsgbot>	 !log aokoth@dns1004 START - running authdns-update
[18:20:00] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] aphlict: revert eqiad host to active [puppet] - 10https://gerrit.wikimedia.org/r/1140217 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[18:20:22] <icinga-wm>	 RECOVERY - Disk space on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops
[18:20:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack codw1dev: upgrade to release 'epoxy' [puppet] - 10https://gerrit.wikimedia.org/r/1143154 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott)
[18:20:49] <logmsgbot>	 !log aokoth@dns1004 END - running authdns-update
[18:21:21] <volans>	 !log uploaded spicerack_10.2.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia
[18:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1125.eqiad.wmnet with reason: host reimage
[18:21:45] <andrewbogott>	 arnaudb: want me to merge 'Arnold Okoth: aphlict: revert eqiad host to active' ?
[18:21:54] <andrewbogott>	 oops, I mean arnoldokoth ^
[18:22:00] <andrewbogott>	 arnaudb, disregard
[18:22:16] <arnoldokoth>	 Yes please.
[18:22:20] <andrewbogott>	 ok! doing
[18:22:42] <arnoldokoth>	 Thanks!
[18:24:02] <wikibugs>	 (03PS2) 10Volans: DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843
[18:28:32] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[18:28:50] <wikibugs>	 (03PS2) 10HMonroy: Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121)
[18:29:08] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.28  refs T386223
[18:29:10] <stashbot>	 T386223: 1.44.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T386223
[18:29:36] <swfrench-wmf>	 hmmmm ... `CalicoKubeControllersDown` is both worrisome and has a minimally helpful summary
[18:29:53] <dancy>	 TODO.. hehe
[18:30:25] <swfrench-wmf>	 ah, alright - `site:eqiad prometheus:k8s-dse`
[18:30:32] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "This really should use a dblist to avoid unreadable configuration code. I can help set that up if that's useful?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[18:30:35] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[18:31:28] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:31:48] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[18:32:12] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:33:59] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:34:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:34:50] <wikibugs>	 (03CR) 10Bvibber: "Yeah that's best :D I think I can figure it out i'll poke you if I get lost :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[18:35:12] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:35:28] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:35:30] <wikibugs>	 (03PS3) 10HMonroy: Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121)
[18:35:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801942 (10Justman10000) >>! In T393587#10801870, @Aklapper hat geschrieben: > You have been told before first to write patches to "modify group permissions". You have not yet. Feel free t...
[18:36:14] <wikibugs>	 (03CR) 10MusikAnimal: [C:03+1] Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[18:36:52] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:37:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10801946 (10Justman10000) What I mean to say is, I'd rather be able to do it directly than have to hope to be faster than those who can comit directly!
[18:37:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[18:37:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host apus-fe1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:38:04] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Codex and Multiblocks in Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142642 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[18:38:13] <wikibugs>	 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10801948 (10Eevans) >>! In T390630#10793749, @Scott_French wrote: > After a bit of thought and some back-...
[18:38:27] <logmsgbot>	 !log hmonroy@deploy1003 Started scap sync-world: Backport for [[gerrit:1142642|Enable Codex and Multiblocks in Hebrew wiki (T377121)]]
[18:38:29] <stashbot>	 T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121
[18:38:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1125.eqiad.wmnet with OS bullseye
[18:39:07] <wikibugs>	 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10801953 (10Eevans)
[18:39:12] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:39:28] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:41:48] <icinga-wm>	 RECOVERY - Disk space on analytics1073 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1073&var-datasource=eqiad+prometheus/ops
[18:41:48] <icinga-wm>	 RECOVERY - Disk space on analytics1072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops
[18:42:12] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:42:28] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:43:06] <logmsgbot>	 jclark@cumin1002 provision (PID 3784978) is awaiting input
[18:43:59] <wikibugs>	 (03PS3) 10Aleksandar Mastilovic: Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556)
[18:44:12] <swfrench-wmf>	 alright, following up on the CalicoKubeControllersDown alert, it appears that the calico-kube-controllers pod in dse-k8s-eqiad is OOMing
[18:45:04] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh)
[18:45:06] <logmsgbot>	 !log hmonroy@deploy1003 hmonroy: Backport for [[gerrit:1142642|Enable Codex and Multiblocks in Hebrew wiki (T377121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:45:09] <stashbot>	 T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121
[18:45:25] <wikibugs>	 (03PS7) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517)
[18:45:54] <wikibugs>	 (03CR) 10Bvibber: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[18:46:35] <hmonroy>	 musikanimal ready in testserver
[18:47:27] <musikanimal>	 okay great! give me a few minutes
[18:48:32] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[18:48:54] <musikanimal>	 hmonroy: looks good!
[18:49:06] <logmsgbot>	 !log hmonroy@deploy1003 hmonroy: Continuing with sync
[18:50:32] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[18:53:25] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10801995 (10Jclark-ctr) @MatthewVernon  Can you update the eqiad.yaml file for this one think some things where missed it will not image in eqiad for @VRiley-WMF
[18:55:48] <logmsgbot>	 !log hmonroy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142642|Enable Codex and Multiblocks in Hebrew wiki (T377121)]] (duration: 17m 21s)
[18:55:53] <stashbot>	 T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121
[18:56:05] <hmonroy>	 musikanimal: done!
[18:56:13] <musikanimal>	 \o/
[18:57:16] <wikibugs>	 (03PS1) 10Cory Massaro: Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165
[19:00:07] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802011 (10RobH) a:03RobH I'll open a case with Dell, which will inevitably require the firmware on the NIC, mainboard, and idrac be updated before they...
[19:03:36] <wikibugs>	 (03CR) 10David Martin: [C:03+1] Change red to blue: blue=bad, green=good, yellow=yyyeah... [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143165 (owner: 10Cory Massaro)
[19:06:23] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] Set the remaining Enterprise WM Downloader job to absent [puppet] - 10https://gerrit.wikimedia.org/r/1143134 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[19:13:08] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802052 (10RobH) I can see it seems to have randomly fired a few times:  ` Mon Mar 17 2025 13:32:01  A fatal error was detected on a component at bus 4 de...
[19:16:14] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83589MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[19:18:49] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "Looks reasonable. When switching everything will be pretty cold in the new datacenter, i added https://wikitech.wikimedia.org/wiki/Search/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[19:19:39] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] "try again" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[19:21:18] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802084 (10RobH) Support request confirmed as 'after hours english support' so I had to fill out my contact details a second time and request the upload u...
[19:22:27] <wikibugs>	 (03PS1) 10Cwhite: logstash: calculate w3c generated timestamp [puppet] - 10https://gerrit.wikimedia.org/r/1143169 (https://phabricator.wikimedia.org/T266886)
[19:22:59] <wikibugs>	 (03PS1) 10Ladsgroup: Improve circuit breaking error message [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930)
[19:25:44] <wikibugs>	 (03PS2) 10Ebernhardson: Update plugins for extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1143156 (https://phabricator.wikimedia.org/T317599)
[19:25:58] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609#10802095 (10VRiley-WMF)
[19:26:08] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609#10802097 (10VRiley-WMF) This is completed
[19:28:46] <wikibugs>	 (03PS1) 10Ladsgroup: Remove hard-coded timestamps in SpecialGlobalContributionsTest [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143172 (https://phabricator.wikimedia.org/T393531)
[19:28:58] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Remove hard-coded timestamps in SpecialGlobalContributionsTest [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143172 (https://phabricator.wikimedia.org/T393531) (owner: 10Ladsgroup)
[19:29:23] <logmsgbot>	 bking@cumin2002 reimage (PID 1765198) is awaiting input
[19:30:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[19:34:31] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10802125 (10VRiley-WMF) 05Open→03Resolved This unit has been decommed. We will ensure these disks are certainly shredded.
[19:35:14] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-fe1003.eqiad.wmnet with OS bookworm
[19:35:23] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10802129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host apus-fe1003.eqiad.wmnet with OS bookworm executed with errors: - apus-fe...
[19:38:25] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[19:38:36] <wikibugs>	 06SRE, 10Icinga, 10observability, 10Observability-Alerting: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336#10802136 (10Dwisehaupt) The nsca restart by cdanis helped temporarily but the awol condition quickly returned. It fully cleared up after an icinga...
[19:38:59] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:39:09] <jinxer-wm>	 FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[19:42:02] <icinga-wm>	 PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100%
[19:43:20] <icinga-wm>	 RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms
[19:44:52] <icinga-wm>	 PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[19:45:14] <wikibugs>	 (03PS1) 10Bking: dse-k8s-eqiad: raise calico requests and remove limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636)
[19:45:35] <wikibugs>	 (03Merged) 10jenkins-bot: Remove hard-coded timestamps in SpecialGlobalContributionsTest [extensions/CheckUser] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143172 (https://phabricator.wikimedia.org/T393531) (owner: 10Ladsgroup)
[19:50:44] <icinga-wm>	 PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100%
[19:52:12] <icinga-wm>	 RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms
[19:52:27] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Improve circuit breaking error message [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[19:52:47] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] "hit me baby one more time" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[19:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[19:55:47] <wikibugs>	 (03CR) 10Scott French: dse-k8s-eqiad: raise calico requests and remove limits. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking)
[19:58:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[19:58:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T2000).
[20:00:05] <jouncebot>	 bvibber and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:13] <bvibber>	 o/
[20:01:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:02:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:02:41] <wikibugs>	 (03PS2) 10Bking: dse-k8s-eqiad: raise calico requests and remove limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636)
[20:03:00] <stephanebisson>	 o/
[20:03:48] <jeena>	 I can deploy
[20:03:57] <jeena>	 just need a couple minutes
[20:04:02] <wikibugs>	 (03Merged) 10jenkins-bot: Improve circuit breaking error message [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143171 (https://phabricator.wikimedia.org/T360930) (owner: 10Ladsgroup)
[20:04:03] <bvibber>	 cool :)
[20:04:06] <wikibugs>	 (03Merged) 10jenkins-bot: Remove whatlinkshere hook [extensions/Flow] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143124 (https://phabricator.wikimedia.org/T393513) (owner: 10Ladsgroup)
[20:04:33] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143124|Remove whatlinkshere hook (T393513)]], [[gerrit:1143171|Improve circuit breaking error message (T360930)]], [[gerrit:1143172|Remove hard-coded timestamps in SpecialGlobalContributionsTest (T393531)]]
[20:04:39] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[20:04:39] <stashbot>	 T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930
[20:04:40] <stashbot>	 T393531: SpecialGlobalContributionsTest::testExecuteTarget with data set "Valid IP" failure in other extensions' build - https://phabricator.wikimedia.org/T393531
[20:07:46] <jeena>	 bvibber: can yours both go out together?
[20:08:13] <bvibber>	 yes
[20:09:11] <Amir1>	 I can do the deploy, mine will finish soon
[20:09:27] <Amir1>	 (depending on how slow these things are)
[20:09:36] <jeena>	 oh i didn't realize you were deploying!
[20:09:48] <jeena>	 sorry
[20:09:53] <bvibber>	 aiee :)
[20:10:23] <jeena>	 I stopped mine
[20:10:32] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[20:10:47] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[20:11:27] <Amir1>	 sorry, my previous deploy broke so many times
[20:11:44] <Amir1>	 and this one is taking way too long bleeding to this window
[20:11:53] <jeena>	 that's okay, I should have checked the backscroll more closely
[20:12:30] <Amir1>	 scap is really slow today, one deploy I had took 36 minutes :/
[20:12:47] <jeena>	 hmm strange
[20:13:10] <jeena>	 last time I backported it took a while but I thought it was because of the localization changes
[20:13:50] <wikibugs>	 (03CR) 10Scott French: [C:03+1] dse-k8s-eqiad: raise calico requests and remove limits. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking)
[20:18:26] <wikibugs>	 (03PS3) 10Scott French: P:mw::maintenance::refreshlinks: rename and prepare for mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143121 (https://phabricator.wikimedia.org/T388530)
[20:18:26] <wikibugs>	 (03CR) 10Scott French: "Alright, despite the *very* long commit message, I think this is the simplest option that gets us out of the business of using the `<shard" [puppet] - 10https://gerrit.wikimedia.org/r/1143121 (https://phabricator.wikimedia.org/T388530) (owner: 10Scott French)
[20:18:29] <wikibugs>	 (03PS3) 10Scott French: P:mw::maintenance::refreshlinks: migrate s8 to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1143122 (https://phabricator.wikimedia.org/T388530)
[20:21:02] <wikibugs>	 (03CR) 10Bking: dse-k8s-eqiad: raise calico requests and remove limits. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking)
[20:21:29] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: lvs3009: set lower priority (depool) [puppet] - 10https://gerrit.wikimedia.org/r/1143153 (https://phabricator.wikimedia.org/T393616) (owner: 10Ssingh)
[20:21:55] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: raise calico requests and remove limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143179 (https://phabricator.wikimedia.org/T393636) (owner: 10Bking)
[20:22:18] <Amir1>	 > 20:21:58 Finished build-and-push-container-images (duration: 16m 37s)
[20:22:21] <Amir1>	 This is not normal
[20:22:54] <dancy>	 The multiversion image was a full build 
[20:23:06] <sukhe>	 !log depooling lvs3009 for HW maint: T393616
[20:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:08] <stashbot>	 T393616: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616
[20:23:29] <dancy>	 Amir1: Updated 536 CDB files(s) in /srv/mediawiki-staging/php-1.44.0-wmf.28/cache/l10n is the reason for the full image build.
[20:24:02] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802334 (10ssingh) >>! In T393616#10802011, @RobH wrote: > I'll open a case with Dell, which will inevitably require the firmware on the NIC, mainboard, a...
[20:24:09] <jinxer-wm>	 FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[20:24:17] <Amir1>	 dancy: aaaah, that makes sense now. Thanks
[20:24:33] <wikibugs>	 (03PS1) 10Andrew Bogott: wmf_sink: remove an arg when asking keystone for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1143183 (https://phabricator.wikimedia.org/T390914)
[20:25:25] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[20:26:00] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009*} and A:liberica (T393616)
[20:26:07] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009*} and A:liberica (T393616)
[20:26:18] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be1060.eqiad.wmnet - https://phabricator.wikimedia.org/T393609#10802343 (10VRiley-WMF) 05Open→03Resolved
[20:26:30] <jinxer-wm>	 FIRING: LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=esams&var-instance=lvs3009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[20:26:41] <sukhe>	 hmm that's not cool
[20:26:58] <sukhe>	 probably a race condition given I just ran it
[20:27:02] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[20:27:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1144:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1144 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:28:15] <logmsgbot>	 !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[20:31:30] <jinxer-wm>	 RESOLVED: LibericaStaleConfig: Liberica instance lvs3009 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=esams&var-instance=lvs3009 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[20:32:59] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1143124|Remove whatlinkshere hook (T393513)]], [[gerrit:1143171|Improve circuit breaking error message (T360930)]], [[gerrit:1143172|Remove hard-coded timestamps in SpecialGlobalContributionsTest (T393531)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:33:02] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[20:33:04] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[20:33:05] <stashbot>	 T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930
[20:33:05] <stashbot>	 T393531: SpecialGlobalContributionsTest::testExecuteTarget with data set "Valid IP" failure in other extensions' build - https://phabricator.wikimedia.org/T393531
[20:34:05] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142671 (https://phabricator.wikimedia.org/T393286) (owner: 10Jdlrobson)
[20:34:10] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1142698 (https://phabricator.wikimedia.org/T393286) (owner: 10Bvibber)
[20:34:14] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:34:54] <wikibugs>	 (03CR) 10Ladsgroup: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:35:17] <bvibber>	 \o/
[20:35:49] <bd808>	 bvibber: :hype:
[20:36:54] <jeena>	 can we deploy now?
[20:37:15] <jeena>	 I think it's still syncing
[20:37:41] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "I think this patch is broken. You need to add the dblist to the DB_LISTS, otherwise it doesn't work. The tests should have caught this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:38:29] <bvibber>	 on it
[20:38:43] <Amir1>	 yeah, it's still syncing
[20:38:45] <wikibugs>	 (03PS2) 10Andrew Bogott: wmf_sink: remove an arg when asking keystone for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1143183 (https://phabricator.wikimedia.org/T390914)
[20:39:09] <wikibugs>	 (03PS8) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517)
[20:39:27] <wikibugs>	 (03CR) 10Bvibber: "Whoops! Should be fixed now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:39:43] <bvibber>	 and now we wait for the tests again :D
[20:39:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:39:56] <wikibugs>	 (03CR) 10Ladsgroup: "The diff CI is broken:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:40:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmf_sink: remove an arg when asking keystone for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1143183 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott)
[20:41:11] <bvibber>	 ...what?
[20:41:49] <wikibugs>	 (03CR) 10Bvibber: "I have no idea what's wrong. Is there documentation I have failed to find and follow?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:42:45] <bvibber>	 does anybody know what's wrong and how to fix it?
[20:43:03] <wikibugs>	 (03Merged) 10jenkins-bot: Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1142671 (https://phabricator.wikimedia.org/T393286) (owner: 10Jdlrobson)
[20:43:04] <wikibugs>	 (03Merged) 10jenkins-bot: Clear floats to avoid tall charts [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1142698 (https://phabricator.wikimedia.org/T393286) (owner: 10Bvibber)
[20:43:50] <Amir1>	 bvibber: run "composer manage-dblist update"
[20:44:19] <bvibber>	 thx
[20:44:24] <bvibber>	 is this documented somewhere?
[20:44:44] <wikibugs>	 (03PS9) 10Bvibber: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517)
[20:45:53] <wikibugs>	 (03CR) 10Bvibber: "was told to run composer manage-dblist update, hopefully that does it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:46:15] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143124|Remove whatlinkshere hook (T393513)]], [[gerrit:1143171|Improve circuit breaking error message (T360930)]], [[gerrit:1143172|Remove hard-coded timestamps in SpecialGlobalContributionsTest (T393531)]] (duration: 41m 41s)
[20:46:20] <stashbot>	 T393513: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds." - https://phabricator.wikimedia.org/T393513
[20:46:20] <stashbot>	 T360930: Section-wide circuit breaking - https://phabricator.wikimedia.org/T360930
[20:46:21] <stashbot>	 T393531: SpecialGlobalContributionsTest::testExecuteTarget with data set "Valid IP" failure in other extensions' build - https://phabricator.wikimedia.org/T393531
[20:46:45] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:46:58] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[20:47:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:47:22] <bvibber>	 \o/
[20:47:34] <wikibugs>	 (03Merged) 10jenkins-bot: Charts phase 1 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1142701 (https://phabricator.wikimedia.org/T393517) (owner: 10Bvibber)
[20:48:01] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1142701|Charts phase 1 deployment (T393517)]], [[gerrit:1142671|Clear floats to avoid tall charts (T393286)]], [[gerrit:1142698|Clear floats to avoid tall charts (T393286)]]
[20:48:06] <stashbot>	 T393517: Enable Charts for Phase 1 wikis - https://phabricator.wikimedia.org/T393517
[20:48:06] <stashbot>	 T393286: Chart without any explicit height is very high on itwiki - https://phabricator.wikimedia.org/T393286
[20:49:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: Allow setting env variables in mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142794 (https://phabricator.wikimedia.org/T380925) (owner: 10RLazarus)
[20:49:42] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1124.eqiad.wmnet with OS bullseye
[20:49:56] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno12399np0) - https://phabricator.wikimedia.org/T393616#10802443 (10ssingh) The host has been depooled so you can reboot or shut it down without checking with us. Thanks for the quick response Rob!
[20:50:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1124.eqiad.wmnet with OS bullseye
[20:52:34] <wikibugs>	 (03PS1) 10Bvibber: Stub README.md for dblists/ dir to remind people to use the tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143188
[20:54:53] <Amir1>	 bvibber: it's on the mw-debug
[20:55:05] <bvibber>	 excellent
[20:55:17] <logmsgbot>	 !log ladsgroup@deploy1003 jdlrobson, bvibber, ladsgroup: Backport for [[gerrit:1142701|Charts phase 1 deployment (T393517)]], [[gerrit:1142671|Clear floats to avoid tall charts (T393286)]], [[gerrit:1142698|Clear floats to avoid tall charts (T393286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:55:22] <bvibber>	 testing
[20:55:28] <stashbot>	 T393517: Enable Charts for Phase 1 wikis - https://phabricator.wikimedia.org/T393517
[20:55:29] <stashbot>	 T393286: Chart without any explicit height is very high on itwiki - https://phabricator.wikimedia.org/T393286
[20:56:30] <bvibber>	 Amir1: we're good to go <3
[20:56:43] <logmsgbot>	 !log ladsgroup@deploy1003 jdlrobson, bvibber, ladsgroup: Continuing with sync
[20:56:48] <Amir1>	 let's go then
[20:57:11] <bvibber>	 i guess it's time to  *flips down sunglasses*  deploy the patches
[20:59:47] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_magru
[21:00:06] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T2100)
[21:01:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1124.eqiad.wmnet with reason: host reimage
[21:01:25] <wikibugs>	 (03PS2) 10Bvibber: Stub README.md for dblists/ dir to remind people to use the tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143188 (https://phabricator.wikimedia.org/T393648)
[21:04:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1124.eqiad.wmnet with reason: host reimage
[21:05:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[21:05:23] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1142701|Charts phase 1 deployment (T393517)]], [[gerrit:1142671|Clear floats to avoid tall charts (T393286)]], [[gerrit:1142698|Clear floats to avoid tall charts (T393286)]] (duration: 17m 21s)
[21:05:26] <stashbot>	 T393517: Enable Charts for Phase 1 wikis - https://phabricator.wikimedia.org/T393517
[21:05:26] <stashbot>	 T393286: Chart without any explicit height is very high on itwiki - https://phabricator.wikimedia.org/T393286
[21:05:27] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[21:05:38] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_magru
[21:05:42] <Amir1>	 bvibber: deployed
[21:05:47] <stephanebisson>	 Amir1 we can do my patches tomorrow as we're running out of time. And we can only do wmf.28 at that point.
[21:05:55] <bvibber>	 Amir1: woohoo! thanks
[21:06:03] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_esams
[21:06:06] <Amir1>	 stephanebisson: nah, It's straightforward IMO
[21:06:21] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_esams
[21:06:33] <Amir1>	 it can go over a bit. I don't think anything is happening with the next window
[21:06:56] <Amir1>	 deployment windows are a social construct anyway made to sell more clocks or something like that
[21:08:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[21:08:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[21:09:01] <stephanebisson>	 Alright, go for it. There is nothing to test directly. It disables the possibility of calling `cxpublishedtransaltion` without to/from parameters. We don't believe any API caller is doing that but it they do, it's going to trigger and API validation error instead of a slow query, and we are fine with that.
[21:09:43] <Amir1>	 yeah
[21:11:33] <stephanebisson>	 I have to run but I'll search the logs in a few hours to see if there is anything related.
[21:11:59] <stephanebisson>	 Amir1 thanks for deploying and sorry for the situation.
[21:13:17] <Amir1>	 thanks. No worries
[21:16:34] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10802579 (10VRiley-WMF) thanos-fe1005 A7  U26 CableID 4888 Port 24  thanos-fe1006 B4  U8 CableID 4778 Port35  thanos-fe1007 D4 U26 CableID 20220118
[21:16:55] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10802581 (10VRiley-WMF)
[21:18:52] <wikibugs>	 (03Merged) 10jenkins-bot: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1143138 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[21:18:53] <wikibugs>	 (03Merged) 10jenkins-bot: ApiQueryPublishedTranslations: Make `from` and `to` mandatory [extensions/ContentTranslation] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1143142 (https://phabricator.wikimedia.org/T392839) (owner: 10Sbisson)
[21:19:20] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1143138|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]], [[gerrit:1143142|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]]
[21:19:24] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[21:21:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1124.eqiad.wmnet with OS bullseye
[21:26:54] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, sbisson: Backport for [[gerrit:1143138|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]], [[gerrit:1143142|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:26:57] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[21:27:02] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, sbisson: Continuing with sync
[21:28:08] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:00] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 759, active_shards: 1784, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[21:29:00] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:29:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380)
[21:33:33] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143138|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]], [[gerrit:1143142|ApiQueryPublishedTranslations: Make `from` and `to` mandatory (T392839)]] (duration: 14m 12s)
[21:33:36] <stashbot>	 T392839: CX tables miss a lot of important indexes causing partial outages - https://phabricator.wikimedia.org/T392839
[21:44:02] <icinga-wm>	 PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[21:47:00] <icinga-wm>	 RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[21:49:54] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134)
[21:51:10] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134)
[21:51:45] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[21:56:27] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250507T2200)
[22:10:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802714 (10Aklapper) > Why should I submit patches when others can commit directly? Provide one specific example where someone "committed directly" instead of going via a patch. One. Thanks.
[22:11:43] <wikibugs>	 (03CR) 10Btullis: [C:03+1] wdqs-main: allow query.wikidata.org to hit main [puppet] - 10https://gerrit.wikimedia.org/r/1143194 (https://phabricator.wikimedia.org/T388134) (owner: 10Ryan Kemper)
[22:18:07] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of failed login attempts (unknown device and IP) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[22:36:55] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service puppetmaster1001:8140 has failed probes (http_puppetmaster1001_eqiad_wmnet_https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:44:04] <icinga-wm>	 PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[22:46:00] <icinga-wm>	 RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd
[22:58:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802784 (10Justman10000) Everyone? Why do a patch when one can comit directly? And even then, the same question still remains... Only that I would then have to create a patch faster than a...
[23:09:09] <jinxer-wm>	 FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1119-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:13:46] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on elastic1062:9290 - https://phabricator.wikimedia.org/T393657 (10phaultfinder) 03NEW
[23:30:37] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802849 (10Aklapper) Could you simply answer my question and link to one specific example?
[23:38:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143204
[23:38:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143204 (owner: 10TrainBranchBot)
[23:39:00] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:51:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143204 (owner: 10TrainBranchBot)
[23:55:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[23:57:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to shell for Justman10000 - https://phabricator.wikimedia.org/T393587#10802874 (10Justman10000) >>! In T393587#10802849, @Aklapper hat geschrieben: > Could you simply answer my question and link to one specific example?  From my answer, I think that one does...