[00:01:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P52441 and previous config saved to /var/cache/conftool/dbconfig/20230912-000148-arnaudb.json
[00:12:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[00:16:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52442 and previous config saved to /var/cache/conftool/dbconfig/20230912-001654-arnaudb.json
[00:16:56] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[00:16:58] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[00:17:09] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[00:17:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52443 and previous config saved to /var/cache/conftool/dbconfig/20230912-001715-arnaudb.json
[00:38:08] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956023
[00:38:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956023 (owner: 10TrainBranchBot)
[00:53:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956023 (owner: 10TrainBranchBot)
[01:03:52] <icinga-wm>	 PROBLEM - dump of s5 in eqiad on backupmon1001 is CRITICAL: Last dump for s5 at eqiad (db1216) taken on 2023-09-12 00:00:03 is 61 GiB, but the previous one was 73 GiB, a change of -17.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:05:49] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346112 (10phaultfinder)
[01:06:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host titan1001.eqiad.wmnet with OS bookworm
[01:06:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host titan1001.eqiad.wmnet with OS bookworm
[01:19:26] <icinga-wm>	 PROBLEM - dump of s5 in codfw on backupmon1001 is CRITICAL: Last dump for s5 at codfw (db2101) taken on 2023-09-12 00:00:05 is 61 GiB, but the previous one was 73 GiB, a change of -17.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:34:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on titan1001.eqiad.wmnet with reason: host reimage
[01:38:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan1001.eqiad.wmnet with reason: host reimage
[01:54:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[01:55:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[01:55:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan1001.eqiad.wmnet with OS bookworm
[01:56:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host titan1001.eqiad.wmnet with OS bookworm completed: - titan10...
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0200)
[02:06:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.26 [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956024 (https://phabricator.wikimedia.org/T343728)
[02:06:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host titan1002.eqiad.wmnet with OS bookworm
[02:06:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.26 [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956024 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[02:07:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host titan1002.eqiad.wmnet with OS bookworm
[02:07:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:50] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.26 [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956024 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[02:28:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on titan1002.eqiad.wmnet with reason: host reimage
[02:32:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan1002.eqiad.wmnet with reason: host reimage
[02:37:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:48:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:49:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:49:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan1002.eqiad.wmnet with OS bookworm
[02:49:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host titan1002.eqiad.wmnet with OS bookworm completed: - titan10...
[02:50:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jhancock.wm) 05Open→03Resolved
[03:00:06] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0300)
[03:01:06] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:37] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956521 (https://phabricator.wikimedia.org/T343728)
[03:01:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956521 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[03:02:20] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956521 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[03:02:50] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.26  refs T343728
[03:02:53] <stashbot>	 T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728
[03:56:08] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.26  refs T343728 (duration: 53m 18s)
[03:56:12] <stashbot>	 T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728
[03:58:41] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.23, 1.41.0-wmf.24 (duration: 02m 30s)
[04:10:01] <wikibugs>	 (03CR) 10TTO: Enable PageNotice on enwiktionary beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[04:12:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[04:14:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52444 and previous config saved to /var/cache/conftool/dbconfig/20230912-041425-arnaudb.json
[04:14:29] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[04:29:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P52445 and previous config saved to /var/cache/conftool/dbconfig/20230912-042931-arnaudb.json
[04:44:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P52446 and previous config saved to /var/cache/conftool/dbconfig/20230912-044437-arnaudb.json
[04:56:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2014.codfw.wmnet to cluster codfw and group C
[04:59:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52447 and previous config saved to /var/cache/conftool/dbconfig/20230912-045944-arnaudb.json
[04:59:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[04:59:47] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[05:00:00] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[05:00:01] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[05:00:27] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[05:00:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52448 and previous config saved to /var/cache/conftool/dbconfig/20230912-050033-arnaudb.json
[05:00:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2014.codfw.wmnet to cluster codfw and group C
[05:04:58] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[05:05:24] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[05:06:00] <moritzm>	 ^ expected, I'm reimaging the node after a mainboard replacement
[05:06:10] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[05:08:15] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff)
[05:08:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2014.codfw.wmnet with OS bullseye
[05:09:03] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye
[05:11:14] <moritzm>	 !log installing aom security updates
[05:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2158', diff saved to https://phabricator.wikimedia.org/P52449 and previous config saved to /var/cache/conftool/dbconfig/20230912-051725-root.json
[05:17:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1119 with Debian Bookworm in s1 with just 10% T339185', diff saved to https://phabricator.wikimedia.org/P52450 and previous config saved to /var/cache/conftool/dbconfig/20230912-051753-marostegui.json
[05:18:00] <stashbot>	 T339185: Test MariaDB + Debian bookworm on databases - https://phabricator.wikimedia.org/T339185
[05:18:28] <wikibugs>	 (03PS1) 10Marostegui: db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956524
[05:19:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956524 (owner: 10Marostegui)
[05:27:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2014.codfw.wmnet with reason: host reimage
[05:32:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2014.codfw.wmnet with reason: host reimage
[05:38:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343198)', diff saved to https://phabricator.wikimedia.org/P52452 and previous config saved to /var/cache/conftool/dbconfig/20230912-053813-arnaudb.json
[05:38:17] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[05:50:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2014.codfw.wmnet with OS bullseye
[05:50:33] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye completed: - ganeti2014 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Di...
[05:53:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P52453 and previous config saved to /var/cache/conftool/dbconfig/20230912-055319-arnaudb.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0600).
[06:08:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P52454 and previous config saved to /var/cache/conftool/dbconfig/20230912-060825-arnaudb.json
[06:16:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Routinator: tmpfs, bump the maximum number of inodes [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi)
[06:23:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add esams sandbox network prefixes [puppet] - 10https://gerrit.wikimedia.org/r/956454 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi)
[06:23:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343198)', diff saved to https://phabricator.wikimedia.org/P52455 and previous config saved to /var/cache/conftool/dbconfig/20230912-062332-arnaudb.json
[06:23:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
[06:23:36] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[06:23:47] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
[06:23:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T343198)', diff saved to https://phabricator.wikimedia.org/P52456 and previous config saved to /var/cache/conftool/dbconfig/20230912-062353-arnaudb.json
[06:29:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:33:26] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:33:36] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/956065
[06:34:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:34:42] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:36:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/956065 (owner: 10Marostegui)
[06:36:35] <wikibugs>	 (03PS1) 10Elukey: services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0700).
[07:00:06] <jouncebot>	 tto and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:14] * urbanecm waves
[07:00:16] <taavi>	 morning
[07:00:20] <tto>	 Hello!
[07:00:24] <urbanecm>	 hi tto!
[07:00:40] <tto>	 Sadly my patch has a -1 from Krinkle
[07:00:43] <urbanecm>	 taavi: hi, glad you're here. can you please have a look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668156 and add a 3rd opinion please?
[07:00:48] <tto>	 I am not sure that he actually looked at it?
[07:00:57] <taavi>	 the patch number in [[Deployments]] is wrong btw
[07:01:01] <taavi>	 urbanecm: looking
[07:01:02] <urbanecm>	 Timo gave a -1 and suggested to deploy the extension to beta first, which is...what this extension does :)
[07:01:09] <tto>	 Oh I see you have responded there
[07:01:24] <urbanecm>	 thanks taa.vi.
[07:01:52] <tto>	 And thank you urbanecm for fixing the situation regarding the Maintainers list on mw:
[07:01:57] <urbanecm>	 np
[07:03:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet
[07:03:30] <_joe_>	 uh sorry I missed the ping
[07:03:34] <taavi>	 urbanecm: usually the extension-list addition is done in a separate commit. but I can't immediately think of a reason why doing it in the same commit would not work
[07:04:07] <urbanecm>	 taavi: i think that doesn't matter for beta anyway, since its syncs happen whenever CI feels like it (unless deployer intervenes, which is generally not the case)
[07:04:13] <_joe_>	 should I star with my patches?
[07:04:26] <urbanecm>	 _joe_: feel free to and please ping me once done :)
[07:05:07] <_joe_>	 urbanecm: just a piece of advice: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/943036/ doesn't really need to be synched separately
[07:05:25] <_joe_>	 should I just merge it then run scap backport on the others?
[07:05:39] <urbanecm>	 i recommend passing multiple change IDs to `scap backport`
[07:05:49] <_joe_>	 oh TIL that works?
[07:05:51] <urbanecm>	 ie. `scap backport 943036 942675`
[07:05:53] <urbanecm>	 yup, it does.
[07:06:03] <_joe_>	 ack then, thanks
[07:06:18] <urbanecm>	 merging and then running scap backport on the next patch works too, but you'll get a warning about an unexpected patch.
[07:06:49] <taavi>	 urbanecm: right. only risk I see is that a revert of the technically-production-facing parts would take longer because you have i18n changes
[07:08:06] <urbanecm>	 taavi: we can split that if you feel like that. i asked because of the -1, which's probably caused by the patch changing production files without also having production impact (well, that's the goal at least, but it doesn't seem like it actually suggests some change/improvement.
[07:08:25] <urbanecm>	 (we can merge anyway, but i feel it would be helpful to get a second opinion before doing so)
[07:08:30] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:09:16] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[07:09:30] <taavi>	 urbanecm: tbh I'm find with both approaches, and the patch looks fine to me
[07:09:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet
[07:09:46] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Backport for [[gerrit:943036|update noc README]], [[gerrit:951046|Use ClusterConfig]]
[07:09:56] <urbanecm>	 taavi: thanks. then let's leave it as-is i guess :). would you mind formally saying so on the patch please?
[07:11:22] <logmsgbot>	 !log oblivian@deploy1002 oblivian: Backport for [[gerrit:943036|update noc README]], [[gerrit:951046|Use ClusterConfig]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:12:02] <taavi>	 yes
[07:12:04] <urbanecm>	 ty
[07:12:17] <_joe_>	 ook
[07:13:31] <urbanecm>	 tto: fyi, with a second +1, i'm ok with deploying your beta-patch today. let's wait for joe to be done with his deployment now and then i think we can proceed.
[07:13:55] <tto>	 Thanks a bunch urbanecm!
[07:14:12] <icinga-wm>	 RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:58] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[07:17:27] <logmsgbot>	 !log oblivian@deploy1002 oblivian: Continuing with sync
[07:20:55] <urbanecm>	 _joe_: if it would be possible to "squeeze in" a beta patch, i'd appreciate that. just realized you have quite a few of patches and we have only ~40 minutes left before train starts.
[07:21:15] <_joe_>	 urbanecm: I think so yes
[07:21:35] <_joe_>	 like my next patch is very fast and not risky (adds a function, still unused)
[07:21:40] <_joe_>	 we can merge them together maybe?
[07:21:44] <urbanecm>	 sounds good to me
[07:22:21] <urbanecm>	 (tto's patch will (re)build i18n, so it would be slower, but otherwise, why not)
[07:23:29] <_joe_>	 ugh so much slower
[07:23:33] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:943036|update noc README]], [[gerrit:951046|Use ClusterConfig]] (duration: 13m 46s)
[07:23:46] <_joe_>	 ok let's bite the bullet now
[07:23:51] <_joe_>	 jouncebot: nowandnext
[07:23:51] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0700)
[07:23:51] <jouncebot>	 In 0 hour(s) and 36 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0800)
[07:24:14] <_joe_>	 urbanecm: what's the change id?
[07:24:44] <urbanecm>	 _joe_: 668156
[07:24:52] <_joe_>	 ok merging now
[07:24:57] <_joe_>	 any particular thing to test?
[07:25:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto)
[07:25:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[07:25:12] <urbanecm>	 nope, it only impacts beta.
[07:25:14] <tto>	 I can test on enwiktionary beta if you tell me when?
[07:25:30] <urbanecm>	 tto: it will be deployed there in ~30 minutes after the patch merges.
[07:25:32] <_joe_>	 tto: no need, if beta breaks it's not great but that's what it's for
[07:25:35] <_joe_>	 :)
[07:25:38] <urbanecm>	 and that.
[07:25:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org
[07:25:50] <tto>	 Sure thing, thanks again for this!
[07:25:57] <_joe_>	 we break beta so we hopefully don't break production :)
[07:26:04] <_joe_>	 (emphasis on the hopefulness
[07:26:31] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[07:26:54] <wikibugs>	 (03Merged) 10jenkins-bot: ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto)
[07:26:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "escalating to a +2 to record that it was my decision to go ahead despite a -1, based on Taavi's review and the fact that the recommendatio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[07:27:10] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[07:27:33] <wikibugs>	 (03Merged) 10jenkins-bot: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO)
[07:27:47] <_joe_>	 urbanecm: thanks for recording the decision in a comment, but I can handle timo in case of need :)
[07:28:00] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Backport for [[gerrit:951047|ClusterConfig: also allow to return hostname]], [[gerrit:668156|Enable PageNotice on enwiktionary beta (T61245)]]
[07:28:03] <stashbot>	 T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245
[07:28:13] <_joe_>	 07:28:01 Running rebuildLocalisationCache.php as www-data
[07:28:32] <_joe_>	 well I guess the next patch might run shortly into the train
[07:29:04] <urbanecm>	 afaik i18n rebuild's faster than it used to be, esp. when it's only few messages like in this case, but we'll see.
[07:29:20] <urbanecm>	 and thank you for the offer!
[07:29:32] <tto>	 Do I need to stick around or are we all good?
[07:29:36] <urbanecm>	 tto: from your end of things, this is now done. beta'll pick up the new configuration in about half an hour.
[07:30:01] <tto>	 Thanks again urbanecm and _joe_ and taavi! Everyone here today has been so helpful, it's been a pleasure to interact.
[07:30:13] <urbanecm>	 glad we could help.
[07:30:32] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater-k8s: Add egress rules to values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking)
[07:30:33] <icinga-wm>	 PROBLEM - Host mw2444 is DOWN: PING CRITICAL - Packet loss = 100%
[07:30:47] <_joe_>	 siigh mw2444
[07:30:54] <_joe_>	 I *hope* it's depooled already
[07:31:27] <_joe_>	 yeah, thankfully it is
[07:35:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[07:36:30] <wikibugs>	 (03PS16) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798)
[07:36:56] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netmon1003.wikimedia.org
[07:37:35] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:37:39] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[07:40:43] <_joe_>	 rebuilding the l10n cache is indeed horrible
[07:40:48] <urbanecm>	 :-(
[07:41:01] <_joe_>	 yeah that's one of the things we need to solve long-term
[07:41:03] <urbanecm>	 do we know where we're at with that?
[07:41:16] <_joe_>	 another being how slow it is to push large images to our registry
[07:41:20] <urbanecm>	 (with the rebuild, not with long term solution)
[07:41:30] <_joe_>	 ah yes, it's almost done
[07:41:49] <urbanecm>	 great.
[07:41:49] <_joe_>	 we're now pulling this new large image to the k8s nodes
[07:42:39] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[07:43:26] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1153.eqiad.wmnet
[07:45:04] <_joe_>	 urbanecm: but every passage is terribly slow
[07:45:08] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[07:45:09] <urbanecm>	 :(
[07:45:09] <_joe_>	 even the sync-masters is
[07:45:14] <_joe_>	 it's just too much data
[07:45:31] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1153.eqiad.wmnet
[07:45:35] <_joe_>	 I would question if we really need to rebuild all l10n every time the extension list changes
[07:45:53] <_joe_>	 I kinda suspect the whole process can be optimized hard and there's a lot of low-hanging fruits there
[07:46:24] <urbanecm>	 well, extension-list is the list of extensions we build i18n for. when adding a new one, one needs to build i18n for at least the extension that changed.
[07:49:16] <urbanecm>	 but yeah, i agree that there's a lot of things to improve about i18n building.
[07:49:20] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1154.eqiad.wmnet
[07:50:44] <_joe_>	 urbanecm: I should've told you no :/
[07:51:01] <_joe_>	 syncing to the fleet is gonna end up eating all the remaining time
[07:51:21] <_joe_>	 syncing to mwdebug is taking forever
[07:51:24] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1154.eqiad.wmnet
[07:51:39] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1155.eqiad.wmnet
[07:52:06] <urbanecm>	 :-( apologies for that.
[07:53:17] <wikibugs>	 (03PS4) 10Ayounsi: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747)
[07:53:52] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1155.eqiad.wmnet
[07:54:05] <_joe_>	 urbanecm: not your fault
[07:54:21] <_joe_>	 but it's taking 15 minutes to sync this patch to the mwdebugs
[07:55:45] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:56:10] <logmsgbot>	 !log oblivian@deploy1002 tto and oblivian: Backport for [[gerrit:951047|ClusterConfig: also allow to return hostname]], [[gerrit:668156|Enable PageNotice on enwiktionary beta (T61245)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:56:17] <stashbot>	 T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245
[07:56:45] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1156.eqiad.wmnet
[07:57:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[07:58:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet
[07:58:35] <logmsgbot>	 !log oblivian@deploy1002 tto and oblivian: Continuing with sync
[07:58:44] <_joe_>	 ok, main sync starting
[07:58:52] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1156.eqiad.wmnet
[07:59:01] <_joe_>	 who's running the train today?
[08:00:05] <jouncebot>	 jnuche and hashar: How many deployers does it take to do MediaWiki train - Utc-0 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0800).
[08:00:22] <jnuche>	 morning
[08:00:31] <hashar>	 :-]
[08:00:35] <jnuche>	 _joe_: I'll be running the train today with andre
[08:00:59] <jnuche>	 take your time to finish the backports, there's no hurry :)
[08:01:09] <_joe_>	  jnuche thanks
[08:01:25] * andre o/
[08:01:59] <_joe_>	 andre: running the train?
[08:02:05] <hashar>	 yes
[08:02:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[08:02:42] <hashar>	 well I guess pairing on it is a better description :]
[08:03:32] <_joe_>	 jnuche: mostly it's out of my control; this patch will take the same time as the train to deploy
[08:03:41] <_joe_>	 we deploy mediawiki in all the wrong ways.
[08:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[08:04:23] <_joe_>	 (and this is not to blame anyone, it's the result of sedimentation of tech debt and stuff)
[08:06:09] <_joe_>	 basically the problem is a 40 lines patch causing this: rsync transfer: average 5,705,916,078 bytes/host
[08:06:20] <_joe_>	 6 GB per hosts worth of transfer
[08:06:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet
[08:07:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Pass ensure->present to the nftables class if selected [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff)
[08:08:10] <hashar>	 I guess it is because the first bakcport on tuesday ends up syncing the whole new wmf mw version
[08:09:16] <_joe_>	 hashar: this isn't the first backport
[08:09:23] <_joe_>	 it's that this change rebuilds l10
[08:09:27] <_joe_>	 *l10n
[08:09:41] <_joe_>	 and l10nupdate is "rebuild everything all the time"
[08:09:46] <hashar>	 ah yes
[08:09:59] <hashar>	 l10nupdate is indeed a bottleneck
[08:09:59] <_joe_>	 so 18 LOC => 6 GB
[08:10:12] <_joe_>	 yeah I suspect there's some quick wins there
[08:10:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: site: apply role to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999)
[08:10:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2014.codfw.wmnet to cluster codfw and group C
[08:12:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2014.codfw.wmnet to cluster codfw and group C
[08:12:22] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[08:13:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43208/console" [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[08:13:24] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:951047|ClusterConfig: also allow to return hostname]], [[gerrit:668156|Enable PageNotice on enwiktionary beta (T61245)]] (duration: 45m 23s)
[08:13:25] <_joe_>	 jnuche: almost done, can I squeeze in the next patch as well?
[08:13:27] <stashbot>	 T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245
[08:13:45] <jnuche>	 _joe_: yeah, np, go ahead
[08:13:46] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060)
[08:13:48] <_joe_>	 thanks 
[08:13:59] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048
[08:14:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 (owner: 10Giuseppe Lavagetto)
[08:14:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43209/console" [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[08:14:56] <wikibugs>	 (03Merged) 10jenkins-bot: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 (owner: 10Giuseppe Lavagetto)
[08:15:23] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Backport for [[gerrit:951048|Replace calls to wfHostname with clusterconfig ones]]
[08:15:27] <wikibugs>	 (03PS1) 10Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762)
[08:17:00] <logmsgbot>	 !log oblivian@deploy1002 oblivian: Backport for [[gerrit:951048|Replace calls to wfHostname with clusterconfig ones]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:18:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] site: apply role to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[08:18:48] <logmsgbot>	 !log oblivian@deploy1002 oblivian: Continuing with sync
[08:20:33] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) ganeti2014 was reimaged after the mainboard replacement and re-added to the cluster.
[08:22:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet
[08:24:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060) (owner: 10Jelto)
[08:24:39] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:951048|Replace calls to wfHostname with clusterconfig ones]] (duration: 09m 16s)
[08:25:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:25:41] <_joe_>	 jnuche: all yours, sorry :/
[08:26:12] <jnuche>	 _joe_: perfect, thanks :)
[08:26:19] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060)
[08:27:32] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060) (owner: 10Jelto)
[08:29:12] <icinga-wm>	 PROBLEM - Check systemd state on titan2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:57] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956787 (https://phabricator.wikimedia.org/T343728)
[08:29:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956787 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[08:30:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:30:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: add titan hosts to thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999)
[08:31:02] <wikibugs>	 (03CR) 10Jcrespo: "nitpicks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb)
[08:32:26] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956787 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot)
[08:32:52] <wikibugs>	 (03PS6) 10Arnaudb: admin: Change defaults on Arnaud's homedir [puppet] - 10https://gerrit.wikimedia.org/r/956025
[08:34:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43210/console" [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[08:36:15] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] admin: Change defaults on Arnaud's homedir [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb)
[08:38:10] <icinga-wm>	 PROBLEM - Check systemd state on titan1001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:38:42] <godog>	 that's me ^ filing task for this failure mode
[08:38:43] <moritzm>	 !log rebalance Ganeti cluster in codfw/C following node replacement
[08:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:10] <wikibugs>	 (03CR) 10Vgutierrez: Release 9.2.1-1wm2 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[08:39:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm from a cookbook PoV, will need leave the opensearch specifics to the experts" [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[08:39:39] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.26  refs T343728
[08:39:41] <stashbot>	 T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728
[08:40:02] <wikibugs>	 (03PS1) 10Brouberol: Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762)
[08:41:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52459 and previous config saved to /var/cache/conftool/dbconfig/20230912-084059-arnaudb.json
[08:41:04] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[08:41:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb)
[08:43:14] <icinga-wm>	 RECOVERY - Check systemd state on titan2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:54] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] admin: Change defaults on Arnaud's homedir (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb)
[08:43:56] <icinga-wm>	 RECOVERY - Check systemd state on titan1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:44:00] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1028.eqiad.wmnet
[08:45:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pass ensure->present to the nftables class if selected [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff)
[08:48:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[08:49:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43212/console" [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[08:49:20] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[08:49:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[08:49:54] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[08:50:08] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[08:50:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[08:50:33] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[08:50:41] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[08:50:51] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[08:51:03] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[08:51:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoyproxy: tox.ini: whitelist_externals -> allowlist_externals [puppet] - 10https://gerrit.wikimedia.org/r/955876 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar)
[08:51:16] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[08:51:27] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[08:51:28] <claime>	 !log mw-api-ext, mw-web: Raise total replicas to 14 - T341780
[08:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:31] <stashbot>	 T341780: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780
[08:52:14] <wikibugs>	 (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid support for Kartographer on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871)
[08:52:39] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[08:53:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[08:53:34] <jynus>	 jayme: ok to merge ?
[08:53:57] <jayme>	 hm,...I've an open puppet merge as well :)
[08:54:08] <claime>	 huh I was about to merge something too lol
[08:54:15] <claime>	 I'm gonna wait :D
[08:55:36] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[08:56:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P52460 and previous config saved to /var/cache/conftool/dbconfig/20230912-085606-arnaudb.json
[08:57:24] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:58:04] <claime>	 !log Sending 5% of global traffic to mw-on-k8s - T341780
[08:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:07] <stashbot>	 T341780: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780
[08:58:26] <claime>	 !log Running puppet on cp-text P:trafficserver::backend - T341780
[08:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:37] <icinga-wm>	 PROBLEM - Check systemd state on titan1002 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol)
[08:59:51] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] update_version: drop python 3.5, 3.6. Add 3.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955901 (owner: 10Hashar)
[09:00:33] <wikibugs>	 (03Merged) 10jenkins-bot: update_version: drop python 3.5, 3.6. Add 3.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955901 (owner: 10Hashar)
[09:00:35] <wikibugs>	 (03Merged) 10jenkins-bot: update_version: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar)
[09:01:41] <wikibugs>	 (03PS2) 10Klausman: profile::k8s::deployment_server: Add config for readability isvc [puppet] - 10https://gerrit.wikimedia.org/r/951460 (https://phabricator.wikimedia.org/T334182)
[09:02:17] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[09:02:39] <icinga-wm>	 RECOVERY - Check systemd state on titan1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the new cookbook! A suggestion and a question inline, none is a blocker." [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[09:05:24] <wikibugs>	 10SRE: Survey the third-party library market for UA policy compliance - https://phabricator.wikimedia.org/T313634 (10ayounsi) FYI: * https://github.com/mwclient/mwclient/issues/302 * https://github.com/martin-majlis/Wikipedia-API/issues/63
[09:08:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Don't require dummy 'team' label for multi-owner alerts [alerts] - 10https://gerrit.wikimedia.org/r/956794
[09:11:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P52461 and previous config saved to /var/cache/conftool/dbconfig/20230912-091112-arnaudb.json
[09:11:38] <wikibugs>	 (03PS1) 10Muehlenhoff: nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497)
[09:12:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:12:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:12:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:13:29] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Er, merge whenever you feel like it, I know you've got +2 in here :-D" [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis)
[09:15:09] <wikibugs>	 (03PS2) 10Muehlenhoff: nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497)
[09:15:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance
[09:15:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance
[09:17:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: route requests for geo-analytics via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/954890 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[09:17:23] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10thiemowmde) Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent examp...
[09:18:19] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10RhinosF1) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everyt...
[09:18:39] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:20:02] <wikibugs>	 (03PS1) 10Jbond: wmnet: add puppetdb1002 to SRV record for eqiad [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067)
[09:20:05] <wikibugs>	 (03PS1) 10Arnaudb: Revert "admin: Change defaults on Arnaud's homedir" [puppet] - 10https://gerrit.wikimedia.org/r/956806
[09:20:25] <wikibugs>	 (03Abandoned) 10Arnaudb: Revert "admin: Change defaults on Arnaud's homedir" [puppet] - 10https://gerrit.wikimedia.org/r/956806 (owner: 10Arnaudb)
[09:20:33] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Vgutierrez) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, ever...
[09:24:49] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43214/console" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall)
[09:26:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.pki.restart-reboot rolling reboot on A:pki
[09:26:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52463 and previous config saved to /var/cache/conftool/dbconfig/20230912-092618-arnaudb.json
[09:26:20] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:26:22] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[09:26:34] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:26:37] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.pki.restart-reboot (exit_code=99) rolling reboot on A:pki
[09:26:39] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: route requests for geo-analytics via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/954890 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[09:26:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52464 and previous config saved to /var/cache/conftool/dbconfig/20230912-092639-arnaudb.json
[09:27:25] <wikibugs>	 (03CR) 10Muehlenhoff: wmnet: add puppetdb1002 to SRV record for eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond)
[09:27:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[09:32:02] <hnowlan>	 !log disabled puppet on A:cp
[09:32:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:47] <wikibugs>	 (03PS17) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798)
[09:39:02] <wikibugs>	 (03CR) 10Physikerwelt: [C: 03+1] [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) (owner: 10Jforrester)
[09:41:17] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] Enable Parsoid support for Kartographer on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[09:44:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: add titan hosts to thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[09:48:26] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol)
[09:48:39] <wikibugs>	 (03CR) 10Brouberol: [V: 03+2 C: 03+2] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol)
[09:48:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999)
[09:49:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:49:43] <wikibugs>	 (03PS1) 10Muehlenhoff: SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s [cookbooks] - 10https://gerrit.wikimedia.org/r/956801
[09:50:28] <wikibugs>	 (03PS2) 10Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762)
[09:50:44] <wikibugs>	 (03PS3) 10Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762)
[09:51:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah)
[09:52:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.pki.restart-reboot rolling reboot on A:pki
[09:52:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors
[09:53:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors
[09:53:10] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Yes 15 seems a bit too short and 30 should work. That said I wonder if we should poll instead to avoid timing issues." [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (owner: 10Muehlenhoff)
[09:54:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:54:21] <wikibugs>	 10SRE-tools, 10Spicerack: spicerrack.decorators.retry: dynamic_params_callbacks=(set_tries,) dfosn;t seem to work as epected - https://phabricator.wikimedia.org/T346134 (10jbond) p:05Triage→03Medium
[09:55:45] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans)
[09:56:42] <wikibugs>	 (03CR) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes (0318 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[09:58:50] <wikibugs>	 (03PS14) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160
[09:59:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors
[09:59:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors
[10:00:07] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1000)
[10:01:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52465 and previous config saved to /var/cache/conftool/dbconfig/20230912-100138-root.json
[10:02:20] <hnowlan>	 !log enabling puppet on A:cp 
[10:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:53] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah)
[10:04:45] <wikibugs>	 (03CR) 10Volans: "Thanks for all the fixes" [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[10:05:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres on old nodes to ensure nothing hits them anyway
[10:05:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres on old nodes to ensure nothing hits them anyway
[10:05:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90305a26-47b2-42a2-abe5-284f8035bf3b) set by jmm@cumin2002 for...
[10:08:29] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: add config to limit routes to a domain, limit aqs2 apis to wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/956826 (https://phabricator.wikimedia.org/T336400)
[10:08:49] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 389 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[10:08:51] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[10:09:06] <volans>	 jbond: is that you? ^^^
[10:09:18] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[10:09:25] <wikibugs>	 (03PS2) 10Jbond: wmnet: add puppetserver1002 to SRV record for eqiad [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067)
[10:09:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[10:09:38] <wikibugs>	 (03CR) 10Jbond: wmnet: add puppetserver1002 to SRV record for eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond)
[10:09:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway
[10:09:59] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2af641c9-48a3-42b7-8c75-56c12506718a) set by jmm@cumin2002 for...
[10:12:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:12:37] <moritzm>	 volans: that's me
[10:13:17] <moritzm>	 !log disabled nginx/puppetdb/postgresql/microservice on puppetdb1002/2002 to ensure nothing hits the old endpoints anymore
[10:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.pki.restart-reboot (exit_code=0) rolling reboot on A:pki
[10:13:31] <volans>	 ack thx
[10:14:59] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 389 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[10:16:23] <wikibugs>	 (03CR) 10Muehlenhoff: "And JFTR, I did a test-cookbook run with the 30s delay and that made the sre.pki.restart-reboot cookbook work fine, I'll merge this change" [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (owner: 10Muehlenhoff)
[10:16:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52466 and previous config saved to /var/cache/conftool/dbconfig/20230912-101643-root.json
[10:16:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet
[10:17:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:17:51] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067)
[10:21:19] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[10:21:38] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[10:23:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet
[10:23:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (owner: 10Muehlenhoff)
[10:25:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet
[10:31:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52467 and previous config saved to /var/cache/conftool/dbconfig/20230912-103148-root.json
[10:32:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[10:34:34] <wikibugs>	 (03PS3) 10Majavah: conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463)
[10:35:24] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[10:36:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: LVS servers using autoconf SLAAC IPv6 addresses - https://phabricator.wikimedia.org/T336505 (10cmooney) I'm gonna close this, I think we can probably deal with it under T102099.
[10:37:43] <logmsgbot>	 !log taavi@cumin1001 conftool action : set/pooled=yes:weight=10; selector: cluster=cloudweb
[10:38:59] <wikibugs>	 10SRE, 10Traffic, 10cloud-services-team, 10Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10taavi) `lang=shell-session taavi@cumin1001 ~ $ confctl select "cluster=(lab|cloud)web" get {"cloudweb1003.wikimedia.org": {"weight": 10, "pooled": "yes"}, "tag...
[10:39:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[10:39:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet
[10:39:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add snapshot101[4-7] to the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis)
[10:39:22] <wikibugs>	 (03PS3) 10Majavah: service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463)
[10:40:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond)
[10:40:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmnet: add puppetserver1002 to SRV record for eqiad [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond)
[10:45:29] <moritzm>	 !log rebalance Ganeti cluster in eqiad/C following node reboots
[10:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52468 and previous config saved to /var/cache/conftool/dbconfig/20230912-104652-root.json
[10:53:12] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web: Raise apc size to 1536 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956830 (https://phabricator.wikimedia.org/T341780)
[10:53:45] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) a:03Volans Yes the issue is that the `set_tries` defined in spicerack doesn't check the function s...
[10:54:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10cmooney) >>! In T344259#9158656, @Eevans wrote: > This (actually) worked.  Forcing the switch port to 100mbit was enough to successful PXE boot and reimage the server.  On...
[10:54:27] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1004.wikimedia.org
[10:55:57] <wikibugs>	 (03CR) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[10:59:08] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43216/console" [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol)
[10:59:11] <wikibugs>	 (03PS18) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798)
[10:59:26] <wikibugs>	 (03CR) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[11:00:43] <wikibugs>	 (03CR) 10Brouberol: "The pcc run was successful https://puppet-compiler.wmflabs.org/output/956785/43216/" [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol)
[11:01:04] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise apc size to 1536 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956830 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[11:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: Raise apc size to 1536 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956830 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert)
[11:01:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52470 and previous config saved to /var/cache/conftool/dbconfig/20230912-110157-root.json
[11:02:06] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:02:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:03:02] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:03:11] <wikibugs>	 (03PS1) 10FNegri: [openstack] bridge-utils workaround is bullseye-only [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810)
[11:03:12] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[11:03:57] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[11:04:05] <wikibugs>	 (03PS2) 10FNegri: [openstack] bridge-utils workaround is bullseye-only [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810)
[11:07:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol)
[11:08:28] <wikibugs>	 (03CR) 10FNegri: "Not sure why this was only enabled in codfw and not in eqiad..." [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[11:09:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM cookbook-wise, leaving the opensearch-specific details to your team." [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[11:09:24] <wikibugs>	 (03CR) 10FNegri: "I found more info here: I74c57d49dc7790bee11cd6cb5d7de8c1d9e2b969" [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[11:09:37] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on snapshot1014 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[11:10:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] [openstack] bridge-utils workaround is bullseye-only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[11:11:57] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:12:10] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol)
[11:16:04] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[11:16:51] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Seems good to me. Will any of these end-points end up being swift-specific, and others thanos-specific in due course?" [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[11:16:53] <wikibugs>	 (03CR) 10Jbond: "This will need a parent change to update get_puppet_ca" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[11:17:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52471 and previous config saved to /var/cache/conftool/dbconfig/20230912-111702-root.json
[11:17:29] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: emit cache-control header for 404s [deployment-charts] - 10https://gerrit.wikimedia.org/r/956833 (https://phabricator.wikimedia.org/T336400)
[11:18:59] <logmsgbot>	 !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudservices1004.wikimedia.org
[11:22:22] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1004: decomission [puppet] - 10https://gerrit.wikimedia.org/r/956834 (https://phabricator.wikimedia.org/T346033)
[11:24:43] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:27:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero feel free to close this one if it's not being worked on, the status...
[11:30:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:32:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52472 and previous config saved to /var/cache/conftool/dbconfig/20230912-113207-root.json
[11:32:12] <wikibugs>	 (03PS2) 10Filippo Giunchedi: conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999)
[11:32:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: conftool-data: add titan hosts to thanos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[11:33:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[11:35:06] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan1001.eqiad.wmnet
[11:35:13] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan1002.eqiad.wmnet
[11:35:25] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan2001.codfw.wmnet
[11:35:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) 05Open→03Declined OK, closing for now and hoping some more modern BGP-bas...
[11:35:35] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan2002.codfw.wmnet
[11:35:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:18] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan*
[11:36:36] <godog>	 that did nothing ^ but announced itself anyways
[11:36:43] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:14] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan1001.eqiad.wmnet
[11:37:15] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan1002.eqiad.wmnet
[11:37:16] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan2001.codfw.wmnet
[11:37:17] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan2002.codfw.wmnet
[11:38:25] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on snapshot1016 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[11:39:08] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1001.eqiad.wmnet,service=thanos-web
[11:40:14] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=titan1002.eqiad.wmnet,service=thanos-web
[11:40:45] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1002.eqiad.wmnet,service=thanos-web
[11:41:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:15] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 7 hosts with reason: Mute initial failures of hadoop-hdfs-datanode.service
[11:41:32] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 7 hosts with reason: Mute initial failures of hadoop-hdfs-datanode.service
[11:42:23] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1002.eqiad.wmnet,service=thanos-web
[11:42:28] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1001.eqiad.wmnet,service=thanos-web
[11:43:39] <godog>	 !log pool titan hosts alongside thanos-fe for thanos-query / thanos-web services - T341999
[11:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:43] <stashbot>	 T341999: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999
[11:45:25] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on snapshot1017 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[11:47:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52473 and previous config saved to /var/cache/conftool/dbconfig/20230912-114711-root.json
[11:47:31] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on snapshot1015 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[11:49:15] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:38] <wikibugs>	 (03PS2) 10Muehlenhoff: SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (https://phabricator.wikimedia.org/T346134)
[11:49:59] <wikibugs>	 (03CR) 10Muehlenhoff: SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (https://phabricator.wikimedia.org/T346134) (owner: 10Muehlenhoff)
[11:50:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:38] <wikibugs>	 (03PS1) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836
[11:53:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (https://phabricator.wikimedia.org/T346134) (owner: 10Muehlenhoff)
[11:56:59] <wikibugs>	 (03CR) 10Muehlenhoff: P:idm allow for installation via Debian packages. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede)
[11:57:27] <godog>	 !log pool thanos[12]001 for thanos.w.o - T341999
[11:57:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:30] <stashbot>	 T341999: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999
[11:58:42] <wikibugs>	 (03CR) 10Muehlenhoff: "Could you pleae also update the Cumin aliases? Likely A:thanos-fe should also include the titan role and not sure if we need a separate al" [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[11:59:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:59:48] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1004.wikimedia.org
[11:59:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1004: decomission [puppet] - 10https://gerrit.wikimedia.org/r/956834 (https://phabricator.wikimedia.org/T346033) (owner: 10Arturo Borrero Gonzalez)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1200)
[12:02:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cumin: add titan aliases [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999)
[12:03:22] <wikibugs>	 (03PS1) 10Elukey: profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620)
[12:03:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: add titan hosts to thanos frontends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[12:04:45] <wikibugs>	 (03PS2) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836
[12:04:50] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[12:04:52] <wikibugs>	 (03PS1) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826)
[12:04:54] <wikibugs>	 (03PS2) 10Elukey: profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620)
[12:05:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede)
[12:07:22] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[12:09:48] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001"
[12:11:46] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001"
[12:11:46] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:11:47] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudservices1004.wikimedia.org
[12:12:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[12:12:44] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[12:13:00] <wikibugs>	 (03PS3) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836
[12:14:14] <wikibugs>	 (03PS2) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826)
[12:14:16] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[12:14:31] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[12:15:01] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[12:15:04] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "LGTM, phab_deploy_finalize.sh is in the sudo list of the deployment user, so we don't need to add a rule for the new restart command" [puppet] - 10https://gerrit.wikimedia.org/r/956486 (https://phabricator.wikimedia.org/T314460) (owner: 10Brennen Bearnes)
[12:15:15] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[12:15:43] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[12:15:59] <wikibugs>	 (03PS3) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826)
[12:16:02] <wikibugs>	 (03PS4) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[12:16:53] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:04] <wikibugs>	 (03PS4) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826)
[12:18:06] <wikibugs>	 (03PS5) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[12:18:34] <jinxer-wm>	 (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[12:18:36] <taavi>	 i'm getting "Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow " on meta-wiki
[12:19:32] * taavi klaxons
[12:19:52] * volans here
[12:19:55] <taavi>	 hi
[12:20:07] <jinxer-wm>	 (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:20:12] <volans>	 checking
[12:20:26] <marostegui>	 here too and checking
[12:20:29] <marostegui>	 Thanks for paging taavi
[12:20:31] <jelto>	 I also get "Error
[12:20:31] <jelto>	 Our servers are currently under maintenance or experiencing a technical problem" on office wiki.
[12:20:44] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[12:20:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:20:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet are marked down but pooled: testlb6_443: Serve
[12:20:47] <icinga-wm>	 4.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:21:03] <jinxer-wm>	 (ProbeDown) firing: (8) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:21:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:21:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are ma
[12:21:17] <icinga-wm>	 n but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:21:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:21:39] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:22:16] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:22:51] <icinga-wm>	 PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - no socket TCP[103.102.166.8] Connection timed out https://wikitech.wikimedia.org/wiki/DNS
[12:23:01] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:23:09] <jinxer-wm>	 (HttpdUnreachable) firing: httpd unavailable for deployment mw-web at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[12:23:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:23:13] <icinga-wm>	 PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:23:16] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[12:23:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:23:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:23:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:23:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[12:24:09] <icinga-wm>	 RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[12:24:10] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[12:24:29] <icinga-wm>	 RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:24:54] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) re: the last point, namely cleaning up thanos components off thanos-fe (therefore l...
[12:25:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10cmooney) In the medium term I think we need to carefully consider how this operates, probably as part of a move away from using...
[12:25:07] <jinxer-wm>	 (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:26:03] <jinxer-wm>	 (ProbeDown) firing: (22) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:26:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:26:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add static network defs and DHCP config for new codfw subnets [puppet] - 10https://gerrit.wikimedia.org/r/954896 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[12:27:16] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:27:33] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:28:09] <jinxer-wm>	 (HttpdUnreachable) resolved: httpd unavailable for deployment mw-web at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[12:28:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:28:16] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[12:28:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:28:22] <wikibugs>	 (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[12:28:58] <wikibugs>	 (03CR) 10KartikMistry: Enable MinT translation service on MediaWiki - rollout #4 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[12:29:03] <wikibugs>	 (03CR) 10Abijeet Patro: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[12:30:14] <wikibugs>	 (03PS4) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836
[12:30:44] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[12:30:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[12:30:50] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cumin: add titan aliases [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999)
[12:30:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alerting_host: add titan to thanos-query hosts [puppet] - 10https://gerrit.wikimedia.org/r/956866 (https://phabricator.wikimedia.org/T341999)
[12:30:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: titan: move pyrra off thanos role [puppet] - 10https://gerrit.wikimedia.org/r/956867 (https://phabricator.wikimedia.org/T341999)
[12:31:03] <jinxer-wm>	 (ProbeDown) resolved: (22) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:31:14] <wikibugs>	 (03PS3) 10Abijeet Patro: Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445)
[12:31:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[12:31:34] <wikibugs>	 (03CR) 10Abijeet Patro: Enable MinT translation service on MediaWiki - rollout #4 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[12:31:35] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:32:07] <volans>	 btw, thanks taavi for the advance notice ;)
[12:33:20] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+1] Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[12:33:34] <jinxer-wm>	 (DatasourceError) resolved: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[12:34:53] <wikibugs>	 (03PS1) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497)
[12:35:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:36:51] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: increase the recommendation-api-ng memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890)
[12:40:09] <wikibugs>	 (03Merged) 10jenkins-bot: Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney)
[12:40:23] <logmsgbot>	 !log brouberol@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[12:41:55] <wikibugs>	 (03PS3) 10Kevin Bazira: ml-services: increase the recommendation-api-ng memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890)
[12:42:09] <wikibugs>	 (03PS2) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497)
[12:42:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:42:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alerting_host: add titan to thanos-query hosts [puppet] - 10https://gerrit.wikimedia.org/r/956866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[12:46:29] <wikibugs>	 (03PS3) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497)
[12:46:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:49:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:51:21] <wikibugs>	 (03PS4) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497)
[12:54:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:54:15] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:54:43] <wikibugs>	 (03PS2) 10Jelto: gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall)
[12:55:07] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:45] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:31] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43224/console" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall)
[12:56:54] <wikibugs>	 (03PS5) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826)
[12:56:56] <wikibugs>	 (03PS6) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[12:58:36] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43225/console" [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[12:59:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:59:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343198)', diff saved to https://phabricator.wikimedia.org/P52474 and previous config saved to /var/cache/conftool/dbconfig/20230912-125932-arnaudb.json
[12:59:43] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[13:00:06] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1300).
[13:00:06] <jouncebot>	 ihurbain and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:18] <ihurbain>	 o/
[13:00:19] <taavi>	 hello
[13:00:24] * TheresNoTime is here
[13:00:29] * ihurbain is here too
[13:00:48] <TheresNoTime>	 taavi: you're welcome to deploy, I'm mid-email
[13:00:52] <ihurbain>	 note: this is the first time i have stuff in the backport window ever, so I'M SCAAAARED :P
[13:01:09] <ihurbain>	 (and i may ask stupid questions.)
[13:01:19] <taavi>	 TheresNoTime: sure, will ping you when you're clear to deploy your patch
[13:01:31] <taavi>	 ihurbain: sure, no worries! do you have the x-wikimedia-debug browser extension installed?
[13:01:40] <ihurbain>	 taavi: i do
[13:01:42] <wikibugs>	 (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: better handler label [puppet] - 10https://gerrit.wikimedia.org/r/956875
[13:01:43] <TheresNoTime>	 ihurbain: cliché, but there are no stupid questions when it comes to deploying to production ^^
[13:01:50] <ihurbain>	 _fine_ :D
[13:02:19] <taavi>	 great, I'll ping you when your patch can be tested using it. will take a few minutes or so
[13:02:19] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "SSL_CERT_DIR is needed only with object storage enabled. What do you think about adding a additional if to the SSL_CERT_DIR?" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall)
[13:02:26] <ihurbain>	 ack!
[13:02:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:02:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:03:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/956875 (owner: 10Kamila Součková)
[13:04:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:04:31] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Parsoid support for Kartographer on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin)
[13:05:03] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:956792|Enable Parsoid support for Kartographer on enwiki (T342871)]]
[13:05:07] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:06:34] <logmsgbot>	 !log taavi@deploy1002 ihurbain and taavi: Backport for [[gerrit:956792|Enable Parsoid support for Kartographer on enwiki (T342871)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:06:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[13:07:21] <taavi>	 ihurbain: your patch can now be tested. so open the extension pop-up and change the giant on-off switch to 'ON', all other settings should be fine as is
[13:07:27] <ihurbain>	 ok
[13:07:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:09:21] <taavi>	 and let me know when you've verified that your change works properly
[13:09:39] <ihurbain>	 yup, going through a few pages, shouldn't take more than a few minutes
[13:09:55] <taavi>	 sure
[13:10:56] <moritzm>	 !log installing grub2 updates from Bullseye point release
[13:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:10] <TheresNoTime>	 taavi: do you happen to know which grafana graph the images from T345414 are from? (JS content size)
[13:12:11] <stashbot>	 T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414
[13:12:15] <wikibugs>	 (03PS1) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878
[13:12:51] <taavi>	 TheresNoTime: uhh sorry, I do not
[13:13:04] <wikibugs>	 (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar)
[13:13:10] <wikibugs>	 (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464
[13:14:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P52475 and previous config saved to /var/cache/conftool/dbconfig/20230912-131438-arnaudb.json
[13:14:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] cumin: add titan aliases [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[13:14:47] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add config to limit routes to a domain, limit aqs2 apis to wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/956826 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[13:15:48] <wikibugs>	 (03PS2) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878
[13:15:50] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: add config to limit routes to a domain, limit aqs2 apis to wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/956826 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan)
[13:16:56] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10MatthewVernon) Yeah, if it's easier to just reimage them (//especially// if that can be done wi...
[13:17:41] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:43] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar)
[13:18:56] <ihurbain>	 taavi: we're good and happy
[13:19:06] <logmsgbot>	 !log taavi@deploy1002 ihurbain and taavi: Continuing with sync
[13:19:17] <taavi>	 awesome. now i'm syncing your changes to the entire cluster
[13:19:26] <ihurbain>	 woot!
[13:21:11] <wikibugs>	 (03PS3) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878
[13:21:51] <wikibugs>	 (03PS1) 10Samtar: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956809 (https://phabricator.wikimedia.org/T345414)
[13:23:11] <TheresNoTime>	 (^ oops, forgot to cherry pick)
[13:24:29] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[13:25:23] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi New hosts are in service, resolving
[13:25:46] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "thanks, I was thinking we should do the same thing!" [puppet] - 10https://gerrit.wikimedia.org/r/956867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[13:26:21] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi)
[13:26:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] titan: move pyrra off thanos role [puppet] - 10https://gerrit.wikimedia.org/r/956867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[13:26:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:26:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[13:27:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete
[13:27:19] <taavi>	 hmm sync-apaches is taking a while
[13:27:50] * ihurbain gently shakes apaches
[13:28:08] <TheresNoTime>	 things still settling after the outage maybe?
[13:28:20] <taavi>	 ah, new snapshot hosts added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/955931 that probably are somewhat outdated
[13:29:15] <taavi>	 there we go, now it's doing the even normally slow stuff (php-fpm-restart)
[13:29:43] <wikibugs>	 (03PS4) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144)
[13:29:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P52476 and previous config saved to /var/cache/conftool/dbconfig/20230912-132944-arnaudb.json
[13:31:08] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:956792|Enable Parsoid support for Kartographer on enwiki (T342871)]] (duration: 26m 05s)
[13:31:11] <stashbot>	 T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871
[13:31:12] <taavi>	 ihurbain: your patch is now live
[13:31:14] <taavi>	 TheresNoTime: your turn
[13:31:20] <ihurbain>	 yaaay! thanks taavi :)
[13:31:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:31:58] <TheresNoTime>	 taavi: are you okay to deploy or..?
[13:32:06] <wikibugs>	 (03CR) 10Gehel: "I know I'm late, but a few comments inline anyway." [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking)
[13:32:28] <taavi>	 oh I somehow thought you were going to self-deploy your patch?
[13:32:47] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[13:32:54] <TheresNoTime>	 taavi: I can do, just not SSH'd in atm etc
[13:33:18] <taavi>	 (I need to disappear into a meeting in a few, so would prefer not to deploy that)
[13:33:40] <TheresNoTime>	 taavi: ack, will do it :)
[13:34:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956809 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar)
[13:35:32] <wikibugs>	 (03Merged) 10jenkins-bot: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[13:35:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[13:36:04] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956809 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar)
[13:36:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[13:36:35] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:956809|Reduce initial payload of Phonos styles (T345414)]]
[13:36:38] <stashbot>	 T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414
[13:38:06] <logmsgbot>	 !log samtar@deploy1002 samtar: Backport for [[gerrit:956809|Reduce initial payload of Phonos styles (T345414)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:38:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[13:38:21] * TheresNoTime testing
[13:38:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[13:39:04] <logmsgbot>	 !log samtar@deploy1002 samtar: Continuing with sync
[13:39:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[13:40:03] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[13:42:20] <wikibugs>	 (03CR) 10Gehel: "minor comments inline (yes, I'm late to the party!)" [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[13:42:26] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi)
[13:42:41] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) 05Resolved→03Open I was a little too hasty here, I forgot we need raid0 on these hosts to be...
[13:42:56] <wikibugs>	 (03PS2) 10Ssingh: Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154)
[13:43:16] <wikibugs>	 (03CR) 10Ssingh: Release 9.2.1-1wm2 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[13:43:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: restore raid1 for thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/956886 (https://phabricator.wikimedia.org/T346143)
[13:44:01] <icinga-wm>	 PROBLEM - puppet last run on testreduce1001 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:44:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: use raid0 for titan [puppet] - 10https://gerrit.wikimedia.org/r/956887 (https://phabricator.wikimedia.org/T341999)
[13:44:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143)
[13:44:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343198)', diff saved to https://phabricator.wikimedia.org/P52477 and previous config saved to /var/cache/conftool/dbconfig/20230912-134451-arnaudb.json
[13:44:54] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[13:45:35] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:956809|Reduce initial payload of Phonos styles (T345414)]] (duration: 08m 59s)
[13:45:38] <stashbot>	 T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414
[13:46:33] <TheresNoTime>	 !log UTC afternoon backport window closed
[13:46:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:12] <wikibugs>	 (03PS7) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[13:48:14] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: control-plane components should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826)
[13:48:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[13:49:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use raid0 for titan [puppet] - 10https://gerrit.wikimedia.org/r/956887 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[13:49:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: use raid0 for titan [puppet] - 10https://gerrit.wikimedia.org/r/956887 (https://phabricator.wikimedia.org/T341999)
[13:50:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: remove references to nsa.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[13:50:33] <sukhe>	 !log disable puppet on A:dns-rec
[13:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:08] <wikibugs>	 (03PS3) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810)
[13:51:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[13:53:05] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:46] <wikibugs>	 (03PS4) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810)
[13:54:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[13:55:38] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: control-plane components should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826)
[13:55:40] <wikibugs>	 (03PS8) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826)
[13:56:11] <wikibugs>	 (03PS5) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810)
[13:56:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:57:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:57:58] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route to wikifeeds via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/956895 (https://phabricator.wikimedia.org/T339119)
[13:58:16] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43228/console" [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[13:58:26] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF)
[14:00:23] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[14:00:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[14:00:53] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43229/console" [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[14:01:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr) a:03Jclark-ctr
[14:02:02] <sukhe>	 !log enable puppet on doh6001 to test nsa removal
[14:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:15] <sukhe>	 !log [correction] enable puppet on dns6001 to test nsa removal
[14:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr)
[14:04:11] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF)
[14:07:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1027.eqiad.wmnet']
[14:07:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:07:45] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) I worked on some of the items for this week. I let @UOzurumba check on the done elements. Distribut...
[14:09:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1028.eqiad.wmnet']
[14:09:50] <wikibugs>	 (03PS6) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810)
[14:10:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet']
[14:10:04] <icinga-wm>	 RECOVERY - puppet last run on testreduce1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:10:05] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet']
[14:10:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1030.eqiad.wmnet']
[14:10:20] <sukhe>	 !log enable puppet on dns-rec to progessively roll out nsa->ns2 updates
[14:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) 05Open→03Resolved a:03cmooney Thanks all, config applied now.  @volans I left the timeout at 30 mins.  I think (esp. in an emergency situation) it's not unlikely yo...
[14:11:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1031.eqiad.wmnet']
[14:11:39] <wikibugs>	 (03CR) 10FNegri: [openstack] bridge-utils config in bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[14:13:30] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:00] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba)
[14:15:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1032.eqiad.wmnet']
[14:15:26] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1027.eqiad.wmnet']
[14:15:56] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1033.eqiad.wmnet']
[14:16:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9159690, @cmooney wrote: >>>! In T344259#9158656, @Eevans wrote: >> This (actually) worked.  Forcing the switch port to 100mbit was enough to succes...
[14:17:33] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:00] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "This matches what we do with ms-fe, which feels like the right answer." [puppet] - 10https://gerrit.wikimedia.org/r/956886 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[14:18:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1030.eqiad.wmnet']
[14:18:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1028.eqiad.wmnet']
[14:18:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1034.eqiad.wmnet']
[14:19:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1035.eqiad.wmnet']
[14:21:29] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED
[14:21:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1031.eqiad.wmnet']
[14:22:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet']
[14:22:22] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba)
[14:22:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet']
[14:23:46] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1032.eqiad.wmnet']
[14:24:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet']
[14:24:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1033.eqiad.wmnet']
[14:24:31] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet']
[14:24:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1037.eqiad.wmnet']
[14:24:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: restore raid1 for thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/956886 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi)
[14:24:59] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah)
[14:25:24] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah)
[14:25:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1038.eqiad.wmnet']
[14:26:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:27:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1034.eqiad.wmnet']
[14:27:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1039.eqiad.wmnet']
[14:27:43] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1035.eqiad.wmnet']
[14:27:51] <wikibugs>	 (03PS2) 10Ssingh: wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219)
[14:27:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1040.eqiad.wmnet']
[14:30:41] <moritzm>	 !log installing libssh2 security updates#
[14:30:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet']
[14:30:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:53] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet']
[14:31:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1041.eqiad.wmnet']
[14:31:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1042.eqiad.wmnet']
[14:31:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:31:37] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] benthos/mw_accesslog_metrics: better handler label [puppet] - 10https://gerrit.wikimedia.org/r/956875 (owner: 10Kamila Součková)
[14:33:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1037.eqiad.wmnet']
[14:34:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1038.eqiad.wmnet']
[14:35:48] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1039.eqiad.wmnet']
[14:36:14] <wikibugs>	 (03PS4) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739
[14:36:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1040.eqiad.wmnet']
[14:37:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10MoritzMuehlenhoff) I've set the Netbox status back to Active.
[14:38:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: furud.codfw.wmnet
[14:38:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: furud.codfw.wmnet
[14:38:40] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:44] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:39:01] <wikibugs>	 (03PS1) 10JMeybohm: Add systemd dependencies to kube-apiserver [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826)
[14:39:36] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1041.eqiad.wmnet']
[14:39:42] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1042.eqiad.wmnet']
[14:39:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Just to mention here, but the restriction described in T322937#8847201 no longer seems to be the case.  In codfw with devices on JunOS 22.2R3....
[14:39:55] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet
[14:40:13] <wikibugs>	 (03PS1) 10Herron: titan: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/956901
[14:40:15] <wikibugs>	 (03PS1) 10Herron: titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902
[14:40:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[14:40:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1043.eqiad.wmnet']
[14:40:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1044.eqiad.wmnet']
[14:40:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1045.eqiad.wmnet']
[14:40:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1046.eqiad.wmnet']
[14:40:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902 (owner: 10Herron)
[14:40:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1047.eqiad.wmnet']
[14:40:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1048.eqiad.wmnet']
[14:41:01] <wikibugs>	 (03PS1) 10FNegri: Remove unused toolsdb roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929)
[14:42:55] <moritzm>	 !log installing Linux 6.1.52 on Bookworm hosts
[14:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:26] <wikibugs>	 (03PS2) 10FNegri: [toolsdb] Remove unused primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929)
[14:43:44] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Jclark-ctr) @aborrero. I am available to move it This week tomorrow or Thursday.
[14:46:13] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet
[14:48:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143)
[14:48:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: move rule evaluation to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143)
[14:48:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143)
[14:48:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143)
[14:49:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1045.eqiad.wmnet']
[14:49:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1048.eqiad.wmnet']
[14:49:31] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1043.eqiad.wmnet']
[14:49:37] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1047.eqiad.wmnet']
[14:49:38] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1046.eqiad.wmnet']
[14:49:38] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1049.eqiad.wmnet']
[14:49:42] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1044.eqiad.wmnet']
[14:50:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1054.eqiad.wmnet']
[14:50:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1053.eqiad.wmnet']
[14:50:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1052.eqiad.wmnet']
[14:50:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1050.eqiad.wmnet']
[14:50:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1051.eqiad.wmnet']
[14:54:29] <wikibugs>	 (03PS2) 10Herron: titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902
[14:55:14] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.61.0" for 596 hosts
[14:55:30] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:26] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.61.0" for 595 hosts
[14:56:58] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:57:38] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.61.0" completed for 595 hosts
[14:57:53] <godog>	 !log add 30G to prometheus@services and 300G to prometheus@ops (codfw)
[14:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:45] <wikibugs>	 (03PS1) 10Cathal Mooney: Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937)
[14:58:54] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1049.eqiad.wmnet']
[14:58:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1052.eqiad.wmnet']
[14:58:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1053.eqiad.wmnet']
[14:59:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1054.eqiad.wmnet']
[14:59:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1050.eqiad.wmnet']
[14:59:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1051.eqiad.wmnet']
[14:59:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:00:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1055.eqiad.wmnet']
[15:01:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1056.eqiad.wmnet']
[15:01:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero)
[15:02:50] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43232/console" [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron)
[15:03:25] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route requests to mediarequests service [puppet] - 10https://gerrit.wikimedia.org/r/956909 (https://phabricator.wikimedia.org/T336380)
[15:04:12] <wikibugs>	 (03PS3) 10Filippo Giunchedi: conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143)
[15:04:14] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: move rule evaluation to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143)
[15:04:16] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143)
[15:04:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:04:18] <wikibugs>	 (03PS2) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143)
[15:05:03] <godog>	 herron: re: titan and cfssl please hold off, I'll let you know once I'm done with the transition
[15:05:20] <godog>	 the merging that is
[15:05:43] <herron>	 godog: will do, was about to add you to the patch, I'll wait on your +1
[15:05:46] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:08] <godog>	 cheers
[15:07:38] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43233/console" [puppet] - 10https://gerrit.wikimedia.org/r/956902 (owner: 10Herron)
[15:09:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:09:24] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:09:25] <wikibugs>	 (03CR) 10Elukey: Add systemd dependencies to kube-apiserver (032 comments) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[15:09:28] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1055.eqiad.wmnet']
[15:10:49] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:12:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[15:12:53] <wikibugs>	 (03CR) 10JMeybohm: Add systemd dependencies to kube-apiserver (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[15:13:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1056.eqiad.wmnet']
[15:14:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:14:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10LSobanski) @Urbanecm what ongoing support would you envision beyond setting up the VM and some sort of a deployment method and keeping up to date with...
[15:15:02] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:15:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) db1236 - A 5. U 27. port 45 CableID 3160 db1237 - A 5. U 33. port 33 CableID 1962 db1238 - B 5. U 14. port 23 CableID 4048 db1239 - B 5. U 15. port 09 CableID 3793
[15:18:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[15:19:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[15:19:16] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:19:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[15:19:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[15:21:18] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:02] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:22:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[15:22:45] <wikibugs>	 (03CR) 10Elukey: "Left a couple of comments to better understand this :)" [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[15:23:30] <wikibugs>	 (03CR) 10Elukey: "Is there a rationale for this, or is it just a simplification of the workflow?" [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[15:24:38] <wikibugs>	 (03CR) 10Gehel: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[15:25:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[15:27:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[15:27:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[15:28:03] <wikibugs>	 (03PS5) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739
[15:28:20] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] kubernetes::master: control-plane components should use the local api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[15:30:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[15:31:25] <wikibugs>	 (03PS1) 10Ssingh: varnish:common: add bookworm version for Python [puppet] - 10https://gerrit.wikimedia.org/r/956914 (https://phabricator.wikimedia.org/T342154)
[15:32:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond)
[15:32:35] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43234/console" [puppet] - 10https://gerrit.wikimedia.org/r/956914 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[15:33:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956914 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[15:39:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2003.codfw.wmnet
[15:39:34] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] k8s::apiserver: Use a separate systemd service for safe restarts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[15:41:03] <wikibugs>	 (03PS1) 10Samtar: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414)
[15:41:49] <wikibugs>	 (03PS2) 10Jdlrobson: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar)
[15:42:06] <wikibugs>	 (03PS1) 10Brouberol: Fix typo in allowed cluster group aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798)
[15:43:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2003.codfw.wmnet
[15:43:23] <TheresNoTime>	 jouncebot: nowandnext
[15:43:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 16 minute(s)
[15:43:23] <jouncebot>	 In 0 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1600)
[15:43:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1004.eqiad.wmnet
[15:44:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to shell/dcops for VRiley - https://phabricator.wikimedia.org/T346077 (10RobH) 05Open→03Resolved
[15:46:15] <wikibugs>	 (03PS3) 10Dduvall: gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515
[15:47:00] <wikibugs>	 (03CR) 10Dduvall: gitlab: Fix conditional end in gitlab.rb template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall)
[15:47:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10lmata) Thank you!
[15:47:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1004.eqiad.wmnet
[15:51:38] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[15:54:34] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:54:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[15:54:56] <TheresNoTime>	 hm
[15:55:56] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) >>! In T346042#9160486, @Jclark-ctr wrote: > @aborrero. I am available to move it This week tomorrow or Thursday.     OK! Let's do it to...
[15:59:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[16:00:07] <jouncebot>	 jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1600).
[16:00:07] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:01:38] <jinxer-wm>	 (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:01:53] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:03:03] <_joe_>	 uh this alert seems serious
[16:03:09] <wikibugs>	 (03CR) 10Elukey: k8s::apiserver: Use a separate systemd service for safe restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[16:03:36] <_joe_>	 Krinkle: ^^
[16:03:45] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] [toolsdb] Remove unused primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929) (owner: 10FNegri)
[16:04:08] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] [toolsdb] Remove unused primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929) (owner: 10FNegri)
[16:06:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:06:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[16:06:38] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:06:53] <jinxer-wm>	 (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:07:14] <Krinkle>	 https://grafana.wikimedia.org/d/_L_3fQh4z/cross-dc-traffic-alerts?orgId=1&from=now-24h&to=now
[16:07:30] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[16:08:00] <wikibugs>	 (03PS2) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[16:08:07] <jinxer-wm>	 (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:08:07] <jinxer-wm>	 (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:08:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:08:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1389.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1371.eqiad.wmnet, mw1397.eqiad.wmnet, mw1455.eqiad.wmnet, mw1475.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1349.eqiad.wmnet, mw1452.eqiad.wmnet, mw1387.eqiad.wmnet, mw1456.eqiad.wmnet, mw1430.eqiad.wmnet, mw1476.eqiad.wmnet, mw1480.eqiad.wmnet, mw
[16:08:30] <icinga-wm>	 ad.wmnet, mw1352.eqiad.wmnet, mw1413.eqiad.wmnet, mw1441.eqiad.wmnet, mw1368.eqiad.wmnet, mw1420.eqiad.wmnet, mw1366.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, mw1487.eqiad.wmnet, mw1370.eqiad.wmnet, mw1429.eqiad.wmnet, mw1451.eqiad.wmnet, mw1496.eqiad.wmnet, mw1401.eqiad.wmnet, mw1395.eqiad.wmnet, mw1403.eqiad.wmnet, mw1409.eqiad.wmnet, mw1411.eqiad.wmnet, mw1417.eqiad.wmnet, mw1473.eqiad
[16:08:30] <icinga-wm>	 mw1399.eqiad.wmnet, mw1479.eqiad.wmnet, mw1353.eqiad.wmnet, mw1477.eqiad.wmnet, mw1416.eqiad.wmnet, mw1472.eqiad.wmnet, mw1373.eqiad.wmnet, mw1350.eqiad.wmnet, mw1453.eqiad.wmnet, mw143 https://wikitech.wikimedia.org/wiki/PyBal
[16:08:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5022.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet, cp5023.eqsin.wmnet, cp501
[16:08:40] <icinga-wm>	 wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet, cp5023.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but 
[16:08:40] <icinga-wm>	 ttps://wikitech.wikimedia.org/wiki/PyBal
[16:08:45] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[16:08:46] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[16:08:53] <wikibugs>	 (03CR) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[16:09:03] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri)
[16:09:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet, cp5023.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp501
[16:09:19] <icinga-wm>	 wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5020.eqsin.wmnet, cp5022.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:09:23] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:09:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1389.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1369.eqiad.wmnet, mw1371.eqiad.wmnet, mw1397.eqiad.wmnet, mw1365.eqiad.wmnet, mw1367.eqiad.wmnet, mw1475.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1456.eqiad.wmnet, mw1432.eqiad.wmnet, mw1478.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw
[16:09:24] <icinga-wm>	 ad.wmnet, mw1364.eqiad.wmnet, mw1407.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1476.eqiad.wmnet, mw1480.eqiad.wmnet, mw1351.eqiad.wmnet, mw1405.eqiad.wmnet, mw1352.eqiad.wmnet, mw1399.eqiad.wmnet, mw1368.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw1454.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1393.eqiad.wmnet, mw1488.eqiad.wmnet, mw1481.eqiad.wmnet, mw1366.eqiad.wmnet, mw1487.eqiad.wmnet, mw1372.eqiad
[16:09:24] <icinga-wm>	 mw1391.eqiad.wmnet, mw1370.eqiad.wmnet, mw1429.eqiad.wmnet, mw1451.eqiad.wmnet, mw1479.eqiad.wmnet, mw1418.eqiad.wmnet, mw1496.eqiad.wmnet, mw1473.eqiad.wmnet, mw1401.eqiad.wmnet, mw139 https://wikitech.wikimedia.org/wiki/PyBal
[16:09:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:10:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:10:16] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[16:10:44] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:10:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:11:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:11:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:11:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:11:38] <jinxer-wm>	 (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:11:53] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:12:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:12:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:12:32] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:13:07] <jinxer-wm>	 (ProbeDown) resolved: (17) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:13:07] <jinxer-wm>	 (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:13:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:13:45] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[16:14:16] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:15:09] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:15:16] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[16:16:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:16:28] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[16:16:52] <jinxer-wm>	 (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[16:17:04] <wikibugs>	 (03PS1) 10BBlack: haproxy: limit to 20k conns to varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/956919
[16:17:45] <wikibugs>	 (03PS3) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[16:19:35] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, please add the bug line to the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/956919 (owner: 10BBlack)
[16:19:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:19:54] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] haproxy: limit to 20k conns to varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/956919 (owner: 10BBlack)
[16:22:23] <wikibugs>	 (03PS2) 10BBlack: haproxy: limit to 20k conns to varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/956919 (https://phabricator.wikimedia.org/T310609)
[16:23:10] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] haproxy: limit to 20k conns to varnish globally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956919 (https://phabricator.wikimedia.org/T310609) (owner: 10BBlack)
[16:28:49] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS bookworm
[16:45:03] <wikibugs>	 (03PS4) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048)
[16:46:44] <wikibugs>	 (03CR) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking)
[16:51:29] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila)
[16:53:12] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thank you for the reminder @Trizek-WMF! I sent a Slack announcement a while ago and now sent a reminder...
[16:55:51] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Failover from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse)
[16:57:38] <wikibugs>	 10SRE: All Wikimedia projects inaccessible on 2023-09-12 for several minutes - https://phabricator.wikimedia.org/T346172 (10Aklapper)
[16:57:43] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Don't require dummy 'team' label for multi-owner alerts [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi)
[16:58:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[16:58:35] <wikibugs>	 10SRE: All Wikimedia projects inaccessible on 2023-09-12 for several minutes - https://phabricator.wikimedia.org/T346172 (10Wellverywell)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1700)
[17:03:31] <wikibugs>	 10SRE: All Wikimedia projects inaccessible on 2023-09-12 for several minutes - https://phabricator.wikimedia.org/T346172 (10taavi) 05Open→03Resolved We believe this is resolved now. In general we have monitoring to detect outages of that level and update IRC (`#wikimedia-tech` on libera.chat) and/or wikimedi...
[17:03:45] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs.data-transfer: Keep downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking)
[17:21:56] <wikibugs>	 (03PS1) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158)
[17:21:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926
[17:22:00] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclients: add methods to correlate project id with name [puppet] - 10https://gerrit.wikimedia.org/r/956927 (https://phabricator.wikimedia.org/T343158)
[17:22:02] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-cold-migrate: remove instance_fqdn output hint [puppet] - 10https://gerrit.wikimedia.org/r/956928 (https://phabricator.wikimedia.org/T343158)
[17:22:04] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-instance-fqdns: support cases where project_name != project_id [puppet] - 10https://gerrit.wikimedia.org/r/956929 (https://phabricator.wikimedia.org/T343158)
[17:22:06] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158)
[17:23:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[17:24:09] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) I took a little look at the routed-mode docs from [[ https://github.com/grnet/gnt-networking/blob/develop/docs/routed.rst | here ]].  Overall the setup looks a...
[17:27:13] <wikibugs>	 (03PS1) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951)
[17:31:21] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158)
[17:32:11] <wikibugs>	 (03PS8) 10Bking: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (https://phabricator.wikimedia.org/T346048) (owner: 10Ebernhardson)
[17:33:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926 (owner: 10Andrew Bogott)
[17:35:49] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (https://phabricator.wikimedia.org/T346048) (owner: 10Ebernhardson)
[17:36:09] <wikibugs>	 (03PS2) 10Andrew Bogott: Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926
[17:36:11] <wikibugs>	 (03PS2) 10Andrew Bogott: mwopenstackclients: add methods to correlate project id with name [puppet] - 10https://gerrit.wikimedia.org/r/956927 (https://phabricator.wikimedia.org/T343158)
[17:36:13] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-cold-migrate: remove instance_fqdn output hint [puppet] - 10https://gerrit.wikimedia.org/r/956928 (https://phabricator.wikimedia.org/T343158)
[17:36:15] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-instance-fqdns: support cases where project_name != project_id [puppet] - 10https://gerrit.wikimedia.org/r/956929 (https://phabricator.wikimedia.org/T343158)
[17:36:17] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158)
[17:36:19] <wikibugs>	 (03PS2) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158)
[17:37:38] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:45:54] <wikibugs>	 (03Merged) 10jenkins-bot: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (https://phabricator.wikimedia.org/T346048) (owner: 10Ebernhardson)
[17:49:37] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh)
[17:50:03] <sukhe>	 !log run authdns-update to remove nsa.wikimedia.org
[17:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:32] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346112 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[18:02:16] <wikibugs>	 (03PS10) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027)
[18:02:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[18:06:14] <wikibugs>	 10SRE, 10Cloud-Services: Certain systems failing to resolve - https://phabricator.wikimedia.org/T346177 (10cmooney) p:05Triage→03Medium The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a mo...
[18:07:49] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve - https://phabricator.wikimedia.org/T346177 (10taavi) p:05Medium→03High
[18:08:28] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10cmooney)
[18:08:58] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10cmooney)
[18:09:55] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10cmooney)
[18:12:22] <wikibugs>	 (03PS1) 10Sharvaniharan: Make the new stream name consistent with convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934
[18:12:35] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10Nux) `login.toolforge.org` is not working too (even after flushing local dns). So no way to ssh into TS.
[18:13:22] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10taavi)
[18:13:28] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10taavi)
[18:18:43] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org - https://phabricator.wikimedia.org/T346177 (10taavi)
[18:19:38] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10taavi)
[18:23:39] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) @RobH if we can ask the registrar to change the IP they have for //ns1.openstack.eqiad1.wikimediacloud.org// to 185.15.56.163...
[18:24:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) db1240 - B 6. U 03. port 01 CableID 1170 db1241 - B 6. U 04. port 05 CableID 1274 db1242 - C 3. U 12. port 25 CableID 5090 db1243 - C 3. U 13. port 04 CableID 2871
[18:25:14] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10RobH) >>! In T346177#9161332, @cmooney wrote: > @RobH if we can ask the registrar to change the IP they have for //ns1.openstack.eqiad1...
[18:27:45] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) >>! In T346177#9161360, @RobH wrote: >>>! In T346177#9161332, @cmooney wrote: >> @RobH if we can ask the registrar to change t...
[18:29:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:30:33] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10RobH) > Naoya, >  > We're juggling around some nameservers in our cloud environment over here, and need to update one of them: >  > ns1...
[18:33:05] <wikibugs>	 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) @ssingh do you think this is still an issue that's worth keeping open, and should it then be tagged to IF?
[18:34:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:35:08] <wikibugs>	 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) Thanks @BCornwall, I think we can close this one as we have done some other reimages in eqsin and not observed this issue.
[18:35:58] <wikibugs>	 10SRE, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Hi, @CRoslof! Have you been able to look into this?  Thanks so much!
[18:37:25] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) @Vgutierrez and @BBlack friendly poke :)
[18:40:28] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825 (10BCornwall) p:05Lowest→03Low
[18:43:47] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10RobH) > Update, >  > It turns out the other nameserver was migrated previously with no IP update either, so its currently incorrect on...
[18:55:11] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Happy to help monitor after deployment, morning Eastern TZ generally works for me" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[18:56:08] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Happy to monitor after deploying this as well, morning Eastern TZ generally works for me" [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[18:58:58] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:59:20] <wikibugs>	 (03PS5) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048)
[19:03:01] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I've not seen any change in what ORG is returning.  I have made some routing and host-level iptables changes on cloudservices1...
[19:07:45] <wikibugs>	 (03PS6) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048)
[19:09:48] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) FWIW I took the following steps:  # Routed 208.80.154.11/32 to cloudservices2006 on cloudsw # Updated cloudsw and cr1-eqiad ro...
[19:11:16] <wikibugs>	 (03PS7) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048)
[19:12:24] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking)
[19:12:29] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking)
[19:13:32] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking)
[19:14:29] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking)
[19:15:13] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[19:17:29] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[19:23:41] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[19:26:44] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I've temporarily reserved 208.80.154.11 in Netbox so it doesn't get used - we should remove that once ORG has updated their re...
[19:28:13] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ssw1 old irb int dns - cmooney@cumin1001"
[19:28:28] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you. 😊" [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi)
[19:31:22] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:32:19] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ssw1 old irb int dns - cmooney@cumin1001"
[19:32:19] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:34:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:38:09] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:39:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:40:53] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I'd not noticed in my initial comment above, but *neither* IP that ORG is returning was working earlier on.  Seems at some sta...
[19:43:09] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:44:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:47:27] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I investigated making the same routing change for 208.80.154.135/32 but it's assigned to gerrit1003 so not possible.
[19:49:24] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:54:24] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:55:39] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:59:24] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:00:06] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T2000)
[20:00:06] <jouncebot>	 Jdlrobson and  sharvani_: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:39] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:04:33] <Jdlrobson>	 here
[20:06:11] <brett>	 Jdlrobson: is there an incident?
[20:06:11] <wikibugs>	 (03PS1) 10Volans: external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534)
[20:08:43] <Jdlrobson>	 brett: there's a backport window. I'm not aware if there is an incident.  I'm not sure how critical mw-api-int is
[20:09:02] <Jdlrobson>	 looks like an SRE team alert
[20:09:12] <brett>	 ah, misunderstood your "here". gotcha
[20:09:24] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:09:40] <sharvani_>	 Hi! Here for UTC late backport window patch deployment :)
[20:10:26] <cjming>	 hi - i can deploy if it's not already underway
[20:10:39] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:10:41] <Jdlrobson>	 cjming: thank you. Nobody has claimed the backport window yet :)
[20:11:01] <wikibugs>	 (03CR) 10Volans: "Test run results:" [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans)
[20:11:10] <sharvani_>	 cjming: Thank you :)
[20:11:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar)
[20:11:46] <wikibugs>	 (03PS2) 10Clare Ming: Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia)
[20:12:47] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:13:20] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar)
[20:13:48] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:956811|Reduce initial payload of Phonos styles (T345414)]]
[20:13:51] <stashbot>	 T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414
[20:15:25] <logmsgbot>	 !log cjming@deploy1002 cjming and samtar: Backport for [[gerrit:956811|Reduce initial payload of Phonos styles (T345414)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:15:29] <cjming>	 Jdlrobson: do you want to test your first patch?
[20:15:51] <Jdlrobson>	 cjming: on it
[20:17:51] <Jdlrobson>	 cjming: is the change live on the debug servers?
[20:18:09] <Jdlrobson>	 (Reduce initial payload of Phonos styles [extensions/Phonos])
[20:18:11] <cjming>	 Jdlrobson: it should be - are you not seeing it?
[20:18:21] <Jdlrobson>	 it might not be possible to test
[20:18:23] <Jdlrobson>	 since it touches parser
[20:18:36] <cjming>	 oh - should i just go ahead and sync?
[20:19:15] <Jdlrobson>	 cjming: i think so
[20:19:24] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:19:27] <logmsgbot>	 !log cjming@deploy1002 cjming and samtar: Continuing with sync
[20:21:50] <wikibugs>	 (03PS3) 10BCornwall: Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[20:24:24] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:25:54] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:956811|Reduce initial payload of Phonos styles (T345414)]] (duration: 12m 06s)
[20:25:57] <stashbot>	 T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414
[20:26:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52483 and previous config saved to /var/cache/conftool/dbconfig/20230912-202609-arnaudb.json
[20:26:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia)
[20:26:12] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[20:27:01] <wikibugs>	 (03Merged) 10jenkins-bot: Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia)
[20:27:30] <cjming>	 Jdlrobson: Phonos patch should be live - labs patch should be gtg
[20:27:55] <wikibugs>	 (03PS2) 10Clare Ming: Make the new stream name consistent with convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934 (owner: 10Sharvaniharan)
[20:28:13] <wikibugs>	 (03PS3) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813
[20:28:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[20:28:35] <wikibugs>	 (03PS11) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027)
[20:29:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934 (owner: 10Sharvaniharan)
[20:29:33] <wikibugs>	 (03CR) 10RhinosF1: "this can be merged blind. I will do the follow up unless someone wants too. Host is wikistats-bookworm.eqiad1.wikimedia.cloud though if yo" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[20:29:40] <Jdlrobson>	 cjming: thanks checking now
[20:29:56] <wikibugs>	 (03Merged) 10jenkins-bot: Make the new stream name consistent with convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934 (owner: 10Sharvaniharan)
[20:30:25] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:956934|Make the new stream name consistent with convention]]
[20:31:25] <wikibugs>	 (03PS4) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813
[20:31:33] <wikibugs>	 (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[20:31:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[20:31:53] <logmsgbot>	 !log cjming@deploy1002 sharvaniharan and cjming: Backport for [[gerrit:956934|Make the new stream name consistent with convention]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:32:25] <cjming>	 sharvani_:  i should just sync your patch?
[20:32:31] <wikibugs>	 (03PS5) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813
[20:32:38] <wikibugs>	 (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[20:32:42] <sharvani_>	 cjming: 
[20:32:46] <sharvani_>	 I can test
[20:32:54] <cjming>	 please do
[20:32:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[20:33:13] <sharvani_>	 looks good please sync. thank you.
[20:33:18] <logmsgbot>	 !log cjming@deploy1002 sharvaniharan and cjming: Continuing with sync
[20:33:26] <wikibugs>	 (03PS6) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813
[20:34:19] <wikibugs>	 (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1)
[20:37:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] alertmanager: add link to DatasourceError runbook [puppet] - 10https://gerrit.wikimedia.org/r/953495 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite)
[20:39:21] <Jdlrobson>	 cjming: managed to verify everything is okay with the patch - it just didn't work as advertised :)
[20:39:49] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:956934|Make the new stream name consistent with convention]] (duration: 09m 24s)
[20:40:03] <cjming>	 Jdlrobson: bummer - is there anything to do?
[20:40:05] <cjming>	 sharvani_: should be live!
[20:40:21] <sharvani_>	 thank you!
[20:40:22] <Jdlrobson>	 cjming: nope nothing we can do here :)
[20:40:42] <cjming>	 sharvani_: yw - np!
[20:41:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P52484 and previous config saved to /var/cache/conftool/dbconfig/20230912-204115-arnaudb.json
[20:41:25] <cjming>	 Jdlrobson: sorry to hear -- lmk if you want to revert
[20:41:54] <Jdlrobson>	 nope no need to revert it's still a minor improvement
[20:41:58] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1030 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:42:22] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.235:7000 on restbase1030 is OK: SSL OK - Certificate restbase1030-b valid until 2024-08-30 21:39:18 +0000 (expires in 353 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:42:25] <inflatador>	 !log rebooting search-loader2001.codfw.wmnet T344671
[20:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:38] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.236:7000 on restbase1030 is OK: SSL OK - Certificate restbase1030-c valid until 2024-08-30 21:39:21 +0000 (expires in 353 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:42:45] <cjming>	 closing the window if there's nothing else
[20:42:50] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1030 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:43:00] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader2001.codfw.wmnet
[20:43:37] <cjming>	 !log end of UTC late backport window
[20:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:56] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader2001.codfw.wmnet
[20:53:16] <icinga-wm>	 RECOVERY - Host mw2444 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms
[20:54:26] <wikibugs>	 10SRE, 10Movement-Insights, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi, any update on this?
[20:54:32] <icinga-wm>	 PROBLEM - puppet last run on mw2444 is CRITICAL: CRITICAL: Puppet last ran 13 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:56:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P52485 and previous config saved to /var/cache/conftool/dbconfig/20230912-205621-arnaudb.json
[20:57:48] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) I've replaced the CPU. We should know by Friday if it has issues. I will leave the ticket open until then.   return tracking for the bad part (which I will also hold until Friday): 783629071254
[20:59:58] <icinga-wm>	 RECOVERY - puppet last run on mw2444 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:01:26] <wikibugs>	 10SRE, 10Security-Team, 10Traffic, 10SecTeam-Processed, 10Security: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10sbassett)
[21:01:47] <wikibugs>	 10SRE, 10Security-Team, 10Traffic, 10SecTeam-Processed, 10Security: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10sbassett)
[21:04:18] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader1001.eqiad.wmnet
[21:07:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[21:08:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader1001.eqiad.wmnet
[21:09:28] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:11:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52486 and previous config saved to /var/cache/conftool/dbconfig/20230912-211128-arnaudb.json
[21:11:30] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[21:11:34] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[21:11:43] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[21:12:37] <wikibugs>	 (03CR) 10Cwhite: [V: 03+1 C: 03+2] grafana: ensure prometheus/global datasources removed [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite)
[21:14:04] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) 05Open→03Resolved @bking  @RKemper All disks are present
[21:14:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:18:12] <wikibugs>	 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) 05Stalled→03Resolved a:03BCornwall
[21:29:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:29:24] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:30:39] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:34:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:34:24] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:40:03] <wikibugs>	 (03CR) 10Cwhite: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol)
[21:44:43] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh)
[21:46:49] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Don't require dummy 'team' label for multi-owner alerts [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi)
[21:59:58] <wikibugs>	 (03PS1) 10BCornwall: package_builder: add piuparts for >=bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956968
[22:11:43] <wikibugs>	 (03CR) 10Fabfur: "I found piuparts very useful but wait for someone with more experience than me with this" [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall)
[22:46:19] <wikibugs>	 (03PS1) 10Volans: decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134)
[23:07:52] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:42] <wikibugs>	 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I'd like to understand where the requirement for the "glue" A records that org are returning for these comes from.  As I under...
[23:09:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:11:58] <Amir1>	 !bash Krinkle: The quintessential pattern of classic MediaWiki is query() -> doQuery() -> reallyDoQuery(). And when really (ha) in doubt, just tack a number or "internal" to the caller, like guessMimeInternal, makeThumbLink2, or recordUpload3
[23:11:58] <stashbot>	 Amir1: Stored quip at https://bash.toolforge.org/quip/I4Wqi4oBGiVuUzOddIg1
[23:14:57] <brett>	 !log Upload trafficserver_9.2.1-1wm2_amd64 to bookworm-wikimedia
[23:14:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:00] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:44:09] <wikibugs>	 (03PS1) 10Jdlrobson: Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956814 (https://phabricator.wikimedia.org/T345414)
[23:44:23] <wikibugs>	 (03PS1) 10Jdlrobson: Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956815 (https://phabricator.wikimedia.org/T345414)