[00:01:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P52441 and previous config saved to /var/cache/conftool/dbconfig/20230912-000148-arnaudb.json [00:12:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:16:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52442 and previous config saved to /var/cache/conftool/dbconfig/20230912-001654-arnaudb.json [00:16:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [00:16:58] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [00:17:09] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [00:17:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52443 and previous config saved to /var/cache/conftool/dbconfig/20230912-001715-arnaudb.json [00:38:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956023 [00:38:14] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956023 (owner: 10TrainBranchBot) [00:53:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956023 (owner: 10TrainBranchBot) [01:03:52] PROBLEM - dump of s5 in eqiad on backupmon1001 is CRITICAL: Last dump for s5 at eqiad (db1216) taken on 2023-09-12 00:00:03 is 61 GiB, but the previous one was 73 GiB, a change of -17.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:05:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346112 (10phaultfinder) [01:06:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host titan1001.eqiad.wmnet with OS bookworm [01:06:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host titan1001.eqiad.wmnet with OS bookworm [01:19:26] PROBLEM - dump of s5 in codfw on backupmon1001 is CRITICAL: Last dump for s5 at codfw (db2101) taken on 2023-09-12 00:00:05 is 61 GiB, but the previous one was 73 GiB, a change of -17.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:34:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on titan1001.eqiad.wmnet with reason: host reimage [01:38:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan1001.eqiad.wmnet with reason: host reimage [01:54:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:55:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:55:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan1001.eqiad.wmnet with OS bookworm [01:56:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host titan1001.eqiad.wmnet with OS bookworm completed: - titan10... [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0200) [02:06:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.26 [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956024 (https://phabricator.wikimedia.org/T343728) [02:06:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host titan1002.eqiad.wmnet with OS bookworm [02:06:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.26 [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956024 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [02:07:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host titan1002.eqiad.wmnet with OS bookworm [02:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:50] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.26 [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956024 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [02:28:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on titan1002.eqiad.wmnet with reason: host reimage [02:32:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan1002.eqiad.wmnet with reason: host reimage [02:37:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:49:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:49:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan1002.eqiad.wmnet with OS bookworm [02:49:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host titan1002.eqiad.wmnet with OS bookworm completed: - titan10... [02:50:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jhancock.wm) 05Open→03Resolved [03:00:06] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0300) [03:01:06] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:37] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956521 (https://phabricator.wikimedia.org/T343728) [03:01:39] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956521 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [03:02:20] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956521 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [03:02:50] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.26 refs T343728 [03:02:53] T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728 [03:56:08] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.26 refs T343728 (duration: 53m 18s) [03:56:12] T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728 [03:58:41] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.23, 1.41.0-wmf.24 (duration: 02m 30s) [04:10:01] (03CR) 10TTO: Enable PageNotice on enwiktionary beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [04:12:54] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [04:14:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52444 and previous config saved to /var/cache/conftool/dbconfig/20230912-041425-arnaudb.json [04:14:29] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [04:29:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P52445 and previous config saved to /var/cache/conftool/dbconfig/20230912-042931-arnaudb.json [04:44:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P52446 and previous config saved to /var/cache/conftool/dbconfig/20230912-044437-arnaudb.json [04:56:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2014.codfw.wmnet to cluster codfw and group C [04:59:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T343198)', diff saved to https://phabricator.wikimedia.org/P52447 and previous config saved to /var/cache/conftool/dbconfig/20230912-045944-arnaudb.json [04:59:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [04:59:47] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [05:00:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:00:01] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:00:27] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:00:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52448 and previous config saved to /var/cache/conftool/dbconfig/20230912-050033-arnaudb.json [05:00:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2014.codfw.wmnet to cluster codfw and group C [05:04:58] PROBLEM - ganeti-noded running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [05:05:24] PROBLEM - ganeti-confd running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [05:06:00] ^ expected, I'm reimaging the node after a mainboard replacement [05:06:10] PROBLEM - ganeti-mond running on ganeti2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [05:08:15] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) [05:08:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2014.codfw.wmnet with OS bullseye [05:09:03] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye [05:11:14] !log installing aom security updates [05:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2158', diff saved to https://phabricator.wikimedia.org/P52449 and previous config saved to /var/cache/conftool/dbconfig/20230912-051725-root.json [05:17:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1119 with Debian Bookworm in s1 with just 10% T339185', diff saved to https://phabricator.wikimedia.org/P52450 and previous config saved to /var/cache/conftool/dbconfig/20230912-051753-marostegui.json [05:18:00] T339185: Test MariaDB + Debian bookworm on databases - https://phabricator.wikimedia.org/T339185 [05:18:28] (03PS1) 10Marostegui: db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956524 [05:19:14] (03CR) 10Marostegui: [C: 03+2] db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956524 (owner: 10Marostegui) [05:27:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2014.codfw.wmnet with reason: host reimage [05:32:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2014.codfw.wmnet with reason: host reimage [05:38:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343198)', diff saved to https://phabricator.wikimedia.org/P52452 and previous config saved to /var/cache/conftool/dbconfig/20230912-053813-arnaudb.json [05:38:17] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [05:50:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2014.codfw.wmnet with OS bullseye [05:50:33] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye completed: - ganeti2014 (**PASS**) - Downtimed on Icinga/Alertmanager - Di... [05:53:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P52453 and previous config saved to /var/cache/conftool/dbconfig/20230912-055319-arnaudb.json [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0600) [06:00:05] kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0600). [06:08:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P52454 and previous config saved to /var/cache/conftool/dbconfig/20230912-060825-arnaudb.json [06:16:01] (03CR) 10Ayounsi: [C: 03+2] Routinator: tmpfs, bump the maximum number of inodes [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [06:23:06] (03CR) 10Ayounsi: [C: 03+2] Add esams sandbox network prefixes [puppet] - 10https://gerrit.wikimedia.org/r/956454 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [06:23:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343198)', diff saved to https://phabricator.wikimedia.org/P52455 and previous config saved to /var/cache/conftool/dbconfig/20230912-062332-arnaudb.json [06:23:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [06:23:36] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [06:23:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [06:23:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T343198)', diff saved to https://phabricator.wikimedia.org/P52456 and previous config saved to /var/cache/conftool/dbconfig/20230912-062353-arnaudb.json [06:29:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:33:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:33:36] (03PS1) 10Marostegui: Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/956065 [06:34:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:34:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:36:29] (03CR) 10Marostegui: [C: 03+2] Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/956065 (owner: 10Marostegui) [06:36:35] (03PS1) 10Elukey: services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) [07:00:05] Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0700). [07:00:06] tto and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] * urbanecm waves [07:00:16] morning [07:00:20] Hello! [07:00:24] hi tto! [07:00:40] Sadly my patch has a -1 from Krinkle [07:00:43] taavi: hi, glad you're here. can you please have a look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668156 and add a 3rd opinion please? [07:00:48] I am not sure that he actually looked at it? [07:00:57] the patch number in [[Deployments]] is wrong btw [07:01:01] urbanecm: looking [07:01:02] Timo gave a -1 and suggested to deploy the extension to beta first, which is...what this extension does :) [07:01:09] Oh I see you have responded there [07:01:24] thanks taa.vi. [07:01:52] And thank you urbanecm for fixing the situation regarding the Maintainers list on mw: [07:01:57] np [07:03:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [07:03:30] <_joe_> uh sorry I missed the ping [07:03:34] urbanecm: usually the extension-list addition is done in a separate commit. but I can't immediately think of a reason why doing it in the same commit would not work [07:04:07] taavi: i think that doesn't matter for beta anyway, since its syncs happen whenever CI feels like it (unless deployer intervenes, which is generally not the case) [07:04:13] <_joe_> should I star with my patches? [07:04:26] _joe_: feel free to and please ping me once done :) [07:05:07] <_joe_> urbanecm: just a piece of advice: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/943036/ doesn't really need to be synched separately [07:05:25] <_joe_> should I just merge it then run scap backport on the others? [07:05:39] i recommend passing multiple change IDs to `scap backport` [07:05:49] <_joe_> oh TIL that works? [07:05:51] ie. `scap backport 943036 942675` [07:05:53] yup, it does. [07:06:03] <_joe_> ack then, thanks [07:06:18] merging and then running scap backport on the next patch works too, but you'll get a warning about an unexpected patch. [07:06:49] urbanecm: right. only risk I see is that a revert of the technically-production-facing parts would take longer because you have i18n changes [07:08:06] taavi: we can split that if you feel like that. i asked because of the -1, which's probably caused by the patch changing production files without also having production impact (well, that's the goal at least, but it doesn't seem like it actually suggests some change/improvement. [07:08:25] (we can merge anyway, but i feel it would be helpful to get a second opinion before doing so) [07:08:30] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:16] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [07:09:30] urbanecm: tbh I'm find with both approaches, and the patch looks fine to me [07:09:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [07:09:46] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:943036|update noc README]], [[gerrit:951046|Use ClusterConfig]] [07:09:56] taavi: thanks. then let's leave it as-is i guess :). would you mind formally saying so on the patch please? [07:11:22] !log oblivian@deploy1002 oblivian: Backport for [[gerrit:943036|update noc README]], [[gerrit:951046|Use ClusterConfig]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:12:02] yes [07:12:04] ty [07:12:17] <_joe_> ook [07:13:31] tto: fyi, with a second +1, i'm ok with deploying your beta-patch today. let's wait for joe to be done with his deployment now and then i think we can proceed. [07:13:55] Thanks a bunch urbanecm! [07:14:12] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:58] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [07:17:27] !log oblivian@deploy1002 oblivian: Continuing with sync [07:20:55] _joe_: if it would be possible to "squeeze in" a beta patch, i'd appreciate that. just realized you have quite a few of patches and we have only ~40 minutes left before train starts. [07:21:15] <_joe_> urbanecm: I think so yes [07:21:35] <_joe_> like my next patch is very fast and not risky (adds a function, still unused) [07:21:40] <_joe_> we can merge them together maybe? [07:21:44] sounds good to me [07:22:21] (tto's patch will (re)build i18n, so it would be slower, but otherwise, why not) [07:23:29] <_joe_> ugh so much slower [07:23:33] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:943036|update noc README]], [[gerrit:951046|Use ClusterConfig]] (duration: 13m 46s) [07:23:46] <_joe_> ok let's bite the bullet now [07:23:51] <_joe_> jouncebot: nowandnext [07:23:51] For the next 0 hour(s) and 36 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0700) [07:23:51] In 0 hour(s) and 36 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0800) [07:24:14] <_joe_> urbanecm: what's the change id? [07:24:44] _joe_: 668156 [07:24:52] <_joe_> ok merging now [07:24:57] <_joe_> any particular thing to test? [07:25:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [07:25:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [07:25:12] nope, it only impacts beta. [07:25:14] I can test on enwiktionary beta if you tell me when? [07:25:30] tto: it will be deployed there in ~30 minutes after the patch merges. [07:25:32] <_joe_> tto: no need, if beta breaks it's not great but that's what it's for [07:25:35] <_joe_> :) [07:25:38] and that. [07:25:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [07:25:50] Sure thing, thanks again for this! [07:25:57] <_joe_> we break beta so we hopefully don't break production :) [07:26:04] <_joe_> (emphasis on the hopefulness [07:26:31] (03PS7) 10Giuseppe Lavagetto: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [07:26:54] (03Merged) 10jenkins-bot: ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [07:26:56] (03CR) 10Urbanecm: [C: 03+2] "escalating to a +2 to record that it was my decision to go ahead despite a -1, based on Taavi's review and the fact that the recommendatio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [07:27:10] (03CR) 10TrainBranchBot: "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [07:27:33] (03Merged) 10jenkins-bot: Enable PageNotice on enwiktionary beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [07:27:47] <_joe_> urbanecm: thanks for recording the decision in a comment, but I can handle timo in case of need :) [07:28:00] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:951047|ClusterConfig: also allow to return hostname]], [[gerrit:668156|Enable PageNotice on enwiktionary beta (T61245)]] [07:28:03] T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245 [07:28:13] <_joe_> 07:28:01 Running rebuildLocalisationCache.php as www-data [07:28:32] <_joe_> well I guess the next patch might run shortly into the train [07:29:04] afaik i18n rebuild's faster than it used to be, esp. when it's only few messages like in this case, but we'll see. [07:29:20] and thank you for the offer! [07:29:32] Do I need to stick around or are we all good? [07:29:36] tto: from your end of things, this is now done. beta'll pick up the new configuration in about half an hour. [07:30:01] Thanks again urbanecm and _joe_ and taavi! Everyone here today has been so helpful, it's been a pleasure to interact. [07:30:13] glad we could help. [07:30:32] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater-k8s: Add egress rules to values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking) [07:30:33] PROBLEM - Host mw2444 is DOWN: PING CRITICAL - Packet loss = 100% [07:30:47] <_joe_> siigh mw2444 [07:30:54] <_joe_> I *hope* it's depooled already [07:31:27] <_joe_> yeah, thankfully it is [07:35:35] (03CR) 10CI reject: [V: 04-1] Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:36:30] (03PS16) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) [07:36:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netmon1003.wikimedia.org [07:37:35] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:39] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:40:43] <_joe_> rebuilding the l10n cache is indeed horrible [07:40:48] :-( [07:41:01] <_joe_> yeah that's one of the things we need to solve long-term [07:41:03] do we know where we're at with that? [07:41:16] <_joe_> another being how slow it is to push large images to our registry [07:41:20] (with the rebuild, not with long term solution) [07:41:30] <_joe_> ah yes, it's almost done [07:41:49] great. [07:41:49] <_joe_> we're now pulling this new large image to the k8s nodes [07:42:39] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:43:26] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1153.eqiad.wmnet [07:45:04] <_joe_> urbanecm: but every passage is terribly slow [07:45:08] (03CR) 10AikoChou: [C: 03+1] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [07:45:09] :( [07:45:09] <_joe_> even the sync-masters is [07:45:14] <_joe_> it's just too much data [07:45:31] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1153.eqiad.wmnet [07:45:35] <_joe_> I would question if we really need to rebuild all l10n every time the extension list changes [07:45:53] <_joe_> I kinda suspect the whole process can be optimized hard and there's a lot of low-hanging fruits there [07:46:24] well, extension-list is the list of extensions we build i18n for. when adding a new one, one needs to build i18n for at least the extension that changed. [07:49:16] but yeah, i agree that there's a lot of things to improve about i18n building. [07:49:20] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1154.eqiad.wmnet [07:50:44] <_joe_> urbanecm: I should've told you no :/ [07:51:01] <_joe_> syncing to the fleet is gonna end up eating all the remaining time [07:51:21] <_joe_> syncing to mwdebug is taking forever [07:51:24] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1154.eqiad.wmnet [07:51:39] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1155.eqiad.wmnet [07:52:06] :-( apologies for that. [07:53:17] (03PS4) 10Ayounsi: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) [07:53:52] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1155.eqiad.wmnet [07:54:05] <_joe_> urbanecm: not your fault [07:54:21] <_joe_> but it's taking 15 minutes to sync this patch to the mwdebugs [07:55:45] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:10] !log oblivian@deploy1002 tto and oblivian: Backport for [[gerrit:951047|ClusterConfig: also allow to return hostname]], [[gerrit:668156|Enable PageNotice on enwiktionary beta (T61245)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:56:17] T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245 [07:56:45] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1156.eqiad.wmnet [07:57:04] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [07:58:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [07:58:35] !log oblivian@deploy1002 tto and oblivian: Continuing with sync [07:58:44] <_joe_> ok, main sync starting [07:58:52] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1156.eqiad.wmnet [07:59:01] <_joe_> who's running the train today? [08:00:05] jnuche and hashar: How many deployers does it take to do MediaWiki train - Utc-0 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T0800). [08:00:22] morning [08:00:31] :-] [08:00:35] _joe_: I'll be running the train today with andre [08:00:59] take your time to finish the backports, there's no hurry :) [08:01:09] <_joe_> jnuche thanks [08:01:25] * andre o/ [08:01:59] <_joe_> andre: running the train? [08:02:05] yes [08:02:11] (03CR) 10Ayounsi: [C: 03+2] Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [08:02:42] well I guess pairing on it is a better description :] [08:03:32] <_joe_> jnuche: mostly it's out of my control; this patch will take the same time as the train to deploy [08:03:41] <_joe_> we deploy mediawiki in all the wrong ways. [08:03:59] (03Merged) 10jenkins-bot: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [08:04:23] <_joe_> (and this is not to blame anyone, it's the result of sedimentation of tech debt and stuff) [08:06:09] <_joe_> basically the problem is a 40 lines patch causing this: rsync transfer: average 5,705,916,078 bytes/host [08:06:20] <_joe_> 6 GB per hosts worth of transfer [08:06:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [08:07:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Pass ensure->present to the nftables class if selected [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff) [08:08:10] I guess it is because the first bakcport on tuesday ends up syncing the whole new wmf mw version [08:09:16] <_joe_> hashar: this isn't the first backport [08:09:23] <_joe_> it's that this change rebuilds l10 [08:09:27] <_joe_> *l10n [08:09:41] <_joe_> and l10nupdate is "rebuild everything all the time" [08:09:46] ah yes [08:09:59] l10nupdate is indeed a bottleneck [08:09:59] <_joe_> so 18 LOC => 6 GB [08:10:12] <_joe_> yeah I suspect there's some quick wins there [08:10:15] (03PS1) 10Filippo Giunchedi: site: apply role to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999) [08:10:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2014.codfw.wmnet to cluster codfw and group C [08:12:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2014.codfw.wmnet to cluster codfw and group C [08:12:22] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [08:13:18] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43208/console" [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [08:13:24] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:951047|ClusterConfig: also allow to return hostname]], [[gerrit:668156|Enable PageNotice on enwiktionary beta (T61245)]] (duration: 45m 23s) [08:13:25] <_joe_> jnuche: almost done, can I squeeze in the next patch as well? [08:13:27] T61245: Review the PageNotice extension for deployment - https://phabricator.wikimedia.org/T61245 [08:13:45] _joe_: yeah, np, go ahead [08:13:46] (03PS1) 10Jelto: gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060) [08:13:48] <_joe_> thanks [08:13:59] (03PS4) 10Giuseppe Lavagetto: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 [08:14:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 (owner: 10Giuseppe Lavagetto) [08:14:47] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43209/console" [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [08:14:56] (03Merged) 10jenkins-bot: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 (owner: 10Giuseppe Lavagetto) [08:15:23] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:951048|Replace calls to wfHostname with clusterconfig ones]] [08:15:27] (03PS1) 10Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) [08:17:00] !log oblivian@deploy1002 oblivian: Backport for [[gerrit:951048|Replace calls to wfHostname with clusterconfig ones]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:18:45] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] site: apply role to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956782 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [08:18:48] !log oblivian@deploy1002 oblivian: Continuing with sync [08:20:33] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) ganeti2014 was reimaged after the mainboard replacement and re-added to the cluster. [08:22:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [08:24:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060) (owner: 10Jelto) [08:24:39] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:951048|Replace calls to wfHostname with clusterconfig ones]] (duration: 09m 16s) [08:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:25:41] <_joe_> jnuche: all yours, sorry :/ [08:26:12] _joe_: perfect, thanks :) [08:26:19] (03PS2) 10Jelto: gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060) [08:27:32] (03CR) 10Jelto: [C: 03+2] gitlab_runner: change docker_subnet in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/956784 (https://phabricator.wikimedia.org/T346060) (owner: 10Jelto) [08:29:12] PROBLEM - Check systemd state on titan2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:57] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956787 (https://phabricator.wikimedia.org/T343728) [08:29:59] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956787 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [08:30:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:30:53] (03PS1) 10Filippo Giunchedi: hieradata: add titan hosts to thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) [08:31:02] (03CR) 10Jcrespo: "nitpicks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb) [08:32:26] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956787 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [08:32:52] (03PS6) 10Arnaudb: admin: Change defaults on Arnaud's homedir [puppet] - 10https://gerrit.wikimedia.org/r/956025 [08:34:06] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43210/console" [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [08:36:15] (03CR) 10Jcrespo: [C: 03+1] admin: Change defaults on Arnaud's homedir [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb) [08:38:10] PROBLEM - Check systemd state on titan1001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:42] that's me ^ filing task for this failure mode [08:38:43] !log rebalance Ganeti cluster in codfw/C following node replacement [08:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:10] (03CR) 10Vgutierrez: Release 9.2.1-1wm2 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [08:39:27] (03CR) 10Jbond: [C: 03+1] "lgtm from a cookbook PoV, will need leave the opensearch specifics to the experts" [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [08:39:39] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.26 refs T343728 [08:39:41] T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728 [08:40:02] (03PS1) 10Brouberol: Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762) [08:41:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52459 and previous config saved to /var/cache/conftool/dbconfig/20230912-084059-arnaudb.json [08:41:04] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:41:57] (03CR) 10Jcrespo: [C: 03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb) [08:43:14] RECOVERY - Check systemd state on titan2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:54] (03CR) 10Arnaudb: [C: 03+2] admin: Change defaults on Arnaud's homedir (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956025 (owner: 10Arnaudb) [08:43:56] RECOVERY - Check systemd state on titan1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:00] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1028.eqiad.wmnet [08:45:17] (03CR) 10Muehlenhoff: [C: 03+2] Pass ensure->present to the nftables class if selected [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff) [08:48:22] (03CR) 10Clément Goubert: [C: 03+2] mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [08:49:08] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43212/console" [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [08:49:20] (03Merged) 10jenkins-bot: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [08:49:50] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [08:49:54] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [08:50:08] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [08:50:20] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [08:50:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [08:50:41] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [08:50:51] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [08:51:03] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [08:51:09] (03CR) 10JMeybohm: [C: 03+2] envoyproxy: tox.ini: whitelist_externals -> allowlist_externals [puppet] - 10https://gerrit.wikimedia.org/r/955876 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [08:51:16] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [08:51:27] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [08:51:28] !log mw-api-ext, mw-web: Raise total replicas to 14 - T341780 [08:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:31] T341780: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 [08:52:14] (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid support for Kartographer on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871) [08:52:39] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:53:27] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [08:53:34] jayme: ok to merge ? [08:53:57] hm,...I've an open puppet merge as well :) [08:54:08] huh I was about to merge something too lol [08:54:15] I'm gonna wait :D [08:55:36] (03CR) 10Klausman: [C: 03+1] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [08:56:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P52460 and previous config saved to /var/cache/conftool/dbconfig/20230912-085606-arnaudb.json [08:57:24] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:04] !log Sending 5% of global traffic to mw-on-k8s - T341780 [08:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:07] T341780: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 [08:58:26] !log Running puppet on cp-text P:trafficserver::backend - T341780 [08:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:37] PROBLEM - Check systemd state on titan1002 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:48] (03CR) 10Btullis: [C: 03+1] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol) [08:59:51] (03CR) 10Hashar: [C: 03+2] update_version: drop python 3.5, 3.6. Add 3.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955901 (owner: 10Hashar) [09:00:33] (03Merged) 10jenkins-bot: update_version: drop python 3.5, 3.6. Add 3.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/955901 (owner: 10Hashar) [09:00:35] (03Merged) 10jenkins-bot: update_version: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [09:01:41] (03PS2) 10Klausman: profile::k8s::deployment_server: Add config for readability isvc [puppet] - 10https://gerrit.wikimedia.org/r/951460 (https://phabricator.wikimedia.org/T334182) [09:02:17] (03CR) 10MVernon: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [09:02:39] RECOVERY - Check systemd state on titan1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:12] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the new cookbook! A suggestion and a question inline, none is a blocker." [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [09:05:24] 10SRE: Survey the third-party library market for UA policy compliance - https://phabricator.wikimedia.org/T313634 (10ayounsi) FYI: * https://github.com/mwclient/mwclient/issues/302 * https://github.com/martin-majlis/Wikipedia-API/issues/63 [09:08:33] (03PS1) 10Filippo Giunchedi: Don't require dummy 'team' label for multi-owner alerts [alerts] - 10https://gerrit.wikimedia.org/r/956794 [09:11:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P52461 and previous config saved to /var/cache/conftool/dbconfig/20230912-091112-arnaudb.json [09:11:38] (03PS1) 10Muehlenhoff: nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) [09:12:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:12:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:12:28] (03CR) 10CI reject: [V: 04-1] nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:13:29] (03CR) 10ArielGlenn: [C: 03+1] "Er, merge whenever you feel like it, I know you've got +2 in here :-D" [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis) [09:15:09] (03PS2) 10Muehlenhoff: nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) [09:15:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance [09:15:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2115.codfw.wmnet with reason: Maintenance [09:17:01] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: route requests for geo-analytics via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/954890 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [09:17:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10thiemowmde) Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent examp... [09:18:19] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10RhinosF1) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everyt... [09:18:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:20:02] (03PS1) 10Jbond: wmnet: add puppetdb1002 to SRV record for eqiad [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) [09:20:05] (03PS1) 10Arnaudb: Revert "admin: Change defaults on Arnaud's homedir" [puppet] - 10https://gerrit.wikimedia.org/r/956806 [09:20:25] (03Abandoned) 10Arnaudb: Revert "admin: Change defaults on Arnaud's homedir" [puppet] - 10https://gerrit.wikimedia.org/r/956806 (owner: 10Arnaudb) [09:20:33] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Vgutierrez) >>! In T341780#9159438, @thiemowmde wrote: > Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, ever... [09:24:49] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43214/console" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [09:26:11] !log jmm@cumin2002 START - Cookbook sre.pki.restart-reboot rolling reboot on A:pki [09:26:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52463 and previous config saved to /var/cache/conftool/dbconfig/20230912-092618-arnaudb.json [09:26:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:26:22] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:26:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:26:37] !log jmm@cumin2002 END (FAIL) - Cookbook sre.pki.restart-reboot (exit_code=99) rolling reboot on A:pki [09:26:39] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route requests for geo-analytics via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/954890 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [09:26:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52464 and previous config saved to /var/cache/conftool/dbconfig/20230912-092639-arnaudb.json [09:27:25] (03CR) 10Muehlenhoff: wmnet: add puppetdb1002 to SRV record for eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [09:27:33] (03CR) 10Btullis: [C: 03+1] services: remove mediawiki.revision-score from eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/956775 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [09:32:02] !log disabled puppet on A:cp [09:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:47] (03PS17) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) [09:39:02] (03CR) 10Physikerwelt: [C: 03+1] [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) (owner: 10Jforrester) [09:41:17] (03CR) 10Jgiannelos: [C: 03+1] Enable Parsoid support for Kartographer on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [09:44:23] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: add titan hosts to thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [09:48:26] (03CR) 10Brouberol: [C: 03+2] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol) [09:48:39] (03CR) 10Brouberol: [V: 03+2 C: 03+2] Add dummy secrets for all new an-worker11(49->56).eqiad.wmnet hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/956789 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol) [09:48:55] (03PS1) 10Filippo Giunchedi: conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) [09:49:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:49:43] (03PS1) 10Muehlenhoff: SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 [09:50:28] (03PS2) 10Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) [09:50:44] (03PS3) 10Brouberol: Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) [09:51:33] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah) [09:52:26] !log jmm@cumin2002 START - Cookbook sre.pki.restart-reboot rolling reboot on A:pki [09:52:59] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [09:53:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [09:53:10] (03CR) 10Volans: [C: 03+1] "Yes 15 seems a bit too short and 30 should work. That said I wonder if we should poll instead to avoid timing issues." [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (owner: 10Muehlenhoff) [09:54:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:54:21] 10SRE-tools, 10Spicerack: spicerrack.decorators.retry: dynamic_params_callbacks=(set_tries,) dfosn;t seem to work as epected - https://phabricator.wikimedia.org/T346134 (10jbond) p:05Triage→03Medium [09:55:45] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) [09:56:42] (03CR) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes (0318 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [09:58:50] (03PS14) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [09:59:18] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache pki.discovery.wmnet. on all recursors [09:59:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) pki.discovery.wmnet. on all recursors [10:00:07] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1000) [10:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52465 and previous config saved to /var/cache/conftool/dbconfig/20230912-100138-root.json [10:02:20] !log enabling puppet on A:cp [10:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:53] (03CR) 10Majavah: [C: 03+2] prometheus: allow external management of blackbox modules [puppet] - 10https://gerrit.wikimedia.org/r/956037 (https://phabricator.wikimedia.org/T288067) (owner: 10Majavah) [10:04:45] (03CR) 10Volans: "Thanks for all the fixes" [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [10:05:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres on old nodes to ensure nothing hits them anyway [10:05:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on puppetdb1002.eqiad.wmnet with reason: Disable puppetdb/postgres on old nodes to ensure nothing hits them anyway [10:05:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=90305a26-47b2-42a2-abe5-284f8035bf3b) set by jmm@cumin2002 for... [10:08:29] (03PS1) 10Hnowlan: rest-gateway: add config to limit routes to a domain, limit aqs2 apis to wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/956826 (https://phabricator.wikimedia.org/T336400) [10:08:49] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 389 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [10:08:51] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:09:06] jbond: is that you? ^^^ [10:09:18] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:09:25] (03PS2) 10Jbond: wmnet: add puppetserver1002 to SRV record for eqiad [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) [10:09:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [10:09:38] (03CR) 10Jbond: wmnet: add puppetserver1002 to SRV record for eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [10:09:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on puppetdb2002.codfw.wmnet with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway [10:09:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2af641c9-48a3-42b7-8c75-56c12506718a) set by jmm@cumin2002 for... [10:12:32] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:37] volans: that's me [10:13:17] !log disabled nginx/puppetdb/postgresql/microservice on puppetdb1002/2002 to ensure nothing hits the old endpoints anymore [10:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.pki.restart-reboot (exit_code=0) rolling reboot on A:pki [10:13:31] ack thx [10:14:59] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 389 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [10:16:23] (03CR) 10Muehlenhoff: "And JFTR, I did a test-cookbook run with the 30s delay and that made the sre.pki.restart-reboot cookbook work fine, I'll merge this change" [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (owner: 10Muehlenhoff) [10:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52466 and previous config saved to /var/cache/conftool/dbconfig/20230912-101643-root.json [10:16:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet [10:17:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:17:51] (03PS3) 10Majavah: P:wmcs::metricsinfra: add config for custom blackbox scraping [puppet] - 10https://gerrit.wikimedia.org/r/956038 (https://phabricator.wikimedia.org/T288067) [10:21:19] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [10:21:38] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [10:23:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet [10:23:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (owner: 10Muehlenhoff) [10:25:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [10:31:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52467 and previous config saved to /var/cache/conftool/dbconfig/20230912-103148-root.json [10:32:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [10:34:34] (03PS3) 10Majavah: conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463) [10:35:24] (03CR) 10Majavah: [C: 03+2] conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [10:36:10] 10SRE, 10Infrastructure-Foundations, 10Traffic: LVS servers using autoconf SLAAC IPv6 addresses - https://phabricator.wikimedia.org/T336505 (10cmooney) I'm gonna close this, I think we can probably deal with it under T102099. [10:37:43] !log taavi@cumin1001 conftool action : set/pooled=yes:weight=10; selector: cluster=cloudweb [10:38:59] 10SRE, 10Traffic, 10cloud-services-team, 10Patch-For-Review: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10taavi) `lang=shell-session taavi@cumin1001 ~ $ confctl select "cluster=(lab|cloud)web" get {"cloudweb1003.wikimedia.org": {"weight": 10, "pooled": "yes"}, "tag... [10:39:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [10:39:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [10:39:13] (03CR) 10Btullis: [C: 03+2] Add snapshot101[4-7] to the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/955931 (https://phabricator.wikimedia.org/T345907) (owner: 10Btullis) [10:39:22] (03PS3) 10Majavah: service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463) [10:40:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [10:40:57] (03CR) 10Jbond: [C: 03+2] wmnet: add puppetserver1002 to SRV record for eqiad [dns] - 10https://gerrit.wikimedia.org/r/956799 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [10:45:29] !log rebalance Ganeti cluster in eqiad/C following node reboots [10:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52468 and previous config saved to /var/cache/conftool/dbconfig/20230912-104652-root.json [10:53:12] (03PS1) 10Clément Goubert: mw-web: Raise apc size to 1536 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956830 (https://phabricator.wikimedia.org/T341780) [10:53:45] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.decorators.retry: dynamic_params_callbacks=(set_tries,) dosn't seem to work as expected - https://phabricator.wikimedia.org/T346134 (10Volans) a:03Volans Yes the issue is that the `set_tries` defined in spicerack doesn't check the function s... [10:54:16] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10cmooney) >>! In T344259#9158656, @Eevans wrote: > This (actually) worked. Forcing the switch port to 100mbit was enough to successful PXE boot and reimage the server. On... [10:54:27] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1004.wikimedia.org [10:55:57] (03CR) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [10:59:08] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43216/console" [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol) [10:59:11] (03PS18) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) [10:59:26] (03CR) 10Brouberol: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [11:00:43] (03CR) 10Brouberol: "The pcc run was successful https://puppet-compiler.wmflabs.org/output/956785/43216/" [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol) [11:01:04] (03CR) 10Clément Goubert: [C: 03+2] mw-web: Raise apc size to 1536 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956830 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [11:01:57] (03Merged) 10jenkins-bot: mw-web: Raise apc size to 1536 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956830 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [11:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52470 and previous config saved to /var/cache/conftool/dbconfig/20230912-110157-root.json [11:02:06] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:02:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:03:02] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:03:11] (03PS1) 10FNegri: [openstack] bridge-utils workaround is bullseye-only [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) [11:03:12] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:03:57] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:04:05] (03PS2) 10FNegri: [openstack] bridge-utils workaround is bullseye-only [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) [11:07:22] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol) [11:08:28] (03CR) 10FNegri: "Not sure why this was only enabled in codfw and not in eqiad..." [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [11:09:22] (03CR) 10Volans: [C: 03+1] "LGTM cookbook-wise, leaving the opensearch-specific details to your team." [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [11:09:24] (03CR) 10FNegri: "I found more info here: I74c57d49dc7790bee11cd6cb5d7de8c1d9e2b969" [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [11:09:37] RECOVERY - mediawiki-installation DSH group on snapshot1014 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:10:49] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] [openstack] bridge-utils workaround is bullseye-only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [11:11:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:10] (03CR) 10Brouberol: [C: 03+2] Register hadoop workers an-worker-1149->1156.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/956785 (https://phabricator.wikimedia.org/T343762) (owner: 10Brouberol) [11:16:04] (03CR) 10MVernon: [C: 03+2] conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [11:16:51] (03CR) 10MVernon: [C: 03+1] "Seems good to me. Will any of these end-points end up being swift-specific, and others thanos-specific in due course?" [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [11:16:53] (03CR) 10Jbond: "This will need a parent change to update get_puppet_ca" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [11:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52471 and previous config saved to /var/cache/conftool/dbconfig/20230912-111702-root.json [11:17:29] (03PS1) 10Hnowlan: api-gateway: emit cache-control header for 404s [deployment-charts] - 10https://gerrit.wikimedia.org/r/956833 (https://phabricator.wikimedia.org/T336400) [11:18:59] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudservices1004.wikimedia.org [11:22:22] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1004: decomission [puppet] - 10https://gerrit.wikimedia.org/r/956834 (https://phabricator.wikimedia.org/T346033) [11:24:43] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:59] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero feel free to close this one if it's not being worked on, the status... [11:30:05] PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52472 and previous config saved to /var/cache/conftool/dbconfig/20230912-113207-root.json [11:32:12] (03PS2) 10Filippo Giunchedi: conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) [11:32:42] (03CR) 10Filippo Giunchedi: conftool-data: add titan hosts to thanos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [11:33:24] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add titan hosts to thanos [puppet] - 10https://gerrit.wikimedia.org/r/956800 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [11:35:06] !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan1001.eqiad.wmnet [11:35:13] !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan1002.eqiad.wmnet [11:35:25] !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan2001.codfw.wmnet [11:35:26] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10aborrero) 05Open→03Declined OK, closing for now and hoping some more modern BGP-bas... [11:35:35] !log filippo@cumin1001 conftool action : set/weight=10; selector: name=titan2002.codfw.wmnet [11:35:37] PROBLEM - Check systemd state on an-worker1149 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:18] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan* [11:36:36] that did nothing ^ but announced itself anyways [11:36:43] RECOVERY - Check systemd state on an-worker1149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:14] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan1001.eqiad.wmnet [11:37:15] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan1002.eqiad.wmnet [11:37:16] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan2001.codfw.wmnet [11:37:17] !log filippo@cumin1001 conftool action : set/weight=100; selector: name=titan2002.codfw.wmnet [11:38:25] RECOVERY - mediawiki-installation DSH group on snapshot1016 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:39:08] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1001.eqiad.wmnet,service=thanos-web [11:40:14] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=titan1002.eqiad.wmnet,service=thanos-web [11:40:45] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1002.eqiad.wmnet,service=thanos-web [11:41:03] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-datanode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:15] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 7 hosts with reason: Mute initial failures of hadoop-hdfs-datanode.service [11:41:32] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 7 hosts with reason: Mute initial failures of hadoop-hdfs-datanode.service [11:42:23] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1002.eqiad.wmnet,service=thanos-web [11:42:28] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1001.eqiad.wmnet,service=thanos-web [11:43:39] !log pool titan hosts alongside thanos-fe for thanos-query / thanos-web services - T341999 [11:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:43] T341999: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 [11:45:25] RECOVERY - mediawiki-installation DSH group on snapshot1017 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:47:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52473 and previous config saved to /var/cache/conftool/dbconfig/20230912-114711-root.json [11:47:31] RECOVERY - mediawiki-installation DSH group on snapshot1015 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:49:15] RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:38] (03PS2) 10Muehlenhoff: SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (https://phabricator.wikimedia.org/T346134) [11:49:59] (03CR) 10Muehlenhoff: SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (https://phabricator.wikimedia.org/T346134) (owner: 10Muehlenhoff) [11:50:53] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:38] (03PS1) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [11:53:38] (03CR) 10Muehlenhoff: [C: 03+2] SREDiscoveryNoLVSBatchRunnerBase: Bump waiting interval to 30s [cookbooks] - 10https://gerrit.wikimedia.org/r/956801 (https://phabricator.wikimedia.org/T346134) (owner: 10Muehlenhoff) [11:56:59] (03CR) 10Muehlenhoff: P:idm allow for installation via Debian packages. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [11:57:27] !log pool thanos[12]001 for thanos.w.o - T341999 [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:30] T341999: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 [11:58:42] (03CR) 10Muehlenhoff: "Could you pleae also update the Cumin aliases? Likely A:thanos-fe should also include the titan role and not sure if we need a separate al" [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [11:59:45] (03CR) 10Muehlenhoff: [C: 03+2] nftables::set: Align names with the resource title and nftables::input [puppet] - 10https://gerrit.wikimedia.org/r/956797 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:59:48] !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1004.wikimedia.org [11:59:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1004: decomission [puppet] - 10https://gerrit.wikimedia.org/r/956834 (https://phabricator.wikimedia.org/T346033) (owner: 10Arturo Borrero Gonzalez) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1200) [12:02:27] (03PS1) 10Filippo Giunchedi: cumin: add titan aliases [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) [12:03:22] (03PS1) 10Elukey: profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) [12:03:34] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: add titan hosts to thanos frontends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956788 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [12:04:45] (03PS2) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [12:04:50] (03PS2) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [12:04:52] (03PS1) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) [12:04:54] (03PS2) 10Elukey: profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) [12:05:10] (03CR) 10CI reject: [V: 04-1] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [12:07:22] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [12:09:48] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001" [12:11:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - aborrero@cumin1001" [12:11:46] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:11:47] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudservices1004.wikimedia.org [12:12:05] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [12:12:44] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [12:13:00] (03PS3) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [12:14:14] (03PS2) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) [12:14:16] (03PS3) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [12:14:31] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:15:01] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:15:04] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM, phab_deploy_finalize.sh is in the sudo list of the deployment user, so we don't need to add a rule for the new restart command" [puppet] - 10https://gerrit.wikimedia.org/r/956486 (https://phabricator.wikimedia.org/T314460) (owner: 10Brennen Bearnes) [12:15:15] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:15:43] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:15:59] (03PS3) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) [12:16:02] (03PS4) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [12:16:53] PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:04] (03PS4) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) [12:18:06] (03PS5) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [12:18:34] (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [12:18:36] i'm getting "Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow " on meta-wiki [12:19:32] * taavi klaxons [12:19:52] * volans here [12:19:55] hi [12:20:07] (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:12] checking [12:20:26] here too and checking [12:20:29] Thanks for paging taavi [12:20:31] I also get "Error [12:20:31] Our servers are currently under maintenance or experiencing a technical problem" on office wiki. [12:20:44] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [12:20:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:20:47] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet are marked down but pooled: testlb6_443: Serve [12:20:47] 4.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:21:03] (ProbeDown) firing: (8) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:21:17] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are ma [12:21:17] n but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:21:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:21:39] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:22:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:22:51] PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - no socket TCP[103.102.166.8] Connection timed out https://wikitech.wikimedia.org/wiki/DNS [12:23:01] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:23:09] (HttpdUnreachable) firing: httpd unavailable for deployment mw-web at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [12:23:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:23:13] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:23:16] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:23:16] (MediaWikiLatencyExceeded) firing: (2) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:23:27] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:23:55] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:23:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [12:24:09] RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [12:24:10] !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [12:24:29] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:24:54] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) re: the last point, namely cleaning up thanos components off thanos-fe (therefore l... [12:25:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10IPv6, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10cmooney) In the medium term I think we need to carefully consider how this operates, probably as part of a move away from using... [12:25:07] (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:26:03] (ProbeDown) firing: (22) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:26:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:26:13] (03CR) 10Cathal Mooney: [C: 03+2] Add static network defs and DHCP config for new codfw subnets [puppet] - 10https://gerrit.wikimedia.org/r/954896 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [12:27:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:27:33] (JobUnavailable) firing: (3) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:28:09] (HttpdUnreachable) resolved: httpd unavailable for deployment mw-web at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [12:28:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:28:16] (MediaWikiLatencyExceeded) resolved: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:22] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [12:28:58] (03CR) 10KartikMistry: Enable MinT translation service on MediaWiki - rollout #4 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [12:29:03] (03CR) 10Abijeet Patro: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [12:30:14] (03PS4) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [12:30:44] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [12:30:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:30:50] (03PS2) 10Filippo Giunchedi: cumin: add titan aliases [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) [12:30:54] (03PS1) 10Filippo Giunchedi: alerting_host: add titan to thanos-query hosts [puppet] - 10https://gerrit.wikimedia.org/r/956866 (https://phabricator.wikimedia.org/T341999) [12:30:58] (03PS1) 10Filippo Giunchedi: titan: move pyrra off thanos role [puppet] - 10https://gerrit.wikimedia.org/r/956867 (https://phabricator.wikimedia.org/T341999) [12:31:03] (ProbeDown) resolved: (22) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:14] (03PS3) 10Abijeet Patro: Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) [12:31:22] (03CR) 10Cathal Mooney: [C: 03+2] Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [12:31:34] (03CR) 10Abijeet Patro: Enable MinT translation service on MediaWiki - rollout #4 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [12:31:35] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:32:07] btw, thanks taavi for the advance notice ;) [12:33:20] (03CR) 10KartikMistry: [C: 03+1] Enable MinT translation service on MediaWiki - rollout #4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956807 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [12:33:34] (DatasourceError) resolved: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [12:34:53] (03PS1) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) [12:35:18] (03CR) 10CI reject: [V: 04-1] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:36:51] (03PS2) 10Kevin Bazira: ml-services: increase the recommendation-api-ng memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) [12:40:09] (03Merged) 10jenkins-bot: Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [12:40:23] !log brouberol@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [12:41:55] (03PS3) 10Kevin Bazira: ml-services: increase the recommendation-api-ng memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) [12:42:09] (03PS2) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) [12:42:33] (03CR) 10CI reject: [V: 04-1] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:42:59] (03CR) 10Filippo Giunchedi: [C: 03+2] alerting_host: add titan to thanos-query hosts [puppet] - 10https://gerrit.wikimedia.org/r/956866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [12:46:29] (03PS3) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) [12:46:54] (03CR) 10CI reject: [V: 04-1] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:49:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:21] (03PS4) 10Muehlenhoff: nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) [12:54:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:54:43] (03PS2) 10Jelto: gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [12:55:07] RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:45] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43224/console" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [12:56:54] (03PS5) 10JMeybohm: k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) [12:56:56] (03PS6) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [12:58:36] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43225/console" [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:59:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343198)', diff saved to https://phabricator.wikimedia.org/P52474 and previous config saved to /var/cache/conftool/dbconfig/20230912-125932-arnaudb.json [12:59:43] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1300). [13:00:06] ihurbain and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] o/ [13:00:19] hello [13:00:24] * TheresNoTime is here [13:00:29] * ihurbain is here too [13:00:48] taavi: you're welcome to deploy, I'm mid-email [13:00:52] note: this is the first time i have stuff in the backport window ever, so I'M SCAAAARED :P [13:01:09] (and i may ask stupid questions.) [13:01:19] TheresNoTime: sure, will ping you when you're clear to deploy your patch [13:01:31] ihurbain: sure, no worries! do you have the x-wikimedia-debug browser extension installed? [13:01:40] taavi: i do [13:01:42] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: better handler label [puppet] - 10https://gerrit.wikimedia.org/r/956875 [13:01:43] ihurbain: cliché, but there are no stupid questions when it comes to deploying to production ^^ [13:01:50] _fine_ :D [13:02:19] great, I'll ping you when your patch can be tested using it. will take a few minutes or so [13:02:19] (03CR) 10Jelto: [V: 03+1] "SSL_CERT_DIR is needed only with object storage enabled. What do you think about adding a additional if to the SSL_CERT_DIR?" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [13:02:26] ack! [13:02:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:02:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:03:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/956875 (owner: 10Kamila Součková) [13:04:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:04:31] (03Merged) 10jenkins-bot: Enable Parsoid support for Kartographer on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956792 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [13:05:03] !log taavi@deploy1002 Started scap: Backport for [[gerrit:956792|Enable Parsoid support for Kartographer on enwiki (T342871)]] [13:05:07] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:06:34] !log taavi@deploy1002 ihurbain and taavi: Backport for [[gerrit:956792|Enable Parsoid support for Kartographer on enwiki (T342871)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:06:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [13:07:21] ihurbain: your patch can now be tested. so open the extension pop-up and change the giant on-off switch to 'ON', all other settings should be fine as is [13:07:27] ok [13:07:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:09:21] and let me know when you've verified that your change works properly [13:09:39] yup, going through a few pages, shouldn't take more than a few minutes [13:09:55] sure [13:10:56] !log installing grub2 updates from Bullseye point release [13:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:10] taavi: do you happen to know which grafana graph the images from T345414 are from? (JS content size) [13:12:11] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [13:12:15] (03PS1) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 [13:12:51] TheresNoTime: uhh sorry, I do not [13:13:04] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar) [13:13:10] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 [13:14:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P52475 and previous config saved to /var/cache/conftool/dbconfig/20230912-131438-arnaudb.json [13:14:39] (03CR) 10Filippo Giunchedi: [C: 03+2] cumin: add titan aliases [puppet] - 10https://gerrit.wikimedia.org/r/956840 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [13:14:47] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add config to limit routes to a domain, limit aqs2 apis to wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/956826 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [13:15:48] (03PS2) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 [13:15:50] (03Merged) 10jenkins-bot: rest-gateway: add config to limit routes to a domain, limit aqs2 apis to wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/956826 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [13:16:56] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10MatthewVernon) Yeah, if it's easier to just reimage them (//especially// if that can be done wi... [13:17:41] PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:43] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/770464 (owner: 10Hashar) [13:18:56] taavi: we're good and happy [13:19:06] !log taavi@deploy1002 ihurbain and taavi: Continuing with sync [13:19:17] awesome. now i'm syncing your changes to the entire cluster [13:19:26] woot! [13:21:11] (03PS3) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 [13:21:51] (03PS1) 10Samtar: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956809 (https://phabricator.wikimedia.org/T345414) [13:23:11] (^ oops, forgot to cherry pick) [13:24:29] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [13:25:23] 10SRE-swift-storage, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi New hosts are in service, resolving [13:25:46] (03CR) 10Herron: [C: 03+1] "thanks, I was thinking we should do the same thing!" [puppet] - 10https://gerrit.wikimedia.org/r/956867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [13:26:21] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi) [13:26:23] (03CR) 10Filippo Giunchedi: [C: 03+2] titan: move pyrra off thanos role [puppet] - 10https://gerrit.wikimedia.org/r/956867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [13:26:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:44] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [13:27:13] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [13:27:19] hmm sync-apaches is taking a while [13:27:50] * ihurbain gently shakes apaches [13:28:08] things still settling after the outage maybe? [13:28:20] ah, new snapshot hosts added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/955931 that probably are somewhat outdated [13:29:15] there we go, now it's doing the even normally slow stuff (php-fpm-restart) [13:29:43] (03PS4) 10Elukey: slo_template: hardcode time window for SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/956878 (https://phabricator.wikimedia.org/T346144) [13:29:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P52476 and previous config saved to /var/cache/conftool/dbconfig/20230912-132944-arnaudb.json [13:31:08] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:956792|Enable Parsoid support for Kartographer on enwiki (T342871)]] (duration: 26m 05s) [13:31:11] T342871: Parsoid + Kartographer roll-out plan - https://phabricator.wikimedia.org/T342871 [13:31:12] ihurbain: your patch is now live [13:31:14] TheresNoTime: your turn [13:31:20] yaaay! thanks taavi :) [13:31:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:58] taavi: are you okay to deploy or..? [13:32:06] (03CR) 10Gehel: "I know I'm late, but a few comments inline anyway." [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [13:32:28] oh I somehow thought you were going to self-deploy your patch? [13:32:47] (03CR) 10Brouberol: [C: 03+2] Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [13:32:54] taavi: I can do, just not SSH'd in atm etc [13:33:18] (I need to disappear into a meeting in a few, so would prefer not to deploy that) [13:33:40] taavi: ack, will do it :) [13:34:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956809 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar) [13:35:32] (03Merged) 10jenkins-bot: Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [13:35:51] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:36:04] (03Merged) 10jenkins-bot: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956809 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar) [13:36:06] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:36:35] !log samtar@deploy1002 Started scap: Backport for [[gerrit:956809|Reduce initial payload of Phonos styles (T345414)]] [13:36:38] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [13:38:06] !log samtar@deploy1002 samtar: Backport for [[gerrit:956809|Reduce initial payload of Phonos styles (T345414)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:38:20] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:38:21] * TheresNoTime testing [13:38:44] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:39:04] !log samtar@deploy1002 samtar: Continuing with sync [13:39:44] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:40:03] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:42:20] (03CR) 10Gehel: "minor comments inline (yes, I'm late to the party!)" [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [13:42:26] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [13:42:41] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) 05Resolved→03Open I was a little too hasty here, I forgot we need raid0 on these hosts to be... [13:42:56] (03PS2) 10Ssingh: Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) [13:43:16] (03CR) 10Ssingh: Release 9.2.1-1wm2 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [13:43:59] (03PS1) 10Filippo Giunchedi: install_server: restore raid1 for thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/956886 (https://phabricator.wikimedia.org/T346143) [13:44:01] PROBLEM - puppet last run on testreduce1001 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:44:01] (03PS1) 10Filippo Giunchedi: install_server: use raid0 for titan [puppet] - 10https://gerrit.wikimedia.org/r/956887 (https://phabricator.wikimedia.org/T341999) [13:44:03] (03PS1) 10Filippo Giunchedi: conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143) [13:44:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343198)', diff saved to https://phabricator.wikimedia.org/P52477 and previous config saved to /var/cache/conftool/dbconfig/20230912-134451-arnaudb.json [13:44:54] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:45:35] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:956809|Reduce initial payload of Phonos styles (T345414)]] (duration: 08m 59s) [13:45:38] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [13:46:33] !log UTC afternoon backport window closed [13:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:12] (03PS7) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [13:48:14] (03PS1) 10JMeybohm: kubernetes::master: control-plane components should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) [13:48:26] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [13:49:09] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use raid0 for titan [puppet] - 10https://gerrit.wikimedia.org/r/956887 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi) [13:49:14] (03PS2) 10Filippo Giunchedi: install_server: use raid0 for titan [puppet] - 10https://gerrit.wikimedia.org/r/956887 (https://phabricator.wikimedia.org/T341999) [13:50:17] (03CR) 10Ssingh: [C: 03+2] hiera: remove references to nsa.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:50:33] !log disable puppet on A:dns-rec [13:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:08] (03PS3) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) [13:51:34] (03CR) 10CI reject: [V: 04-1] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [13:53:05] RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:46] (03PS4) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) [13:54:10] (03CR) 10CI reject: [V: 04-1] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [13:55:38] (03PS2) 10JMeybohm: kubernetes::master: control-plane components should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) [13:55:40] (03PS8) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [13:56:11] (03PS5) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) [13:56:26] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:57:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:57:58] (03PS1) 10Hnowlan: trafficserver: route to wikifeeds via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/956895 (https://phabricator.wikimedia.org/T339119) [13:58:16] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43228/console" [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [13:58:26] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:00:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [14:00:46] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [14:00:53] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43229/console" [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [14:01:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr) a:03Jclark-ctr [14:02:02] !log enable puppet on doh6001 to test nsa removal [14:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:15] !log [correction] enable puppet on dns6001 to test nsa removal [14:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [14:04:11] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [14:07:01] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1027.eqiad.wmnet'] [14:07:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:45] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) I worked on some of the items for this week. I let @UOzurumba check on the done elements. Distribut... [14:09:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1028.eqiad.wmnet'] [14:09:50] (03PS6) 10FNegri: [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) [14:10:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet'] [14:10:04] RECOVERY - puppet last run on testreduce1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:10:05] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet'] [14:10:11] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1030.eqiad.wmnet'] [14:10:20] !log enable puppet on dns-rec to progessively roll out nsa->ns2 updates [14:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:35] 10SRE, 10Infrastructure-Foundations, 10netops: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) 05Open→03Resolved a:03cmooney Thanks all, config applied now. @volans I left the timeout at 30 mins. I think (esp. in an emergency situation) it's not unlikely yo... [14:11:11] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1031.eqiad.wmnet'] [14:11:39] (03CR) 10FNegri: [openstack] bridge-utils config in bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [14:13:30] PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:00] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [14:15:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1032.eqiad.wmnet'] [14:15:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1027.eqiad.wmnet'] [14:15:56] RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:00] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1033.eqiad.wmnet'] [14:16:41] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9159690, @cmooney wrote: >>>! In T344259#9158656, @Eevans wrote: >> This (actually) worked. Forcing the switch port to 100mbit was enough to succes... [14:17:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:00] (03CR) 10MVernon: [C: 03+1] "This matches what we do with ms-fe, which feels like the right answer." [puppet] - 10https://gerrit.wikimedia.org/r/956886 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [14:18:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1030.eqiad.wmnet'] [14:18:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1028.eqiad.wmnet'] [14:18:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1034.eqiad.wmnet'] [14:19:22] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1035.eqiad.wmnet'] [14:21:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [14:21:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1031.eqiad.wmnet'] [14:22:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet'] [14:22:22] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [14:22:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet'] [14:23:46] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1032.eqiad.wmnet'] [14:24:13] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet'] [14:24:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1033.eqiad.wmnet'] [14:24:31] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet'] [14:24:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1037.eqiad.wmnet'] [14:24:51] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: restore raid1 for thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/956886 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [14:24:59] (03CR) 10Majavah: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [14:25:24] (03CR) 10FNegri: [C: 03+2] P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 (https://phabricator.wikimedia.org/T334929) (owner: 10Majavah) [14:25:36] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1038.eqiad.wmnet'] [14:26:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST replicasets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:27:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1034.eqiad.wmnet'] [14:27:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1039.eqiad.wmnet'] [14:27:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1035.eqiad.wmnet'] [14:27:51] (03PS2) 10Ssingh: wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219) [14:27:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1040.eqiad.wmnet'] [14:30:41] !log installing libssh2 security updates# [14:30:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1029.eqiad.wmnet'] [14:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1036.eqiad.wmnet'] [14:31:05] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1041.eqiad.wmnet'] [14:31:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1042.eqiad.wmnet'] [14:31:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:31:37] (03CR) 10Kamila Součková: [C: 03+2] benthos/mw_accesslog_metrics: better handler label [puppet] - 10https://gerrit.wikimedia.org/r/956875 (owner: 10Kamila Součková) [14:33:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1037.eqiad.wmnet'] [14:34:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1038.eqiad.wmnet'] [14:35:48] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1039.eqiad.wmnet'] [14:36:14] (03PS4) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [14:36:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1040.eqiad.wmnet'] [14:37:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10MoritzMuehlenhoff) I've set the Netbox status back to Active. [14:38:38] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: furud.codfw.wmnet [14:38:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: furud.codfw.wmnet [14:38:40] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:39:01] (03PS1) 10JMeybohm: Add systemd dependencies to kube-apiserver [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) [14:39:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1041.eqiad.wmnet'] [14:39:42] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1042.eqiad.wmnet'] [14:39:46] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) Just to mention here, but the restriction described in T322937#8847201 no longer seems to be the case. In codfw with devices on JunOS 22.2R3.... [14:39:55] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet [14:40:13] (03PS1) 10Herron: titan: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/956901 [14:40:15] (03PS1) 10Herron: titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902 [14:40:17] (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [14:40:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1043.eqiad.wmnet'] [14:40:25] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1044.eqiad.wmnet'] [14:40:32] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1045.eqiad.wmnet'] [14:40:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1046.eqiad.wmnet'] [14:40:40] (03CR) 10CI reject: [V: 04-1] titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902 (owner: 10Herron) [14:40:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1047.eqiad.wmnet'] [14:40:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1048.eqiad.wmnet'] [14:41:01] (03PS1) 10FNegri: Remove unused toolsdb roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929) [14:42:55] !log installing Linux 6.1.52 on Bookworm hosts [14:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:26] (03PS2) 10FNegri: [toolsdb] Remove unused primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929) [14:43:44] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Jclark-ctr) @aborrero. I am available to move it This week tomorrow or Thursday. [14:46:13] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet [14:48:30] (03PS2) 10Filippo Giunchedi: conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143) [14:48:34] (03PS1) 10Filippo Giunchedi: thanos: move rule evaluation to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143) [14:48:38] (03PS1) 10Filippo Giunchedi: thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) [14:48:42] (03PS1) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) [14:49:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1045.eqiad.wmnet'] [14:49:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1048.eqiad.wmnet'] [14:49:31] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1043.eqiad.wmnet'] [14:49:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1047.eqiad.wmnet'] [14:49:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1046.eqiad.wmnet'] [14:49:38] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1049.eqiad.wmnet'] [14:49:42] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1044.eqiad.wmnet'] [14:50:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1054.eqiad.wmnet'] [14:50:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1053.eqiad.wmnet'] [14:50:11] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1052.eqiad.wmnet'] [14:50:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1050.eqiad.wmnet'] [14:50:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1051.eqiad.wmnet'] [14:54:29] (03PS2) 10Herron: titan: add pyrra/slo envoy/cfssl config [puppet] - 10https://gerrit.wikimedia.org/r/956902 [14:55:14] !log dancy@deploy1002 Installing scap version "4.61.0" for 596 hosts [14:55:30] PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:26] !log dancy@deploy1002 Installing scap version "4.61.0" for 595 hosts [14:56:58] RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:38] !log dancy@deploy1002 Installation of scap version "4.61.0" completed for 595 hosts [14:57:53] !log add 30G to prometheus@services and 300G to prometheus@ops (codfw) [14:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] (03PS1) 10Cathal Mooney: Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) [14:58:54] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1049.eqiad.wmnet'] [14:58:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1052.eqiad.wmnet'] [14:58:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1053.eqiad.wmnet'] [14:59:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1054.eqiad.wmnet'] [14:59:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1050.eqiad.wmnet'] [14:59:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1051.eqiad.wmnet'] [14:59:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:00:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1055.eqiad.wmnet'] [15:01:03] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes1056.eqiad.wmnet'] [15:01:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero) [15:02:50] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43232/console" [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron) [15:03:25] (03PS1) 10Hnowlan: trafficserver: route requests to mediarequests service [puppet] - 10https://gerrit.wikimedia.org/r/956909 (https://phabricator.wikimedia.org/T336380) [15:04:12] (03PS3) 10Filippo Giunchedi: conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143) [15:04:14] (03PS2) 10Filippo Giunchedi: thanos: move rule evaluation to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143) [15:04:16] (03PS2) 10Filippo Giunchedi: thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) [15:04:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:18] (03PS2) 10Filippo Giunchedi: thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) [15:05:03] herron: re: titan and cfssl please hold off, I'll let you know once I'm done with the transition [15:05:20] the merging that is [15:05:43] godog: will do, was about to add you to the patch, I'll wait on your +1 [15:05:46] PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:08] cheers [15:07:38] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43233/console" [puppet] - 10https://gerrit.wikimedia.org/r/956902 (owner: 10Herron) [15:09:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:09:24] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:09:25] (03CR) 10Elukey: Add systemd dependencies to kube-apiserver (032 comments) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:09:28] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1055.eqiad.wmnet'] [15:10:49] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:12:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [15:12:53] (03CR) 10JMeybohm: Add systemd dependencies to kube-apiserver (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:13:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes1056.eqiad.wmnet'] [15:14:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:33] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10LSobanski) @Urbanecm what ongoing support would you envision beyond setting up the VM and some sort of a deployment method and keeping up to date with... [15:15:02] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:15:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) db1236 - A 5. U 27. port 45 CableID 3160 db1237 - A 5. U 33. port 33 CableID 1962 db1238 - B 5. U 14. port 23 CableID 4048 db1239 - B 5. U 15. port 09 CableID 3793 [15:18:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [15:19:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [15:19:16] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:19:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [15:19:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [15:21:18] RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:22:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [15:22:45] (03CR) 10Elukey: "Left a couple of comments to better understand this :)" [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:23:30] (03CR) 10Elukey: "Is there a rationale for this, or is it just a simplification of the workflow?" [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:24:38] (03CR) 10Gehel: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [15:25:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [15:27:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [15:27:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [15:28:03] (03PS5) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [15:28:20] (03CR) 10JMeybohm: [V: 03+1] kubernetes::master: control-plane components should use the local api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:30:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [15:31:25] (03PS1) 10Ssingh: varnish:common: add bookworm version for Python [puppet] - 10https://gerrit.wikimedia.org/r/956914 (https://phabricator.wikimedia.org/T342154) [15:32:04] (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [15:32:35] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43234/console" [puppet] - 10https://gerrit.wikimedia.org/r/956914 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [15:33:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956914 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [15:39:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2003.codfw.wmnet [15:39:34] (03CR) 10JMeybohm: [V: 03+1] k8s::apiserver: Use a separate systemd service for safe restarts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:41:03] (03PS1) 10Samtar: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414) [15:41:49] (03PS2) 10Jdlrobson: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar) [15:42:06] (03PS1) 10Brouberol: Fix typo in allowed cluster group aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [15:43:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2003.codfw.wmnet [15:43:23] jouncebot: nowandnext [15:43:23] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [15:43:23] In 0 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1600) [15:43:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1004.eqiad.wmnet [15:44:17] 10SRE, 10SRE-Access-Requests: Requesting access to shell/dcops for VRiley - https://phabricator.wikimedia.org/T346077 (10RobH) 05Open→03Resolved [15:46:15] (03PS3) 10Dduvall: gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515 [15:47:00] (03CR) 10Dduvall: gitlab: Fix conditional end in gitlab.rb template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [15:47:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10lmata) Thank you! [15:47:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1004.eqiad.wmnet [15:51:38] (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [15:54:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:54:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:54:56] hm [15:55:56] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) >>! In T346042#9160486, @Jclark-ctr wrote: > @aborrero. I am available to move it This week tomorrow or Thursday. OK! Let's do it to... [15:59:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:00:07] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1600). [16:00:07] No Gerrit patches in the queue for this window AFAICS. [16:01:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:38] (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:01:53] (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:03:03] <_joe_> uh this alert seems serious [16:03:09] (03CR) 10Elukey: k8s::apiserver: Use a separate systemd service for safe restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [16:03:36] <_joe_> Krinkle: ^^ [16:03:45] (03CR) 10Majavah: [C: 03+1] [toolsdb] Remove unused primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929) (owner: 10FNegri) [16:04:08] (03CR) 10FNegri: [C: 03+2] [toolsdb] Remove unused primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/956903 (https://phabricator.wikimedia.org/T334929) (owner: 10FNegri) [16:06:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:06:29] (03CR) 10Andrew Bogott: [C: 03+1] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [16:06:38] (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:06:53] (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:07:14] https://grafana.wikimedia.org/d/_L_3fQh4z/cross-dc-traffic-alerts?orgId=1&from=now-24h&to=now [16:07:30] (03CR) 10BBlack: [C: 03+1] wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [16:08:00] (03PS2) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [16:08:07] (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:07] (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:08:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1389.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1371.eqiad.wmnet, mw1397.eqiad.wmnet, mw1455.eqiad.wmnet, mw1475.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1349.eqiad.wmnet, mw1452.eqiad.wmnet, mw1387.eqiad.wmnet, mw1456.eqiad.wmnet, mw1430.eqiad.wmnet, mw1476.eqiad.wmnet, mw1480.eqiad.wmnet, mw [16:08:30] ad.wmnet, mw1352.eqiad.wmnet, mw1413.eqiad.wmnet, mw1441.eqiad.wmnet, mw1368.eqiad.wmnet, mw1420.eqiad.wmnet, mw1366.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, mw1487.eqiad.wmnet, mw1370.eqiad.wmnet, mw1429.eqiad.wmnet, mw1451.eqiad.wmnet, mw1496.eqiad.wmnet, mw1401.eqiad.wmnet, mw1395.eqiad.wmnet, mw1403.eqiad.wmnet, mw1409.eqiad.wmnet, mw1411.eqiad.wmnet, mw1417.eqiad.wmnet, mw1473.eqiad [16:08:30] mw1399.eqiad.wmnet, mw1479.eqiad.wmnet, mw1353.eqiad.wmnet, mw1477.eqiad.wmnet, mw1416.eqiad.wmnet, mw1472.eqiad.wmnet, mw1373.eqiad.wmnet, mw1350.eqiad.wmnet, mw1453.eqiad.wmnet, mw143 https://wikitech.wikimedia.org/wiki/PyBal [16:08:40] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5022.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet, cp5023.eqsin.wmnet, cp501 [16:08:40] wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet, cp5023.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but [16:08:40] ttps://wikitech.wikimedia.org/wiki/PyBal [16:08:45] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [16:08:46] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:08:53] (03CR) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [16:09:03] (03CR) 10FNegri: [C: 03+2] [openstack] bridge-utils config in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956832 (https://phabricator.wikimedia.org/T345810) (owner: 10FNegri) [16:09:18] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet, cp5023.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp501 [16:09:19] wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5020.eqsin.wmnet, cp5022.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:09:23] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:09:24] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1389.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1369.eqiad.wmnet, mw1371.eqiad.wmnet, mw1397.eqiad.wmnet, mw1365.eqiad.wmnet, mw1367.eqiad.wmnet, mw1475.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1456.eqiad.wmnet, mw1432.eqiad.wmnet, mw1478.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw [16:09:24] ad.wmnet, mw1364.eqiad.wmnet, mw1407.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1476.eqiad.wmnet, mw1480.eqiad.wmnet, mw1351.eqiad.wmnet, mw1405.eqiad.wmnet, mw1352.eqiad.wmnet, mw1399.eqiad.wmnet, mw1368.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw1454.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1393.eqiad.wmnet, mw1488.eqiad.wmnet, mw1481.eqiad.wmnet, mw1366.eqiad.wmnet, mw1487.eqiad.wmnet, mw1372.eqiad [16:09:24] mw1391.eqiad.wmnet, mw1370.eqiad.wmnet, mw1429.eqiad.wmnet, mw1451.eqiad.wmnet, mw1479.eqiad.wmnet, mw1418.eqiad.wmnet, mw1496.eqiad.wmnet, mw1473.eqiad.wmnet, mw1401.eqiad.wmnet, mw139 https://wikitech.wikimedia.org/wiki/PyBal [16:09:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:10:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:10:16] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [16:10:44] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:10:46] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:11:16] (MediaWikiLatencyExceeded) firing: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:11:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:11:28] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:11:38] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:11:53] (DatasourceError) firing: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:12:08] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:12:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:12:32] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:13:07] (ProbeDown) resolved: (17) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:07] (ProbeDown) resolved: (18) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:13:45] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [16:14:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:15:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:15:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [16:16:16] (MediaWikiLatencyExceeded) resolved: (3) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:16:28] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:16:52] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB connections alert - https://grafana.wikimedia.org/alerting/grafana/4tAKSjJVz/view - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [16:17:04] (03PS1) 10BBlack: haproxy: limit to 20k conns to varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/956919 [16:17:45] (03PS3) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [16:19:35] (03CR) 10Vgutierrez: [C: 03+1] "looks good, please add the bug line to the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/956919 (owner: 10BBlack) [16:19:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:19:54] (03CR) 10Fabfur: [C: 03+1] haproxy: limit to 20k conns to varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/956919 (owner: 10BBlack) [16:22:23] (03PS2) 10BBlack: haproxy: limit to 20k conns to varnish globally [puppet] - 10https://gerrit.wikimedia.org/r/956919 (https://phabricator.wikimedia.org/T310609) [16:23:10] (03CR) 10BBlack: [C: 03+2] haproxy: limit to 20k conns to varnish globally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956919 (https://phabricator.wikimedia.org/T310609) (owner: 10BBlack) [16:28:49] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS bookworm [16:45:03] (03PS4) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) [16:46:44] (03CR) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking) [16:51:29] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) [16:53:12] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thank you for the reminder @Trizek-WMF! I sent a Slack announcement a while ago and now sent a reminder... [16:55:51] (03CR) 10Herron: [C: 03+1] netmon: Failover from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [16:57:38] 10SRE: All Wikimedia projects inaccessible on 2023-09-12 for several minutes - https://phabricator.wikimedia.org/T346172 (10Aklapper) [16:57:43] (03CR) 10Herron: [C: 03+1] Don't require dummy 'team' label for multi-owner alerts [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi) [16:58:17] (03CR) 10Herron: [C: 03+1] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [16:58:35] 10SRE: All Wikimedia projects inaccessible on 2023-09-12 for several minutes - https://phabricator.wikimedia.org/T346172 (10Wellverywell) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T1700) [17:03:31] 10SRE: All Wikimedia projects inaccessible on 2023-09-12 for several minutes - https://phabricator.wikimedia.org/T346172 (10taavi) 05Open→03Resolved We believe this is resolved now. In general we have monitoring to detect outages of that level and update IRC (`#wikimedia-tech` on libera.chat) and/or wikimedi... [17:03:45] (03CR) 10Bking: [C: 03+2] wdqs.data-transfer: Keep downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [17:21:56] (03PS1) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) [17:21:58] (03PS1) 10Andrew Bogott: Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926 [17:22:00] (03PS1) 10Andrew Bogott: mwopenstackclients: add methods to correlate project id with name [puppet] - 10https://gerrit.wikimedia.org/r/956927 (https://phabricator.wikimedia.org/T343158) [17:22:02] (03PS1) 10Andrew Bogott: wmcs-cold-migrate: remove instance_fqdn output hint [puppet] - 10https://gerrit.wikimedia.org/r/956928 (https://phabricator.wikimedia.org/T343158) [17:22:04] (03PS1) 10Andrew Bogott: wmcs-instance-fqdns: support cases where project_name != project_id [puppet] - 10https://gerrit.wikimedia.org/r/956929 (https://phabricator.wikimedia.org/T343158) [17:22:06] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) [17:23:05] (03CR) 10CI reject: [V: 04-1] wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [17:24:09] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) I took a little look at the routed-mode docs from [[ https://github.com/grnet/gnt-networking/blob/develop/docs/routed.rst | here ]]. Overall the setup looks a... [17:27:13] (03PS1) 10DDesouza: Deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956931 (https://phabricator.wikimedia.org/T345951) [17:31:21] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) [17:32:11] (03PS8) 10Bking: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (https://phabricator.wikimedia.org/T346048) (owner: 10Ebernhardson) [17:33:45] (03CR) 10Andrew Bogott: [C: 03+2] Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926 (owner: 10Andrew Bogott) [17:35:49] (03CR) 10Bking: [C: 03+2] Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (https://phabricator.wikimedia.org/T346048) (owner: 10Ebernhardson) [17:36:09] (03PS2) 10Andrew Bogott: Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926 [17:36:11] (03PS2) 10Andrew Bogott: mwopenstackclients: add methods to correlate project id with name [puppet] - 10https://gerrit.wikimedia.org/r/956927 (https://phabricator.wikimedia.org/T343158) [17:36:13] (03PS2) 10Andrew Bogott: wmcs-cold-migrate: remove instance_fqdn output hint [puppet] - 10https://gerrit.wikimedia.org/r/956928 (https://phabricator.wikimedia.org/T343158) [17:36:15] (03PS2) 10Andrew Bogott: wmcs-instance-fqdns: support cases where project_name != project_id [puppet] - 10https://gerrit.wikimedia.org/r/956929 (https://phabricator.wikimedia.org/T343158) [17:36:17] (03PS3) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) [17:36:19] (03PS2) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) [17:37:38] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:45:54] (03Merged) 10jenkins-bot: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (https://phabricator.wikimedia.org/T346048) (owner: 10Ebernhardson) [17:49:37] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: remove nsa.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/955962 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [17:50:03] !log run authdns-update to remove nsa.wikimedia.org [17:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:32] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346112 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [18:02:16] (03PS10) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [18:02:45] (03CR) 10CI reject: [V: 04-1] vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [18:06:14] 10SRE, 10Cloud-Services: Certain systems failing to resolve - https://phabricator.wikimedia.org/T346177 (10cmooney) p:05Triage→03Medium The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a mo... [18:07:49] 10SRE, 10Cloud-VPS: Certain systems failing to resolve - https://phabricator.wikimedia.org/T346177 (10taavi) p:05Medium→03High [18:08:28] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10cmooney) [18:08:58] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10cmooney) [18:09:55] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10cmooney) [18:12:22] (03PS1) 10Sharvaniharan: Make the new stream name consistent with convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934 [18:12:35] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10Nux) `login.toolforge.org` is not working too (even after flushing local dns). So no way to ssh into TS. [18:13:22] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10taavi) [18:13:28] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org - https://phabricator.wikimedia.org/T346177 (10taavi) [18:18:43] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org - https://phabricator.wikimedia.org/T346177 (10taavi) [18:19:38] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10taavi) [18:23:39] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) @RobH if we can ask the registrar to change the IP they have for //ns1.openstack.eqiad1.wikimediacloud.org// to 185.15.56.163... [18:24:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) db1240 - B 6. U 03. port 01 CableID 1170 db1241 - B 6. U 04. port 05 CableID 1274 db1242 - C 3. U 12. port 25 CableID 5090 db1243 - C 3. U 13. port 04 CableID 2871 [18:25:14] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10RobH) >>! In T346177#9161332, @cmooney wrote: > @RobH if we can ask the registrar to change the IP they have for //ns1.openstack.eqiad1... [18:27:45] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) >>! In T346177#9161360, @RobH wrote: >>>! In T346177#9161332, @cmooney wrote: >> @RobH if we can ask the registrar to change t... [18:29:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:30:33] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10RobH) > Naoya, > > We're juggling around some nameservers in our cloud environment over here, and need to update one of them: > > ns1... [18:33:05] 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) @ssingh do you think this is still an issue that's worth keeping open, and should it then be tagged to IF? [18:34:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:35:08] 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10ssingh) Thanks @BCornwall, I think we can close this one as we have done some other reimages in eqsin and not observed this issue. [18:35:58] 10SRE, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Hi, @CRoslof! Have you been able to look into this? Thanks so much! [18:37:25] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) @Vgutierrez and @BBlack friendly poke :) [18:40:28] 10SRE, 10Traffic, 10Patch-For-Review: Incorrect X-Cache-Status reported by deployment-prep caches - https://phabricator.wikimedia.org/T269825 (10BCornwall) p:05Lowest→03Low [18:43:47] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10RobH) > Update, > > It turns out the other nameserver was migrated previously with no IP update either, so its currently incorrect on... [18:55:11] (03CR) 10Herron: [C: 03+1] "Happy to help monitor after deployment, morning Eastern TZ generally works for me" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [18:56:08] (03CR) 10Herron: [C: 03+1] "Happy to monitor after deploying this as well, morning Eastern TZ generally works for me" [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [18:58:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:59:20] (03PS5) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) [19:03:01] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I've not seen any change in what ORG is returning. I have made some routing and host-level iptables changes on cloudservices1... [19:07:45] (03PS6) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) [19:09:48] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) FWIW I took the following steps: # Routed 208.80.154.11/32 to cloudservices2006 on cloudsw # Updated cloudsw and cr1-eqiad ro... [19:11:16] (03PS7) 10Bking: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) [19:12:24] (03CR) 10Ebernhardson: [C: 03+1] rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking) [19:12:29] (03CR) 10Ryan Kemper: [C: 03+1] rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking) [19:13:32] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking) [19:14:29] (03Merged) 10jenkins-bot: rdf-streaming-updater-k8s: Add egress/proxy rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking) [19:15:13] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:17:29] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:23:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:26:44] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I've temporarily reserved 208.80.154.11 in Netbox so it doesn't get used - we should remove that once ORG has updated their re... [19:28:13] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ssw1 old irb int dns - cmooney@cumin1001" [19:28:28] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you. 😊" [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi) [19:31:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:32:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ssw1 old irb int dns - cmooney@cumin1001" [19:32:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:34:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:38:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:39:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:40:53] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I'd not noticed in my initial comment above, but *neither* IP that ORG is returning was working earlier on. Seems at some sta... [19:43:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:44:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:47:27] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I investigated making the same routing change for 208.80.154.135/32 but it's assigned to gerrit1003 so not possible. [19:49:24] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:54:24] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:55:39] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:24] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230912T2000) [20:00:06] Jdlrobson and sharvani_: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:39] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:04:33] here [20:06:11] Jdlrobson: is there an incident? [20:06:11] (03PS1) 10Volans: external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) [20:08:43] brett: there's a backport window. I'm not aware if there is an incident. I'm not sure how critical mw-api-int is [20:09:02] looks like an SRE team alert [20:09:12] ah, misunderstood your "here". gotcha [20:09:24] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:09:40] Hi! Here for UTC late backport window patch deployment :) [20:10:26] hi - i can deploy if it's not already underway [20:10:39] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:10:41] cjming: thank you. Nobody has claimed the backport window yet :) [20:11:01] (03CR) 10Volans: "Test run results:" [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [20:11:10] cjming: Thank you :) [20:11:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar) [20:11:46] (03PS2) 10Clare Ming: Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia) [20:12:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:13:20] (03Merged) 10jenkins-bot: Reduce initial payload of Phonos styles [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956811 (https://phabricator.wikimedia.org/T345414) (owner: 10Samtar) [20:13:48] !log cjming@deploy1002 Started scap: Backport for [[gerrit:956811|Reduce initial payload of Phonos styles (T345414)]] [20:13:51] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [20:15:25] !log cjming@deploy1002 cjming and samtar: Backport for [[gerrit:956811|Reduce initial payload of Phonos styles (T345414)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:15:29] Jdlrobson: do you want to test your first patch? [20:15:51] cjming: on it [20:17:51] cjming: is the change live on the debug servers? [20:18:09] (Reduce initial payload of Phonos styles [extensions/Phonos]) [20:18:11] Jdlrobson: it should be - are you not seeing it? [20:18:21] it might not be possible to test [20:18:23] since it touches parser [20:18:36] oh - should i just go ahead and sync? [20:19:15] cjming: i think so [20:19:24] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:19:27] !log cjming@deploy1002 cjming and samtar: Continuing with sync [20:21:50] (03PS3) 10BCornwall: Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [20:24:24] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:25:54] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:956811|Reduce initial payload of Phonos styles (T345414)]] (duration: 12m 06s) [20:25:57] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [20:26:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52483 and previous config saved to /var/cache/conftool/dbconfig/20230912-202609-arnaudb.json [20:26:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia) [20:26:12] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [20:27:01] (03Merged) 10jenkins-bot: Remove zebra from beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955980 (https://phabricator.wikimedia.org/T333180) (owner: 10Kimberly Sarabia) [20:27:30] Jdlrobson: Phonos patch should be live - labs patch should be gtg [20:27:55] (03PS2) 10Clare Ming: Make the new stream name consistent with convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934 (owner: 10Sharvaniharan) [20:28:13] (03PS3) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 [20:28:17] (03CR) 10CI reject: [V: 04-1] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [20:28:35] (03PS11) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [20:29:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934 (owner: 10Sharvaniharan) [20:29:33] (03CR) 10RhinosF1: "this can be merged blind. I will do the follow up unless someone wants too. Host is wikistats-bookworm.eqiad1.wikimedia.cloud though if yo" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [20:29:40] cjming: thanks checking now [20:29:56] (03Merged) 10jenkins-bot: Make the new stream name consistent with convention [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956934 (owner: 10Sharvaniharan) [20:30:25] !log cjming@deploy1002 Started scap: Backport for [[gerrit:956934|Make the new stream name consistent with convention]] [20:31:25] (03PS4) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 [20:31:33] (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [20:31:50] (03CR) 10CI reject: [V: 04-1] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [20:31:53] !log cjming@deploy1002 sharvaniharan and cjming: Backport for [[gerrit:956934|Make the new stream name consistent with convention]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:32:25] sharvani_: i should just sync your patch? [20:32:31] (03PS5) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 [20:32:38] (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [20:32:42] cjming: [20:32:46] I can test [20:32:54] please do [20:32:55] (03CR) 10CI reject: [V: 04-1] wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [20:33:13] looks good please sync. thank you. [20:33:18] !log cjming@deploy1002 sharvaniharan and cjming: Continuing with sync [20:33:26] (03PS6) 10RhinosF1: wikistats: drop some updates [puppet] - 10https://gerrit.wikimedia.org/r/956813 [20:34:19] (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956813 (owner: 10RhinosF1) [20:37:23] (03CR) 10Cwhite: [C: 03+2] alertmanager: add link to DatasourceError runbook [puppet] - 10https://gerrit.wikimedia.org/r/953495 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [20:39:21] cjming: managed to verify everything is okay with the patch - it just didn't work as advertised :) [20:39:49] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:956934|Make the new stream name consistent with convention]] (duration: 09m 24s) [20:40:03] Jdlrobson: bummer - is there anything to do? [20:40:05] sharvani_: should be live! [20:40:21] thank you! [20:40:22] cjming: nope nothing we can do here :) [20:40:42] sharvani_: yw - np! [20:41:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P52484 and previous config saved to /var/cache/conftool/dbconfig/20230912-204115-arnaudb.json [20:41:25] Jdlrobson: sorry to hear -- lmk if you want to revert [20:41:54] nope no need to revert it's still a minor improvement [20:41:58] RECOVERY - cassandra-c service on restbase1030 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:42:22] RECOVERY - cassandra-b SSL 10.64.48.235:7000 on restbase1030 is OK: SSL OK - Certificate restbase1030-b valid until 2024-08-30 21:39:18 +0000 (expires in 353 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:42:25] !log rebooting search-loader2001.codfw.wmnet T344671 [20:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:38] RECOVERY - cassandra-c SSL 10.64.48.236:7000 on restbase1030 is OK: SSL OK - Certificate restbase1030-c valid until 2024-08-30 21:39:21 +0000 (expires in 353 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:42:45] closing the window if there's nothing else [20:42:50] RECOVERY - cassandra-b service on restbase1030 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:43:00] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader2001.codfw.wmnet [20:43:37] !log end of UTC late backport window [20:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:56] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader2001.codfw.wmnet [20:53:16] RECOVERY - Host mw2444 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [20:54:26] 10SRE, 10Movement-Insights, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Hi, any update on this? [20:54:32] PROBLEM - puppet last run on mw2444 is CRITICAL: CRITICAL: Puppet last ran 13 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:56:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P52485 and previous config saved to /var/cache/conftool/dbconfig/20230912-205621-arnaudb.json [20:57:48] 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) I've replaced the CPU. We should know by Friday if it has issues. I will leave the ticket open until then. return tracking for the bad part (which I will also hold until Friday): 783629071254 [20:59:58] RECOVERY - puppet last run on mw2444 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:01:26] 10SRE, 10Security-Team, 10Traffic, 10SecTeam-Processed, 10Security: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10sbassett) [21:01:47] 10SRE, 10Security-Team, 10Traffic, 10SecTeam-Processed, 10Security: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10sbassett) [21:04:18] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader1001.eqiad.wmnet [21:07:51] (03CR) 10Ssingh: [C: 03+1] Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [21:08:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader1001.eqiad.wmnet [21:09:28] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:11:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T343198)', diff saved to https://phabricator.wikimedia.org/P52486 and previous config saved to /var/cache/conftool/dbconfig/20230912-211128-arnaudb.json [21:11:30] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:11:34] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:11:43] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:12:37] (03CR) 10Cwhite: [V: 03+1 C: 03+2] grafana: ensure prometheus/global datasources removed [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite) [21:14:04] 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) 05Open→03Resolved @bking @RKemper All disks are present [21:14:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:18:12] 10SRE, 10Traffic: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook - https://phabricator.wikimedia.org/T327812 (10BCornwall) 05Stalled→03Resolved a:03BCornwall [21:29:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:24] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:30:39] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:34:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:24] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:40:03] (03CR) 10Cwhite: Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [21:44:43] (03CR) 10BCornwall: [C: 03+2] Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [21:46:49] (03CR) 10Cwhite: [C: 03+1] Don't require dummy 'team' label for multi-owner alerts [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi) [21:59:58] (03PS1) 10BCornwall: package_builder: add piuparts for >=bookworm [puppet] - 10https://gerrit.wikimedia.org/r/956968 [22:11:43] (03CR) 10Fabfur: "I found piuparts very useful but wait for someone with more experience than me with this" [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall) [22:46:19] (03PS1) 10Volans: decorators: fix set_tries [software/spicerack] - 10https://gerrit.wikimedia.org/r/956972 (https://phabricator.wikimedia.org/T346134) [23:07:52] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:42] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I'd like to understand where the requirement for the "glue" A records that org are returning for these comes from. As I under... [23:09:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:11:58] !bash Krinkle: The quintessential pattern of classic MediaWiki is query() -> doQuery() -> reallyDoQuery(). And when really (ha) in doubt, just tack a number or "internal" to the caller, like guessMimeInternal, makeThumbLink2, or recordUpload3 [23:11:58] Amir1: Stored quip at https://bash.toolforge.org/quip/I4Wqi4oBGiVuUzOddIg1 [23:14:57] !log Upload trafficserver_9.2.1-1wm2_amd64 to bookworm-wikimedia [23:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:00] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:44:09] (03PS1) 10Jdlrobson: Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956814 (https://phabricator.wikimedia.org/T345414) [23:44:23] (03PS1) 10Jdlrobson: Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956815 (https://phabricator.wikimedia.org/T345414)