[00:01:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P66215 and previous config saved to /var/cache/conftool/dbconfig/20240711-000136-arnaudb.json [00:03:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053397 (owner: 10TrainBranchBot) [00:09:37] (03PS1) 10Dzahn: mailman3: add defined type to sync list members (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1053399 [00:10:00] (03CR) 10CI reject: [V:04-1] mailman3: add defined type to sync list members (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1053399 (owner: 10Dzahn) [00:12:09] (03CR) 10Ssingh: "Thanks for the reviews! Comments in-line:" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [00:12:45] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 378.91 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:35] PROBLEM - dump of s6 in codfw on backupmon1001 is CRITICAL: dump for s6 at codfw (db2197) taken more than a week ago: Most recent backup 2024-07-02 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:16:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P66216 and previous config saved to /var/cache/conftool/dbconfig/20240711-001643-arnaudb.json [00:20:45] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:20:51] (03PS1) 10Dzahn: redirects.dat: change funnel target for sep11.wikipedia.org to meta wiki [puppet] - 10https://gerrit.wikimedia.org/r/1053400 (https://phabricator.wikimedia.org/T367014) [00:23:45] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:24:40] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053400" [puppet] - 10https://gerrit.wikimedia.org/r/225043 (owner: 10Glaisher) [00:28:39] (03CR) 10Dzahn: "added profile::stewards::gitlab_clone_token in private repo" [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [00:31:45] (03PS4) 10Dzahn: stewards: clone user DB repo from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [00:31:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T367781)', diff saved to https://phabricator.wikimedia.org/P66217 and previous config saved to /var/cache/conftool/dbconfig/20240711-003150-arnaudb.json [00:31:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2162.codfw.wmnet with reason: Maintenance [00:32:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [00:32:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2162.codfw.wmnet with reason: Maintenance [00:32:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66218 and previous config saved to /var/cache/conftool/dbconfig/20240711-003212-arnaudb.json [00:34:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66219 and previous config saved to /var/cache/conftool/dbconfig/20240711-003423-arnaudb.json [00:34:31] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1052384/3204/stewards1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [00:37:10] (03CR) 10Dzahn: [C:03+2] "Notice: /Stage[main]/Profile::Stewards/Git::Clone[repos/stewards/users]/Exec[git_set_origin_repos/stewards/users]/returns: executed succes" [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [00:38:45] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 38.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:39:54] (03CR) 10Dzahn: [C:03+2] "The .git/config in the repo shows the gitlab URL with the secret as [remote "origin"] now and there was no error." [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [00:48:09] (03CR) 10Dzahn: [C:03+2] gerrit: switch gerrit-replica from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053068 (owner: 10Dzahn) [00:48:50] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit2002.wikimedia.org with reason: switch firewall provider [00:49:03] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: switch firewall provider [00:49:27] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit-replica.wikimedia.org with reason: switch firewall provider [00:49:28] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on gerrit-replica.wikimedia.org with reason: switch firewall provider [00:49:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P66220 and previous config saved to /var/cache/conftool/dbconfig/20240711-004930-arnaudb.json [00:55:48] !log gerrit-replica.wikimedia.org (gerrit2002) - maintenance [00:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:09] (03CR) 10Scott French: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [01:01:44] (03CR) 10Andrew Bogott: [C:03+2] Set OS_CLOUD in wmcs-openstack.sh [puppet] - 10https://gerrit.wikimedia.org/r/923697 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:01:49] (03CR) 10Andrew Bogott: [C:03+2] wmcs-cold-migrate.py: remove [puppet] - 10https://gerrit.wikimedia.org/r/1053354 (owner: 10Andrew Bogott) [01:01:53] (03CR) 10Andrew Bogott: [C:03+2] wmcs-pause-cloud: remove [puppet] - 10https://gerrit.wikimedia.org/r/1053355 (owner: 10Andrew Bogott) [01:02:10] (03CR) 10Andrew Bogott: [C:03+2] wmcs-makedomain: use clouds.yaml openstack auth [puppet] - 10https://gerrit.wikimedia.org/r/1053356 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:02:34] (03CR) 10Andrew Bogott: [C:03+2] wmcs-drain-hypervisor: remove spurious args, use clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1053357 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:02:44] (03CR) 10Andrew Bogott: [C:03+2] Openstack cli: stamp out openstack auth via env settings [puppet] - 10https://gerrit.wikimedia.org/r/1053086 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:04:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P66221 and previous config saved to /var/cache/conftool/dbconfig/20240711-010437-arnaudb.json [01:09:47] (03PS1) 10Andrew Bogott: util/admin_scripts.pp: wmcs-pause-cloud ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/1053401 (https://phabricator.wikimedia.org/T337577) [01:10:55] (03CR) 10Andrew Bogott: [C:03+2] util/admin_scripts.pp: wmcs-pause-cloud ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/1053401 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:18:53] (03PS1) 10Andrew Bogott: openstack envscript.sh.erb: replace env settings with OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) [01:19:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:19:19] (03CR) 10CI reject: [V:04-1] openstack envscript.sh.erb: replace env settings with OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:19:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T367781)', diff saved to https://phabricator.wikimedia.org/P66222 and previous config saved to /var/cache/conftool/dbconfig/20240711-011944-arnaudb.json [01:19:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance [01:19:49] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [01:19:52] (03PS2) 10Andrew Bogott: openstack envscript.sh.erb: replace env settings with OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) [01:19:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance [01:20:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T367781)', diff saved to https://phabricator.wikimedia.org/P66223 and previous config saved to /var/cache/conftool/dbconfig/20240711-012006-arnaudb.json [01:20:19] (03CR) 10CI reject: [V:04-1] openstack envscript.sh.erb: replace env settings with OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:20:46] (03PS3) 10Andrew Bogott: openstack envscript.sh.erb: replace env settings with OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) [01:20:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:21:34] !log gerrit-replica.wikimedia.org (gerrit2002) - switched firewall provider from iptables to nftables - all seems fine to me but just in case: gerrit:1053068 can be reverted to go back [01:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T367781)', diff saved to https://phabricator.wikimedia.org/P66224 and previous config saved to /var/cache/conftool/dbconfig/20240711-012216-arnaudb.json [01:23:15] (03CR) 10Andrew Bogott: [C:03+2] openstack envscript.sh.erb: replace env settings with OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053402 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:27:03] (03PS1) 10Andrew Bogott: cloudvirt1060 -> OVS [puppet] - 10https://gerrit.wikimedia.org/r/1053404 (https://phabricator.wikimedia.org/T364457) [01:27:27] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1060.eqiad.wmnet with OS bookworm [01:27:36] (03CR) 10Andrew Bogott: [C:03+2] cloudvirt1060 -> OVS [puppet] - 10https://gerrit.wikimedia.org/r/1053404 (https://phabricator.wikimedia.org/T364457) (owner: 10Andrew Bogott) [01:37:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P66225 and previous config saved to /var/cache/conftool/dbconfig/20240711-013723-arnaudb.json [01:39:02] (03PS1) 10Andrew Bogott: wmcs-openstack: export OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053405 (https://phabricator.wikimedia.org/T337577) [01:39:40] (03CR) 10Andrew Bogott: [C:03+2] wmcs-openstack: export OS_CLOUD [puppet] - 10https://gerrit.wikimedia.org/r/1053405 (https://phabricator.wikimedia.org/T337577) (owner: 10Andrew Bogott) [01:40:40] (03CR) 10Andrew Bogott: [C:03+2] "Now that the g3 flavors are disabled this is safe again." [puppet] - 10https://gerrit.wikimedia.org/r/1043163 (owner: 10Andrew Bogott) [01:40:52] (03PS2) 10Andrew Bogott: Revert "nova policy: temporarily disable VM resizing" [puppet] - 10https://gerrit.wikimedia.org/r/1043163 [01:43:14] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [01:46:11] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [01:52:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P66226 and previous config saved to /var/cache/conftool/dbconfig/20240711-015231-arnaudb.json [01:56:30] (03CR) 10Krinkle: [C:04-1] "per task, subdirs under /w/ adds compounding complexity for others that I'd rather avoid, and this would expose /w/beacon.php on all domai" [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [02:07:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T367781)', diff saved to https://phabricator.wikimedia.org/P66227 and previous config saved to /var/cache/conftool/dbconfig/20240711-020738-arnaudb.json [02:07:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance [02:07:42] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [02:07:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance [02:07:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [02:07:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [02:08:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T367781)', diff saved to https://phabricator.wikimedia.org/P66228 and previous config saved to /var/cache/conftool/dbconfig/20240711-020805-arnaudb.json [02:10:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T367781)', diff saved to https://phabricator.wikimedia.org/P66229 and previous config saved to /var/cache/conftool/dbconfig/20240711-021015-arnaudb.json [02:12:07] (03PS1) 10Andrew Bogott: wmcs-openstack.sh: set cwd so we get root's clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1053408 [02:12:41] (03CR) 10Andrew Bogott: [C:03+2] wmcs-openstack.sh: set cwd so we get root's clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1053408 (owner: 10Andrew Bogott) [02:14:59] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1060.eqiad.wmnet with OS bookworm [02:25:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P66230 and previous config saved to /var/cache/conftool/dbconfig/20240711-022522-arnaudb.json [02:40:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P66231 and previous config saved to /var/cache/conftool/dbconfig/20240711-024030-arnaudb.json [02:45:38] !log stewards2001 - sudo mv /srv/repos/users-db /root/ - run puppet and let it recreate the usersdb repo - this time pulling from gitlab - T369780 T369430 [02:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:44] T369780: PuppetFailure - stewards2001 - https://phabricator.wikimedia.org/T369780 [02:45:44] T369430: Ensure /srv/repos/users-db is loaded from GitLab - https://phabricator.wikimedia.org/T369430 [02:55:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T367781)', diff saved to https://phabricator.wikimedia.org/P66232 and previous config saved to /var/cache/conftool/dbconfig/20240711-025537-arnaudb.json [02:55:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2165.codfw.wmnet with reason: Maintenance [02:55:41] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [02:55:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2165.codfw.wmnet with reason: Maintenance [02:55:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T367781)', diff saved to https://phabricator.wikimedia.org/P66233 and previous config saved to /var/cache/conftool/dbconfig/20240711-025558-arnaudb.json [02:58:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T367781)', diff saved to https://phabricator.wikimedia.org/P66234 and previous config saved to /var/cache/conftool/dbconfig/20240711-025809-arnaudb.json [02:59:40] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:53] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:13:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P66235 and previous config saved to /var/cache/conftool/dbconfig/20240711-031316-arnaudb.json [03:26:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on relforge1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:28:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P66236 and previous config saved to /var/cache/conftool/dbconfig/20240711-032823-arnaudb.json [03:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:43:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T367781)', diff saved to https://phabricator.wikimedia.org/P66237 and previous config saved to /var/cache/conftool/dbconfig/20240711-034330-arnaudb.json [03:43:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance [03:43:34] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [03:43:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance [03:43:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T367781)', diff saved to https://phabricator.wikimedia.org/P66238 and previous config saved to /var/cache/conftool/dbconfig/20240711-034352-arnaudb.json [03:46:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T367781)', diff saved to https://phabricator.wikimedia.org/P66239 and previous config saved to /var/cache/conftool/dbconfig/20240711-034603-arnaudb.json [03:55:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:56:22] 10SRE-Access-Requests, 06Data-Engineering: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#9972112 (10Pppery) [03:56:41] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1053423 [03:56:44] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1053424 [03:56:47] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1053425 [03:57:05] (03CR) 10CI reject: [V:04-1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1053424 (owner: 10Ncmonitor) [04:00:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:01:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P66240 and previous config saved to /var/cache/conftool/dbconfig/20240711-040110-arnaudb.json [04:16:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P66241 and previous config saved to /var/cache/conftool/dbconfig/20240711-041617-arnaudb.json [04:31:19] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1053424 (owner: 10Ncmonitor) [04:31:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T367781)', diff saved to https://phabricator.wikimedia.org/P66242 and previous config saved to /var/cache/conftool/dbconfig/20240711-043124-arnaudb.json [04:31:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance [04:31:29] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [04:31:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance [04:31:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T367781)', diff saved to https://phabricator.wikimedia.org/P66243 and previous config saved to /var/cache/conftool/dbconfig/20240711-043147-arnaudb.json [04:32:05] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1053425 (owner: 10Ncmonitor) [04:33:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T367781)', diff saved to https://phabricator.wikimedia.org/P66244 and previous config saved to /var/cache/conftool/dbconfig/20240711-043358-arnaudb.json [04:34:01] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1053423 (owner: 10Ncmonitor) [04:37:15] (03PS1) 10BCornwall: taskgen: Ignore ncredir domain typos [puppet] - 10https://gerrit.wikimedia.org/r/1053426 [04:49:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P66245 and previous config saved to /var/cache/conftool/dbconfig/20240711-044905-arnaudb.json [04:58:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 36 hosts with reason: Primary switchover s1 T369514 [04:58:22] T369514: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T369514 [04:58:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1184 with weight 0 T369514', diff saved to https://phabricator.wikimedia.org/P66246 and previous config saved to /var/cache/conftool/dbconfig/20240711-045829-marostegui.json [04:58:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 36 hosts with reason: Primary switchover s1 T369514 [04:58:53] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 21.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:59:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1184 from API/vslow/dump T369514', diff saved to https://phabricator.wikimedia.org/P66247 and previous config saved to /var/cache/conftool/dbconfig/20240711-045905-marostegui.json [04:59:34] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052748 (https://phabricator.wikimedia.org/T369514) [04:59:59] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052748 (https://phabricator.wikimedia.org/T369514) (owner: 10Gerrit maintenance bot) [05:00:00] (03CR) 10Marostegui: [V:03+2 C:03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052748 (https://phabricator.wikimedia.org/T369514) (owner: 10Gerrit maintenance bot) [05:00:50] (03PS1) 10Marostegui: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1053427 (https://phabricator.wikimedia.org/T369514) [05:02:36] (03Abandoned) 10Marostegui: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1052749 (https://phabricator.wikimedia.org/T369514) (owner: 10Gerrit maintenance bot) [05:04:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P66248 and previous config saved to /var/cache/conftool/dbconfig/20240711-050413-arnaudb.json [05:19:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T367781)', diff saved to https://phabricator.wikimedia.org/P66249 and previous config saved to /var/cache/conftool/dbconfig/20240711-051920-arnaudb.json [05:19:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance [05:19:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [05:19:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance [05:19:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T367781)', diff saved to https://phabricator.wikimedia.org/P66250 and previous config saved to /var/cache/conftool/dbconfig/20240711-051941-arnaudb.json [05:21:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T367781)', diff saved to https://phabricator.wikimedia.org/P66251 and previous config saved to /var/cache/conftool/dbconfig/20240711-052151-arnaudb.json [05:24:53] !log Starting s1 eqiad failover from db1163 to db1184 - T369514 [05:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:56] T369514: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T369514 [05:25:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T369514', diff saved to https://phabricator.wikimedia.org/P66252 and previous config saved to /var/cache/conftool/dbconfig/20240711-052507-root.json [05:25:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1184 to s1 primary and set section read-write T369514', diff saved to https://phabricator.wikimedia.org/P66253 and previous config saved to /var/cache/conftool/dbconfig/20240711-052540-root.json [05:26:13] (03CR) 10Marostegui: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1053427 (https://phabricator.wikimedia.org/T369514) (owner: 10Marostegui) [05:27:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1163 T369514', diff saved to https://phabricator.wikimedia.org/P66254 and previous config saved to /var/cache/conftool/dbconfig/20240711-052702-root.json [05:29:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P66255 and previous config saved to /var/cache/conftool/dbconfig/20240711-052931-root.json [05:36:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P66256 and previous config saved to /var/cache/conftool/dbconfig/20240711-053659-arnaudb.json [05:44:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P66257 and previous config saved to /var/cache/conftool/dbconfig/20240711-054436-root.json [05:50:53] (03PS1) 10KartikMistry: Enable MinT for Wikipedia readers MVP on a second group of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053429 (https://phabricator.wikimedia.org/T367067) [05:52:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P66258 and previous config saved to /var/cache/conftool/dbconfig/20240711-055206-arnaudb.json [05:52:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053429 (https://phabricator.wikimedia.org/T367067) (owner: 10KartikMistry) [05:59:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P66259 and previous config saved to /var/cache/conftool/dbconfig/20240711-055942-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T0600) [06:00:04] marostegui, Amir1, and arnaudb: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T367781)', diff saved to https://phabricator.wikimedia.org/P66260 and previous config saved to /var/cache/conftool/dbconfig/20240711-060714-arnaudb.json [06:07:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance [06:07:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:07:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance [06:07:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T367781)', diff saved to https://phabricator.wikimedia.org/P66261 and previous config saved to /var/cache/conftool/dbconfig/20240711-060736-arnaudb.json [06:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T367856)', diff saved to https://phabricator.wikimedia.org/P66262 and previous config saved to /var/cache/conftool/dbconfig/20240711-060910-marostegui.json [06:09:15] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T367781)', diff saved to https://phabricator.wikimedia.org/P66263 and previous config saved to /var/cache/conftool/dbconfig/20240711-060947-arnaudb.json [06:14:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P66264 and previous config saved to /var/cache/conftool/dbconfig/20240711-061447-root.json [06:24:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P66265 and previous config saved to /var/cache/conftool/dbconfig/20240711-062417-marostegui.json [06:24:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P66266 and previous config saved to /var/cache/conftool/dbconfig/20240711-062454-arnaudb.json [06:29:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P66267 and previous config saved to /var/cache/conftool/dbconfig/20240711-062953-root.json [06:38:50] (03CR) 10Urbanecm: stewards: clone user DB repo from GitLab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [06:39:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P66268 and previous config saved to /var/cache/conftool/dbconfig/20240711-063924-marostegui.json [06:39:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053006 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [06:40:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P66269 and previous config saved to /var/cache/conftool/dbconfig/20240711-064001-arnaudb.json [06:41:10] (03PS1) 10DCausse: Fix pool counter metric [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053533 [06:42:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053533 (owner: 10DCausse) [06:44:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P66270 and previous config saved to /var/cache/conftool/dbconfig/20240711-064459-root.json [06:51:07] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 61942 [06:51:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 61942 [06:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T367856)', diff saved to https://phabricator.wikimedia.org/P66271 and previous config saved to /var/cache/conftool/dbconfig/20240711-065432-marostegui.json [06:54:36] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [06:55:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T367781)', diff saved to https://phabricator.wikimedia.org/P66272 and previous config saved to /var/cache/conftool/dbconfig/20240711-065508-arnaudb.json [06:55:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [06:55:12] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:55:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [06:55:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [06:55:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [06:58:14] (03PS1) 10Kevin Bazira: ml-services: article_descriptions from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053536 (https://phabricator.wikimedia.org/T369344) [06:59:40] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T0700). [07:00:04] kart_, nemo-yiannis, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P66273 and previous config saved to /var/cache/conftool/dbconfig/20240711-070004-root.json [07:00:16] o/ [07:00:25] o/ [07:00:33] \o [07:00:42] I'll start my config patch.. [07:01:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053429 (https://phabricator.wikimedia.org/T367067) (owner: 10KartikMistry) [07:01:48] (03Merged) 10jenkins-bot: Enable MinT for Wikipedia readers MVP on a second group of pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053429 (https://phabricator.wikimedia.org/T367067) (owner: 10KartikMistry) [07:02:14] (03PS3) 10Daniel Kinzler: Linter: trigger parsoid parses on template changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053006 (https://phabricator.wikimedia.org/T361013) [07:02:37] !log kartik@deploy1002 Started scap sync-world: Backport for [[gerrit:1053429|Enable MinT for Wikipedia readers MVP on a second group of pilot wikis (T367067)]] [07:02:40] T367067: Enable MinT for Wikipedia readers MVP on a second group of pilot wikis - https://phabricator.wikimedia.org/T367067 [07:03:08] (03CR) 10Slyngshede: "Suggestion is to test this on Debmonitor, then move it to a general setting, in cas_settings to apply to the other CAS enabled hosts. Fina" [puppet] - 10https://gerrit.wikimedia.org/r/1053535 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [07:05:28] !log kartik@deploy1002 kartik: Backport for [[gerrit:1053429|Enable MinT for Wikipedia readers MVP on a second group of pilot wikis (T367067)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:07] !log kartik@deploy1002 kartik: Continuing with sync [07:07:49] (03CR) 10DCausse: [C:03+2] "backport" [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053533 (owner: 10DCausse) [07:08:14] scheduling CI for my patch ^ [07:12:09] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1053429|Enable MinT for Wikipedia readers MVP on a second group of pilot wikis (T367067)]] (duration: 09m 32s) [07:12:13] T367067: Enable MinT for Wikipedia readers MVP on a second group of pilot wikis - https://phabricator.wikimedia.org/T367067 [07:12:26] Done with my patch. Go ahead nemo-yiannis [07:12:34] ok [07:13:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jgiannelos@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053006 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [07:13:53] (03Merged) 10jenkins-bot: Linter: trigger parsoid parses on template changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053006 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [07:14:21] !log jgiannelos@deploy1002 Started scap sync-world: Backport for [[gerrit:1053006|Linter: trigger parsoid parses on template changes (T361013)]] [07:14:24] T361013: Update lint tables independently of changeprop/restbase - https://phabricator.wikimedia.org/T361013 [07:17:02] !log jgiannelos@deploy1002 daniel, jgiannelos: Backport for [[gerrit:1053006|Linter: trigger parsoid parses on template changes (T361013)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:19:50] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1053360 (https://phabricator.wikimedia.org/T368911) (owner: 10Dzahn) [07:20:26] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1053052 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [07:21:00] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1053352 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [07:22:16] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1053366 (https://phabricator.wikimedia.org/T366032) (owner: 10Dzahn) [07:23:52] !log jgiannelos@deploy1002 daniel, jgiannelos: Continuing with sync [07:26:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on relforge1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:27:31] (03Abandoned) 10Jgiannelos: Enable linter jobs on derived data update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053019 (https://phabricator.wikimedia.org/T367417) (owner: 10Jgiannelos) [07:28:47] !log jgiannelos@deploy1002 Finished scap: Backport for [[gerrit:1053006|Linter: trigger parsoid parses on template changes (T361013)]] (duration: 14m 25s) [07:28:51] T361013: Update lint tables independently of changeprop/restbase - https://phabricator.wikimedia.org/T361013 [07:29:37] i am done with my patch [07:30:14] ack, will ship mine once it's merged [07:30:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:30:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s3 T369691 [07:30:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:30:55] T369691: Switchover s3 master (db2127 -> db2205) - https://phabricator.wikimedia.org/T369691 [07:31:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2205 with weight 0 T369691', diff saved to https://phabricator.wikimedia.org/P66274 and previous config saved to /var/cache/conftool/dbconfig/20240711-073101-root.json [07:31:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T369691 [07:32:11] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1053261 (https://phabricator.wikimedia.org/T369691) (owner: 10Gerrit maintenance bot) [07:36:35] (03Merged) 10jenkins-bot: Fix pool counter metric [extensions/CirrusSearch] (wmf/1.43.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1053533 (owner: 10DCausse) [07:36:45] (03PS2) 10Hashar: gerrit: remove absented clear_gerrit_logs timer job [puppet] - 10https://gerrit.wikimedia.org/r/1049091 (https://phabricator.wikimedia.org/T367505) [07:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:37:35] !log dcausse@deploy1002 Started scap sync-world: Backport for [[gerrit:1053533|Fix pool counter metric]] [07:37:55] (03CR) 10Hashar: [C:03+1] "I have confirmed Gerrit triggers the log rotation on a daily basis (at 11pm UTC) and logs are indeeded rotated: T367505#9969012" [puppet] - 10https://gerrit.wikimedia.org/r/1049091 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [07:38:13] (03PS3) 10Hashar: gerrit: remove absented clear_gerrit_logs timer job [puppet] - 10https://gerrit.wikimedia.org/r/1049091 (https://phabricator.wikimedia.org/T367505) [07:38:36] (03CR) 10Hashar: "Rebased to clear the "merge conflict" since the parent change got rebased automatically upon submission." [puppet] - 10https://gerrit.wikimedia.org/r/1049091 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [07:41:00] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1053533|Fix pool counter metric]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:41:19] (03CR) 10Hashar: [C:03+1] "Ib5cbbe1ba020b103f91e878244c626bca056d7ae has been abandoned without a reason and as expected T367417#9972309 does not list the reason any" [puppet] - 10https://gerrit.wikimedia.org/r/1051134 (owner: 10Paladox) [07:41:29] (03PS1) 10Slyngshede: Docker: Update dependencies [software/bitu] - 10https://gerrit.wikimedia.org/r/1053613 [07:42:35] !log dcausse@deploy1002 dcausse: Continuing with sync [07:45:05] !log Starting s3 codfw failover from db2127 to db2205 - T369691 [07:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:09] T369691: Switchover s3 master (db2127 -> db2205) - https://phabricator.wikimedia.org/T369691 [07:45:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2205 to s3 primary T369691', diff saved to https://phabricator.wikimedia.org/P66275 and previous config saved to /var/cache/conftool/dbconfig/20240711-074534-marostegui.json [07:46:18] (03CR) 10Jelto: [C:03+2] gerrit: remove absented clear_gerrit_logs timer job [puppet] - 10https://gerrit.wikimedia.org/r/1049091 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar) [07:46:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2127 T369691', diff saved to https://phabricator.wikimedia.org/P66276 and previous config saved to /var/cache/conftool/dbconfig/20240711-074629-marostegui.json [07:47:32] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1053533|Fix pool counter metric]] (duration: 09m 56s) [07:47:54] (03PS3) 10Slyngshede: R:idp New CAS 7 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) [07:48:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Long schema change [07:48:41] (03PS1) 10Marostegui: db2127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1053614 (https://phabricator.wikimedia.org/T367856) [07:48:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Long schema change [07:48:43] !log closing the backport window [07:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:47] (03CR) 10Marostegui: [C:03+2] db2127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1053614 (https://phabricator.wikimedia.org/T367856) (owner: 10Marostegui) [07:50:23] !log Deploy schema change on s3 codfw db2127 dbmaint T367856 [07:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:26] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:50:32] (03CR) 10Slyngshede: [C:03+2] Docker: Update dependencies [software/bitu] - 10https://gerrit.wikimedia.org/r/1053613 (owner: 10Slyngshede) [07:55:45] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:56:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T367856)', diff saved to https://phabricator.wikimedia.org/P66277 and previous config saved to /var/cache/conftool/dbconfig/20240711-075630-marostegui.json [07:56:34] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:57:45] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:00:05] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T0800) [08:01:01] (03CR) 10Elukey: [C:03+1] Netbox 4: prepare Puppet for new prod servers [puppet] - 10https://gerrit.wikimedia.org/r/1053266 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:01:09] (03CR) 10Arnaudb: [C:03+1] mariadb: bugfixes mysql_legacy (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [08:05:34] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:05:52] (03CR) 10Arnaudb: [C:03+1] mariadb: bugfixes mysql_legacy (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [08:06:23] I now started promoting group2 wikis to 1.43.0-wmf.13 [08:06:42] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053615 (https://phabricator.wikimedia.org/T366958) [08:06:43] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053615 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [08:07:26] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053615 (https://phabricator.wikimedia.org/T366958) (owner: 10TrainBranchBot) [08:08:35] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [08:09:57] jouncebot: next [08:09:57] In 1 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1000) [08:11:33] (03CR) 10Ayounsi: [C:03+2] Netbox 4: prepare Puppet for new prod servers [puppet] - 10https://gerrit.wikimedia.org/r/1053266 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:11:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P66278 and previous config saved to /var/cache/conftool/dbconfig/20240711-081137-marostegui.json [08:15:04] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.13 refs T366958 [08:15:08] T366958: 1.43.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T366958 [08:19:04] (03PS1) 10Elukey: profile::tcpircbot: allow inbound conn from puppetserver nodes [puppet] - 10https://gerrit.wikimedia.org/r/1053616 (https://phabricator.wikimedia.org/T368023) [08:20:44] (03PS1) 10Slyngshede: Docker: Enable MediaWiki module [software/bitu] - 10https://gerrit.wikimedia.org/r/1053618 [08:22:44] (03CR) 10Slyngshede: [C:03+2] Docker: Enable MediaWiki module [software/bitu] - 10https://gerrit.wikimedia.org/r/1053618 (owner: 10Slyngshede) [08:24:56] Finished promoting group2 wikis to 1.43.0-wmf.13, looks fine so far [08:26:14] (03PS1) 10Elukey: profile::kerveros::kadminserver: allow more nodes in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) [08:26:40] (03PS2) 10Elukey: profile::kerberos::kadminserver: allow more nodes in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) [08:26:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P66279 and previous config saved to /var/cache/conftool/dbconfig/20240711-082644-marostegui.json [08:27:20] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1053366 (https://phabricator.wikimedia.org/T366032) (owner: 10Dzahn) [08:27:51] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3205/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:30:51] !log Switched CI Quibble and Phan jobs based on PHP 8.1, 8.2 and 8.3 from Buster to Bullseye - T335766 T366799 T369146 [08:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:57] T335766: Migrate Quibble images from buster to bullseye - https://phabricator.wikimedia.org/T335766 [08:30:57] T366799: Quibble jobs fail on bullseye images in QUnit job: "Multiple targets are not supported" - https://phabricator.wikimedia.org/T366799 [08:30:58] T369146: Quibble CI images based on Buster fail to build due to sury.org dropping support - https://phabricator.wikimedia.org/T369146 [08:31:07] at some I will need a dedicated project manager to keep track of all those tasks [08:32:55] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9972471 (10fgiunchedi) 05In progress→03Resolved a:03fgiunchedi Patch is merged and I've added `soda` to `nda` ldap group, tentatively resolving though please reopen if needed [08:34:41] (03PS2) 10Filippo Giunchedi: admin: add eup to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1052921 (https://phabricator.wikimedia.org/T369500) [08:36:30] (03CR) 10Filippo Giunchedi: [C:03+2] admin: add eup to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1052921 (https://phabricator.wikimedia.org/T369500) (owner: 10Filippo Giunchedi) [08:36:46] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#9972525 (10elukey) First test didn't go well: ` root@puppetserver1001:/srv/git/private# git add README... [08:39:18] andre: may I use the rest of your window? [08:39:34] effie, yes please [08:39:40] cheers [08:40:53] (03CR) 10Effie Mouzeli: [C:03+2] memcached: enable extstore to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [08:41:48] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9972534 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you @EUwandu-WMF ! Everything checks out, you are now part of `wmf` group and therefore have access. I'm ten... [08:41:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T367856)', diff saved to https://phabricator.wikimedia.org/P66280 and previous config saved to /var/cache/conftool/dbconfig/20240711-084151-marostegui.json [08:41:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [08:41:55] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:41:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [08:46:06] !log cd /srv/git/private; git reset --hard HEAD^ on puppetserver1001 to remove my last local commit (test before migration of the private repo to puppetserver1001) - T368023 [08:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:09] T368023: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023 [08:48:51] (03PS1) 10Hnowlan: shellbox-video: set log level to debug temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053621 (https://phabricator.wikimedia.org/T356241) [08:52:00] (03CR) 10Effie Mouzeli: [C:03+2] mw-*: remove mcrouter container from mw pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053281 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [08:53:39] (03Merged) 10jenkins-bot: mw-*: remove mcrouter container from mw pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053281 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [08:53:41] (03PS2) 10Hnowlan: shellbox-video: set log level to debug temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053621 (https://phabricator.wikimedia.org/T356241) [08:55:04] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:55:38] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:57:17] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:57:50] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:00:09] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:02:04] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [09:02:47] (03PS2) 10Elukey: profile::tcpircbot: allow inbound conn from puppetserver nodes [puppet] - 10https://gerrit.wikimedia.org/r/1053616 (https://phabricator.wikimedia.org/T368023) [09:02:47] (03PS3) 10Elukey: profile::kerberos::kadminserver: allow more nodes in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) [09:02:47] (03PS1) 10Elukey: profile::puppetserver::gitprivate: fix post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) [09:04:13] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [09:05:27] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [09:05:43] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host netbox1003.eqiad.wmnet [09:05:46] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:07:48] (03PS1) 10Btullis: Temporarily disable gobblin timers to permit hive maintenance [puppet] - 10https://gerrit.wikimedia.org/r/1053624 (https://phabricator.wikimedia.org/T365503) [09:08:16] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox1003.eqiad.wmnet - ayounsi@cumin1002" [09:08:45] (03CR) 10Btullis: [C:03+2] Temporarily disable gobblin timers to permit hive maintenance [puppet] - 10https://gerrit.wikimedia.org/r/1053624 (https://phabricator.wikimedia.org/T365503) (owner: 10Btullis) [09:09:10] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9972698 (10mforns) @Scott_French thanks for all! Regarding T361835#9966404, th... [09:09:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox1003.eqiad.wmnet - ayounsi@cumin1002" [09:09:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:09:21] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache netbox1003.eqiad.wmnet on all recursors [09:09:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox1003.eqiad.wmnet on all recursors [09:09:48] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox1003.eqiad.wmnet - ayounsi@cumin1002" [09:10:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox1003.eqiad.wmnet - ayounsi@cumin1002" [09:11:32] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host netbox1003.eqiad.wmnet with OS bookworm [09:12:52] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [09:12:54] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [09:13:59] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [09:18:11] !log ayounsi@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2003.codfw.wmnet [09:18:14] !log ayounsi@cumin2002 START - Cookbook sre.dns.netbox [09:19:17] !log jiji@deploy1002 Started scap sync-world: Remove mcrouter container and exporter from mediawiki pods [09:20:53] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox2003.codfw.wmnet - ayounsi@cumin2002" [09:21:22] (03PS1) 10DCausse: rdf-streaming-updater: bump staging image to flink-1.17.1-rdf-0.3.144 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053626 [09:22:10] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netbox2003.codfw.wmnet - ayounsi@cumin2002" [09:22:10] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:22:11] !log ayounsi@cumin2002 START - Cookbook sre.dns.wipe-cache netbox2003.codfw.wmnet on all recursors [09:22:14] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2003.codfw.wmnet on all recursors [09:22:40] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on netbox1003.eqiad.wmnet with reason: host reimage [09:22:42] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox2003.codfw.wmnet - ayounsi@cumin2002" [09:23:43] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netbox2003.codfw.wmnet - ayounsi@cumin2002" [09:23:50] !log jiji@deploy1002 Finished scap: Remove mcrouter container and exporter from mediawiki pods (duration: 04m 33s) [09:25:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netbox1003.eqiad.wmnet with reason: host reimage [09:25:17] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host netbox2003.codfw.wmnet with OS bookworm [09:28:13] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:31:02] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:32:58] (03PS2) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 75% [puppet] - 10https://gerrit.wikimedia.org/r/1039621 (https://phabricator.wikimedia.org/T366778) [09:33:35] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:36:27] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:40:25] (03CR) 10Vgutierrez: [C:03+1] ats: Route /api/ to /w/rest.php on mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/1052745 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris) [09:42:59] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netbox2003.codfw.wmnet with reason: host reimage [09:45:18] (03PS3) 10Ssingh: geo-maps: send BR (Brazil) to magru [dns] - 10https://gerrit.wikimedia.org/r/1052144 (https://phabricator.wikimedia.org/T359054) [09:45:48] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netbox2003.codfw.wmnet with reason: host reimage [09:47:42] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Drop to 75% [puppet] - 10https://gerrit.wikimedia.org/r/1039621 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [09:53:32] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:53:59] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:54:15] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:54:17] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:21] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:56:17] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1052745 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris) [09:56:18] (03PS1) 10Btullis: Revert "Configure analytics_meta MariaDB clients to connect to an-mariadb1002" [puppet] - 10https://gerrit.wikimedia.org/r/1053630 [09:57:04] (03PS1) 10Btullis: Revert "Facilitate a role swap from an-mariadb1001 to an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053631 [09:59:53] (03CR) 10Ssingh: [V:03+2 C:03+2] geo-maps: send BR (Brazil) to magru [dns] - 10https://gerrit.wikimedia.org/r/1052144 (https://phabricator.wikimedia.org/T359054) (owner: 10Ssingh) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1000) [10:00:28] !log [start] authdns-update for sending BR to magru: T359054 [10:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:01] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki::sites: switch to use APACHE_RUN_PORT [puppet] - 10https://gerrit.wikimedia.org/r/1052128 (owner: 10Giuseppe Lavagetto) [10:01:42] !log [end] authdns-update for sending BR to magru: T359054 [10:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:58] (03CR) 10Clément Goubert: [C:03+1] redirects.dat: change funnel target for sep11.wikipedia.org to meta wiki [puppet] - 10https://gerrit.wikimedia.org/r/1053400 (https://phabricator.wikimedia.org/T367014) (owner: 10Dzahn) [10:09:21] (03CR) 10Volans: [C:03+2] mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [10:12:19] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:12:44] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:15:25] (03Merged) 10jenkins-bot: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [10:16:55] (03CR) 10Btullis: [C:03+2] Revert "Configure analytics_meta MariaDB clients to connect to an-mariadb1002" [puppet] - 10https://gerrit.wikimedia.org/r/1053630 (owner: 10Btullis) [10:19:08] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9972893 (10cmooney) 05Open→03Resolved [10:21:59] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:24:44] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: allowing Action API access [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053633 (https://phabricator.wikimedia.org/T369804) [10:25:27] (03PS2) 10Santiago Faci: Metrics Platform Instrument Configuration: Enabling access to Action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053633 (https://phabricator.wikimedia.org/T369804) [10:27:47] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053633 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [10:27:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:28:54] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Enabling access to Action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053633 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [10:28:55] (03PS1) 10Btullis: Revert "Temporarily disable gobblin timers to permit hive maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/1053634 [10:30:22] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Enabling access to Action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053633 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [10:34:25] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:34:47] (03CR) 10Btullis: [C:03+2] Revert "Facilitate a role swap from an-mariadb1001 to an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053631 (owner: 10Btullis) [10:34:52] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:35:40] (03Merged) 10jenkins-bot: Revert "Facilitate a role swap from an-mariadb1001 to an-mariadb1002" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053631 (owner: 10Btullis) [10:36:59] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [10:37:09] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [10:39:29] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [10:39:57] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [10:40:08] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [10:40:36] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [10:41:20] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [10:47:18] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [10:51:17] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [10:51:25] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [10:51:35] (03PS1) 10Ayounsi: Netbox 4: create parent directories [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) [10:52:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netbox1003.eqiad.wmnet with OS bookworm [10:52:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netbox1003.eqiad.wmnet [10:53:02] (03PS2) 10Ayounsi: Netbox 4: create parent directories [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) [10:53:25] (03PS3) 10Ayounsi: Netbox 4: create parent directories [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) [10:53:57] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host netboxdb1003.eqiad.wmnet [10:53:58] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:54:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [10:55:08] (03CR) 10Hnowlan: [C:03+2] Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [10:56:11] (03Merged) 10jenkins-bot: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [10:56:21] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:57:08] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:58:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:58:51] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache netboxdb1003.eqiad.wmnet on all recursors [10:58:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netboxdb1003.eqiad.wmnet on all recursors [10:58:57] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:59:40] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:41] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:00:48] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [11:01:26] PROBLEM - Disk space on an-druid1001 is CRITICAL: DISK CRITICAL - free space: /srv 35134 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1001&var-datasource=eqiad+prometheus/ops [11:02:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:02:33] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache netboxdb1003.eqiad.wmnet on all recursors [11:02:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netboxdb1003.eqiad.wmnet on all recursors [11:02:59] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netboxdb1003.eqiad.wmnet [11:12:58] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [11:12:58] (03PS1) 10Btullis: Revert "Fail over hive and presto services to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1053639 [11:13:13] (03PS2) 10Btullis: Revert "Fail over hive and presto services to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1053639 [11:13:14] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [11:14:19] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [11:14:46] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [11:15:51] (03PS1) 10Ayounsi: Cumin aliases: hardcode current Netbox prod servers [puppet] - 10https://gerrit.wikimedia.org/r/1053640 (https://phabricator.wikimedia.org/T336275) [11:19:43] (03CR) 10Ssingh: [C:03+1] Cumin aliases: hardcode current Netbox prod servers [puppet] - 10https://gerrit.wikimedia.org/r/1053640 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:20:23] (03CR) 10Ayounsi: [C:03+2] Cumin aliases: hardcode current Netbox prod servers [puppet] - 10https://gerrit.wikimedia.org/r/1053640 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:21:26] RECOVERY - Disk space on an-druid1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1001&var-datasource=eqiad+prometheus/ops [11:24:07] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host netboxdb1003.eqiad.wmnet [11:24:09] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [11:25:44] FIRING: [23x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:01] jouncebot: next [11:26:01] In 0 hour(s) and 33 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1200) [11:26:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on relforge1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:26:28] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netboxdb1003.eqiad.wmnet - ayounsi@cumin1002" [11:27:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1003 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:27:35] (03PS1) 10DCausse: Revert "rdf-streaming-updater: add split graph config for staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053643 [11:28:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netboxdb1003.eqiad.wmnet - ayounsi@cumin1002" [11:28:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:28:07] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache netboxdb1003.eqiad.wmnet on all recursors [11:28:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netboxdb1003.eqiad.wmnet on all recursors [11:28:34] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netboxdb1003.eqiad.wmnet - ayounsi@cumin1002" [11:28:42] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [11:28:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [11:29:12] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [11:29:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [11:29:20] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:29:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netboxdb1003.eqiad.wmnet - ayounsi@cumin1002" [11:29:33] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:29:45] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host netboxdb1003.eqiad.wmnet with OS bookworm [11:36:27] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netbox2003.codfw.wmnet with OS bookworm [11:36:28] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netbox2003.codfw.wmnet [11:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:42:32] (03CR) 10DCausse: [C:03+2] Revert "rdf-streaming-updater: add split graph config for staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053643 (owner: 10DCausse) [11:43:28] (03Merged) 10jenkins-bot: Revert "rdf-streaming-updater: add split graph config for staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053643 (owner: 10DCausse) [11:46:20] (03PS22) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [11:47:13] (03CR) 10Kamila Součková: [C:03+1] shellbox-video: set log level to debug temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053621 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:48:53] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [11:49:05] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [11:50:22] (03CR) 10Clément Goubert: "The pattern used here is unusual. Services are usually declared this way:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [11:50:23] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host netboxdb1003.eqiad.wmnet with OS bookworm [11:50:23] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host netboxdb1003.eqiad.wmnet [11:56:54] (03PS23) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1200) [12:01:42] (03CR) 10Gergő Tisza: [V:03+2] "Live-tested on beta, it mostly works. There are a number of bugs, but it's good enough to merge and iterate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [12:13:45] (03PS1) 10Clément Goubert: verp_boubce_post_url: Switch to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1053650 (https://phabricator.wikimedia.org/T367949) [12:15:48] (03PS2) 10Clément Goubert: verp_bounce_post_url: Switch to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1053650 (https://phabricator.wikimedia.org/T367949) [12:17:09] (03CR) 10Cwhite: [C:03+2] logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451) (owner: 10Cwhite) [12:18:41] (03PS1) 10Clément Goubert: parsoid testing: Switch api_proxy_uri [puppet] - 10https://gerrit.wikimedia.org/r/1053651 (https://phabricator.wikimedia.org/T367949) [12:18:54] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053650 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:19:01] (03PS1) 10Vgutierrez: hiera: Extend bwlim experiment to cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/1053652 (https://phabricator.wikimedia.org/T317799) [12:19:30] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:19:41] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053652 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:20:07] (03PS4) 10Cwhite: logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451) [12:20:24] (03PS2) 10Clément Goubert: parsoid testing: Switch api_proxy_uri [puppet] - 10https://gerrit.wikimedia.org/r/1053651 (https://phabricator.wikimedia.org/T367949) [12:24:16] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053651 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:24:22] (03CR) 10Cwhite: [C:03+2] logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451) (owner: 10Cwhite) [12:24:45] !log ayounsi@cumin2002 START - Cookbook sre.ganeti.makevm for new host netboxdb2003.codfw.wmnet [12:24:47] !log ayounsi@cumin2002 START - Cookbook sre.dns.netbox [12:25:27] (03PS2) 10DCausse: rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053626 [12:27:07] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netboxdb2003.codfw.wmnet - ayounsi@cumin2002" [12:27:40] !log dcausse@deploy1002 Started deploy [airflow-dags/search@7bb895a]: search: stop using api-ro.discovery.wmnet [12:28:02] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@7bb895a]: search: stop using api-ro.discovery.wmnet (duration: 00m 21s) [12:28:27] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netboxdb2003.codfw.wmnet - ayounsi@cumin2002" [12:28:27] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:28:28] !log ayounsi@cumin2002 START - Cookbook sre.dns.wipe-cache netboxdb2003.codfw.wmnet on all recursors [12:28:31] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netboxdb2003.codfw.wmnet on all recursors [12:29:00] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netboxdb2003.codfw.wmnet - ayounsi@cumin2002" [12:29:33] (03PS1) 10Clément Goubert: turnilo: Hit mw-api-int instead of legacy api-rw [puppet] - 10https://gerrit.wikimedia.org/r/1053657 (https://phabricator.wikimedia.org/T367949) [12:30:01] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netboxdb2003.codfw.wmnet - ayounsi@cumin2002" [12:30:43] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host netboxdb2003.codfw.wmnet with OS bookworm [12:31:05] (03PS1) 10Clément Goubert: service::configuration: Fix doc [puppet] - 10https://gerrit.wikimedia.org/r/1053670 (https://phabricator.wikimedia.org/T367949) [12:35:04] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053626 (owner: 10DCausse) [12:36:06] (03Merged) 10jenkins-bot: rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053626 (owner: 10DCausse) [12:39:53] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netboxdb1003.eqiad.wmnet with reason: netbox upgrade prep work [12:39:54] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4 days, 0:00:00 on netboxdb1003.eqiad.wmnet with reason: netbox upgrade prep work [12:42:26] (03CR) 10Elukey: [C:03+1] turnilo: Hit mw-api-int instead of legacy api-rw [puppet] - 10https://gerrit.wikimedia.org/r/1053657 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:42:54] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:43:56] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:46:12] (03PS3) 10JMeybohm: Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) [12:46:53] (03CR) 10CI reject: [V:04-1] Add kyverno_policy_parser [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [12:47:01] (03PS1) 10DCausse: rdf-streaming-updater: fix prefixes definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053681 [12:47:40] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:47:50] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netboxdb2003.codfw.wmnet with reason: host reimage [12:48:10] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: fix prefixes definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053681 (owner: 10DCausse) [12:48:32] !log temp stop benthos@webrequest_live on centrallog2002 - T369737 [12:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:38] T369737: Site Issue: Delayed data in the `webrequest_sampled_live` Druid table - https://phabricator.wikimedia.org/T369737 [12:48:39] (03CR) 10Clément Goubert: [C:03+2] turnilo: Hit mw-api-int instead of legacy api-rw [puppet] - 10https://gerrit.wikimedia.org/r/1053657 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:48:46] (03CR) 10Clément Goubert: [C:03+2] service::configuration: Fix doc [puppet] - 10https://gerrit.wikimedia.org/r/1053670 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:49:07] (03Merged) 10jenkins-bot: rdf-streaming-updater: fix prefixes definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053681 (owner: 10DCausse) [12:50:05] !log running puppet on O:analytics_cluster::turnilo,O:analytics_cluster::turnilo::staging [12:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:08] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:50:17] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:50:32] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:51:06] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netboxdb1003.eqiad.wmnet with reason: netbox upgrade prep work [12:51:10] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netboxdb2003.codfw.wmnet with reason: host reimage [12:51:11] (03CR) 10Elukey: [C:03+1] "Left a comment but feel free to decide!" [puppet] - 10https://gerrit.wikimedia.org/r/1053636 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:51:14] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:51:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 24.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:51:20] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netboxdb1003.eqiad.wmnet with reason: netbox upgrade prep work [12:51:27] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:40] hm [12:51:42] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on netboxdb2003.codfw.wmnet with reason: netbox upgrade prep work [12:51:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on netboxdb2003.codfw.wmnet with reason: netbox upgrade prep work [12:52:28] (03PS1) 10Cwhite: opensearch: add watermarks to instance params [puppet] - 10https://gerrit.wikimedia.org/r/1053682 (https://phabricator.wikimedia.org/T368168) [12:53:55] it's running hot, but ok for now [12:55:18] !log reenable benthos@webrequest_live on centrallog2002 - T369737 [12:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:22] T369737: Site Issue: Delayed data in the `webrequest_sampled_live` Druid table - https://phabricator.wikimedia.org/T369737 [12:56:18] (03PS2) 10Reedy: mediawiki: Refactor and improve captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/993010 [12:57:29] (03CR) 10Hnowlan: [C:03+2] shellbox-video: set log level to debug temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053621 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [12:57:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:58:42] (03Merged) 10jenkins-bot: shellbox-video: set log level to debug temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053621 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [12:59:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:59:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:59:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:59:42] (03CR) 10Playgirlkaybraz11: "1052580: Template: Fix missing success styling on logout. | https://gerrit.wikimedia.org/r/c/operations/software/cas-overlay-template/+/10" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1052580 (owner: 10Slyngshede) [12:59:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:59:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T367781)', diff saved to https://phabricator.wikimedia.org/P66284 and previous config saved to /var/cache/conftool/dbconfig/20240711-125949-arnaudb.json [12:59:53] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:00:01] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:57] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:01:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 21.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:01:27] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:58] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configurator: Action API basepath for k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053684 (https://phabricator.wikimedia.org/T369804) [13:02:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T367781)', diff saved to https://phabricator.wikimedia.org/P66285 and previous config saved to /var/cache/conftool/dbconfig/20240711-130214-arnaudb.json [13:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.46% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:03:50] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T368950 [13:03:54] T368950: Consider migrating our Elastic TLS termination from nginx to envoy - https://phabricator.wikimedia.org/T368950 [13:04:02] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:04:07] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T368950 [13:04:38] !log Cordoning and depooling kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet for T365996 [13:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:42] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [13:05:47] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:05:50] hmm, those mw-api-int spikes are concerning. They kinda line up with when the changeprop/parsoid change went out [13:06:24] (03CR) 10Elukey: [C:03+2] Hosts: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [13:06:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 24.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:07:02] hnowlan: that change shifted traffic from mw-parsoid to mw-api-int basically, no? [13:07:04] it seems unlikely though, if anything the change should have reduced impact [13:07:21] jouncebot: next [13:07:22] In 1 hour(s) and 52 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1500) [13:07:31] jouncebot: now [13:07:31] For the next 0 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1300) [13:08:16] !log cgoubert@cumin1002 conftool action : set/pooled=inactive; selector: name=(kubernetes1062.eqiad.wmnet|mw1494.eqiad.wmnet|mw1495.eqiad.wmnet),cluster=kubernetes,service=kubesvc [13:08:24] (03Merged) 10jenkins-bot: Hosts: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [13:08:39] topranks: ^ serviceops nodes drained and depooled for T365996 [13:09:00] (03PS1) 10Cwhite: opensearch: tune watermark settings to node disktype [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) [13:09:01] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:09:22] (03CR) 10Btullis: [C:03+2] Revert "Fail over hive and presto services to the standby coordinator" [dns] - 10https://gerrit.wikimedia.org/r/1053639 (owner: 10Btullis) [13:09:26] (03CR) 10CI reject: [V:04-1] opensearch: tune watermark settings to node disktype [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) (owner: 10Cwhite) [13:09:28] looks like a doubling of requests from mobileapps/pcs to mw-api-int https://grafana.wikimedia.org/goto/X02XE-lIg?orgId=1 [13:09:40] following up with the team to see what's what [13:10:13] yeah, if we really need to, we can bump replicas up, capacity's there [13:10:47] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:11:23] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053687 [13:12:13] (03CR) 10Btullis: "The helm lint shows no change. Are you sure that this value is being used by the chart?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053684 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [13:12:16] (03PS16) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [13:12:40] (03CR) 10CI reject: [V:04-1] relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [13:12:59] !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [13:13:37] (03PS17) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [13:13:37] argh, there was no hosts, why did I have them in my reminder [13:13:45] well, uncordoning and repooling then [13:13:48] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:14:00] (03CR) 10CI reject: [V:04-1] relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [13:14:02] !log Uncordoning and depooling kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet that were actually not concerned by T365996 [13:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:05] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [13:14:22] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=(kubernetes1062.eqiad.wmnet|mw1494.eqiad.wmnet|mw1495.eqiad.wmnet),cluster=kubernetes,service=kubesvc [13:16:51] (03PS18) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [13:16:53] (03PS3) 10Alexandros Kosiaris: deployment::rsync: Add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) [13:17:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P66286 and previous config saved to /var/cache/conftool/dbconfig/20240711-131721-arnaudb.json [13:17:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [13:19:09] (03PS4) 10Alexandros Kosiaris: deployment::rsync: Add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) [13:20:27] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:20:35] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1090.eqiad.wmnet [13:22:01] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:22:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:47] (03PS1) 10DCausse: Revert "rdf-streaming-updater: add split graph config for staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053691 [13:24:13] (03CR) 10Giuseppe Lavagetto: [C:03+2] Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 (owner: 10Giuseppe Lavagetto) [13:24:22] (03PS2) 10Arnaudb: mysql: replication lag monitoring threshold and severity change [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) [13:24:22] (03CR) 10Arnaudb: "I've dropped https://gerrit.wikimedia.org/r/c/operations/alerts/+/1053689 that had merge issues that were too tedious to fix, this PS is t" [alerts] - 10https://gerrit.wikimedia.org/r/1053689 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:24:54] (03CR) 10DCausse: [C:03+2] Revert "rdf-streaming-updater: add split graph config for staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053691 (owner: 10DCausse) [13:25:01] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:25:40] (03CR) 10Giuseppe Lavagetto: "That is actually the goal, moving from templating out the content of a puppet file reading it from the indirection api to just serving the" [puppet] - 10https://gerrit.wikimedia.org/r/1052129 (owner: 10Giuseppe Lavagetto) [13:25:57] (03PS1) 10Slyngshede: Docker: Allow uwsgi to serve static content. [software/bitu] - 10https://gerrit.wikimedia.org/r/1053692 [13:26:34] (03PS5) 10Alexandros Kosiaris: deployment::rsync: Add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) [13:26:55] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:27:07] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.97% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:27:59] (03CR) 10Slyngshede: [C:03+2] Docker: Allow uwsgi to serve static content. [software/bitu] - 10https://gerrit.wikimedia.org/r/1053692 (owner: 10Slyngshede) [13:28:31] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1090.eqiad.wmnet [13:29:50] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 24.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:32:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P66287 and previous config saved to /var/cache/conftool/dbconfig/20240711-133229-arnaudb.json [13:32:47] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:34:24] (03Merged) 10jenkins-bot: Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 (owner: 10Giuseppe Lavagetto) [13:34:46] (03Merged) 10jenkins-bot: Revert "rdf-streaming-updater: add split graph config for staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053691 (owner: 10DCausse) [13:35:33] (03CR) 10Ottomata: "Ya thank you. Sorry, will update this to use the rest endpoint once available." [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [13:37:43] (03CR) 10Alexandros Kosiaris: [C:03+2] "PCC has the changes I expected to see, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1052111 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [13:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 24.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:39:12] (03CR) 10Vgutierrez: [C:03+2] hiera: Extend bwlim experiment to cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/1053652 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [13:39:47] (03CR) 10Hashar: [C:04-1] "My guess is `update_version` should be removed from this repository and the Jenkins job + Zuul config should be removed. That is worth a t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [13:42:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:43:00] (03CR) 10Hashar: [C:04-1] "Fun twist, if one mentions someone with `@NameOfPerson` they are only added as `cc` and not as `reviewer`. That is why it took me two days" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [13:44:25] !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [13:46:27] FIRING: SystemdUnitFailed: stunnel4.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T367781)', diff saved to https://phabricator.wikimedia.org/P66288 and previous config saved to /var/cache/conftool/dbconfig/20240711-134737-arnaudb.json [13:47:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:47:41] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:47:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:47:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T367781)', diff saved to https://phabricator.wikimedia.org/P66289 and previous config saved to /var/cache/conftool/dbconfig/20240711-134759-arnaudb.json [13:50:07] !log depool ms-fe1014 and thanos-fe1004 before switch work T365996 [13:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:11] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [13:50:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T367781)', diff saved to https://phabricator.wikimedia.org/P66290 and previous config saved to /var/cache/conftool/dbconfig/20240711-135023-arnaudb.json [13:51:08] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973522 (10MatthewVernon) ms and thanos frontends depooled, you're good to go from a swift POV. [13:51:27] FIRING: [2x] SystemdUnitFailed: stunnel4.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:57] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:54:17] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 22.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:56:15] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:01:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 23.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:02:26] (03PS1) 10Hnowlan: mw-api-int: scale up by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053696 (https://phabricator.wikimedia.org/T367418) [14:05:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P66291 and previous config saved to /var/cache/conftool/dbconfig/20240711-140530-arnaudb.json [14:08:15] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-f1-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f1-eqiad [14:08:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-f1-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f1-eqiad [14:08:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973584 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9abb3472-bf69-45f5-8c93-e3c8cfbe9e4e) set by cmooney... [14:08:47] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f1-eqiad [14:09:04] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f1-eqiad [14:09:14] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973587 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d7f08b17-a319-4077-a271-a0ef15a438a3) set by cmooney... [14:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 21.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:12:26] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad [14:12:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad [14:12:48] !log depool titan1001 for switch work T365996 [14:12:58] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973590 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d5a6d4b-345e-4f18-8342-05572d6411e7) set by cmooney... [14:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:07] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [14:13:19] (03PS3) 10CDobbins: purged: set use_pki to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [14:15:06] !log rebooting lsw1-f1-eqiad to install updated JunOS version T365996 [14:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:12] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9973600 (10ssingh) Following up on this after some discussion with the Traffic folks: It seems like our preferred version for this is something like: ` confc... [14:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at eqiad: 21.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:18:01] (03CR) 10Clément Goubert: [C:03+1] mw-api-int: scale up by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053696 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [14:18:18] (03PS2) 10Ssingh: P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) [14:18:24] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3206/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:18:24] (03PS1) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [14:18:24] (03CR) 10Arnaudb: "gave up on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053326 to avoid merge conflict resolution on a tiny ps. I took note of th" [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [14:19:20] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad [14:19:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad [14:19:51] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973616 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=de50ae5f-fec9-4347-b2ef-225a3af373f6) set by cmooney... [14:20:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P66292 and previous config saved to /var/cache/conftool/dbconfig/20240711-142037-arnaudb.json [14:23:55] (03PS2) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [14:24:17] FIRING: [2x] ProbeDown: Service ml-cache1003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:35] FIRING: [7x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:25:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365996 - depool db1193 - s8', diff saved to https://phabricator.wikimedia.org/P66293 and previous config saved to /var/cache/conftool/dbconfig/20240711-142544-arnaudb.json [14:25:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996 [14:25:48] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [14:25:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996 [14:26:21] (03PS1) 10Jgiannelos: Revert "Remove page html endpoints from changeprop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053699 [14:26:49] 10ops-eqiad, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825 (10fgiunchedi) 03NEW [14:26:49] (03CR) 10Ssingh: "1)" [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:27:12] (03PS1) 10Hnowlan: Revert "Remove page html endpoints from changeprop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053700 (https://phabricator.wikimedia.org/T367418) [14:27:57] 10ops-codfw, 06DC-Ops: 10gbit nic option for centrallog2002 - https://phabricator.wikimedia.org/T369826 (10fgiunchedi) 03NEW [14:28:45] (03CR) 10Giuseppe Lavagetto: [C:03+2] Revert "Remove page html endpoints from changeprop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053700 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [14:30:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:30:36] (03Merged) 10jenkins-bot: Revert "Remove page html endpoints from changeprop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053700 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [14:30:48] (03Abandoned) 10Jgiannelos: Revert "Remove page html endpoints from changeprop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053699 (owner: 10Jgiannelos) [14:32:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [14:34:15] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973728 (10cmooney) Switch upgrade complete, all looks good hosts are online and responding to ping again. Thanks for the assis... [14:34:17] FIRING: [7x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:34:54] (03PS1) 10Effie Mouzeli: WIP wf mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053702 [14:35:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:35:05] !log pool titan1001 for switch work T365996 [14:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:08] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [14:35:35] FIRING: [7x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:35:35] RESOLVED: [2x] ProbeDown: Service ml-cache1003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66294 and previous config saved to /var/cache/conftool/dbconfig/20240711-143541-arnaudb.json [14:35:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:35:59] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:36:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T367781)', diff saved to https://phabricator.wikimedia.org/P66295 and previous config saved to /var/cache/conftool/dbconfig/20240711-143606-arnaudb.json [14:36:09] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:36:27] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:52] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on dumpsdata1007 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T369829 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:37:57] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829 (10ops-monitoring-bot) 03NEW [14:38:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367781)', diff saved to https://phabricator.wikimedia.org/P66296 and previous config saved to /var/cache/conftool/dbconfig/20240711-143829-arnaudb.json [14:38:57] (03PS2) 10Effie Mouzeli: mw-wf: remove unused routes from mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053702 [14:39:01] (03PS13) 10Hashar: git: remove umask from git::clone [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) [14:39:17] FIRING: [8x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [14:40:57] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [14:41:10] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [14:41:58] (03CR) 10Hashar: "The Puppet Catalogue Compilation is a bit noisy since the previous code would default umask to `0022`, but I think it is fine." [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [14:42:29] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [14:42:46] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [14:42:47] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [14:43:00] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [14:44:25] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [14:45:15] (03PS3) 10Effie Mouzeli: thumbor: switch to node local mw-mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053293 (https://phabricator.wikimedia.org/T346690) [14:46:04] (03PS4) 10Effie Mouzeli: thumbor: switch to node local mw-mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053293 (https://phabricator.wikimedia.org/T346690) [14:46:17] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configurator: Action API basepath for k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053684 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [14:47:01] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#9973802 (10Marostegui) [14:47:04] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configurator: Action API basepath for k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053684 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [14:47:22] (03PS19) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [14:47:24] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973806 (10ABran-WMF) dbhost repooling dbproxy reloaded backuphost checked and looks green [14:48:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [14:50:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66297 and previous config saved to /var/cache/conftool/dbconfig/20240711-145047-arnaudb.json [14:50:51] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [14:53:30] (03PS20) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [14:53:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P66298 and previous config saved to /var/cache/conftool/dbconfig/20240711-145336-arnaudb.json [14:53:53] (03CR) 10CI reject: [V:04-1] relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [14:55:44] (03CR) 10CDanis: [C:03+1] "let's gooooo" [puppet] - 10https://gerrit.wikimedia.org/r/1053652 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [14:55:56] !log repool ms-fe1014 and thanos-fe1004 before switch work T365996 [14:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:00] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [14:56:45] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: NOC API replacement to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053706 (https://phabricator.wikimedia.org/T369804) [14:57:12] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9973844 (10MatthewVernon) Swift and thanos frontends repooled, all seems OK. [14:57:25] (03CR) 10Alexandros Kosiaris: [C:04-1] "Indeed, but this:" [puppet] - 10https://gerrit.wikimedia.org/r/1052129 (owner: 10Giuseppe Lavagetto) [14:58:16] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: NOC API replacement to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053706 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [14:59:06] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:59:16] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: NOC API replacement to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053706 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [14:59:17] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:57] (03PS1) 10Jdlrobson: Vector theme should default to day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T368795) [15:00:01] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:00:05] andre and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1500) [15:00:10] (03CR) 10CI reject: [V:04-1] Vector theme should default to day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [15:01:17] (03PS21) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [15:01:33] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:01:44] (03CR) 10CI reject: [V:04-1] relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [15:03:30] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:04:05] (03CR) 10Volans: "The fact that the whole file indentation has changed doesn't simplify the review." [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [15:04:27] (03PS1) 10Alexandros Kosiaris: Revert "deployment::rsync: Add support for PKI" [puppet] - 10https://gerrit.wikimedia.org/r/1053709 [15:04:36] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Revert "deployment::rsync: Add support for PKI" [puppet] - 10https://gerrit.wikimedia.org/r/1053709 (owner: 10Alexandros Kosiaris) [15:05:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66299 and previous config saved to /var/cache/conftool/dbconfig/20240711-150553-arnaudb.json [15:05:57] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [15:06:45] (03CR) 10Hnowlan: [C:03+1] thumbor: switch to node local mw-mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053293 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:08:02] (03CR) 10Btullis: [C:03+2] Revert "Temporarily disable gobblin timers to permit hive maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/1053634 (owner: 10Btullis) [15:08:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P66300 and previous config saved to /var/cache/conftool/dbconfig/20240711-150843-arnaudb.json [15:09:25] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [15:09:26] (03CR) 10Clément Goubert: [C:03+1] mw-wf: remove unused routes from mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053702 (owner: 10Effie Mouzeli) [15:09:45] (03PS5) 10Effie Mouzeli: thumbor: switch to node local mw-mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053293 (https://phabricator.wikimedia.org/T346690) [15:10:30] jouncebot: now [15:10:30] For the next 0 hour(s) and 49 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1500) [15:10:36] jouncebot: next [15:10:36] In 0 hour(s) and 49 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1600) [15:10:51] (03CR) 10Effie Mouzeli: [C:03+2] mw-wf: remove unused routes from mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053702 (owner: 10Effie Mouzeli) [15:11:01] (03PS1) 10Santiago Faci: MPIC chart: Adding support for a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053710 (https://phabricator.wikimedia.org/T369804) [15:11:39] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [15:11:41] (03Merged) 10jenkins-bot: mw-wf: remove unused routes from mcrouter config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053702 (owner: 10Effie Mouzeli) [15:11:44] (03CR) 10CI reject: [V:04-1] MPIC chart: Adding support for a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053710 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [15:11:44] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [15:11:56] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [15:12:00] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [15:12:15] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [15:12:41] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [15:13:13] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [15:13:32] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [15:14:42] (03PS2) 10Santiago Faci: MPIC chart: Adding support for a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053710 (https://phabricator.wikimedia.org/T369804) [15:14:51] (03CR) 10Effie Mouzeli: [C:03+2] thumbor: switch to node local mw-mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053293 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:15:39] (03PS2) 10Cwhite: opensearch: tune watermark settings to node disktype [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) [15:15:44] (03Merged) 10jenkins-bot: thumbor: switch to node local mw-mcrouter ds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053293 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:16:24] (03CR) 10Clare Ming: [C:03+2] MPIC chart: Adding support for a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053710 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [15:16:58] (03PS3) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [15:17:15] (03Merged) 10jenkins-bot: MPIC chart: Adding support for a new config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053710 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [15:17:22] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:17:24] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053686 (https://phabricator.wikimedia.org/T368168) (owner: 10Cwhite) [15:17:44] (03CR) 10Arnaudb: "PS2 fixes this, sorry 😬" [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [15:20:28] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:20:48] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:20:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66301 and previous config saved to /var/cache/conftool/dbconfig/20240711-152058-arnaudb.json [15:21:02] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [15:21:32] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:22:11] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:22:21] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:22:48] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:23:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367781)', diff saved to https://phabricator.wikimedia.org/P66302 and previous config saved to /var/cache/conftool/dbconfig/20240711-152350-arnaudb.json [15:23:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:23:54] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:24:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:24:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T367781)', diff saved to https://phabricator.wikimedia.org/P66303 and previous config saved to /var/cache/conftool/dbconfig/20240711-152412-arnaudb.json [15:26:10] (03PS1) 10Alexandros Kosiaris: multirootca: Add an stunnel related intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1053717 [15:26:10] (03PS1) 10Alexandros Kosiaris: deploy1003: Undo the puppet 7 force [puppet] - 10https://gerrit.wikimedia.org/r/1053718 (https://phabricator.wikimedia.org/T364417) [15:26:26] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:26:27] RESOLVED: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367781)', diff saved to https://phabricator.wikimedia.org/P66304 and previous config saved to /var/cache/conftool/dbconfig/20240711-152635-arnaudb.json [15:26:40] (03CR) 10CI reject: [V:04-1] multirootca: Add an stunnel related intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1053717 (owner: 10Alexandros Kosiaris) [15:28:07] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:29:16] (03PS2) 10Alexandros Kosiaris: deploy1003: Undo the puppet 7 force [puppet] - 10https://gerrit.wikimedia.org/r/1053718 (https://phabricator.wikimedia.org/T364417) [15:29:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:29:31] (03PS1) 10Santiago Faci: MPIC chart: Updating chart version to apply last changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053720 (https://phabricator.wikimedia.org/T369804) [15:29:35] (03Abandoned) 10Alexandros Kosiaris: multirootca: Add an stunnel related intermediate [puppet] - 10https://gerrit.wikimedia.org/r/1053717 (owner: 10Alexandros Kosiaris) [15:29:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:29:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T367856)', diff saved to https://phabricator.wikimedia.org/P66305 and previous config saved to /var/cache/conftool/dbconfig/20240711-152946-marostegui.json [15:29:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:30:53] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:31:06] (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1003: Undo the puppet 7 force [puppet] - 10https://gerrit.wikimedia.org/r/1053718 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [15:31:26] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:32:34] (03CR) 10Clare Ming: [C:03+2] MPIC chart: Updating chart version to apply last changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053720 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [15:33:36] (03Merged) 10jenkins-bot: MPIC chart: Updating chart version to apply last changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053720 (https://phabricator.wikimedia.org/T369804) (owner: 10Santiago Faci) [15:36:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66306 and previous config saved to /var/cache/conftool/dbconfig/20240711-153604-arnaudb.json [15:36:08] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [15:36:35] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host deploy1003.eqiad.wmnet with OS bullseye [15:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:40:58] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:41:09] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:41:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P66307 and previous config saved to /var/cache/conftool/dbconfig/20240711-154142-arnaudb.json [15:48:37] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [15:51:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66308 and previous config saved to /var/cache/conftool/dbconfig/20240711-155109-arnaudb.json [15:51:14] T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996 [15:51:52] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on deploy1003.eqiad.wmnet with reason: host reimage [15:52:07] (03PS1) 10Daniel Kinzler: Enable Special:RestSandbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053723 (https://phabricator.wikimedia.org/T362006) [15:52:59] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:53:12] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:56:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P66309 and previous config saved to /var/cache/conftool/dbconfig/20240711-155649-arnaudb.json [15:57:03] (03PS3) 10Ssingh: P:conftool: add schema for geodns [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) [15:58:40] (03PS1) 10Btullis: Add the thirdparty/yarn deb repository to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1053724 (https://phabricator.wikimedia.org/T365839) [16:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:35] (03CR) 10Ssingh: "FYI I am also updating https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/4/commits based on whatever revisions are doing he" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [16:03:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [16:03:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [16:05:35] (03PS22) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [16:05:39] (03Abandoned) 10Hnowlan: mw-api-int: scale up by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053696 (https://phabricator.wikimedia.org/T367418) (owner: 10Hnowlan) [16:05:59] (03CR) 10CI reject: [V:04-1] relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [16:11:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367781)', diff saved to https://phabricator.wikimedia.org/P66310 and previous config saved to /var/cache/conftool/dbconfig/20240711-161157-arnaudb.json [16:11:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:12:01] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:12:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:12:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T367781)', diff saved to https://phabricator.wikimedia.org/P66311 and previous config saved to /var/cache/conftool/dbconfig/20240711-161219-arnaudb.json [16:14:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367781)', diff saved to https://phabricator.wikimedia.org/P66312 and previous config saved to /var/cache/conftool/dbconfig/20240711-161446-arnaudb.json [16:16:17] (03CR) 10CDanis: [C:03+1] Add the thirdparty/yarn deb repository to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1053724 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [16:18:22] (03PS23) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [16:22:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [16:22:50] (03CR) 10Btullis: [C:03+2] Add the thirdparty/yarn deb repository to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1053724 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [16:27:31] (03PS24) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [16:27:34] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2024-07-11-122459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053732 [16:29:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P66313 and previous config saved to /var/cache/conftool/dbconfig/20240711-162953-arnaudb.json [16:30:57] (03PS25) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [16:31:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [16:35:58] (03PS1) 10DCausse: rdf-streaming-updater: add split graph config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053734 [16:37:55] (03PS4) 10CDobbins: purged: set use_pki to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [16:40:30] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:40:58] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:41:02] (03CR) 10Dzahn: [C:03+2] "I see! No problem, I can rename it. The main reason I did it was that I had already merged the private secret and I wanted it to match the" [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [16:41:21] (03PS1) 10Btullis: Fix the thirdparty/yarn repo [puppet] - 10https://gerrit.wikimedia.org/r/1053745 (https://phabricator.wikimedia.org/T365839) [16:42:20] (03CR) 10Btullis: [C:03+2] Fix the thirdparty/yarn repo [puppet] - 10https://gerrit.wikimedia.org/r/1053745 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [16:42:57] (03PS26) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [16:43:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [16:45:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P66314 and previous config saved to /var/cache/conftool/dbconfig/20240711-164500-arnaudb.json [16:46:19] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:46:36] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:53:42] (03PS2) 10Jdlrobson: Vector theme should default to day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T369833) [16:53:59] (03CR) 10CI reject: [V:04-1] Vector theme should default to day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T369833) (owner: 10Jdlrobson) [16:58:02] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [16:58:21] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [16:58:44] !log puppetmaster1001 - puppet cert clean phabricator.discovery.wmnet T369796 T360413 [16:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:48] T369796: Puppet CA certificate phabricator.discovery.wmnet is about to expire - https://phabricator.wikimedia.org/T369796 [16:58:49] T360413: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413 [17:00:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367781)', diff saved to https://phabricator.wikimedia.org/P66315 and previous config saved to /var/cache/conftool/dbconfig/20240711-170007-arnaudb.json [17:00:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:00:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:00:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:00:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T367781)', diff saved to https://phabricator.wikimedia.org/P66316 and previous config saved to /var/cache/conftool/dbconfig/20240711-170030-arnaudb.json [17:02:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367781)', diff saved to https://phabricator.wikimedia.org/P66317 and previous config saved to /var/cache/conftool/dbconfig/20240711-170258-arnaudb.json [17:05:05] bd808: How many deployers does it take to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1700). [17:05:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T1700) [17:05:36] o/ I have a developer portal build to push out via my window today [17:05:48] (03PS27) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [17:06:05] !log puppetmaster1001 - puppet cert clean aphlict..discovery.wmnet T369796 T360413 [17:06:08] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container version to 2024-07-11-122459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053732 (owner: 10BryanDavis) [17:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:10] T369796: Puppet CA certificate phabricator.discovery.wmnet is about to expire - https://phabricator.wikimedia.org/T369796 [17:06:10] T360413: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413 [17:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 3.277% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:06:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:07:01] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2024-07-11-122459-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053732 (owner: 10BryanDavis) [17:07:20] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:07:39] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:08:21] (03CR) 10Ahmon Dancy: "Nice change." [puppet] - 10https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [17:08:30] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:08:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053723 (https://phabricator.wikimedia.org/T362006) (owner: 10Daniel Kinzler) [17:09:15] (03PS2) 10Daniel Kinzler: Enable Special:RestSandbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053723 (https://phabricator.wikimedia.org/T362006) [17:09:22] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:09:24] (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053723 (https://phabricator.wikimedia.org/T362006) (owner: 10Daniel Kinzler) [17:09:31] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:09:44] (03CR) 10Elukey: [C:04-1] "Precautionary -1 since I just realized that the /etc/puppet/private repo on puppetserver should be the readonly repository on puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [17:10:00] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:10:05] (03Merged) 10jenkins-bot: Enable Special:RestSandbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053723 (https://phabricator.wikimedia.org/T362006) (owner: 10Daniel Kinzler) [17:10:21] !log daniel@deploy1002 Started scap sync-world: Backport for [[gerrit:1053723|Enable Special:RestSandbox on testwiki (T362006)]] [17:10:24] T362006: Provide a Swagger-UI for exploring the core REST API - https://phabricator.wikimedia.org/T362006 [17:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 2.57% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:12:32] (03CR) 10Elukey: [C:04-1] "Also:" [puppet] - 10https://gerrit.wikimedia.org/r/1053623 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [17:18:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P66318 and previous config saved to /var/cache/conftool/dbconfig/20240711-171806-arnaudb.json [17:23:54] (03CR) 10Ssingh: purged: set use_pki to true for all sites (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:23:57] (03PS5) 10CDobbins: purged: set use_pki to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [17:28:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host deploy1003.eqiad.wmnet with OS bullseye [17:29:39] (03PS28) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [17:30:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:33:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P66319 and previous config saved to /var/cache/conftool/dbconfig/20240711-173313-arnaudb.json [17:34:42] (03PS29) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [17:34:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:38:02] (03CR) 10Bking: [C:03+2] relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:38:21] (03CR) 10Bking: [C:03+2] "self-merging, as this only affects a test environment (relforge)" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:41:37] !log daniel@deploy1002 Started scap sync-world: Backport for [[gerrit:1053723|Enable Special:RestSandbox on testwiki (T362006)]] [17:41:41] T362006: Provide a Swagger-UI for exploring the core REST API - https://phabricator.wikimedia.org/T362006 [17:42:47] (03PS1) 10Ottomata: DO NOT MERGE - example of what ESC would look like if we remove wgEnableEventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053750 [17:46:08] (03CR) 10Dzahn: "it has been noticed during deployment today that the tests on the secure.wikimedia.org vhost are failing. this was confirmed on mwdebug*, " [puppet] - 10https://gerrit.wikimedia.org/r/1052128 (owner: 10Giuseppe Lavagetto) [17:46:27] !log daniel@deploy1002 daniel: Backport for [[gerrit:1053723|Enable Special:RestSandbox on testwiki (T362006)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:47:27] !log daniel@deploy1002 daniel: Continuing with sync [17:48:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367781)', diff saved to https://phabricator.wikimedia.org/P66321 and previous config saved to /var/cache/conftool/dbconfig/20240711-174820-arnaudb.json [17:48:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [17:48:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:48:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [17:48:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [17:49:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [17:49:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:49:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:49:57] PROBLEM - Host db1179 #page is DOWN: PING CRITICAL - Packet loss = 100% [17:50:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:50:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2128.codfw.wmnet with reason: Maintenance [17:50:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:50:29] looking [17:50:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:50:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T367781)', diff saved to https://phabricator.wikimedia.org/P66322 and previous config saved to /var/cache/conftool/dbconfig/20240711-175038-arnaudb.json [17:50:58] rzl: can you depool? [17:51:03] marostegui: on it [17:51:26] Thank you rzl [17:51:36] (03PS1) 10Bking: relforge: test envoyproxy with multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) [17:51:44] I can be on my laptop in 5 min [17:51:49] (03CR) 10CI reject: [V:04-1] relforge: test envoyproxy with multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:51:55] all those maintenance downtimes for other db hosts while this one goes down.. a bit suspicious [17:52:13] !log rzl@cumin2002 dbctl commit (dc=all): 'db1179 depooled', diff saved to https://phabricator.wikimedia.org/P66324 and previous config saved to /var/cache/conftool/dbconfig/20240711-175212-rzl.json [17:52:34] (03PS6) 10CDobbins: purged: set use_pki to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [17:52:39] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:1053723|Enable Special:RestSandbox on testwiki (T362006)]] (duration: 11m 01s) [17:52:42] T362006: Provide a Swagger-UI for exploring the core REST API - https://phabricator.wikimedia.org/T362006 [17:52:54] marostegui: want a phab task? [17:52:59] arnaudb: any chance db1179 is part of the maintenance? [17:53:04] rzl: that would be amazing [17:53:08] 👍 [17:53:28] mutante: it's not, we have schema changes running 24*7 nowadays so there's always going to be a SAL entry close by [17:53:38] ack [17:54:08] (03PS1) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) [17:54:14] (03CR) 10Ssingh: "Looks good! There are two other overrides I see:" [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [17:54:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:54:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T367781)', diff saved to https://phabricator.wikimedia.org/P66325 and previous config saved to /var/cache/conftool/dbconfig/20240711-175424-arnaudb.json [17:54:28] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:54:46] (03CR) 10CI reject: [V:04-1] MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [17:55:10] marostegui: https://phabricator.wikimedia.org/T369855 [17:55:46] (03PS2) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) [17:56:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:56:59] rzl: thank you so much [17:57:09] not at all <3 [17:59:13] thanks rzl! [17:59:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:59:46] resolving in VO as soon as I finish selecting all images with bridges [18:00:37] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1001.eqiad.wmnet [18:00:42] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=93) on VRTS host vrts1001.eqiad.wmnet [18:01:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [18:05:06] (03PS1) 10AOkoth: vrts: fix version comparison [cookbooks] - 10https://gerrit.wikimedia.org/r/1053755 (https://phabricator.wikimedia.org/T366078) [18:08:23] (03CR) 10CDanis: [C:03+1] Fix the thirdparty/yarn repo [puppet] - 10https://gerrit.wikimedia.org/r/1053745 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [18:08:26] (03PS2) 10AOkoth: vrts: fix version comparison [cookbooks] - 10https://gerrit.wikimedia.org/r/1053755 (https://phabricator.wikimedia.org/T366078) [18:09:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P66326 and previous config saved to /var/cache/conftool/dbconfig/20240711-180931-arnaudb.json [18:09:35] (03PS3) 10AOkoth: vrts: fix version comparison [cookbooks] - 10https://gerrit.wikimedia.org/r/1053755 (https://phabricator.wikimedia.org/T366078) [18:10:33] (03PS1) 10Ryan Kemper: wdqs graph split: routing for new services [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) [18:10:56] (03CR) 10CI reject: [V:04-1] wdqs graph split: routing for new services [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [18:11:26] (03CR) 10Ryan Kemper: "Woops, bad rebase since I see a bunch of unrelated stuff. Cleaning that up now" [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [18:12:56] (03PS2) 10Ryan Kemper: wdqs graph split: routing for new services [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) [18:12:57] (03CR) 10Andrew Bogott: Revert "nova policy: temporarily disable VM resizing" [puppet] - 10https://gerrit.wikimedia.org/r/1043163 (owner: 10Andrew Bogott) [18:13:07] (03PS3) 10Andrew Bogott: Revert "nova policy: temporarily disable VM resizing" [puppet] - 10https://gerrit.wikimedia.org/r/1043163 [18:13:38] (03CR) 10AOkoth: [C:03+2] vrts: fix version comparison [cookbooks] - 10https://gerrit.wikimedia.org/r/1053755 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:15:55] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1001.eqiad.wmnet [18:17:18] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Revert "nova policy: temporarily disable VM resizing" [puppet] - 10https://gerrit.wikimedia.org/r/1043163 (owner: 10Andrew Bogott) [18:18:13] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=99) on VRTS host vrts1001.eqiad.wmnet [18:19:45] (03CR) 10EoghanGaffney: [C:03+1] vrts: fix version comparison [cookbooks] - 10https://gerrit.wikimedia.org/r/1053755 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:24:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P66327 and previous config saved to /var/cache/conftool/dbconfig/20240711-182438-arnaudb.json [18:24:42] (03PS3) 10Ryan Kemper: wdqs graph split: routing for new services [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) [18:32:54] (03PS7) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) [18:34:58] (03CR) 10Dzahn: "I confirm webserver-misc-sites.discovery.wmnet is nicer to use than webserver-misc-eqiad.discovery.wmnet and currently that's pointing to " [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [18:35:16] (03PS7) 10CDobbins: purged: set use_pki to true for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1050417 (https://phabricator.wikimedia.org/T360506) [18:37:27] (03PS8) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) [18:39:11] (03PS1) 10AOkoth: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) [18:39:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T367781)', diff saved to https://phabricator.wikimedia.org/P66328 and previous config saved to /var/cache/conftool/dbconfig/20240711-183946-arnaudb.json [18:39:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:39:51] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:40:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:40:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [18:40:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T367781)', diff saved to https://phabricator.wikimedia.org/P66329 and previous config saved to /var/cache/conftool/dbconfig/20240711-184009-arnaudb.json [18:42:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T367781)', diff saved to https://phabricator.wikimedia.org/P66330 and previous config saved to /var/cache/conftool/dbconfig/20240711-184258-arnaudb.json [18:43:16] (03CR) 10Arlolra: Change Linter log level to info (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [18:43:30] (03CR) 10CI reject: [V:04-1] vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:46:57] 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management, 10TimedMediaHandler: Consider increasing $wgTranscodeBackgroundSizeLimit to 5GB - https://phabricator.wikimedia.org/T357184#9974832 (10Yuhong) Right now this limit is not even 4GB. [18:48:51] (03PS2) 10AOkoth: vrts: fix proxy for download [cookbooks] - 10https://gerrit.wikimedia.org/r/1053761 (https://phabricator.wikimedia.org/T366078) [18:52:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbproxy2005.codfw.wmnet with OS bookworm [18:52:53] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9974850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm [18:57:08] (03PS4) 10Ryan Kemper: wdqs graph split: route / to miscweb microsite [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) [18:57:08] (03PS1) 10Ryan Kemper: wdqs graph split: routing for wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1053765 (https://phabricator.wikimedia.org/T364367) [18:58:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P66331 and previous config saved to /var/cache/conftool/dbconfig/20240711-185805-arnaudb.json [18:58:37] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:59:17] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:46] (03CR) 10Dzahn: [C:03+1] "yes, lgtm. using the name that is not tied to a datacenter is nicest. currently "sites" and 'eqiad" is the same thing:" [puppet] - 10https://gerrit.wikimedia.org/r/1053756 (https://phabricator.wikimedia.org/T364367) (owner: 10Ryan Kemper) [19:03:24] (03PS1) 10Scott French: commons-impact-analytics: bump image to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053767 (https://phabricator.wikimedia.org/T369745) [19:03:39] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 124.53 ms [19:06:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage [19:08:10] (03CR) 10Mforns: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053767 (https://phabricator.wikimedia.org/T369745) (owner: 10Scott French) [19:08:44] (03CR) 10Scott French: [C:03+2] commons-impact-analytics: bump image to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053767 (https://phabricator.wikimedia.org/T369745) (owner: 10Scott French) [19:09:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy2005.codfw.wmnet with reason: host reimage [19:09:33] (03Merged) 10jenkins-bot: commons-impact-analytics: bump image to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053767 (https://phabricator.wikimedia.org/T369745) (owner: 10Scott French) [19:11:52] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [19:12:04] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [19:12:44] (03PS9) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) [19:13:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P66332 and previous config saved to /var/cache/conftool/dbconfig/20240711-191313-arnaudb.json [19:16:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [19:21:27] (03CR) 10BryanDavis: "Implementation LGTM. There is a trivial spelling error in the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [19:22:17] (03PS3) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) [19:22:26] (03CR) 10Ahmon Dancy: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [19:23:03] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:24:22] (03CR) 10BryanDavis: [C:03+1] MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [19:26:12] (03PS1) 10Catrope: Graph: Fix JSON parse errors in Graph data source tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053771 [19:26:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053771 (owner: 10Catrope) [19:28:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T367781)', diff saved to https://phabricator.wikimedia.org/P66333 and previous config saved to /var/cache/conftool/dbconfig/20240711-192820-arnaudb.json [19:28:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:28:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:28:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:28:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T367781)', diff saved to https://phabricator.wikimedia.org/P66334 and previous config saved to /var/cache/conftool/dbconfig/20240711-192842-arnaudb.json [19:31:12] (03PS1) 10Dzahn: miscweb: add query-main and query-scholarly.wikidata.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1053773 (https://phabricator.wikimedia.org/T364367) [19:32:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T367781)', diff saved to https://phabricator.wikimedia.org/P66335 and previous config saved to /var/cache/conftool/dbconfig/20240711-193231-arnaudb.json [19:33:22] (03PS1) 10Ahmon Dancy: Bump all buildkit image tags to wmf-v0.15.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/1053774 (https://phabricator.wikimedia.org/T369862) [19:35:38] 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9974939 (10RobH) Please add 'lshw' because I use it constantly (since i have root) for determining serials of any items installed in the host, and track hw failures. [19:35:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T369833) (owner: 10Jdlrobson) [19:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:45:06] (03PS10) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) [19:47:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P66336 and previous config saved to /var/cache/conftool/dbconfig/20240711-194739-arnaudb.json [19:49:29] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [19:59:27] (03CR) 10Ryan Kemper: wdqs: add main and scholarly role assignments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T2000). [20:00:05] arlolra, RoanKattouw, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] I can deploy [20:00:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053771 (owner: 10Catrope) [20:01:34] (03Merged) 10jenkins-bot: Graph: Fix JSON parse errors in Graph data source tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053771 (owner: 10Catrope) [20:01:52] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1053771|Graph: Fix JSON parse errors in Graph data source tracking]] [20:02:43] (03Abandoned) 10Ottomata: DO NOT MERGE - example of what ESC would look like if we remove wgEnableEventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053750 (owner: 10Ottomata) [20:02:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P66337 and previous config saved to /var/cache/conftool/dbconfig/20240711-200246-arnaudb.json [20:03:48] * jan_drewniak o/ thanks RoanKattouw [20:06:54] I got some weird errors from the check_testservers step: [20:06:57] https://www.irccloud.com/pastebin/9sQ5DUCE/ [20:08:07] Hmm looks like all these errors are related to missing redirects from secure.wikimedia.org, and only 4 out of 132 requests failed, so I'm going to proceed [20:08:11] !log catrope@deploy1002 catrope: Backport for [[gerrit:1053771|Graph: Fix JSON parse errors in Graph data source tracking]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:20] (03PS13) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [20:10:35] !log catrope@deploy1002 catrope: Continuing with sync [20:11:25] (03PS1) 10Ryan Kemper: wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077) [20:12:10] (03PS2) 10Ryan Kemper: wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077) [20:12:48] (03PS3) 10Ryan Kemper: wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077) [20:13:42] RoanKattouw: was spotted earlier too [20:13:52] Cc rzl ^ [20:15:25] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1053771|Graph: Fix JSON parse errors in Graph data source tracking]] (duration: 13m 32s) [20:16:37] (03CR) 10CI reject: [V:04-1] wdqs restart envoy: support graph split aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1053778 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [20:16:45] jan_drewniak: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1053708 is based on a bunch of other patches. Do you want all of those deployed, or just the one? [20:17:11] Specifically https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1050082/7 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1050083/4 [20:17:47] * jan_drewniak RoanKattouw: I guess they should be in reverse order, but just this one today [20:17:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T367781)', diff saved to https://phabricator.wikimedia.org/P66338 and previous config saved to /var/cache/conftool/dbconfig/20240711-201753-arnaudb.json [20:17:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:17:57] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:18:00] (03PS3) 10Jdlrobson: Vector theme should default to day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T369833) [20:18:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:18:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T367781)', diff saved to https://phabricator.wikimedia.org/P66339 and previous config saved to /var/cache/conftool/dbconfig/20240711-201815-arnaudb.json [20:18:16] OK I've rebased it accordingly [20:18:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T369833) (owner: 10Jdlrobson) [20:18:35] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9975052 (10wiki_willy) Hi @Eevans - I'll let @Jclark-ctr and @VRiley-WMF confirm your first two questions. From some of the feedback I've received though, it seems that the issue on both hosts s... [20:18:38] (03CR) 10CI reject: [V:04-1] Vector theme should default to day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T369833) (owner: 10Jdlrobson) [20:18:41] ... p.s. I keep hitting shift+enter because that's how I have Slack setup, it looks like I'm replying to the last post :P [20:19:06] (03Merged) 10jenkins-bot: Vector theme should default to day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053708 (https://phabricator.wikimedia.org/T369833) (owner: 10Jdlrobson) [20:19:53] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1053708|Vector theme should default to day (T369833)]] [20:19:57] T369833: [Bug] Dark mode turning on by default for logged-in users - https://phabricator.wikimedia.org/T369833 [20:20:21] (03CR) 10Bking: [C:03+2] relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [20:20:42] (03CR) 10Bking: [C:03+2] "self-merging, as this only touches a test environment (Relforge)" [puppet] - 10https://gerrit.wikimedia.org/r/1053751 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [20:22:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T367781)', diff saved to https://phabricator.wikimedia.org/P66340 and previous config saved to /var/cache/conftool/dbconfig/20240711-202204-arnaudb.json [20:28:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:29:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2005.codfw.wmnet with OS bookworm [20:29:09] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9975081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm completed: - dbproxy2005 (**P... [20:30:07] !log catrope@deploy1002 jdlrobson, catrope: Backport for [[gerrit:1053708|Vector theme should default to day (T369833)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:30:12] T369833: [Bug] Dark mode turning on by default for logged-in users - https://phabricator.wikimedia.org/T369833 [20:30:18] (03PS1) 10JHathaway: postfix: remove unused hiera entry [puppet] - 10https://gerrit.wikimedia.org/r/1053781 (https://phabricator.wikimedia.org/T325407) [20:30:24] jan_drewniak: Ready for you to test on the debug servers [20:30:38] ok checking [20:31:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053781 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [20:32:12] RoanKattouw: ok good to sync [20:32:17] !log catrope@deploy1002 jdlrobson, catrope: Continuing with sync [20:32:35] arlolra: Are you here for your backport deploy window? Your patch is next [20:32:40] yup [20:32:48] there isn't much to test [20:33:19] (03CR) 10JHathaway: [C:03+2] postfix: remove unused hiera entry [puppet] - 10https://gerrit.wikimedia.org/r/1053781 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [20:37:02] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1053708|Vector theme should default to day (T369833)]] (duration: 17m 09s) [20:37:06] T369833: [Bug] Dark mode turning on by default for logged-in users - https://phabricator.wikimedia.org/T369833 [20:37:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P66341 and previous config saved to /var/cache/conftool/dbconfig/20240711-203711-arnaudb.json [20:50:55] (03CR) 10Dzahn: [C:03+2] admin: add Tyler Cipriani as group approver for contint-docker [puppet] - 10https://gerrit.wikimedia.org/r/1053052 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [20:52:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P66342 and previous config saved to /var/cache/conftool/dbconfig/20240711-205218-arnaudb.json [20:53:04] RoanKattouw: let me know if you're finished deploying, and I'll start working on that httpbb error as discussed -- no rush though :) [20:53:14] (03CR) 10Dzahn: [C:03+2] admin: add Ariel to analytics-privatedata-users, add krb: present [puppet] - 10https://gerrit.wikimedia.org/r/1053360 (https://phabricator.wikimedia.org/T368911) (owner: 10Dzahn) [20:55:31] rzl: One more patch sorry [20:56:16] cool, no worries [20:57:45] (03PS3) 10Arlolra: Change Linter log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 [20:58:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [20:58:45] (03Merged) 10jenkins-bot: Change Linter log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051487 (owner: 10Arlolra) [20:59:00] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1051487|Change Linter log level to info]] [20:59:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Request for Kerb credentials for Ariel Glenn - https://phabricator.wikimedia.org/T368911#9975173 (10Dzahn) @ArielGlenn You are now in the additional group and I created a Kerberos principal. You should have received an email with instru... [20:59:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Request for Kerb credentials for Ariel Glenn - https://phabricator.wikimedia.org/T368911#9975174 (10Dzahn) a:03ArielGlenn [21:03:23] (03PS3) 10JHathaway: verp_bounce_post_url: Switch to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1053650 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [21:03:29] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053650 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [21:04:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1096-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:05:26] !log catrope@deploy1002 arlolra, catrope: Backport for [[gerrit:1051487|Change Linter log level to info]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:43] arlolra: Ready for testing on the debug servers [21:06:43] (03PS2) 10Dbrant: Enable account vanishing in CentralAuth. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053373 (https://phabricator.wikimedia.org/T369141) [21:07:23] (03PS1) 10Dzahn: stewards: rename userdb_gitlab_token variable [puppet] - 10https://gerrit.wikimedia.org/r/1053783 (https://phabricator.wikimedia.org/T369430) [21:07:24] thanks, you can proceed [21:07:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T367781)', diff saved to https://phabricator.wikimedia.org/P66343 and previous config saved to /var/cache/conftool/dbconfig/20240711-210725-arnaudb.json [21:07:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2192.codfw.wmnet with reason: Maintenance [21:07:30] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:07:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2192.codfw.wmnet with reason: Maintenance [21:07:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T367781)', diff saved to https://phabricator.wikimedia.org/P66344 and previous config saved to /var/cache/conftool/dbconfig/20240711-210747-arnaudb.json [21:08:13] (03CR) 10JHathaway: [C:03+1] "pushed an updated patch after removing a bogus hiera entry in I5065621d4d542dc5d764010dfc87e2ad720bdedf" [puppet] - 10https://gerrit.wikimedia.org/r/1053650 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [21:08:13] (03CR) 10Dzahn: [C:03+2] stewards: rename userdb_gitlab_token variable [puppet] - 10https://gerrit.wikimedia.org/r/1053783 (https://phabricator.wikimedia.org/T369430) (owner: 10Dzahn) [21:08:31] (03PS1) 10Dbrant: Enable account vanishing in CentralAuth (labs). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) [21:08:35] RoanKattouw: ^ [21:08:45] !log catrope@deploy1002 arlolra, catrope: Continuing with sync [21:09:21] (03CR) 10Dzahn: [C:03+2] stewards: clone user DB repo from GitLab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [21:10:10] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1053783 (https://phabricator.wikimedia.org/T369430) (owner: 10Dzahn) [21:11:38] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [21:11:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T367781)', diff saved to https://phabricator.wikimedia.org/P66345 and previous config saved to /var/cache/conftool/dbconfig/20240711-211138-arnaudb.json [21:11:41] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:41] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1051487|Change Linter log level to info]] (duration: 14m 40s) [21:14:16] thanks RoanKattouw [21:14:23] rzl: OK now I'm done [21:14:30] (03PS1) 10Bking: relforge: Attempt to use envoyproxy instead of nginx for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) [21:14:37] RoanKattouw: rad, thanks! [21:17:37] (03CR) 10JHathaway: [C:03+1] R:idp New CAS 7 hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1049761 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [21:19:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [21:23:05] (03PS1) 10Dzahn: scap: remove scandium from dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1053791 (https://phabricator.wikimedia.org/T363402) [21:25:18] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9975238 (10Eevans) >>! In T368766#9962325, @VRiley-WMF wrote: > @Eevans It did. I was planning on swapping the unit back. Is there a good time to proceed with this? Sorry for taking so long to get back to you. We ca... [21:25:43] (03PS2) 10Bking: relforge: Attempt to use envoyproxy instead of nginx for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) [21:26:13] (03PS2) 10Dzahn: miscweb: add query-main and query-scholarly.wikidata.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1053773 (https://phabricator.wikimedia.org/T364367) [21:26:46] (03CR) 10Dzahn: [C:03+2] miscweb: add query-main and query-scholarly.wikidata.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1053773 (https://phabricator.wikimedia.org/T364367) (owner: 10Dzahn) [21:26:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P66346 and previous config saved to /var/cache/conftool/dbconfig/20240711-212646-arnaudb.json [21:26:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:27:58] (03CR) 10Dzahn: [C:03+2] "the nice part: we don't have to do all these steps anymore with manually running cergen, committing in 2 repos..etc..since cergen was repl" [puppet] - 10https://gerrit.wikimedia.org/r/1053773 (https://phabricator.wikimedia.org/T364367) (owner: 10Dzahn) [21:30:15] (03CR) 10Dzahn: [C:03+2] "new sites exist on the cert now:" [puppet] - 10https://gerrit.wikimedia.org/r/1053773 (https://phabricator.wikimedia.org/T364367) (owner: 10Dzahn) [21:32:53] (03PS3) 10Bking: relforge: Attempt to use envoyproxy instead of nginx for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) [21:34:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:34:39] RESOLVED: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1096-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:38:15] !log upgrading exim4 to 4.94.2-7+deb11u3 [21:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:26] (03PS4) 10Bking: relforge: Attempt to use envoyproxy instead of nginx for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) [21:41:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P66347 and previous config saved to /var/cache/conftool/dbconfig/20240711-214153-arnaudb.json [21:43:13] (03PS5) 10Bking: relforge: Attempt to use envoyproxy instead of nginx for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) [21:43:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:44:51] !log rzl@mwdebug1002:~$ sudo apache2ctl restart [21:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:18] (03PS6) 10Bking: relforge: Attempt to use envoyproxy instead of nginx for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) [21:50:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:50:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:52:58] (03CR) 10JHathaway: [C:03+1] profile::tcpircbot: allow inbound conn from puppetserver nodes [puppet] - 10https://gerrit.wikimedia.org/r/1053616 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [21:54:45] (03CR) 10JHathaway: [C:03+1] profile::kerberos::kadminserver: allow more nodes in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1053619 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [21:55:20] (03PS7) 10Bking: relforge: Attempt to use envoyproxy instead of nginx for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1053789 (https://phabricator.wikimedia.org/T368950) [21:57:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T367781)', diff saved to https://phabricator.wikimedia.org/P66348 and previous config saved to /var/cache/conftool/dbconfig/20240711-215700-arnaudb.json [21:57:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2201.codfw.wmnet with reason: Maintenance [21:57:04] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:57:06] !log systemctl restart apache2 on mwdebug1002, mwdebug2001, mwdebug2002 for https://gerrit.wikimedia.org/r/1052128 [21:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2201.codfw.wmnet with reason: Maintenance [21:59:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2211.codfw.wmnet with reason: Maintenance [21:59:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2211.codfw.wmnet with reason: Maintenance [21:59:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T367781)', diff saved to https://phabricator.wikimedia.org/P66349 and previous config saved to /var/cache/conftool/dbconfig/20240711-215921-arnaudb.json [22:03:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T367781)', diff saved to https://phabricator.wikimedia.org/P66350 and previous config saved to /var/cache/conftool/dbconfig/20240711-220315-arnaudb.json [22:03:19] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:12:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9975389 (10Papaul) [22:18:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P66351 and previous config saved to /var/cache/conftool/dbconfig/20240711-221822-arnaudb.json [22:22:59] (03PS1) 10Scott French: mediawiki: update siteinfo URL to use mw-api-int [software/spicerack] - 10https://gerrit.wikimedia.org/r/1053801 (https://phabricator.wikimedia.org/T367949) [22:23:29] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:26:37] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPV6 for dbproxy200[5-8] - pt1979@cumin2002" [22:26:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9975437 (10Papaul) @Jhancock.wm i think you missed @Marostegui comment about not setting IPV6 for those hosts. I fixed it. [22:27:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove IPV6 for dbproxy200[5-8] - pt1979@cumin2002" [22:27:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:33:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P66352 and previous config saved to /var/cache/conftool/dbconfig/20240711-223329-arnaudb.json [22:33:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9975456 (10Papaul) @Marostegui like we discussed this morning, I was able to install dbproxy2005 using the workaround of using the 1G NIC for the install and switch to 10G aft... [22:43:00] (03PS1) 10Scott French: cxserver: update outdated comments on chart values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053805 (https://phabricator.wikimedia.org/T367949) [22:43:01] (03PS1) 10Scott French: mobileapps: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053806 (https://phabricator.wikimedia.org/T367949) [22:43:03] (03PS1) 10Scott French: push-notifications: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053807 (https://phabricator.wikimedia.org/T367949) [22:43:04] (03PS1) 10Scott French: wikifeeds: update references to deprecated services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053808 (https://phabricator.wikimedia.org/T367949) [22:43:06] (03PS1) 10Scott French: kserve-inference: update references to deprecated services in fixtures [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) [22:48:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T367781)', diff saved to https://phabricator.wikimedia.org/P66353 and previous config saved to /var/cache/conftool/dbconfig/20240711-224836-arnaudb.json [22:48:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance [22:48:40] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:48:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance [22:48:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T367781)', diff saved to https://phabricator.wikimedia.org/P66354 and previous config saved to /var/cache/conftool/dbconfig/20240711-224858-arnaudb.json [22:51:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T367781)', diff saved to https://phabricator.wikimedia.org/P66355 and previous config saved to /var/cache/conftool/dbconfig/20240711-225150-arnaudb.json [22:52:43] (03CR) 10Scott French: "This and the next two patches in the chain change the chart default values to the endpoints currently used in production (albeit via the s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053806 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [22:57:22] jouncebot: nowandnext [22:57:22] No deployments scheduled for the next 7 hour(s) and 2 minute(s) [22:57:22] In 7 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240712T0600) [22:57:25] (03CR) 10Scott French: "Hi Luca - Would you be able to review this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053809 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [22:57:57] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053810 [22:57:57] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053810 (owner: 10Zabe) [22:58:35] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053810 (owner: 10Zabe) [22:59:02] !log zabe@deploy1002 Started scap sync-world: update interwiki cache [22:59:17] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:00:19] (03PS2) 10Dbrant: Enable account vanishing in CentralAuth (labs). [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053784 (https://phabricator.wikimedia.org/T369141) [23:01:14] (03PS3) 10Dbrant: Enable account vanishing in CentralAuth. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053373 (https://phabricator.wikimedia.org/T369141) [23:06:39] !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 07m 37s) [23:06:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P66356 and previous config saved to /var/cache/conftool/dbconfig/20240711-230657-arnaudb.json [23:14:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [23:15:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T367856)', diff saved to https://phabricator.wikimedia.org/P66357 and previous config saved to /var/cache/conftool/dbconfig/20240711-231547-marostegui.json [23:15:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:21:11] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd2001 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:21:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [23:22:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P66358 and previous config saved to /var/cache/conftool/dbconfig/20240711-232205-arnaudb.json [23:22:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [23:22:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T367856)', diff saved to https://phabricator.wikimedia.org/P66359 and previous config saved to /var/cache/conftool/dbconfig/20240711-232218-marostegui.json [23:22:21] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:25:03] RECOVERY - OpenSearch health check for shards on 9200 on logging-hd2001 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 658, active_shards: 1307, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 216, delayed_unassigned [23:25:03] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.81746552856205 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:26:49] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:29:39] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:30:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P66360 and previous config saved to /var/cache/conftool/dbconfig/20240711-233054-marostegui.json [23:33:49] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:37:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T367781)', diff saved to https://phabricator.wikimedia.org/P66361 and previous config saved to /var/cache/conftool/dbconfig/20240711-233712-arnaudb.json [23:37:16] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053813 [23:38:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053813 (owner: 10TrainBranchBot) [23:39:58] (03CR) 10Jeena Huneidi: "Update_version is used by pipelinelib to update the chart version if a repo's .pipeline/config file specifies `promote: chart: 'chartname'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [23:46:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P66362 and previous config saved to /var/cache/conftool/dbconfig/20240711-234602-marostegui.json [23:48:05] (03CR) 10Jeena Huneidi: "But it might be worth checking with the repo contributors whether they actually use or want to use the feature, because it doesn't seem li" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052964 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [23:48:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure