[00:16:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:51] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:09:45] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:45] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:35:11] (03PS2) 10KartikMistry: Update cxserver to 2023-03-09-061555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/895904 (https://phabricator.wikimedia.org/T331097) [03:44:14] (03PS1) 10KartikMistry: Enable Section Translation on 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897404 (https://phabricator.wikimedia.org/T327102) [03:46:17] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:48:01] (03PS2) 10KartikMistry: testwiki: Enable Section Translation on 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897404 (https://phabricator.wikimedia.org/T327102) [03:56:45] * kart_ updating cxserver; only minor DB updates [03:57:41] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-03-09-061555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/895904 (https://phabricator.wikimedia.org/T331097) (owner: 10KartikMistry) [04:02:29] (03Merged) 10jenkins-bot: Update cxserver to 2023-03-09-061555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/895904 (https://phabricator.wikimedia.org/T331097) (owner: 10KartikMistry) [04:12:01] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:12:30] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [04:17:28] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [04:18:24] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [04:19:01] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [04:19:51] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [04:37:06] !log Updated cxserver to 2023-03-09-061555-production (T331097, T327102, T326541) [04:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:14] T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541 [04:37:14] T327102: Enable Content and Section translation on 10 Wikipedias - https://phabricator.wikimedia.org/T327102 [04:37:15] T331097: Update section title mapping database - https://phabricator.wikimedia.org/T331097 [05:29:03] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.339 second response time https://wikitech.wikimedia.org/wiki/Swift [05:30:49] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:13:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:15:33] 10ops-eqiad: mr1-eqiad down - https://phabricator.wikimedia.org/T331839 (10ayounsi) p:05Triage→03High [06:16:27] !log Deploy schema change on s3 codfw dbmaint T329684 [06:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:32] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [06:18:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:18:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:19:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 138886 [06:21:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 138886 [06:22:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [06:22:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [06:22:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T329260)', diff saved to https://phabricator.wikimedia.org/P45732 and previous config saved to /var/cache/conftool/dbconfig/20230313-062244-marostegui.json [06:22:50] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [06:24:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:01] (03PS1) 10Marostegui: site.pp: Add dbproxy10[22-27] insetup [puppet] - 10https://gerrit.wikimedia.org/r/897446 (https://phabricator.wikimedia.org/T326346) [06:25:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 29357 [06:25:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 29357 [06:25:43] (03CR) 10Marostegui: [C: 03+2] site.pp: Add dbproxy10[22-27] insetup [puppet] - 10https://gerrit.wikimedia.org/r/897446 (https://phabricator.wikimedia.org/T326346) (owner: 10Marostegui) [06:27:33] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 34549 [06:27:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 34549 [06:29:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8966 [06:29:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T329260)', diff saved to https://phabricator.wikimedia.org/P45733 and previous config saved to /var/cache/conftool/dbconfig/20230313-062942-marostegui.json [06:29:48] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [06:29:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8966 [06:31:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9902 [06:31:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9902 [06:33:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15830 [06:33:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15830 [06:34:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9507 [06:34:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9507 [06:35:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9873 [06:35:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9873 [06:35:57] (03PS1) 10Marostegui: dbproxy10[22-27]: Add hosts [puppet] - 10https://gerrit.wikimedia.org/r/897454 (https://phabricator.wikimedia.org/T326346) [06:36:53] (03CR) 10Marostegui: [C: 03+2] dbproxy10[22-27]: Add hosts [puppet] - 10https://gerrit.wikimedia.org/r/897454 (https://phabricator.wikimedia.org/T326346) (owner: 10Marostegui) [06:40:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, and 2 others: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) All the puppet patches needed are done. [06:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P45734 and previous config saved to /var/cache/conftool/dbconfig/20230313-064448-marostegui.json [06:52:13] !log Remove pagetriage_log from testwiki and test2wiki T328309 [06:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:18] T328309: Remove pagetriage_log table - https://phabricator.wikimedia.org/T328309 [06:59:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P45735 and previous config saved to /var/cache/conftool/dbconfig/20230313-065954-marostegui.json [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230312T0800) [07:00:04] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:04:39] (03PS1) 10Marostegui: filtered_tables.txt: Remove table [puppet] - 10https://gerrit.wikimedia.org/r/897571 (https://phabricator.wikimedia.org/T328309) [07:12:04] Ah. DST confusion :) [07:13:03] I'll go ahead with simple config change. And, I'm only one in the window, so far. [07:14:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897404 (https://phabricator.wikimedia.org/T327102) (owner: 10KartikMistry) [07:14:55] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation on 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897404 (https://phabricator.wikimedia.org/T327102) (owner: 10KartikMistry) [07:15:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T329260)', diff saved to https://phabricator.wikimedia.org/P45736 and previous config saved to /var/cache/conftool/dbconfig/20230313-071501-marostegui.json [07:15:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [07:15:07] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:15:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [07:15:18] !log kartik@deploy2002 Started scap: Backport for [[gerrit:897404|testwiki: Enable Section Translation on 11 Wikipedias (T327102 T326541)]] [07:15:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T329260)', diff saved to https://phabricator.wikimedia.org/P45737 and previous config saved to /var/cache/conftool/dbconfig/20230313-071522-marostegui.json [07:15:26] T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541 [07:15:26] T327102: Enable Content and Section translation on 10 Wikipedias - https://phabricator.wikimedia.org/T327102 [07:19:04] (03PS1) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [07:19:29] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio) [07:20:12] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [07:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T329260)', diff saved to https://phabricator.wikimedia.org/P45738 and previous config saved to /var/cache/conftool/dbconfig/20230313-072219-marostegui.json [07:22:25] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:25:04] !log kartik@deploy2002 kartik: Backport for [[gerrit:897404|testwiki: Enable Section Translation on 11 Wikipedias (T327102 T326541)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:25:10] T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541 [07:25:11] T327102: Enable Content and Section translation on 10 Wikipedias - https://phabricator.wikimedia.org/T327102 [07:32:23] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:897404|testwiki: Enable Section Translation on 11 Wikipedias (T327102 T326541)]] (duration: 17m 04s) [07:32:29] T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541 [07:32:30] T327102: Enable Content and Section translation on 10 Wikipedias - https://phabricator.wikimedia.org/T327102 [07:34:17] (03CR) 10Santhosh: WIP: Add new self hosted machinetranslation service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [07:36:04] (03CR) 10Zabe: [C: 03+2] use core Renameuser classes [extensions/LiquidThreads] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897188 (https://phabricator.wikimedia.org/T27482) (owner: 10Zabe) [07:36:06] (03CR) 10Zabe: [C: 03+2] UserRenameHandler: Use core RenameUser classes [extensions/AbuseFilter] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897187 (https://phabricator.wikimedia.org/T27482) (owner: 10Zabe) [07:37:17] !log Remove pagetriage_log from enwiki T328309 [07:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:22] T328309: Remove pagetriage_log table - https://phabricator.wikimedia.org/T328309 [07:37:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P45739 and previous config saved to /var/cache/conftool/dbconfig/20230313-073725-marostegui.json [07:37:46] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Remove table [puppet] - 10https://gerrit.wikimedia.org/r/897571 (https://phabricator.wikimedia.org/T328309) (owner: 10Marostegui) [07:38:06] (03Merged) 10jenkins-bot: use core Renameuser classes [extensions/LiquidThreads] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897188 (https://phabricator.wikimedia.org/T27482) (owner: 10Zabe) [07:39:09] I'm done with deployment. [07:39:43] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:39:48] thanks [07:40:57] (03PS2) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [07:41:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [extensions/AbuseFilter] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897187 (https://phabricator.wikimedia.org/T27482) (owner: 10Zabe) [07:43:22] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [07:46:55] (03Abandoned) 10Slyngshede: R:idp_test create development service [puppet] - 10https://gerrit.wikimedia.org/r/896109 (owner: 10Slyngshede) [07:46:57] (03PS1) 10Marostegui: db2160: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/897768 (https://phabricator.wikimedia.org/T322294) [07:48:47] (03CR) 10Marostegui: [C: 03+2] db2160: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/897768 (https://phabricator.wikimedia.org/T322294) (owner: 10Marostegui) [07:50:57] (03PS2) 10Muehlenhoff: Apply url_downloader role to urldownloader2004 [puppet] - 10https://gerrit.wikimedia.org/r/896325 (https://phabricator.wikimedia.org/T329945) [07:51:30] (03Merged) 10jenkins-bot: UserRenameHandler: Use core RenameUser classes [extensions/AbuseFilter] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897187 (https://phabricator.wikimedia.org/T27482) (owner: 10Zabe) [07:51:47] !log zabe@deploy2002 Started scap: Backport for [[gerrit:897188|use core Renameuser classes (T27482)]], [[gerrit:897187|UserRenameHandler: Use core RenameUser classes (T27482)]] [07:51:53] T27482: Merge RenameUser into core - https://phabricator.wikimedia.org/T27482 [07:52:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P45740 and previous config saved to /var/cache/conftool/dbconfig/20230313-075232-marostegui.json [07:53:01] (03CR) 10Nicolas Fraison: [C: 03+2] spark: Authorize driver and executor pods to communicate [deployment-charts] - 10https://gerrit.wikimedia.org/r/896303 (https://phabricator.wikimedia.org/T318924) (owner: 10Nicolas Fraison) [07:53:16] !log zabe@deploy2002 zabe: Backport for [[gerrit:897188|use core Renameuser classes (T27482)]], [[gerrit:897187|UserRenameHandler: Use core RenameUser classes (T27482)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:58:50] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:897188|use core Renameuser classes (T27482)]], [[gerrit:897187|UserRenameHandler: Use core RenameUser classes (T27482)]] (duration: 07m 02s) [07:58:55] T27482: Merge RenameUser into core - https://phabricator.wikimedia.org/T27482 [08:00:59] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:03] !log installing curl security updates [08:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:06] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:05:22] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:07:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T329260)', diff saved to https://phabricator.wikimedia.org/P45741 and previous config saved to /var/cache/conftool/dbconfig/20230313-080738-marostegui.json [08:07:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:07:44] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:07:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T329260)', diff saved to https://phabricator.wikimedia.org/P45742 and previous config saved to /var/cache/conftool/dbconfig/20230313-080759-marostegui.json [08:13:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T329260)', diff saved to https://phabricator.wikimedia.org/P45743 and previous config saved to /var/cache/conftool/dbconfig/20230313-081357-marostegui.json [08:14:03] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:29:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P45744 and previous config saved to /var/cache/conftool/dbconfig/20230313-082903-marostegui.json [08:34:01] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:38:54] (03CR) 10Stevemunene: [C: 03+1] hadoop-hdfs: Add alert on FSImage age [alerts] - 10https://gerrit.wikimedia.org/r/896049 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [08:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:40:52] .13 [08:40:53] uff [08:41:12] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:44:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P45745 and previous config saved to /var/cache/conftool/dbconfig/20230313-084409-marostegui.json [08:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:49:23] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Tried to stop all the consumers on centrallog nodes, delete the consumer group and restart all. Traffic changed and dro... [08:52:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:57:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:58:01] (03CR) 10Ayounsi: "Post merge comment with a suggestion on making in more sustainable." [homer/public] - 10https://gerrit.wikimedia.org/r/896200 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [08:59:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T329260)', diff saved to https://phabricator.wikimedia.org/P45746 and previous config saved to /var/cache/conftool/dbconfig/20230313-085916-marostegui.json [08:59:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [08:59:22] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:59:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [08:59:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T329260)', diff saved to https://phabricator.wikimedia.org/P45747 and previous config saved to /var/cache/conftool/dbconfig/20230313-085937-marostegui.json [09:02:51] (03CR) 10Elukey: [C: 03+1] Move default kubernetes version to 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:03:20] (03CR) 10Muehlenhoff: [C: 03+2] Apply url_downloader role to urldownloader2004 [puppet] - 10https://gerrit.wikimedia.org/r/896325 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [09:03:51] (03CR) 10Volans: [C: 03+1] "LGTM. Do you plan to restore the old name if the test succeed?" [puppet] - 10https://gerrit.wikimedia.org/r/897063 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [09:05:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T329260)', diff saved to https://phabricator.wikimedia.org/P45748 and previous config saved to /var/cache/conftool/dbconfig/20230313-090539-marostegui.json [09:05:45] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:06:24] (03PS3) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [09:07:15] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [09:08:05] (03CR) 10Elukey: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:08:25] (03CR) 10KartikMistry: WIP: Add new self hosted machinetranslation service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [09:08:34] (03CR) 10Elukey: "Looks good, can you run pcc to see what is the result?" [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:10:06] (03PS4) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [09:10:15] (03CR) 10Elukey: [C: 03+2] profile::benthos: change kafka consumer group name for webrequest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897063 (https://phabricator.wikimedia.org/T331801) (owner: 10Elukey) [09:11:16] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [09:16:53] !log installing python-werkzeug security updates [09:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:28] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment: fix helmfile-defaults... [deployment-charts] - 10https://gerrit.wikimedia.org/r/896365 (owner: 10DCausse) [09:18:32] (03CR) 10Ottomata: [C: 03+2] "TY" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896365 (owner: 10DCausse) [09:19:17] (03CR) 10Ottomata: [C: 03+2] page-content-change: set flink total memory size. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896423 (owner: 10Gmodena) [09:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:20:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P45749 and previous config saved to /var/cache/conftool/dbconfig/20230313-092045-marostegui.json [09:21:05] (03CR) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:23:08] (03Merged) 10jenkins-bot: mediawiki-page-content-change-enrichment: fix helmfile-defaults... [deployment-charts] - 10https://gerrit.wikimedia.org/r/896365 (owner: 10DCausse) [09:24:27] (03Merged) 10jenkins-bot: page-content-change: set flink total memory size. [deployment-charts] - 10https://gerrit.wikimedia.org/r/896423 (owner: 10Gmodena) [09:24:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:27:40] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) To keep archives happy - in order to be able to delete the consumer group I had to add the following: ` kafka acls --a... [09:32:40] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.184 second response time https://wikitech.wikimedia.org/wiki/Swift [09:33:04] (03PS1) 10Vgutierrez: hiera: Enable haproxy hardening globally for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/897803 (https://phabricator.wikimedia.org/T323944) [09:33:36] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Swift [09:33:37] (03PS1) 10Zabe: Revert "Revert "Unload RenameUser, now part of core: Part I of II"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897804 (https://phabricator.wikimedia.org/T331685) [09:35:23] (03CR) 10Zabe: [C: 03+2] Revert "Revert "Unload RenameUser, now part of core: Part I of II"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897804 (https://phabricator.wikimedia.org/T331685) (owner: 10Zabe) [09:35:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P45750 and previous config saved to /var/cache/conftool/dbconfig/20230313-093552-marostegui.json [09:36:09] (03Merged) 10jenkins-bot: Revert "Revert "Unload RenameUser, now part of core: Part I of II"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897804 (https://phabricator.wikimedia.org/T331685) (owner: 10Zabe) [09:36:29] !log zabe@deploy2002 Started scap: Backport for [[gerrit:897804|Revert "Revert "Unload RenameUser, now part of core: Part I of II"" (T331685)]] [09:36:34] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [09:38:02] !log zabe@deploy2002 zabe: Backport for [[gerrit:897804|Revert "Revert "Unload RenameUser, now part of core: Part I of II"" (T331685)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:40:43] !log pcc-worker1001:~# rm -r /srv/jenkins/puppet-compiler/40079 /srv/jenkins/puppet-compiler/38943 - / back to 68% usage [09:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:49] (03CR) 10Ottomata: rdf-streaming-updater: add a "wcqs" release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [09:43:21] (03PS3) 10Zabe: Drop loading of former extension Renameuser's i18n strings [Re-apply] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896037 (owner: 10Jforrester) [09:43:25] (03CR) 10Zabe: [C: 03+2] Drop loading of former extension Renameuser's i18n strings [Re-apply] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896037 (owner: 10Jforrester) [09:44:08] (03Merged) 10jenkins-bot: Drop loading of former extension Renameuser's i18n strings [Re-apply] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896037 (owner: 10Jforrester) [09:44:22] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:897804|Revert "Revert "Unload RenameUser, now part of core: Part I of II"" (T331685)]] (duration: 07m 52s) [09:44:26] T331685: Error: Interface 'MediaWiki\Extension\Renameuser\Hook\RenameUserSQLHook' not found - https://phabricator.wikimedia.org/T331685 [09:44:38] !log zabe@deploy2002 Started scap: Backport for [[gerrit:896037|Drop loading of former extension Renameuser's i18n strings [Re-apply]]] [09:45:37] !log pcc-worker1002:~# rm -r /srv/jenkins/puppet-compiler/40078 - / back to 47% usage [09:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:53] !log zabe@deploy2002 jforrester and zabe: Backport for [[gerrit:896037|Drop loading of former extension Renameuser's i18n strings [Re-apply]]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [09:47:25] (03CR) 10Ottomata: "I have a feeling there will be some issues with running two apps in the same namespace...but I even after looking I can't recall why. So" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [09:48:28] !log pcc-worker1003:~# rm -r /srv/jenkins/puppet-compiler/40076 - / back to 70% [09:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40085/console" [puppet] - 10https://gerrit.wikimedia.org/r/897803 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [09:49:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40086/console" [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:50:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T329260)', diff saved to https://phabricator.wikimedia.org/P45751 and previous config saved to /var/cache/conftool/dbconfig/20230313-095058-marostegui.json [09:51:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [09:51:03] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:51:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [09:51:17] (03CR) 10JMeybohm: [V: 03+1] cfssl/cert: Allow to absent cert resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:51:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T329260)', diff saved to https://phabricator.wikimedia.org/P45752 and previous config saved to /var/cache/conftool/dbconfig/20230313-095119-marostegui.json [09:52:18] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:896037|Drop loading of former extension Renameuser's i18n strings [Re-apply]]] (duration: 07m 40s) [09:53:07] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable haproxy hardening globally for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/897803 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [09:53:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40087/console" [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:53:32] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40088/console" [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:53:35] (03CR) 10Btullis: [C: 03+1] hadoop-hdfs: Add alert on FSImage age (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/896049 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [09:53:47] (03CR) 10Volans: [C: 03+2] Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [09:54:00] (03CR) 10Btullis: [C: 03+1] hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [09:54:17] (03CR) 10Btullis: [C: 03+1] hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [09:55:04] !log Enable haproxy hardening in cp hosts globally - T323944 [09:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:10] T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 [09:57:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T329260)', diff saved to https://phabricator.wikimedia.org/P45753 and previous config saved to /var/cache/conftool/dbconfig/20230313-095728-marostegui.json [09:57:33] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:57:41] (03Merged) 10jenkins-bot: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [09:58:49] (03PS1) 10Superpes15: Revert "Add a temporary logo to trwikiquote (Vector legacy + Vector 2022)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) [09:59:03] (03PS2) 10Superpes15: Revert "Add a temporary logo to trwikiquote (Vector legacy + Vector 2022)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) [09:59:09] (03PS1) 10Slavina Stefanova: chagelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 [09:59:22] (03CR) 10Nicolas Fraison: hadoop-hdfs: Add alert on FSImage age (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/896049 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [09:59:26] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop-hdfs: Add alert on FSImage age [alerts] - 10https://gerrit.wikimedia.org/r/896049 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [09:59:48] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [09:59:52] (03CR) 10CI reject: [V: 04-1] chagelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1000) [10:00:12] (03PS2) 10Slavina Stefanova: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 [10:00:55] (03CR) 10CI reject: [V: 04-1] d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova) [10:02:15] (03PS3) 10Slavina Stefanova: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 [10:02:20] !log imported dh-php 0.35+wmf1+buster1+icu67u1 T329491 [10:02:21] (03CR) 10Ayounsi: "One post merge comment." [homer/public] - 10https://gerrit.wikimedia.org/r/896331 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [10:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:24] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [10:03:08] (03PS4) 10Slavina Stefanova: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 [10:03:43] (03PS5) 10Slavina Stefanova: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 [10:04:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:05:03] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:05:07] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:05:13] (03PS5) 10Clément Goubert: Exclude traindev from tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 [10:05:46] (03PS6) 10Clément Goubert: Exclude traindev from tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 [10:06:24] (03PS6) 10Slavina Stefanova: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 [10:06:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 668 [10:06:49] (03PS3) 10Superpes15: Revert "Add a temporary logo to trwikiquote (Vector legacy + Vector 2022)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) [10:07:19] (03PS4) 10Superpes15: [trwikiquote] Removing temporary logo (Vector legacy + Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) [10:07:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 668 [10:07:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38082 [10:08:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38082 [10:08:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45558 [10:08:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45558 [10:09:27] (03PS5) 10Superpes15: [trwikiquote] Reverting temporary logo (Vector legacy + Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) [10:09:34] PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: benthos@webrequest_live.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6663 [10:09:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6663 [10:10:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 46632 [10:10:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46632 [10:10:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38193 [10:10:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38193 [10:10:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 55701 [10:11:40] (03PS4) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [10:11:46] (03CR) 10Majavah: [C: 04-1] "Please use `gbp dch` instead of hand-crafting the message (and sign it as yourself) https://wikitech.wikimedia.org/wiki/Portal:Toolforge/A" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova) [10:12:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P45754 and previous config saved to /var/cache/conftool/dbconfig/20230313-101234-marostegui.json [10:12:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 55701 [10:12:41] (03PS2) 10Clément Goubert: switch noc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [10:12:50] (03PS3) 10Clément Goubert: switch noc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [10:13:07] (03PS5) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [10:13:22] RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:33] (03PS1) 10Filippo Giunchedi: hieradata: remove authdns hosts from blackbox_smoke_hosts [puppet] - 10https://gerrit.wikimedia.org/r/897833 (https://phabricator.wikimedia.org/T330670) [10:14:45] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [10:15:47] (03CR) 10Slyngshede: "To avoid very large CR my suggestion is cutting the request for permissions of here. We then implement the approval process in a separate " [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 (owner: 10Slyngshede) [10:18:54] (03CR) 10Clément Goubert: [C: 03+1] switch noc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [10:19:45] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:52] (03CR) 10Clément Goubert: [C: 03+1] mwdebug_deploy: clean up physical resources from target hosts [puppet] - 10https://gerrit.wikimedia.org/r/896355 (owner: 10Jaime Nuche) [10:20:03] (03CR) 10Clément Goubert: [C: 03+1] mwdebug_deploy: remove configuration [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [10:20:35] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/897833 (https://phabricator.wikimedia.org/T330670) (owner: 10Filippo Giunchedi) [10:20:53] (03CR) 10Slavina Stefanova: d/changelog: prepare for 0.92 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova) [10:21:51] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove authdns hosts from blackbox_smoke_hosts [puppet] - 10https://gerrit.wikimedia.org/r/897833 (https://phabricator.wikimedia.org/T330670) (owner: 10Filippo Giunchedi) [10:22:06] (03PS6) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [10:26:43] !log imported php-defaults 7.4+76+wmf1~buster2+icu67u1 T329491 [10:26:45] (03PS1) 10Superpes15: [trwiki] Removing the temporary logo, previously added, and already reverted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897837 (https://phabricator.wikimedia.org/T329047) [10:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:49] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [10:27:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P45755 and previous config saved to /var/cache/conftool/dbconfig/20230313-102740-marostegui.json [10:30:22] (03CR) 10Hnowlan: [C: 03+1] changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [10:30:48] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:30:50] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:32:16] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:32:19] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [10:34:58] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:38:03] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10MatthewVernon) [10:38:13] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10MatthewVernon) p:05Triage→03High [10:38:25] !log imported php-pcov 1.0.6-4+wmf1~buster1+icu67u1 T329491 [10:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:30] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [10:42:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T329260)', diff saved to https://phabricator.wikimedia.org/P45756 and previous config saved to /var/cache/conftool/dbconfig/20230313-104246-marostegui.json [10:42:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [10:42:53] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:43:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [10:43:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:43:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:43:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T329260)', diff saved to https://phabricator.wikimedia.org/P45757 and previous config saved to /var/cache/conftool/dbconfig/20230313-104322-marostegui.json [10:44:20] (03CR) 10Ayounsi: [C: 03+1] "LGTM too, not familiar with the syntax, but should we log the matching packets instead of count them?" [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) (owner: 10Arturo Borrero Gonzalez) [10:49:49] (ProbeDown) firing: (34) Service rpki2002:443 has failed probes (http_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#rpki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:51] (03PS3) 10Clément Goubert: Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) [10:55:19] !log imported php-imagick 3.4.4+php8.0+3.4.4-2+deb11u2+wmf1+buster1+icu67u1 T329491 [10:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:25] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [10:57:28] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:31] (03PS4) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) [10:57:33] (03PS1) 10Elukey: services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547) [10:57:58] (03CR) 10Elukey: api-gateway: allow to configure prefixes without JWT requirements (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [10:58:09] I'm investigating the rpki2002 probes alerts above, not sure what's going on yet [11:02:28] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:10:58] ok opening a task because I can't find an obvious reason [11:11:07] (03PS1) 10Btullis: Upgrade the airflow package in stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/897845 (https://phabricator.wikimedia.org/T326193) [11:11:15] !log imported php-msgpack 2.1.2+0.5.7-2+wmf1+buster1+icu67u1 T329491 [11:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:20] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [11:12:36] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40089/console" [puppet] - 10https://gerrit.wikimedia.org/r/897845 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [11:14:22] (03CR) 10Kosta Harlan: "Could someone +2 this change, please? Or should I self-merge?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [11:15:19] (03PS7) 10Slavina Stefanova: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 [11:15:23] (03Abandoned) 10Btullis: Upgrade the airflow package in stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/897845 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [11:19:52] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [11:21:13] (03CR) 10Volans: "I'm planning to merge this later today if noone has objections." [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans) [11:21:51] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [11:21:58] !log jnuche@deploy2002 Installing scap version "latest" for 553 hosts [11:22:51] !log jnuche@deploy2002 Installation of scap version "latest" completed for 553 hosts [11:25:14] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [11:26:38] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [11:26:53] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [11:31:14] !log imported php-apcu 5.1.19+4.0.11-3+wmf2+buster1+icu67u1 T329491 [11:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:19] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [11:38:41] (03PS1) 10Jbond: site.pp: ensure rule for pki does not match rpki [puppet] - 10https://gerrit.wikimedia.org/r/897851 (https://phabricator.wikimedia.org/T331867) [11:39:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] site.pp: ensure rule for pki does not match rpki [puppet] - 10https://gerrit.wikimedia.org/r/897851 (https://phabricator.wikimedia.org/T331867) (owner: 10Jbond) [11:39:48] (03PS1) 10EoghanGaffney: Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) [11:40:09] (03CR) 10CI reject: [V: 04-1] Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [11:41:18] (03PS2) 10EoghanGaffney: Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) [11:41:39] (03CR) 10CI reject: [V: 04-1] Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [11:43:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T329260)', diff saved to https://phabricator.wikimedia.org/P45758 and previous config saved to /var/cache/conftool/dbconfig/20230313-114348-marostegui.json [11:43:54] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:45:42] (03CR) 10EoghanGaffney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [11:45:44] (03PS1) 10Jbond: pki::multirootca: Add PKI prefix to blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/897853 (https://phabricator.wikimedia.org/T331867) [11:46:33] !log imported php-igbinary 3.2.1+2.0.8-2+wmf1+buster1+icu67u1 T329491 [11:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:39] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [11:47:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40090/console" [puppet] - 10https://gerrit.wikimedia.org/r/897853 (https://phabricator.wikimedia.org/T331867) (owner: 10Jbond) [11:49:03] (03PS3) 10EoghanGaffney: Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) [11:49:07] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki::multirootca: Add PKI prefix to blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/897853 (https://phabricator.wikimedia.org/T331867) (owner: 10Jbond) [11:49:28] (03CR) 10CI reject: [V: 04-1] Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [11:49:49] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:34] (03PS4) 10EoghanGaffney: Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) [11:53:21] (03PS33) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [11:53:46] (03CR) 10CI reject: [V: 04-1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [11:56:59] (03PS1) 10Volans: spicerack: add authdns_active_hosts property [software/spicerack] - 10https://gerrit.wikimedia.org/r/897858 [11:58:03] (03PS34) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [11:58:44] !log imported php-memcached 3.1.5+2.2.0-5+deb11u1+wmf1+buster1+icu67u1 T329491 [11:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:50] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [11:58:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P45759 and previous config saved to /var/cache/conftool/dbconfig/20230313-115854-marostegui.json [12:01:00] (03CR) 10Slyngshede: [C: 03+1] "Very nice, LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 (owner: 10Volans) [12:01:45] (03PS5) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:02:33] (03PS35) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:02:51] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [12:06:44] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40093/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:06:57] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-aux.service,cfssl-ocsprefresh-aux_front_proxy.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-debmonitor.service,cfssl-ocsprefresh-discovery.service,cfssl-ocsprefresh-dse.service,cfssl-ocsprefresh-dse_front_proxy.service,cfssl-ocsprefresh-etcd.serv [12:06:57] l-ocsprefresh-kafka.service,cfssl-ocsprefresh-mlserve.service,cfssl-ocsprefresh-mlserve_front_proxy.service,cfssl-ocsprefresh-mlserve_staging.service,cfssl-ocsprefresh-mlserve_staging_front_proxy.service,cfssl-ocsprefresh-wikikube.service,cfssl-ocsprefresh-wikikube_front_proxy.service,cfssl-ocsprefresh-wikikube_staging.service,cfssl-ocsprefresh-wikikube_staging_front_proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_syste [12:12:17] (03PS6) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:13:12] (03CR) 10CI reject: [V: 04-1] WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [12:14:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P45760 and previous config saved to /var/cache/conftool/dbconfig/20230313-121400-marostegui.json [12:14:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:16:01] RECOVERY - Host ps1-e4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [12:16:01] RECOVERY - Host ps1-e1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [12:16:01] RECOVERY - Host ps1-e3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [12:16:01] RECOVERY - Host ps1-f2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [12:16:01] RECOVERY - Host ps1-f4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.53 ms [12:16:02] RECOVERY - Host ps1-f1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [12:16:02] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [12:16:03] RECOVERY - Host ps1-f3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [12:16:03] PROBLEM - ps1-b4-eqiad-infeed-load-tower-B-phase-Y on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:04] PROBLEM - ps1-a6-eqiad-infeed-load-tower-A-phase-Y on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:04] PROBLEM - ps1-c8-eqiad-infeed-load-tower-A-phase-X on ps1-c8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:05] PROBLEM - ps1-c1-eqiad-infeed-load-tower-A-phase-Y on ps1-c1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:05] PROBLEM - ps1-f4-eqiad-infeed-load-tower-A-phase-X on ps1-f4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:06] PROBLEM - ps1-d4-eqiad-infeed-load-tower-B-phase-Z on ps1-d4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:06] PROBLEM - ps1-e4-eqiad-infeed-load-tower-B-phase-Y on ps1-e4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:07] PROBLEM - ps1-e1-eqiad-infeed-load-tower-A-phase-Z on ps1-e1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:07] PROBLEM - ps1-a5-eqiad-infeed-load-tower-A-phase-Z on ps1-a5-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:08] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [12:16:11] RECOVERY - Host ps1-e2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [12:16:19] PROBLEM - Host an-worker1140 is DOWN: PING CRITICAL - Packet loss = 100% [12:16:19] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:16:37] RECOVERY - ps1-a6-eqiad-infeed-load-tower-A-phase-Y on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-A-phase-Y 197 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:37] RECOVERY - ps1-c8-eqiad-infeed-load-tower-A-phase-X on ps1-c8-eqiad is OK: SNMP OK - ps1-c8-eqiad-infeed-load-tower-A-phase-X 394 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:37] RECOVERY - ps1-c1-eqiad-infeed-load-tower-A-phase-Y on ps1-c1-eqiad is OK: SNMP OK - ps1-c1-eqiad-infeed-load-tower-A-phase-Y 252 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:37] RECOVERY - ps1-f4-eqiad-infeed-load-tower-A-phase-X on ps1-f4-eqiad is OK: SNMP OK - ps1-f4-eqiad-infeed-load-tower-A-phase-X 278 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:37] RECOVERY - ps1-b4-eqiad-infeed-load-tower-B-phase-Y on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-B-phase-Y 232 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:38] RECOVERY - ps1-d4-eqiad-infeed-load-tower-B-phase-Z on ps1-d4-eqiad is OK: SNMP OK - ps1-d4-eqiad-infeed-load-tower-B-phase-Z 433 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:38] RECOVERY - ps1-e4-eqiad-infeed-load-tower-B-phase-Y on ps1-e4-eqiad is OK: SNMP OK - ps1-e4-eqiad-infeed-load-tower-B-phase-Y 310 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:39] RECOVERY - ps1-e1-eqiad-infeed-load-tower-A-phase-Z on ps1-e1-eqiad is OK: SNMP OK - ps1-e1-eqiad-infeed-load-tower-A-phase-Z 391 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:39] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [12:16:40] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [12:16:41] RECOVERY - ps1-a5-eqiad-infeed-load-tower-A-phase-Z on ps1-a5-eqiad is OK: SNMP OK - ps1-a5-eqiad-infeed-load-tower-A-phase-Z 249 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:17:01] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [12:17:45] RECOVERY - Check systemd state on rpki2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:56] (03PS7) 10KartikMistry: WIP: Add new self hosted machinetranslation service [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:19:34] !log imported php-redis 5.3.2+4.3.0-2+deb11u1+wmf1+buster1+icu67u1 T329491 [12:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:40] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [12:20:05] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [12:21:20] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:21:21] 10SRE, 10ops-eqiad: mr1-eqiad down - https://phabricator.wikimedia.org/T331839 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Power-cycled device. mr1-eqiad has Recovered [12:21:29] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:21:35] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [12:21:53] (03PS36) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:21:59] (03PS1) 10Volans: Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 [12:22:33] (03CR) 10CI reject: [V: 04-1] Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:22:42] (03PS37) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:23:44] (03PS2) 10Volans: Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 [12:24:12] (03CR) 10CI reject: [V: 04-1] Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:25:54] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40094/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:26:43] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:27:47] (03PS3) 10Volans: Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 [12:28:12] (03CR) 10CI reject: [V: 04-1] Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:29:01] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:29:04] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:29:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T329260)', diff saved to https://phabricator.wikimedia.org/P45761 and previous config saved to /var/cache/conftool/dbconfig/20230313-122906-marostegui.json [12:29:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:29:14] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [12:29:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:29:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T329260)', diff saved to https://phabricator.wikimedia.org/P45762 and previous config saved to /var/cache/conftool/dbconfig/20230313-122928-marostegui.json [12:29:46] (03PS38) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:30:15] (03PS4) 10Volans: Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 [12:30:49] (03CR) 10CI reject: [V: 04-1] Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:31:44] (03CR) 10Jbond: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:32:26] (03CR) 10KartikMistry: "@Alex, You can go ahead with initial review. We need to check if cpu/memory usage is correct. I've put 32Gi for memory but unsure about cp" [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [12:34:10] (03PS5) 10Volans: Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 [12:34:38] (03CR) 10CI reject: [V: 04-1] Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:34:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40095/console" [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:35:01] (03CR) 10Bartosz Dziewoński: [C: 03+1] "Good to go now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (owner: 10Daniel Kinzler) [12:35:15] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40096/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:35:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T329260)', diff saved to https://phabricator.wikimedia.org/P45763 and previous config saved to /var/cache/conftool/dbconfig/20230313-123543-marostegui.json [12:35:49] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [12:36:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:37:01] !log imported php-geoip 1.1.1-7+wmf2+buster1+icu67u1 T329491 [12:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:06] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [12:38:13] (03PS39) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:38:36] (03CR) 10CI reject: [V: 04-1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:38:41] (03PS6) 10Volans: Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 [12:39:11] (03CR) 10CI reject: [V: 04-1] Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:39:29] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40097/console" [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:41:35] (03PS7) 10Volans: Node regex should match the whole string [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 [12:41:45] (03CR) 10Jbond: [C: 03+1] "lgtm, minor optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:42:28] (03PS40) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [12:42:51] (03CR) 10Btullis: Configure the new ceph servers with mon and mgr daemons (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [12:43:05] (03CR) 10Volans: "Before merging we need to fix the current offenses in site.pp." [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:43:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:46:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor2005.codfw.wmnet [12:46:25] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor2005.codfw.wmnet [12:47:37] (03CR) 10Jbond: "LGTM open questions inline" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [12:47:48] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:47:51] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:48:06] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:48:07] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:48:10] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:48:12] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:48:48] !log restarting codfw thumbor instances to attempt to remedy 502 issues [12:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P45764 and previous config saved to /var/cache/conftool/dbconfig/20230313-125049-marostegui.json [12:52:33] (03CR) 10Volans: "reply inline" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1300). [13:00:04] TheresNoTime, koi, Lucas_WMDE, and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] !log imported php-wmerrors 2.0.0~git20190628.183ef7d-3+wmf1+buster1+icu67u1 T329491 [13:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:14] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [13:00:22] I’m having lunch but will be back to deploy my change towards the end of the window [13:00:35] I can deploy [13:00:37] Hello :) [13:01:30] hi [13:01:54] Superpes: hey, is there any specific reason to not delete the variant definition entirely? [13:02:20] koi: hi, starting with yours [13:02:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896321 (https://phabricator.wikimedia.org/T331691) (owner: 10Stang) [13:02:57] taavi urbanecm suggested me to delete them after about a week [13:03:13] Superpes: ah, sounds good [13:03:26] (03Merged) 10jenkins-bot: zhwiki: Add movefile to extendedconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896321 (https://phabricator.wikimedia.org/T331691) (owner: 10Stang) [13:03:41] !log taavi@deploy2002 Started scap: Backport for [[gerrit:896321|zhwiki: Add movefile to extendedconfirmed (T331691)]] [13:03:46] T331691: Add movefile right to extendedconfirmed group on zhwiki - https://phabricator.wikimedia.org/T331691 [13:05:11] taavi In fact I did the same with trwiki (and in the next window I'm deleting the variants)! If there is time, it can also be done in this window, because the patch is ready, but I saw that it was the seventh patch, so I didn't schedule it now :D [13:05:18] !log taavi@deploy2002 stang and taavi: Backport for [[gerrit:896321|zhwiki: Add movefile to extendedconfirmed (T331691)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:05:31] koi: please test [13:05:35] looking [13:05:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P45766 and previous config saved to /var/cache/conftool/dbconfig/20230313-130555-marostegui.json [13:05:56] Superpes: yeah we can do it now, it does not really add any extra time or effort needed [13:07:09] taavi Thanks :) [13:07:36] taavi, LGTM [13:07:51] thanks, syncing [13:08:01] TheresNoTime: hey, around? [13:11:51] !log imported php-luasandbox 4.0.2-3+wmf1+buster1+icu67u1 T329491 [13:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:56] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [13:13:11] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:896321|zhwiki: Add movefile to extendedconfirmed (T331691)]] (duration: 09m 29s) [13:13:16] T331691: Add movefile right to extendedconfirmed group on zhwiki - https://phabricator.wikimedia.org/T331691 [13:13:44] (03PS6) 10Majavah: [trwikiquote] Reverting temporary logo (Vector legacy + Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [13:13:50] (03PS2) 10Majavah: [trwiki] Removing the temporary logo, previously added, and already reverted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897837 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [13:13:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [13:14:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897837 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [13:14:52] (03Merged) 10jenkins-bot: [trwikiquote] Reverting temporary logo (Vector legacy + Vector 2022) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897195 (https://phabricator.wikimedia.org/T329399) (owner: 10Superpes15) [13:14:55] (03Merged) 10jenkins-bot: [trwiki] Removing the temporary logo, previously added, and already reverted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897837 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [13:15:07] !log taavi@deploy2002 Started scap: Backport for [[gerrit:897195|[trwikiquote] Reverting temporary logo (Vector legacy + Vector 2022) (T329399)]], [[gerrit:897837|[trwiki] Removing the temporary logo, previously added, and already reverted (T329047)]] [13:15:13] T329399: Temporary logo change for trwikiquote - https://phabricator.wikimedia.org/T329399 [13:15:14] T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047 [13:15:16] (03CR) 10Ssingh: "Thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/897833 (https://phabricator.wikimedia.org/T330670) (owner: 10Filippo Giunchedi) [13:15:51] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.316 second response time https://wikitech.wikimedia.org/wiki/Swift [13:16:32] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:16:35] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [13:16:39] !log taavi@deploy2002 taavi and superpes: Backport for [[gerrit:897195|[trwikiquote] Reverting temporary logo (Vector legacy + Vector 2022) (T329399)]], [[gerrit:897837|[trwiki] Removing the temporary logo, previously added, and already reverted (T329047)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:16:49] Superpes: please test [13:16:52] Checking [13:17:09] (03CR) 10Jbond: [C: 03+2] Node regex should match the whole string (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [13:17:37] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Swift [13:17:44] taavi Everything is fine [13:17:52] thanks, syncing [13:18:12] (03CR) 10Jbond: [C: 03+2] Node regex should match the whole string (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897871 (owner: 10Volans) [13:21:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T329260)', diff saved to https://phabricator.wikimedia.org/P45767 and previous config saved to /var/cache/conftool/dbconfig/20230313-132101-marostegui.json [13:21:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:21:08] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:21:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:21:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T329260)', diff saved to https://phabricator.wikimedia.org/P45768 and previous config saved to /var/cache/conftool/dbconfig/20230313-132123-marostegui.json [13:21:35] (03PS1) 10Jbond: README.release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897875 [13:21:37] (03PS1) 10Jbond: 1.1.1: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897876 [13:21:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:45] (03CR) 10Jbond: [C: 03+2] README.release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897875 (owner: 10Jbond) [13:22:49] (03CR) 10Jbond: [C: 03+2] 1.1.1: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897876 (owner: 10Jbond) [13:23:17] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:897195|[trwikiquote] Reverting temporary logo (Vector legacy + Vector 2022) (T329399)]], [[gerrit:897837|[trwiki] Removing the temporary logo, previously added, and already reverted (T329047)]] (duration: 08m 10s) [13:23:24] T329399: Temporary logo change for trwikiquote - https://phabricator.wikimedia.org/T329399 [13:23:24] T329047: Temporary logo change for trwiki - https://phabricator.wikimedia.org/T329047 [13:23:38] (03CR) 10CI reject: [V: 04-1] 1.1.1: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897876 (owner: 10Jbond) [13:24:28] ok, deployed [13:24:52] anyone else have anything else to deploy? [13:25:01] !log imported php-excimer 1.0.2-1+wmf2+buster1+icu67u1T329491 [13:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:05] !log imported php-excimer 1.0.2-1+wmf2+buster1+icu67u1 T329491 [13:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:10] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [13:25:35] 10SRE, 10Traffic: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10Vgutierrez) 05In progress→03Resolved a:03ssingh [13:25:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] 1.1.1: prepare release [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897876 (owner: 10Jbond) [13:26:23] jbond: ok to merge your changes? re: pki in blackbox checks ? [13:28:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T329260)', diff saved to https://phabricator.wikimedia.org/P45769 and previous config saved to /var/cache/conftool/dbconfig/20230313-132829-marostegui.json [13:28:35] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:29:07] (03CR) 10Ssingh: "Thanks very much for the patch! Two quick questions inline:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/897858 (owner: 10Volans) [13:29:41] RECOVERY - Host an-worker1140 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [13:29:51] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:25] RECOVERY - SSH on an-worker1140 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:31:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:54] (03CR) 10Klausman: [C: 03+1] api-gateway: allow to configure prefixes without JWT requirements (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896313 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [13:33:09] (03CR) 10Klausman: [C: 03+1] services: allow anon traffic for liftwing's paths in API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/897844 (https://phabricator.wikimedia.org/T331547) (owner: 10Elukey) [13:33:46] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/897858 (owner: 10Volans) [13:34:00] 10SRE, 10Traffic: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10ssingh) Thanks to @Vgutierrez for taking care of the rollout of this. For posterity, the final result for now before we do more enhancements: ` ===== NODE GROUP =====... [13:36:36] o/ [13:36:39] (03CR) 10Ssingh: [C: 03+1] spicerack: add authdns_active_hosts property (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/897858 (owner: 10Volans) [13:36:45] taavi: can I deploy or is something else going on? [13:37:03] Lucas_WMDE: sure, go ahead [13:37:07] ok thanks [13:38:59] (03PS4) 10Lucas Werkmeister (WMDE): termbox(prod): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) [13:39:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Rebased and updated the commit message; earlier +1 should still apply. Deploying." [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) (owner: 10Lucas Werkmeister (WMDE)) [13:40:00] !log imported wikidiff2 1.13.0-1+wmf1+buster1+icu67u1 T329491 [13:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:06] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [13:42:18] (03PS1) 10Marostegui: mariadb: Move db1106 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/897881 (https://phabricator.wikimedia.org/T331875) [13:43:00] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1106 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/897881 (https://phabricator.wikimedia.org/T331875) (owner: 10Marostegui) [13:43:10] zabe: You're awesome, BTW! [13:43:21] jbond: ok to merge your change? [13:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P45770 and previous config saved to /var/cache/conftool/dbconfig/20230313-134336-marostegui.json [13:44:15] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:44:57] (03Merged) 10jenkins-bot: termbox(prod): update to 2023-03-06-101138-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/894599 (https://phabricator.wikimedia.org/T309176) (owner: 10Lucas Werkmeister (WMDE)) [13:45:53] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [13:46:47] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [13:47:12] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [13:48:25] !log milimetric@deploy2002 Started deploy [airflow-dags/analytics@4696eff]: Deploying analytics dags from origin/main_airflow_2.5 [airflow-dags@4f393e6] [13:48:36] !log milimetric@deploy2002 Finished deploy [airflow-dags/analytics@4696eff]: Deploying analytics dags from origin/main_airflow_2.5 [airflow-dags@4f393e6] (duration: 00m 11s) [13:48:40] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [13:49:29] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [13:49:36] (03PS1) 10Marostegui: Revert "pki::multirootca: Add PKI prefix to blackbox checks" [puppet] - 10https://gerrit.wikimedia.org/r/897201 [13:50:42] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [13:50:45] (03Abandoned) 10Marostegui: Revert "pki::multirootca: Add PKI prefix to blackbox checks" [puppet] - 10https://gerrit.wikimedia.org/r/897201 (owner: 10Marostegui) [13:51:29] okay, I think I’m done too [13:51:49] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:55:47] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) [13:56:23] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:56:50] 10SRE, 10API Platform, 10Traffic: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10daniel) >>! In T319423#8385567, @Joe wrote: > FWIW we're banning more generic UAs via dynamic requestctl rules; our rule of thumb is to start rate-limiting requ... [13:56:58] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) [13:57:11] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:57:23] dbproxy alerts are to be expected [13:58:30] 10SRE, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Elitre) [13:58:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P45772 and previous config saved to /var/cache/conftool/dbconfig/20230313-135842-marostegui.json [13:59:31] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [13:59:38] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:59:41] (03PS4) 10Nicolas Fraison: hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310) [13:59:57] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:00:00] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:00:14] (03PS1) 10Jbond: site.pp: fix style violations [puppet] - 10https://gerrit.wikimedia.org/r/897886 [14:09:15] (03CR) 10Subramanya Sastry: "Yes, good to go if Daniel is ready for it and isn't tweaking other things that he wants to test first before deploying this. Removing my -" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (owner: 10Daniel Kinzler) [14:10:51] jouncebot: nowandnext [14:10:51] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [14:10:51] In 1 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1530) [14:11:17] (03PS1) 10BBlack: dns: add more reflection debugging records [dns] - 10https://gerrit.wikimedia.org/r/897887 [14:12:46] (03CR) 10Clément Goubert: [C: 03+2] mwdebug_deploy: clean up physical resources from target hosts [puppet] - 10https://gerrit.wikimedia.org/r/896355 (owner: 10Jaime Nuche) [14:13:13] (03CR) 10BBlack: [C: 03+2] dns: add more reflection debugging records [dns] - 10https://gerrit.wikimedia.org/r/897887 (owner: 10BBlack) [14:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T329260)', diff saved to https://phabricator.wikimedia.org/P45773 and previous config saved to /var/cache/conftool/dbconfig/20230313-141348-marostegui.json [14:13:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [14:13:54] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:14:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [14:14:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T329260)', diff saved to https://phabricator.wikimedia.org/P45774 and previous config saved to /var/cache/conftool/dbconfig/20230313-141409-marostegui.json [14:15:02] (03CR) 10D3r1ck01: Make VE on officewiki use Parsoid directly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (owner: 10Daniel Kinzler) [14:15:56] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:16:39] (03PS5) 10Clément Goubert: mwdebug_deploy: remove configuration [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [14:18:54] (03CR) 10Clément Goubert: [C: 03+2] mwdebug_deploy: remove configuration [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [14:20:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T329260)', diff saved to https://phabricator.wikimedia.org/P45776 and previous config saved to /var/cache/conftool/dbconfig/20230313-142004-marostegui.json [14:20:10] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:20:35] (03PS1) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [14:20:57] (03CR) 10CI reject: [V: 04-1] spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [14:21:11] (03PS4) 10Clément Goubert: switch noc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [14:22:41] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:23:19] (03CR) 10Clément Goubert: [C: 03+2] switch noc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [14:23:37] !log switch noc.wikimedia.org from eqiad to codfw - T331634 [14:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:42] T331634: switch noc.wikimedia.org from eqiad to codfw - https://phabricator.wikimedia.org/T331634 [14:24:50] (03PS2) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [14:25:28] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:26:37] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) >>! In T328872#8683319, @MatthewVernon wrote: > I am a bit concerned about the small rise... [14:26:40] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10KOfori) [14:26:50] (03PS3) 10Nicolas Fraison: jobhistory: add prometheus jmx javaagent on prod jobhistory [puppet] - 10https://gerrit.wikimedia.org/r/896305 [14:27:46] (03PS2) 10Muehlenhoff: Add Cumin aliases for IDM [puppet] - 10https://gerrit.wikimedia.org/r/896112 (https://phabricator.wikimedia.org/T320797) [14:28:51] (03CR) 10Muehlenhoff: [C: 03+2] slapd: correct module loading [puppet] - 10https://gerrit.wikimedia.org/r/896110 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [14:30:36] (03CR) 10Herron: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [14:30:53] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/896112 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [14:31:45] (03PS10) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [14:31:58] (03PS4) 10Jbond: Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) [14:33:30] (03PS3) 10Muehlenhoff: Configure database size for MDB backend [puppet] - 10https://gerrit.wikimedia.org/r/896359 (https://phabricator.wikimedia.org/T292942) [14:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P45777 and previous config saved to /var/cache/conftool/dbconfig/20230313-143510-marostegui.json [14:35:11] (03CR) 10Volans: [C: 03+2] spicerack: add authdns_active_hosts property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/897858 (owner: 10Volans) [14:35:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) I am on the phone again with Dell now on this case, since 3-06-2023 no update on the case. [14:35:47] (03CR) 10Jbond: [C: 03+2] P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [14:35:52] (03CR) 10Jbond: [C: 03+2] Ship custom /etc/logrotate.d/rsyslog on KDC hosts [puppet] - 10https://gerrit.wikimedia.org/r/894647 (https://phabricator.wikimedia.org/T331123) (owner: 10Jbond) [14:38:56] !log disable puppet fleet wide to debug strange issue [14:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:00] (03Merged) 10jenkins-bot: spicerack: add authdns_active_hosts property [software/spicerack] - 10https://gerrit.wikimedia.org/r/897858 (owner: 10Volans) [14:41:38] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.07771 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:41:46] * jbond fixing ^^ [14:43:49] (03PS1) 10Jbond: rsyslog: fix source file location [puppet] - 10https://gerrit.wikimedia.org/r/897902 [14:44:25] (03CR) 10Jbond: [C: 03+2] rsyslog: fix source file location [puppet] - 10https://gerrit.wikimedia.org/r/897902 (owner: 10Jbond) [14:49:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:50:14] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [14:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P45778 and previous config saved to /var/cache/conftool/dbconfig/20230313-145016-marostegui.json [14:50:57] (03CR) 10Tchanders: [C: 03+1] "Looks good - we can do this in the next convenient window!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896900 (https://phabricator.wikimedia.org/T330533) (owner: 10TsepoThoabala) [14:51:40] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) Dell create a new case ` Work Order 439017270 has been submitted for 2 8tb hard drives parts only. [14:51:40] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [14:51:54] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [14:51:58] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [14:53:41] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: sync [14:54:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:55:16] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002933 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:58:43] (03PS4) 10DCausse: rdf-streaming-updater: add a "wcqs" release [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 [14:59:02] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Investigate methods to rate-limit/discard excessive log messages closer to the producer - https://phabricator.wikimedia.org/T331879 (10herron) [14:59:19] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Investigate methods to rate-limit/discard excessive log messages closer to the producer - https://phabricator.wikimedia.org/T331879 (10herron) a:05lmata→03None [14:59:21] (03PS3) 10Kimberly Sarabia: Add header at top of main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894765 (https://phabricator.wikimedia.org/T325362) [14:59:29] (03CR) 10Tchanders: [C: 03+1] "This looks good to me, but I've added a couple more reviewers, because I'm not 100% sure - is this safe to do all at once?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala) [15:02:42] (03CR) 10DCausse: rdf-streaming-updater: add a "wcqs" release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [15:02:59] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for IDM [puppet] - 10https://gerrit.wikimedia.org/r/896112 (https://phabricator.wikimedia.org/T320797) (owner: 10Muehlenhoff) [15:03:48] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:04:11] jouncebot nowandnext [15:04:11] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [15:04:11] In 0 hour(s) and 25 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1530) [15:05:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T329260)', diff saved to https://phabricator.wikimedia.org/P45779 and previous config saved to /var/cache/conftool/dbconfig/20230313-150523-marostegui.json [15:05:25] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:05:28] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:06:07] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:07:13] (03PS3) 10Volans: docstrings: automatically document type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/896323 [15:07:22] jbond: I am seeing 'Could not evaluate: Could not retrieve file metadata for puppet:///profile/rsyslog/logrotate.conf: Error 500 on SERVER: Server Error: Not authorized to call find on /file_metadata/profile/rsyslog/logrotate.conf' on cloud vps instances, maybe related to your recent rsyslog patches? [15:08:09] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) @ayounsi, @cmooney: Quick question about Junos OS: so we are planning to spread `ns0` over `dns100[123]` and `ns1` over `dns200[123]`, similar to how we are doing wit... [15:08:54] taavi: that should have been fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/897902 [15:09:06] (03PS3) 10SBassett: eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio) [15:09:18] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) >>>! In T328872#8685266, @Yann wrote: >> This bug would not be so bad if it would be possible to s... [15:09:33] taavi: actully i fogot to update cloud.yaml one sec [15:10:48] (03PS1) 10Jbond: rsyslog: fix source file location [puppet] - 10https://gerrit.wikimedia.org/r/897907 [15:11:02] (03CR) 10Jbond: [V: 03+2 C: 03+2] rsyslog: fix source file location [puppet] - 10https://gerrit.wikimedia.org/r/897907 (owner: 10Jbond) [15:11:19] taavi: ok shuold be fixed now [15:11:33] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/897886 (owner: 10Jbond) [15:12:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/897886 (owner: 10Jbond) [15:14:34] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.190 second response time https://wikitech.wikimedia.org/wiki/Swift [15:14:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:19] (03PS1) 10Elukey: Revert "Revert "centrallog: Remove centrallog1001 from the kafka-jumbo allow list"" [puppet] - 10https://gerrit.wikimedia.org/r/897203 [15:16:26] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:17:23] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Revert "centrallog: Remove centrallog1001 from the kafka-jumbo allow list"" [puppet] - 10https://gerrit.wikimedia.org/r/897203 (owner: 10Elukey) [15:18:22] (03CR) 10Elukey: [C: 03+2] Revert "Revert "centrallog: Remove centrallog1001 from the kafka-jumbo allow list"" [puppet] - 10https://gerrit.wikimedia.org/r/897203 (owner: 10Elukey) [15:19:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:06] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Swift [15:21:19] !log dancy@deploy2002 Started scap: testing T329857 [15:21:25] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [15:24:12] (03PS2) 10Majavah: Remove l10nupdate support [puppet] - 10https://gerrit.wikimedia.org/r/896318 [15:24:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [15:25:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [15:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:29:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:29:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:29:54] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [15:30:05] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1530). [15:31:28] !log dancy@deploy2002 Finished scap: testing T329857 (duration: 10m 08s) [15:31:33] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [15:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:33:59] (03CR) 10Elukey: "Didn't we discuss a while ago to avoid using calico specific policies in favor of more standard k8s ones? Just to understand the direction" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [15:35:01] (03CR) 10Elukey: [C: 03+1] Migrate away from deprecated typology annotations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [15:35:40] !log imported php-yaml 2.2.1+2.1.0+2.0.4+1.3.2-2+wmf1~buster1+icu67u1 T329491 [15:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:45] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [15:37:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [15:38:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [15:38:35] (03PS1) 10Esanders: Disable visual enhancements on newsectionlink pages initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897912 (https://phabricator.wikimedia.org/T331635) [15:42:28] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-03-13-121751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/897914 [15:43:45] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897915 (https://phabricator.wikimedia.org/T128546) [15:44:22] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10RobH) [15:45:10] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897915 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:45:54] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897915 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:46:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [15:46:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [15:46:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T329260)', diff saved to https://phabricator.wikimedia.org/P45780 and previous config saved to /var/cache/conftool/dbconfig/20230313-154641-marostegui.json [15:46:48] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:47:05] (03PS1) 10Filippo Giunchedi: centrallog: restore benthos consumer group and read all partitions [puppet] - 10https://gerrit.wikimedia.org/r/897916 (https://phabricator.wikimedia.org/T331801) [15:47:52] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Me and Filippo tried a ton of workarounds and solutions today, but none of them really worked. In the end we removed th... [15:48:17] (03CR) 10Elukey: [C: 03+1] centrallog: restore benthos consumer group and read all partitions [puppet] - 10https://gerrit.wikimedia.org/r/897916 (https://phabricator.wikimedia.org/T331801) (owner: 10Filippo Giunchedi) [15:49:49] (03CR) 10Filippo Giunchedi: [C: 03+2] centrallog: restore benthos consumer group and read all partitions [puppet] - 10https://gerrit.wikimedia.org/r/897916 (https://phabricator.wikimedia.org/T331801) (owner: 10Filippo Giunchedi) [15:49:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:50:27] (03PS1) 10EoghanGaffney: Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T32797) [15:50:48] (03CR) 10CI reject: [V: 04-1] Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T32797) (owner: 10EoghanGaffney) [15:51:09] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ayounsi) Yep that should be enough as both hosts are directly reachable by the router (they're in row A/B/D). We will need to look closely at them the day they're behind L3 s... [15:51:25] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson) @MatthewVernon the raid configuration states "\Partitioning/Raid: Same as existing ms-fe hosts" Can you be more specific, is this h/w rai... [15:51:46] (03PS2) 10EoghanGaffney: Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T32797) [15:53:24] (03CR) 10Jbond: [C: 03+2] site.pp: fix style violations [puppet] - 10https://gerrit.wikimedia.org/r/897886 (owner: 10Jbond) [15:53:49] (03CR) 10CI reject: [V: 04-1] Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T32797) (owner: 10EoghanGaffney) [15:54:39] (03PS3) 10EoghanGaffney: Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T327978) [15:54:44] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] cfssl/cert: Allow to absent cert resources [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:54:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi) I noticed that it's running `19.1R3-S2.3` we should upgrade it to latest Junos recommended bef... [15:56:46] (03CR) 10Atieno: [V: 03+2 C: 03+2] Add the ability to specify the default DPI value for PDF files (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771) (owner: 10Vlad.shapik) [15:57:46] (03PS4) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) [15:57:48] (03PS6) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) [15:58:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8688051, @ayounsi wrote: > I noticed that it's running `19.1R3-S2.3` we should... [15:58:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T329260)', diff saved to https://phabricator.wikimedia.org/P45781 and previous config saved to /var/cache/conftool/dbconfig/20230313-155830-marostegui.json [15:58:37] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:59:08] (03CR) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [16:00:10] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:897915| Bumping portals to master (T128546)]] (duration: 06m 43s) [16:00:15] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:00:28] (03CR) 10Btullis: [C: 03+1] jobhistory: add prometheus jmx javaagent on prod jobhistory [puppet] - 10https://gerrit.wikimedia.org/r/896305 (owner: 10Nicolas Fraison) [16:00:42] !log imported xdebug 3.0.3+2.9.8+2.8.1+2.5.5-0+deb11u1+wmf1+buster1+icu67u1 T329491 [16:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:47] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [16:00:55] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-03-13-121751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/897914 (owner: 10BryanDavis) [16:00:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:45] (03PS5) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) [16:01:47] (03PS7) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) [16:03:22] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40099/console" [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [16:04:37] (03Merged) 10jenkins-bot: Add the ability to specify the default DPI value for PDF files [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771) (owner: 10Vlad.shapik) [16:05:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:06:26] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:897915| Bumping portals to master (T128546)]] (duration: 06m 15s) [16:06:32] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:06:33] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-03-13-121751-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/897914 (owner: 10BryanDavis) [16:06:46] (03CR) 10Jbond: [C: 03+1] calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [16:07:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40100/console" [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [16:08:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [16:08:35] (03PS2) 10JMeybohm: Migrate away from deprecated topology annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) [16:08:59] (03CR) 10JMeybohm: Migrate away from deprecated topology annotations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [16:10:55] (03CR) 10JMeybohm: [C: 03+2] custom_deploy.d: Make k8s 1.23 istio configs the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/896131 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [16:11:10] !log dancy@deploy2002 Started scap: testing [16:11:17] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Enable stable certificate request names in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/896111 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [16:12:04] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Move default kubernetes version to 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [16:13:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P45782 and previous config saved to /var/cache/conftool/dbconfig/20230313-161337-marostegui.json [16:14:57] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add default-network-policy globally [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [16:15:49] (03Merged) 10jenkins-bot: custom_deploy.d: Make k8s 1.23 istio configs the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/896131 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [16:15:56] (03Merged) 10jenkins-bot: cert-manager: Enable stable certificate request names in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/896111 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [16:16:02] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [16:16:25] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [16:16:33] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [16:17:05] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [16:17:10] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [16:17:40] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [16:18:03] !log dancy@deploy2002 Finished scap: testing (duration: 06m 53s) [16:18:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:20:19] !log imported tideways 5.0.4-2+wmf1+buster1+icu67u1 T329491 [16:20:20] (03Merged) 10jenkins-bot: admin_ng: Add default-network-policy globally [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035) (owner: 10JMeybohm) [16:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:24] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [16:22:04] 10SRE, 10serviceops: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) These packages have been rebuilt against the ICU67-enabled PHP packages and imported to the component/icu67 component (some packages depend on others, e.g. igbinary on apcu and memcached on igbi... [16:22:26] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Jclark-ctr) I am in process of racking right now will have them finished being racked and cabled in the next day or so [16:23:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:24:09] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Marostegui) Amazing, thank you! [16:24:36] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Migrate Kerberos clients towards TCP - https://phabricator.wikimedia.org/T329839 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:26:08] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MPhamWMF) [16:27:54] (03CR) 10Nicolas Fraison: [C: 03+2] jobhistory: add prometheus jmx javaagent on prod jobhistory [puppet] - 10https://gerrit.wikimedia.org/r/896305 (owner: 10Nicolas Fraison) [16:28:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P45783 and previous config saved to /var/cache/conftool/dbconfig/20230313-162843-marostegui.json [16:29:45] (03CR) 10JMeybohm: [C: 03+2] Revert: cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/896128 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [16:35:19] (03Merged) 10jenkins-bot: Revert: cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/896128 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [16:35:46] (03PS1) 10Jbond: node_regex: add a fixer [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/897925 [16:36:46] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10RobH) [16:36:52] !log jayme@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:43:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T329260)', diff saved to https://phabricator.wikimedia.org/P45784 and previous config saved to /var/cache/conftool/dbconfig/20230313-164349-marostegui.json [16:43:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [16:43:55] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:44:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [16:44:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T329260)', diff saved to https://phabricator.wikimedia.org/P45785 and previous config saved to /var/cache/conftool/dbconfig/20230313-164410-marostegui.json [16:47:37] !log jayme@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:50:46] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-03-13-164047-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/897948 [16:54:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T329260)', diff saved to https://phabricator.wikimedia.org/P45787 and previous config saved to /var/cache/conftool/dbconfig/20230313-165449-marostegui.json [16:54:54] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:57:45] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Mike_Peel) Not sure if it's related, but I keep getting this error when uploading new version of files.... [16:58:46] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10ayounsi) Note that we will only have one mx480 (current cr3-esams), the other router will be the current cr3-knams (mx204). Future refresh will... [16:58:58] (03PS2) 10Daniel Kinzler: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) [16:59:06] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-03-13-164047-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/897948 (owner: 10BryanDavis) [16:59:12] (03CR) 10Daniel Kinzler: Make VE on officewiki use Parsoid directly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [16:59:14] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2023-03-10-212005-production [puppet] - 10https://gerrit.wikimedia.org/r/896194 (https://phabricator.wikimedia.org/T330759) (owner: 10BryanDavis) [16:59:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10wiki_willy) Hi @MatthewVernon & @Jclark-ctr - if this sample drive looks good, let me know and we'll work on ordering a bunch more to keep them onsite as spares. Thanks,... [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1700) [17:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1700). [17:00:11] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10ItamarWMDE) [17:02:42] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10BBlack) Resilient hashing indeed sounds much better (it seems like that's their codeword for some internal "consistent hashing" implementation), but it doesn't look like our... [17:03:50] jouncebot nowandnext [17:03:50] For the next 0 hour(s) and 56 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1700) [17:03:50] For the next 0 hour(s) and 26 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T1700) [17:03:50] In 2 hour(s) and 56 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T2000) [17:04:00] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-03-13-164047-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/897948 (owner: 10BryanDavis) [17:04:07] !log dancy@deploy2002 Installing scap version "4.46.0" for 553 hosts [17:04:54] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10RobH) [17:04:55] !log Ran cache.purge_openstack_users() for Striker following deploy of e1f7491 (T331674) [17:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:59] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [17:05:00] T331674: Some tool maintainers not showing in Striker UI following config change - https://phabricator.wikimedia.org/T331674 [17:05:38] PROBLEM - MariaDB Replica Lag: s8 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1176.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:05:46] (03PS1) 10JMeybohm: cert-manager: Actually run 1.10.1 with chart version 1.10.x [deployment-charts] - 10https://gerrit.wikimedia.org/r/897950 (https://phabricator.wikimedia.org/T325292) [17:07:41] !log dancy@deploy2002 Installing scap version "4.46.0" for 553 hosts [17:07:49] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10RobH) >>! In T331886#8688402, @ayounsi wrote: > Note that we will only have one mx480 (current cr3-esams), the other router will be the current... [17:08:37] !log dancy@deploy2002 Installation of scap version "4.46.0" completed for 553 hosts [17:09:50] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:09:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P45788 and previous config saved to /var/cache/conftool/dbconfig/20230313-170955-marostegui.json [17:10:15] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:10:24] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:10:53] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:10:55] !log roll-restart of codfw eqiad frontends [17:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:05] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:11:08] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [17:11:11] 10SRE, 10Traffic: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ayounsi) Good point! Looks like it's only for switches, not common! Compatible with the routers, there is `load-balance consistent-hash` but only for BGP peers: > (BGP only)... [17:11:28] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:12:21] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:12:50] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:13:39] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:13:42] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:15:27] !log dancy@deploy2002 Started scap: testing T329857 [17:15:32] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [17:16:24] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [17:16:51] 10SRE, 10ops-knams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10ayounsi) > In reviewing the contract, its "Precabling/Patc h Panels : Fiber – 6 Ports" so it's a bundle and likely has to terminate in the same... [17:22:22] !log dancy@deploy2002 Finished scap: testing T329857 (duration: 06m 54s) [17:22:28] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [17:25:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P45789 and previous config saved to /var/cache/conftool/dbconfig/20230313-172503-marostegui.json [17:32:50] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:33:00] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:33:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) No complaints from me, thanks, drive is now 25% loaded and behaving fine. [17:35:17] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:38:16] (03PS2) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [17:38:57] (03CR) 10CI reject: [V: 04-1] DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [17:39:33] (03PS3) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [17:40:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T329260)', diff saved to https://phabricator.wikimedia.org/P45790 and previous config saved to /var/cache/conftool/dbconfig/20230313-174009-marostegui.json [17:40:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [17:40:12] (03PS4) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [17:40:15] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:40:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [17:40:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T329260)', diff saved to https://phabricator.wikimedia.org/P45791 and previous config saved to /var/cache/conftool/dbconfig/20230313-174030-marostegui.json [17:42:58] (03PS1) 10Andrew Bogott: rbd2backy2.py: handle empty 'expire' stamps [puppet] - 10https://gerrit.wikimedia.org/r/897955 [17:43:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:44:17] !log dancy@deploy2002 Started scap: test cleanup [17:50:21] (03CR) 10Dzahn: [C: 03+1] Add blackbox http check for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/897917 (https://phabricator.wikimedia.org/T327978) (owner: 10EoghanGaffney) [17:50:58] !log dancy@deploy2002 Finished scap: test cleanup (duration: 06m 40s) [17:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T329260)', diff saved to https://phabricator.wikimedia.org/P45792 and previous config saved to /var/cache/conftool/dbconfig/20230313-175109-marostegui.json [17:51:15] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:51:40] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Joe) >>! In T331138#8682260, @Krinkle wrote: > No mention of a production error in this task. Just to be sure I understood correctly: is fact we're not cleanin... [17:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:52:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10wiki_willy) Thanks for confirming @MatthewVernon. ++@robh to order spares for both eqiad and codfw >>! In T329305#8688732, @MatthewVernon wrote: > No complaints from me... [17:55:15] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1001.eqiad.wmnet [17:55:29] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson) [17:55:42] 10SRE, 10SRE-Access-Requests: Enroll Lucas Werkmeister’s YubiKey for production access - https://phabricator.wikimedia.org/T273193 (10Lucas_Werkmeister_WMDE) (FWIW, the same YubiKey now does require touch as expected – ever since I got a new work laptop, IIRC. Slightly weird but not the end of the world.) [17:56:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [17:56:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:59:13] (03CR) 10Dzahn: [C: 03+2] "after puppet on prometheus hosts was fixed (unrelated) this started to alert.. 503 in codfw and 403 in eqiad, as reported by Jelto on the " [puppet] - 10https://gerrit.wikimedia.org/r/893828 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [18:04:08] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10xcollazo) Hal needs to deploy to the `platform-eng` Airflow instance. So he needs `platform-eng-deployers`. [18:06:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P45793 and previous config saved to /var/cache/conftool/dbconfig/20230313-180615-marostegui.json [18:07:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [18:07:41] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1001.eqiad.wmnet [18:07:46] RECOVERY - MariaDB Replica Lag: s8 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:11:02] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1002.eqiad.wmnet [18:17:31] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1002.eqiad.wmnet [18:21:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P45794 and previous config saved to /var/cache/conftool/dbconfig/20230313-182121-marostegui.json [18:28:24] (03CR) 10Dzahn: [C: 03+1] Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [18:31:28] (03CR) 10Gergő Tisza: [C: 03+2] changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [18:32:58] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1002.eqiad.wmnet [18:33:02] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1002.eqiad.wmnet [18:36:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T329260)', diff saved to https://phabricator.wikimedia.org/P45795 and previous config saved to /var/cache/conftool/dbconfig/20230313-183628-marostegui.json [18:36:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [18:36:33] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:36:35] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@a8d066e]: Parameterize streaming updater reconcile start date [18:36:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [18:36:49] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@a8d066e]: Parameterize streaming updater reconcile start date (duration: 00m 14s) [18:37:56] (03Merged) 10jenkins-bot: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [18:38:54] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1002.eqiad.wmnet [18:43:15] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@196e10d]: allow spark3-submit as a valid spark exeutable [18:43:28] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@196e10d]: allow spark3-submit as a valid spark exeutable (duration: 00m 13s) [18:44:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [18:44:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [18:45:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T329260)', diff saved to https://phabricator.wikimedia.org/P45796 and previous config saved to /var/cache/conftool/dbconfig/20230313-184502-marostegui.json [18:45:07] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:47:43] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sessionstore1002.eqiad.wmnet [18:48:02] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1002.eqiad.wmnet [18:48:05] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sessionstore1002.eqiad.wmnet [18:48:24] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1002.eqiad.wmnet [18:49:03] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1002.eqiad.wmnet [18:55:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T329260)', diff saved to https://phabricator.wikimedia.org/P45797 and previous config saved to /var/cache/conftool/dbconfig/20230313-185558-marostegui.json [18:56:04] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:59:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1002.eqiad.wmnet [18:59:42] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1002.eqiad.wmnet [19:00:26] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [19:04:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:05:50] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:07:15] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [19:09:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:11:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P45798 and previous config saved to /var/cache/conftool/dbconfig/20230313-191104-marostegui.json [19:19:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:24:13] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite) [19:24:43] (03CR) 10Nicolas Fraison: [C: 03+1] Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [19:26:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P45799 and previous config saved to /var/cache/conftool/dbconfig/20230313-192610-marostegui.json [19:30:10] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1003.eqiad.wmnet [19:38:04] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sessionstore1003.eqiad.wmnet [19:38:15] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1003.eqiad.wmnet [19:39:29] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1003.eqiad.wmnet [19:41:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T329260)', diff saved to https://phabricator.wikimedia.org/P45800 and previous config saved to /var/cache/conftool/dbconfig/20230313-194116-marostegui.json [19:41:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [19:41:23] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [19:41:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [19:41:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T329260)', diff saved to https://phabricator.wikimedia.org/P45801 and previous config saved to /var/cache/conftool/dbconfig/20230313-194148-marostegui.json [19:50:51] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1003.eqiad.wmnet [19:50:52] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1003.eqiad.wmnet [19:51:39] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1001.eqiad.wmnet [19:51:41] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sessionstore1001.eqiad.wmnet [19:51:54] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sessionstore1001.eqiad.wmnet [19:52:24] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [19:52:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T329260)', diff saved to https://phabricator.wikimedia.org/P45802 and previous config saved to /var/cache/conftool/dbconfig/20230313-195244-marostegui.json [19:52:49] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T2000). [20:00:05] kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:33] I can deploy [20:01:25] kimberly_sarabia: ready? [20:02:16] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [20:02:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sessionstore1001.eqiad.wmnet [20:03:06] kindrobot: Kim is having issues connecting to IRC, just one sec... [20:03:21] Sure, no problem. [20:07:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P45803 and previous config saved to /var/cache/conftool/dbconfig/20230313-200750-marostegui.json [20:08:13] kindrobot: Kim's having connection issues, I can monitor the backport for her [20:08:41] Sounds good. You have context on the change, eh? [20:08:52] kindrobot: yup :) [20:09:00] Great. [20:09:17] I'll start the backport. [20:13:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894765 (https://phabricator.wikimedia.org/T325362) (owner: 10Kimberly Sarabia) [20:14:30] (03Merged) 10jenkins-bot: Add header at top of main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894765 (https://phabricator.wikimedia.org/T325362) (owner: 10Kimberly Sarabia) [20:14:42] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:894765|Add header at top of main page (T325362)]] [20:14:47] T325362: Add a header at the top of the Main page of French, Kotava and Konkani projects - https://phabricator.wikimedia.org/T325362 [20:15:17] !log start UTC late backport window [20:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:20] !log kindrobot@deploy2002 kindrobot and ksarabia: Backport for [[gerrit:894765|Add header at top of main page (T325362)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:16:42] jan_drewniak: ready on debug, could you please confirm the changes? [20:20:35] kindrobot: checking... looks good! ready to sync :) [20:21:03] Syncing o7 [20:22:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P45804 and previous config saved to /var/cache/conftool/dbconfig/20230313-202256-marostegui.json [20:26:54] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:894765|Add header at top of main page (T325362)]] (duration: 12m 11s) [20:26:59] T325362: Add a header at the top of the Main page of French, Kotava and Konkani projects - https://phabricator.wikimedia.org/T325362 [20:27:28] Sync finished. Thanks jan_drewniak [20:27:35] !log close UTC late backport window [20:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:50] kindrobot: thanks! [20:38:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T329260)', diff saved to https://phabricator.wikimedia.org/P45805 and previous config saved to /var/cache/conftool/dbconfig/20230313-203802-marostegui.json [20:38:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [20:38:09] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:38:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [20:38:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T329260)', diff saved to https://phabricator.wikimedia.org/P45806 and previous config saved to /var/cache/conftool/dbconfig/20230313-203824-marostegui.json [20:43:01] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@8685c9e]: drop_dated_directories.py must run through skein [20:43:16] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@8685c9e]: drop_dated_directories.py must run through skein (duration: 00m 14s) [20:47:23] !log herron@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging2001.codfw.wmnet with OS bullseye [20:47:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:50:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T329260)', diff saved to https://phabricator.wikimedia.org/P45807 and previous config saved to /var/cache/conftool/dbconfig/20230313-205004-marostegui.json [20:50:10] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:52:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:58:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [21:00:04] Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230313T2100). [21:00:11] ^logging-codfw is me, it's expected. silenced alerady on the icinga side will silence this as well [21:00:11] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [21:01:45] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: host reimage [21:04:59] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-logging2001.codfw.wmnet with reason: host reimage [21:05:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P45808 and previous config saved to /var/cache/conftool/dbconfig/20230313-210510-marostegui.json [21:06:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [21:09:48] (03PS4) 10SBassett: eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio) [21:12:05] (03PS1) 10Zabe: dewiki: Allow 'crats to remove sysopship and manage importers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/897997 (https://phabricator.wikimedia.org/T331921) [21:13:49] Hey all - no sec patches for today’s deployment window - but I would like to do a quick config change if there are no objections: https://gerrit.wikimedia.org/r/896085 [21:14:43] (03CR) 10Effie Mouzeli: [C: 03+1] trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [21:14:47] * herzog can test in mwdebug if needed sbassett [21:15:01] but I don't think there are interface changes that can be checked [21:19:57] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:20:03] herzog: only if we add you to the deny list somehow, which seems a bit much. In theory, logstash should be able to provide us with enough detail? [21:20:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P45809 and previous config saved to /var/cache/conftool/dbconfig/20230313-212017-marostegui.json [21:20:27] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:20:29] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:20:42] I think logstash should be enough [21:23:47] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-logging2001.codfw.wmnet with OS bullseye [21:23:51] (03PS1) 10Dzahn: releases: limit monitoring of releases-jenkins to active server [puppet] - 10https://gerrit.wikimedia.org/r/897999 (https://phabricator.wikimedia.org/T330960) [21:25:54] (03CR) 10Dzahn: [C: 03+2] releases: limit monitoring of releases-jenkins to active server [puppet] - 10https://gerrit.wikimedia.org/r/897999 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [21:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:35:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T329260)', diff saved to https://phabricator.wikimedia.org/P45810 and previous config saved to /var/cache/conftool/dbconfig/20230313-213523-marostegui.json [21:35:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:35:29] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [21:35:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T329260)', diff saved to https://phabricator.wikimedia.org/P45811 and previous config saved to /var/cache/conftool/dbconfig/20230313-213544-marostegui.json [21:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:42:27] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Krinkle) >>! In T331138#8688891, @Joe wrote: > not cleaning up stale thumbnails and we're still serving them to the public not a production error? Can you sugge... [21:47:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T329260)', diff saved to https://phabricator.wikimedia.org/P45812 and previous config saved to /var/cache/conftool/dbconfig/20230313-214751-marostegui.json [21:47:57] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [22:02:56] (03CR) 10Dzahn: [C: 03+2] "missed linking this to T327975" [puppet] - 10https://gerrit.wikimedia.org/r/897999 (https://phabricator.wikimedia.org/T330960) (owner: 10Dzahn) [22:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P45813 and previous config saved to /var/cache/conftool/dbconfig/20230313-220257-marostegui.json [22:03:22] (03PS1) 10Dzahn: releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) [22:03:32] (03CR) 10CI reject: [V: 04-1] releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) (owner: 10Dzahn) [22:03:51] (03PS2) 10Dzahn: releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) [22:04:12] (03CR) 10CI reject: [V: 04-1] releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) (owner: 10Dzahn) [22:05:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:09:51] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10wiki_willy) a:03Jclark-ctr Hi @Jclark-ctr - just a heads up that this one is out of warranty, but @RobH is working on purchasing more spares after the testing in T329305... [22:10:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:12:43] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40101/console" [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [22:15:46] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10wiki_willy) a:05Jclark-ctr→03Papaul [22:17:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:18:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P45814 and previous config saved to /var/cache/conftool/dbconfig/20230313-221803-marostegui.json [22:20:12] (03CR) 10SBassett: [C: 03+2] eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio) [22:20:20] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Add dummy 'config_deploy_vars' for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/897852 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [22:20:57] (03Merged) 10jenkins-bot: eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio) [22:22:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:30:31] !log sbassett@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Set ext:StopForumSpam to enforce on es.wikiversity (duration: 06m 59s) [22:30:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:31:55] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T329260)', diff saved to https://phabricator.wikimedia.org/P45815 and previous config saved to /var/cache/conftool/dbconfig/20230313-223309-marostegui.json [22:33:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:33:15] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [22:33:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [22:33:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T329260)', diff saved to https://phabricator.wikimedia.org/P45816 and previous config saved to /var/cache/conftool/dbconfig/20230313-223331-marostegui.json [22:34:27] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:35:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:36:49] (03PS1) 10Zabe: noc: Switch default selection from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898037 [22:38:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:38:07] (03PS3) 10Dzahn: releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) [22:38:20] (03CR) 10CI reject: [V: 04-1] releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) (owner: 10Dzahn) [22:38:52] (03PS2) 10Zabe: noc: Switch default selection on db.php from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898037 [22:39:04] (03CR) 10Zabe: [C: 03+2] noc: Switch default selection on db.php from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898037 (owner: 10Zabe) [22:39:45] (03Merged) 10jenkins-bot: noc: Switch default selection on db.php from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898037 (owner: 10Zabe) [22:40:46] (03PS4) 10Dzahn: releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) [22:40:50] !log zabe@deploy2002 Started scap: [[gerrit:898037 [22:40:50] !log zabe@deploy2002 scap failed: BrokenPipeError [Errno 32] Broken pipe (duration: 00m 00s) [22:41:10] !log zabe@deploy2002 Started scap: [[gerrit:898037|noc: Switch default selection on db.php from eqiad to codfw]] [22:41:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:42:18] (03PS5) 10Dzahn: releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) [22:42:47] (03PS6) 10Dzahn: releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) [22:42:57] (03CR) 10Dzahn: [C: 03+2] releases-jenkins: expect status code 401 or 403 [puppet] - 10https://gerrit.wikimedia.org/r/898028 (https://phabricator.wikimedia.org/T327975) (owner: 10Dzahn) [22:44:28] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:45:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T329260)', diff saved to https://phabricator.wikimedia.org/P45817 and previous config saved to /var/cache/conftool/dbconfig/20230313-224532-marostegui.json [22:45:48] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [22:48:07] !log zabe@deploy2002 Finished scap: [[gerrit:898037|noc: Switch default selection on db.php from eqiad to codfw]] (duration: 06m 56s) [22:51:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:54:30] (03PS1) 10Dzahn: releases-jenkins: don't force TLS when monitoring port 8080 [puppet] - 10https://gerrit.wikimedia.org/r/898038 (https://phabricator.wikimedia.org/T327975) [22:55:31] (03CR) 10Dzahn: [C: 03+2] releases-jenkins: don't force TLS when monitoring port 8080 [puppet] - 10https://gerrit.wikimedia.org/r/898038 (https://phabricator.wikimedia.org/T327975) (owner: 10Dzahn) [23:00:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P45818 and previous config saved to /var/cache/conftool/dbconfig/20230313-230038-marostegui.json [23:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:15:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P45819 and previous config saved to /var/cache/conftool/dbconfig/20230313-231544-marostegui.json [23:19:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:24:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:26:15] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:51] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:29:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:30:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T329260)', diff saved to https://phabricator.wikimedia.org/P45820 and previous config saved to /var/cache/conftool/dbconfig/20230313-233050-marostegui.json [23:30:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:31:00] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [23:31:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [23:31:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:31:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:31:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T329260)', diff saved to https://phabricator.wikimedia.org/P45821 and previous config saved to /var/cache/conftool/dbconfig/20230313-233127-marostegui.json [23:33:09] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1003.eqiad.wmnet [23:39:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1003.eqiad.wmnet [23:43:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T329260)', diff saved to https://phabricator.wikimedia.org/P45822 and previous config saved to /var/cache/conftool/dbconfig/20230313-234301-marostegui.json [23:43:06] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [23:51:17] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) is CRITICAL: Test Get value for key returned the unexpected status 500 (expecting: 200): /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/Kask [23:55:05] RECOVERY - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [23:58:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P45823 and previous config saved to /var/cache/conftool/dbconfig/20230313-235807-marostegui.json