[00:20:49] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:29] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:30:41] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:17] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:35] (03PS2) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [01:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:37:01] (03PS1) 10AndyRussG: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810985 (https://phabricator.wikimedia.org/T311944) [01:40:49] (03PS1) 10AndyRussG: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810986 (https://phabricator.wikimedia.org/T311944) [01:41:39] (03Abandoned) 10AndyRussG: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810986 (https://phabricator.wikimedia.org/T311944) (owner: 10AndyRussG) [01:42:51] (03PS1) 10AndyRussG: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810987 (https://phabricator.wikimedia.org/T311944) [01:43:24] (03Abandoned) 10AndyRussG: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810987 (https://phabricator.wikimedia.org/T311944) (owner: 10AndyRussG) [01:43:54] (03Abandoned) 10AndyRussG: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810985 (https://phabricator.wikimedia.org/T311944) (owner: 10AndyRussG) [01:44:43] (03PS1) 10AndyRussG: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810988 (https://phabricator.wikimedia.org/T311944) [02:07:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.19 [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/810989 [02:07:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:29] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.19 [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/810989 (owner: 10TrainBranchBot) [02:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:08:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.19 [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/810989 (owner: 10TrainBranchBot) [02:29:29] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:31:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:31:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:47] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:34:29] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:36:28] (03PS1) 10Andrew Bogott: OpenStack Heat: configure creds for managing the 'heat' domain [puppet] - 10https://gerrit.wikimedia.org/r/810990 [03:38:47] (03PS2) 10Andrew Bogott: OpenStack Heat: configure creds for managing the 'heat' domain [puppet] - 10https://gerrit.wikimedia.org/r/810990 [03:41:27] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Heat: configure creds for managing the 'heat' domain [puppet] - 10https://gerrit.wikimedia.org/r/810990 (owner: 10Andrew Bogott) [04:29:07] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:35:49] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:05:57] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:21:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s6 T311522 [05:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:31] T311522: Switchover s6 master db1131 -> db1173 - https://phabricator.wikimedia.org/T311522 [05:21:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s6 T311522 [05:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1173 with weight 0 T311522', diff saved to https://phabricator.wikimedia.org/P30813 and previous config saved to /var/cache/conftool/dbconfig/20220705-052219-marostegui.json [05:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:42] (03PS3) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [05:38:51] (03PS2) 10Marostegui: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/810878 (https://phabricator.wikimedia.org/T311522) [05:39:58] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/810878 (https://phabricator.wikimedia.org/T311522) (owner: 10Marostegui) [05:48:31] (03PS14) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [05:51:52] (03CR) 10Tim Starling: [C: 03+2] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [05:58:32] !log deploying multi-DC support g 801621, manual puppet run on cp1080 [05:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1130.eqiad.wmnet with reason: Maintenance [05:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1130.eqiad.wmnet with reason: Maintenance [05:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:38] Amir1: ^ is that going to use dbctl? [05:59:49] marostegui: nope [05:59:52] ok [05:59:54] great! [06:00:13] done already [06:00:21] great [06:00:24] let's go for the switchover [06:00:39] sure [06:00:47] Not sure why the window message hasn't show up yet [06:00:50] But anyways [06:00:51] !log Starting s6 eqiad failover from db1131 to db1173 - T311522 [06:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:55] T311522: Switchover s6 master db1131 -> db1173 - https://phabricator.wikimedia.org/T311522 [06:01:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T311522', diff saved to https://phabricator.wikimedia.org/P30814 and previous config saved to /var/cache/conftool/dbconfig/20220705-060111-marostegui.json [06:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1173 to s6 primary and set section read-write T311522', diff saved to https://phabricator.wikimedia.org/P30815 and previous config saved to /var/cache/conftool/dbconfig/20220705-060139-root.json [06:01:41] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [06:01:42] All done [06:01:58] I can edit wikitech [06:02:00] marostegui: ^ lol [06:02:07] haha expected :) [06:02:29] I can see changes in frwiki [06:02:39] I can edit in wikitech as well [06:02:43] It is done then [06:03:00] we have only one schema change on the old master T298557 [06:03:01] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:04:22] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/810881 (https://phabricator.wikimedia.org/T311522) (owner: 10Marostegui) [06:05:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 T311522', diff saved to https://phabricator.wikimedia.org/P30816 and previous config saved to /var/cache/conftool/dbconfig/20220705-060526-root.json [06:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:40] Amir1: I think so [06:09:37] !log dbmaint s6@eqiad T298557 [06:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:41] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:22:07] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10ayounsi) [06:25:10] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [06:31:23] 10SRE, 10Infrastructure-Foundations, 10Traffic: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10ayounsi) [06:34:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30817 and previous config saved to /var/cache/conftool/dbconfig/20220705-063402-root.json [06:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P30818 and previous config saved to /var/cache/conftool/dbconfig/20220705-063531-root.json [06:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:30] 10SRE, 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10ayounsi) [06:38:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30819 and previous config saved to /var/cache/conftool/dbconfig/20220705-063848-root.json [06:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:58] (03PS1) 10Marostegui: mariadb: Enable notifications db2156,db2157 [puppet] - 10https://gerrit.wikimedia.org/r/811196 (https://phabricator.wikimedia.org/T311493) [06:43:02] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notifications db2156,db2157 [puppet] - 10https://gerrit.wikimedia.org/r/811196 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:45:58] (03PS1) 10Marostegui: mariadb: Decommission db2073 [puppet] - 10https://gerrit.wikimedia.org/r/811197 (https://phabricator.wikimedia.org/T311837) [06:46:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2073.codfw.wmnet [06:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:19] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P30820 and previous config saved to /var/cache/conftool/dbconfig/20220705-065035-root.json [06:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:30] (03PS4) 10Sergio Gimeno: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) [06:53:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30821 and previous config saved to /var/cache/conftool/dbconfig/20220705-065352-root.json [06:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2073.codfw.wmnet [06:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2073 [puppet] - 10https://gerrit.wikimedia.org/r/811197 (https://phabricator.wikimedia.org/T311837) (owner: 10Marostegui) [07:00:04] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T0700). [07:00:05] kostajh, matthiasmullie, koi, and AndyRussG: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:19] o/ [07:00:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Decommission db2073 T311837', diff saved to https://phabricator.wikimedia.org/P30822 and previous config saved to /var/cache/conftool/dbconfig/20220705-070019-marostegui.json [07:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:24] T311837: decommission db2073 - https://phabricator.wikimedia.org/T311837 [07:00:26] I can deploy today [07:00:27] o/ [07:00:34] here, but not back to my desk until 20 minutes from now [07:01:22] heyyy thx urbanecm :) [07:01:22] (03CR) 10Urbanecm: [C: 03+2] Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810988 (https://phabricator.wikimedia.org/T311944) (owner: 10AndyRussG) [07:01:48] kostajh: ack [07:01:55] (03CR) 10Urbanecm: [C: 03+2] SuggestedEdits: Adjust thumbnailSource logic [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810142 (https://phabricator.wikimedia.org/T311789) (owner: 10Kosta Harlan) [07:02:02] (03CR) 10Urbanecm: [C: 03+2] Retrieve pages-with-suggestion via Elastic scroll directly [extensions/ImageSuggestions] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810889 (https://phabricator.wikimedia.org/T311476) (owner: 10Matthias Mullie) [07:02:10] (03CR) 10Urbanecm: [C: 03+2] SuggestedEdits: Adjust thumbnailSource logic [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810142 (https://phabricator.wikimedia.org/T311789) (owner: 10Kosta Harlan) [07:02:14] 10ops-codfw, 10decommission-hardware: decommission db2073 - https://phabricator.wikimedia.org/T311837 (10Marostegui) a:03Papaul [07:02:29] o/ [07:02:40] Hi koi :) [07:03:21] (03PS2) 10Urbanecm: zh(wikiversity|wiktionary): Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810916 (https://phabricator.wikimedia.org/T312012) (owner: 10Stang) [07:03:25] (03CR) 10Urbanecm: [C: 03+2] zh(wikiversity|wiktionary): Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810916 (https://phabricator.wikimedia.org/T312012) (owner: 10Stang) [07:03:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 8 hosts with reason: codfw s3 sanitarium master switch [07:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: codfw s3 sanitarium master switch [07:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:18] (03Merged) 10jenkins-bot: zh(wikiversity|wiktionary): Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810916 (https://phabricator.wikimedia.org/T312012) (owner: 10Stang) [07:04:25] (03Merged) 10jenkins-bot: Only add tabs to special pages [extensions/CentralNotice] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810988 (https://phabricator.wikimedia.org/T311944) (owner: 10AndyRussG) [07:05:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P30823 and previous config saved to /var/cache/conftool/dbconfig/20220705-070539-root.json [07:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:11] koi: the disable local upload patch is at mwdebug1001, can you check? [07:08:23] AndyRussG: your CN patch is also at mwdebug1001, can you check? [07:08:45] urbanecm: checked and LGTM [07:08:45] urbanecm: oki :) [07:08:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30824 and previous config saved to /var/cache/conftool/dbconfig/20220705-070856-root.json [07:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:08] koi: ok, syncing [07:09:33] (03Merged) 10jenkins-bot: Retrieve pages-with-suggestion via Elastic scroll directly [extensions/ImageSuggestions] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810889 (https://phabricator.wikimedia.org/T311476) (owner: 10Matthias Mullie) [07:09:41] urbanecm: FYI mine needs no testing on mwdebug host (it's just a maint script fix, nothing to test) [07:09:48] matthiasmullie: ack [07:10:01] I'll just sync in that case [07:10:10] perfect, thanks [07:10:50] urbanecm: looks great, thx so much! [07:10:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:00] AndyRussG: great, syncing! [07:11:11] :) [07:11:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:11:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:25] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 14df0e25aabf21715b281a9dbb5893ae2ae7db9a: zh(wikiversity|wiktionary): Disable local upload (T312012) (duration: 03m 47s) [07:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:29] T312012: Disable local upload for Chinese (Wikiversity|Wiktionary) - https://phabricator.wikimedia.org/T312012 [07:14:09] 10ops-codfw, 10decommission-hardware: decommission db2073 - https://phabricator.wikimedia.org/T311837 (10Marostegui) @Papaul this is ready for you [07:16:50] (03PS1) 10Marostegui: mariadb: db2074 is no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/811200 (https://phabricator.wikimedia.org/T311475) [07:17:02] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/CentralNotice/includes/specials/CentralNotice.php: 414b7b8a14b451f9bd0fb0c36d44fe6a9310102e: Only add tabs to special pages (T311944) (duration: 03m 30s) [07:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:06] T311944: Meta talkpages for Central Notice banners turned into banner overview pages from the admin interface - https://phabricator.wikimedia.org/T311944 [07:17:16] AndyRussG: koi: and both patches are live now [07:17:37] (03CR) 10Marostegui: [C: 03+2] mariadb: db2074 is no longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/811200 (https://phabricator.wikimedia.org/T311475) (owner: 10Marostegui) [07:17:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:52] urbanecm: yeee works great, thx so much once again! [07:19:18] np [07:20:24] (03CR) 10Urbanecm: [C: 04-1] "the svg needs optimization with svgo (https://www.mediawiki.org/wiki/Manual:Assets#SVG_files)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810550 (https://phabricator.wikimedia.org/T311946) (owner: 10Stang) [07:20:32] koi: ^ can you look at this please? [07:20:40] looking [07:20:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P30826 and previous config saved to /var/cache/conftool/dbconfig/20220705-072043-root.json [07:20:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:38] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/ImageSuggestions/maintenance/SendNotificationsForUnillustratedWatchedTitles.php: d5050b773992aa6100aa14cd328836ff336ef8c1: Retrieve pages-with-suggestion via Elastic scroll directly (T311476) (duration: 03m 32s) [07:21:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:42] T311476: Unable to get list of more than 10k pages with recommendations - https://phabricator.wikimedia.org/T311476 [07:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:51] (03Merged) 10jenkins-bot: SuggestedEdits: Adjust thumbnailSource logic [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/810142 (https://phabricator.wikimedia.org/T311789) (owner: 10Kosta Harlan) [07:21:55] matthiasmullie: should be live. anything else? [07:22:09] (03PS1) 10Marostegui: mariadb: Productionize db2158 [puppet] - 10https://gerrit.wikimedia.org/r/811201 (https://phabricator.wikimedia.org/T311493) [07:22:15] urbanecm: that's all for me, thanks! [07:22:19] no problem [07:23:06] kostajh: for whenever you're ready, your patch is at mwdebug1001 [07:23:28] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810957 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:23:35] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Jelto) `gitlab1001` and `gitlab2001` will be decommissioned soon in T307142. So regarding GitLab this should be resolved... [07:24:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30827 and previous config saved to /var/cache/conftool/dbconfig/20220705-072400-root.json [07:24:02] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:24] urbanecm: thanks, having a look [07:26:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:27:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:15] urbanecm: I can no longer reproduce the issue, but I still think it's worth backporting. what do you think? [07:28:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:28:34] kostajh: no issues with doing so. so, let's sync? [07:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:52] urbanecm: oh wait [07:28:54] I can reproduce [07:28:58] okay [07:28:59] so let me check if the patch fixes it [07:28:59] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810958 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:28:59] waiting [07:29:38] urbanecm: yeah go for it [07:29:46] okay, syncing [07:32:04] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810959 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:33:40] (03CR) 10Slavina Stefanova: wmcs.openstack: move libs to it's own module (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro) [07:33:43] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: ce64780fbd78a414c6ab08fc374186ae4dd58bac: SuggestedEdits: Adjust thumbnailSource logic (T311789) (duration: 03m 32s) [07:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:48] T311789: [wmf.18 - regression] Add image article card displays thumbnail image - https://phabricator.wikimedia.org/T311789 [07:33:50] kostajh: and, should be live [07:33:52] anything else? [07:34:34] urbanecm: not for me, no. thank you! [07:34:40] np [07:35:05] koi: hi, how's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810550 going? should i wait a couple of more minutes? [07:35:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P30828 and previous config saved to /var/cache/conftool/dbconfig/20220705-073546-root.json [07:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:57] (03PS2) 10Stang: trwiki: Change old and new vector logos for 500k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810550 (https://phabricator.wikimedia.org/T311946) [07:36:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Introduce the ClusterConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [07:36:15] done, installing npm [07:36:17] (03CR) 10CI reject: [V: 04-1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [07:36:19] (03CR) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [07:37:33] (03CR) 10Urbanecm: [C: 03+2] trwiki: Change old and new vector logos for 500k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810550 (https://phabricator.wikimedia.org/T311946) (owner: 10Stang) [07:37:52] koi: okay, makes sense. let's go for it now :). i'll let you know when it's testable [07:37:55] <_joe_> urbanecm: sorry for the fat-figering on the +2, I wanted to rebase that patch, not deploy it now [07:38:24] urbanecm: oh, maybe we should do the mentor dashboard patch for beta? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805490 [07:38:25] (03Merged) 10jenkins-bot: trwiki: Change old and new vector logos for 500k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810550 (https://phabricator.wikimedia.org/T311946) (owner: 10Stang) [07:38:28] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810960 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:38:28] <_joe_> anyways, lmk when you're done with the backport window, I want to deploy those changes soon [07:38:37] (03PS5) 10Kosta Harlan: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [07:38:37] _joe_: no problem. i actually overlooked the C+2 you hit. I'll let you know when done, sure [07:38:47] (03PS6) 10Kosta Harlan: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [07:39:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30829 and previous config saved to /var/cache/conftool/dbconfig/20220705-073904-root.json [07:39:07] kostajh: good point. i'll do it too! [07:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:15] (03CR) 10Urbanecm: [C: 03+2] MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [07:39:18] urbanecm: and one more (beta only) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808207 [07:39:21] I can add those to the calendar [07:39:26] please do [07:40:07] (03CR) 10Urbanecm: [C: 04-1] [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [07:40:10] (03Merged) 10jenkins-bot: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [07:40:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2158 [puppet] - 10https://gerrit.wikimedia.org/r/811201 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:40:20] kostajh: i know i +1'ed it before, but i missed this little detail. can you fix it please? [07:40:30] urbanecm: looking [07:40:47] koi: your patch is at mwdebug1001, please check [07:41:14] (03PS4) 10Kosta Harlan: [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) [07:41:27] (03CR) 10Kosta Harlan: [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [07:41:34] (03PS8) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [07:41:48] (03PS1) 10Marostegui: site.pp: Remove insetup role from db2158 [puppet] - 10https://gerrit.wikimedia.org/r/811203 (https://phabricator.wikimedia.org/T311493) [07:41:50] (03PS5) 10Urbanecm: [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [07:42:08] urbanecm: done. Is there a page that documents 'wg', 'wmg', '-wg' etc? [07:42:24] urbanecm: LGTM [07:42:25] (03PS9) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) [07:42:30] koi: thanks, syncing [07:42:52] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db2158 [puppet] - 10https://gerrit.wikimedia.org/r/811203 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:43:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:44:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:06] kostajh: not aware of any docs specifically on that. there's https://wikitech.wikimedia.org/wiki/Configuration_files and https://wikitech.wikimedia.org/wiki/Wikimedia_site_requests, but neither seems like what you look for. wg stands for "wiki global", "wmg" should be "wikimedia global", and the - prefix (valid for both wg and wmg) means "ignore IS.php content, fully replace the variable". [07:46:25] urbanecm: ty [07:46:31] np [07:46:53] !log urbanecm@deploy1002 Synchronized static/: c8c092a4133d119bf9aaece6f934ca7744ea6951: trwiki: Change old and new vector logos for 500k articles (T311946; 1/3) (duration: 03m 17s) [07:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:57] T311946: Change the logo of Turkish Wikipedia to celebrate 500,000 articles - https://phabricator.wikimedia.org/T311946 [07:47:15] (03CR) 10Urbanecm: [C: 03+2] [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [07:47:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:01] (03Merged) 10jenkins-bot: [betalabs] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808207 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan) [07:50:29] !log urbanecm@deploy1002 Synchronized wmf-config/: c8c092a4133d119bf9aaece6f934ca7744ea6951: trwiki: Change old and new vector logos for 500k articles (T311946; 2/3) (duration: 03m 36s) [07:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:35] (03PS1) 10Muehlenhoff: Remove references [puppet] - 10https://gerrit.wikimedia.org/r/811206 [07:50:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P30830 and previous config saved to /var/cache/conftool/dbconfig/20220705-075050-root.json [07:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:53:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:04] !log urbanecm@deploy1002 Synchronized logos/config.yaml: c8c092a4133d119bf9aaece6f934ca7744ea6951: trwiki: Change old and new vector logos for 500k articles (T311946; 3/3) (duration: 03m 34s) [07:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:08] T311946: Change the logo of Turkish Wikipedia to celebrate 500,000 articles - https://phabricator.wikimedia.org/T311946 [07:54:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30831 and previous config saved to /var/cache/conftool/dbconfig/20220705-075408-root.json [07:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:13] koi: should be all live! [07:55:10] kostajh: the two beta patches should be live at beta any second. [07:55:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:55] (03PS1) 10Filippo Giunchedi: netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) [07:56:37] ack [07:56:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10Volans) 05Resolved→03Open As part of a recent refactor of the [[ https://netbox.wikimedia.org/extras/reports/network.Network/ |... [07:57:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove references [puppet] - 10https://gerrit.wikimedia.org/r/811206 (owner: 10Muehlenhoff) [07:57:24] (03PS2) 10Muehlenhoff: Remove references [puppet] - 10https://gerrit.wikimedia.org/r/811206 [07:58:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 89aef540e22aaded6c279d9d11c769507e497b6a: MentorDashboard: enable the Vue version of the dashboard in beta (T300532) (duration: 03m 18s) [07:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:17] T300532: Migration of mentee overview to Vue - https://phabricator.wikimedia.org/T300532 [07:58:30] _joe_: i'm done. floor is yours! [07:58:47] <_joe_> urbanecm: thanks, but I found stuff to update/modify now :/ [07:59:06] :) [07:59:34] urbanecm: just checking, you synced the IS.php file for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805490 ? [08:00:05] jnuche and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T0800). [08:00:17] kostajh: yeah. why? [08:00:32] urbanecm: just double-checking. thanks! [08:00:40] okay. no problem :) [08:02:57] urbanecm: hi, I would like to start the train, are you done with the backports? :) [08:03:06] jnuche: go ahead! [08:03:10] (hi) [08:03:11] thx! [08:04:11] PROBLEM - Check systemd state on mw2378 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:32] (03PS9) 10Giuseppe Lavagetto: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 [08:04:34] (03PS2) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 [08:05:28] (03CR) 10CI reject: [V: 04-1] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [08:05:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P30832 and previous config saved to /var/cache/conftool/dbconfig/20220705-080554-root.json [08:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30833 and previous config saved to /var/cache/conftool/dbconfig/20220705-080911-root.json [08:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:17] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:11:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:39] (03CR) 10Slavina Stefanova: "not reviewing, just asking: how are you running black and isort? Where are they configured?" [puppet] - 10https://gerrit.wikimedia.org/r/810949 (owner: 10David Caro) [08:16:37] (03CR) 10David Caro: wmcs.openstack: move libs to it's own module (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro) [08:17:27] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=DELETE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:18:38] (03PS2) 10David Caro: novafullstack: black and isort [puppet] - 10https://gerrit.wikimedia.org/r/810949 [08:18:58] (03CR) 10David Caro: novafullstack: black and isort (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810949 (owner: 10David Caro) [08:20:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P30834 and previous config saved to /var/cache/conftool/dbconfig/20220705-082058-root.json [08:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:15] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810032 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:22:19] 10SRE: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Legoktm) p:05Triage→03High [08:22:54] (03PS1) 10Volans: sre.ganeti.*: remove row_ prefix from groups [cookbooks] - 10https://gerrit.wikimedia.org/r/811210 [08:24:12] (03CR) 10Ayounsi: [C: 04-1] "Great idea! One comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [08:24:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30835 and previous config saved to /var/cache/conftool/dbconfig/20220705-082415-root.json [08:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @nskaggs I updated the instructions on the wikitech page [[https://wikitech.wikimedia.org/wiki/Network_desig... [08:27:04] (03Abandoned) 10Ayounsi: Netbox Ganeti sync: add groups support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802046 (https://phabricator.wikimedia.org/T262446) (owner: 10Ayounsi) [08:30:41] !log uploaded 7.4.30-3+0~20220627.69+debian10~1.gbpf2b381+wmf1+buster3 to component/php74 (pulling php-common with the socket helper) T311386 [08:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:45] T311386: Install php 7.4 in production - https://phabricator.wikimedia.org/T311386 [08:31:16] (03CR) 10Ayounsi: [C: 03+2] Delete requirements.txt [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806908 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [08:31:45] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) Filippo applied the following rule and everything now works: ` swift post wmf-ml-models+segments -r 'mlserve:ro' ` I had pre... [08:34:14] (03CR) 10Ayounsi: [C: 03+2] Netbox stats, set scrape interval to 2m (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806422 (owner: 10Ayounsi) [08:34:33] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) Tried to upload a model with the new read only account and I got access denied (good): ` elukey@stat1004:~$ sudo s3cmd -c tes... [08:35:46] (03PS23) 10Ayounsi: sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [08:41:39] (03CR) 10Ayounsi: [C: 03+2] sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [08:43:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:43:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:48] (03Abandoned) 10Ayounsi: Anycast: introduce new "deterministic" variable [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [08:45:56] (03Merged) 10jenkins-bot: sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [08:47:31] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) 05Open→03Resolved a:03elukey All working! Added the new account to the ML staging cluster, and it worked nicely. We'll m... [08:48:15] (03PS2) 10Volans: sre.ganeti.*: remove row_ prefix from groups [cookbooks] - 10https://gerrit.wikimedia.org/r/811210 [08:48:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:12] (03CR) 10Volans: [C: 03+2] "Rename done on ganeti and netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/811210 (owner: 10Volans) [08:50:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/811210 (owner: 10Volans) [08:52:51] (03Merged) 10jenkins-bot: sre.ganeti.*: remove row_ prefix from groups [cookbooks] - 10https://gerrit.wikimedia.org/r/811210 (owner: 10Volans) [08:52:59] !log ayounsi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces [08:53:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) [08:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:04] (03PS2) 10Zabe: utils: chmod +x setup_rake.sh and vcl_ec2_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/810973 [08:56:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10netops: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) Step 1 is done, `sre.network.configure-switch-interfaces` cookbook is ready for prime time. [08:58:43] 10SRE, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10TheresNoTime) [08:58:48] (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811214 (https://phabricator.wikimedia.org/T308072) [08:58:50] (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811214 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [08:59:24] PROBLEM - Host gitlab1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:00:39] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:41] RECOVERY - Host gitlab1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:01:34] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811214 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [09:02:29] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.19 refs T308072 [09:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:34] T308072: 1.39.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T308072 [09:03:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) 05In progress→03Resolved Closing this task. Setup in general needs to be considered under T297355 [09:04:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:04:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10Volans) [09:07:21] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:10:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:11:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:48] (03CR) 10Jelto: "I'm going to revert this because it's blocking puppet runs on all GitLab hosts." [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn) [09:13:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) >>! In T304888#8038344, @Cmjohnson wrote: > @cmooney are you requesting cloudnets to be moved to a different switch or this an open disc... [09:17:05] (03PS1) 10Jelto: Revert "gitlab: add prometheus blackbox http monitor" [puppet] - 10https://gerrit.wikimedia.org/r/810899 [09:19:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:19:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10Volans) a:05elukey→03MoritzMuehlenhoff [09:19:41] (03CR) 10CI reject: [V: 04-1] Revert "gitlab: add prometheus blackbox http monitor" [puppet] - 10https://gerrit.wikimedia.org/r/810899 (owner: 10Jelto) [09:20:16] (03PS1) 10Ayounsi: configure-switch-interfaces: log which hosts is being worked on [cookbooks] - 10https://gerrit.wikimedia.org/r/811219 [09:20:36] (03PS3) 10Giuseppe Lavagetto: mediawiki: allow forcing the backend for blank page on wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/810312 (https://phabricator.wikimedia.org/T311386) [09:20:38] (03PS2) 10Giuseppe Lavagetto: jobrunner: allow selecting explicitly the backend when performing health checks. [puppet] - 10https://gerrit.wikimedia.org/r/810348 (https://phabricator.wikimedia.org/T311386) [09:20:40] (03PS3) 10Giuseppe Lavagetto: lvs: check php 7.4 too on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/810313 (https://phabricator.wikimedia.org/T311386) [09:20:42] (03PS1) 10Giuseppe Lavagetto: mediawiki/php: forcibly upgrade php-common if installing php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/811220 [09:20:44] (03PS2) 10Jelto: Revert "gitlab: add prometheus blackbox http monitor" [puppet] - 10https://gerrit.wikimedia.org/r/810899 [09:21:12] (03PS2) 10Giuseppe Lavagetto: mediawiki/php: forcibly upgrade php-common if installing php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/811220 [09:23:47] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/811219 (owner: 10Ayounsi) [09:24:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36185/console" [puppet] - 10https://gerrit.wikimedia.org/r/811220 (owner: 10Giuseppe Lavagetto) [09:24:57] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/810899 (owner: 10Jelto) [09:25:07] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/801389 (owner: 10Volans) [09:25:23] (03CR) 10Ayounsi: [C: 03+2] configure-switch-interfaces: log which hosts is being worked on [cookbooks] - 10https://gerrit.wikimedia.org/r/811219 (owner: 10Ayounsi) [09:25:44] <_joe_> is CI stuck? [09:25:45] (03CR) 10CI reject: [V: 04-1] Revert "gitlab: add prometheus blackbox http monitor" [puppet] - 10https://gerrit.wikimedia.org/r/810899 (owner: 10Jelto) [09:26:24] (03CR) 10Muehlenhoff: [C: 03+1] "One nit inline, looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/811220 (owner: 10Giuseppe Lavagetto) [09:26:39] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki/php: forcibly upgrade php-common if installing php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/811220 (owner: 10Giuseppe Lavagetto) [09:27:27] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki/php: forcibly upgrade php-common if installing php7.4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811220 (owner: 10Giuseppe Lavagetto) [09:28:05] jelto: FYI I don't think you need to revert, you can roll-forward, afaics body_regex_matches wants a list of strings [09:28:12] (03PS3) 10Jelto: Revert "gitlab: add prometheus blackbox http monitor" [puppet] - 10https://gerrit.wikimedia.org/r/810899 [09:28:25] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/810032 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [09:32:13] godog: I added this information to the original change, thanks! And yes I assume it's a small fix but I want dzahn/mutante to troubleshoot this first. Also because there is some confusion on my site regarding the matched string. I'd like to have puppet running again because it's blocking updates in GitLab [09:33:42] !log ayounsi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest1002 [09:33:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1002 [09:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:57] volans: ^ yay [09:34:10] yay! [09:35:31] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:35:42] jelto: fair enough! thanks for the context [09:36:50] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.19 refs T308072 (duration: 34m 21s) [09:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:55] T308072: 1.39.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T308072 [09:37:04] (03PS1) 10Volans: Release v0.5.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/811224 [09:37:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:19] (03CR) 10Jelto: [C: 03+2] Revert "gitlab: add prometheus blackbox http monitor" [puppet] - 10https://gerrit.wikimedia.org/r/810899 (owner: 10Jelto) [09:44:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:18] (03PS1) 10Muehlenhoff: tilerator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811225 (https://phabricator.wikimedia.org/T308013) [09:47:20] (03PS1) 10Muehlenhoff: sysfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811226 (https://phabricator.wikimedia.org/T308013) [09:47:22] (03PS1) 10Muehlenhoff: nutcracker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811227 (https://phabricator.wikimedia.org/T308013) [09:47:24] (03PS1) 10Muehlenhoff: ircecho: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811228 (https://phabricator.wikimedia.org/T308013) [09:47:26] (03PS1) 10Muehlenhoff: smart: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811229 (https://phabricator.wikimedia.org/T308013) [09:47:28] (03PS1) 10Muehlenhoff: network: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811230 (https://phabricator.wikimedia.org/T308013) [09:47:32] (03PS1) 10Muehlenhoff: tcpircbot: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811231 (https://phabricator.wikimedia.org/T308013) [09:47:34] (03PS1) 10Muehlenhoff: osm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811232 (https://phabricator.wikimedia.org/T308013) [09:47:36] (03PS1) 10Muehlenhoff: swift: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811233 (https://phabricator.wikimedia.org/T308013) [09:49:05] (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811234 (https://phabricator.wikimedia.org/T308072) [09:49:09] (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811234 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [09:50:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:01] (03CR) 10Ayounsi: [C: 03+1] "We can also delete frozen-requirements-buster.txt as I don't think it runs on any buster host anymore." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/811224 (owner: 10Volans) [09:55:56] (03PS2) 10Volans: Release v0.5.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/811224 [09:56:02] (03CR) 10Volans: "addressed comment" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/811224 (owner: 10Volans) [09:56:45] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811234 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche) [09:58:21] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:56] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.5.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/811224 (owner: 10Volans) [10:00:15] (03PS15) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [10:00:56] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.19 refs T308072 [10:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:00] T308072: 1.39.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T308072 [10:01:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:25] (03PS1) 10Giuseppe Lavagetto: mediawiki/php: rationalize and order php-fpm installation [puppet] - 10https://gerrit.wikimedia.org/r/811236 [10:01:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:01:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36186/console" [puppet] - 10https://gerrit.wikimedia.org/r/811236 (owner: 10Giuseppe Lavagetto) [10:04:17] !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.5.0 - volans@cumin1001 [10:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:55] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.5.0 - volans@cumin1001 [10:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:51] (03PS2) 10Muehlenhoff: ircecho: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811228 (https://phabricator.wikimedia.org/T308013) [10:18:55] (03CR) 10Jcrespo: [C: 04-1] "I would like to do it slightly differently, will send an edit." [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [10:19:01] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:19:46] (03CR) 10Muehlenhoff: bacula::storage: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [10:20:04] (03PS2) 10Giuseppe Lavagetto: mediawiki/php: rationalize and order php-fpm installation [puppet] - 10https://gerrit.wikimedia.org/r/811236 [10:21:39] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36187/console" [puppet] - 10https://gerrit.wikimedia.org/r/811236 (owner: 10Giuseppe Lavagetto) [10:24:25] (03PS6) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [10:24:29] (03PS1) 10Filippo Giunchedi: prometheus: deploy custom probedown alerts [puppet] - 10https://gerrit.wikimedia.org/r/811241 (https://phabricator.wikimedia.org/T305847) [10:24:33] (03PS1) 10Filippo Giunchedi: prometheus: deploy alerts as yml not yaml [puppet] - 10https://gerrit.wikimedia.org/r/811242 (https://phabricator.wikimedia.org/T305847) [10:25:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki/php: rationalize and order php-fpm installation [puppet] - 10https://gerrit.wikimedia.org/r/811236 (owner: 10Giuseppe Lavagetto) [10:29:02] !log sudo gnt-cluster upgrade --to 3.0 for ganeti/codfw T311686 [10:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:07] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:30:21] <_joe_> !log running benchmarks in codfw for php7.2/7.4 comparison. [10:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:01] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:36:15] (03CR) 10Klausman: [C: 03+1] Upgrade kserve images to upstream release 0.8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/810841 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [10:40:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:40:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:19] PROBLEM - puppet last run on mw1448 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:40:33] PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:40:41] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/811243 [10:40:41] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:21] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:35] PROBLEM - puppet last run on mw1450 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:49] PROBLEM - puppet last run on mw1447 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:55] PROBLEM - puppet last run on mw1418 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:55] PROBLEM - puppet last run on mw1417 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:57] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:43:03] PROBLEM - puppet last run on mw1415 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:43:29] PROBLEM - puppet last run on mw1449 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:43:47] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:43:49] (03PS3) 10Volans: devices: override default timeout for mgmt routers [homer/public] - 10https://gerrit.wikimedia.org/r/799381 [10:43:53] PROBLEM - puppet last run on mw2376 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:44:23] PROBLEM - puppet last run on parse2001 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:45:49] PROBLEM - puppet last run on parse2002 is CRITICAL: CRITICAL: Puppet last ran 20 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:46:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:59] RECOVERY - puppet last run on wtp1025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:47:17] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/810308 (owner: 10Muehlenhoff) [10:48:25] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:48:43] (03CR) 10Volans: [C: 03+2] devices: override default timeout for mgmt routers [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans) [10:48:45] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:49:29] RECOVERY - puppet last run on mw1415 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:49:44] (03Merged) 10jenkins-bot: devices: override default timeout for mgmt routers [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans) [10:52:15] RECOVERY - puppet last run on parse2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:55:00] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Ladsgroup) Regarding mailman, yes, Kunal and I didn't touch those settings [1] (I couldn't as I didn't have access to net... [10:55:25] RECOVERY - puppet last run on mw1450 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:55:39] RECOVERY - puppet last run on mw1447 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:55:47] RECOVERY - puppet last run on mw1417 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:56:25] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:00:45] RECOVERY - puppet last run on mw1418 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:01:19] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [11:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:53] RECOVERY - puppet last run on mw2376 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:12] (03PS1) 10Marostegui: instances.yaml: Add db2158 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811247 (https://phabricator.wikimedia.org/T311493) [11:02:13] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:21] RECOVERY - puppet last run on parse2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:02:54] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2158 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811247 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:04:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2158 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P30848 and previous config saved to /var/cache/conftool/dbconfig/20220705-110432-marostegui.json [11:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:37] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [11:05:13] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:06:03] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:06:16] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10DBA, 10GlobalBlocking, 10Wikimedia-Incident: 2022-05-05 Wikimedia full site outage - https://phabricator.wikimedia.org/T307647 (10Ladsgroup) [11:06:43] RECOVERY - puppet last run on mw1449 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:08:55] (03PS1) 10Marostegui: mariadb: Productionize db2159 [puppet] - 10https://gerrit.wikimedia.org/r/811250 (https://phabricator.wikimedia.org/T311493) [11:09:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2159 [puppet] - 10https://gerrit.wikimedia.org/r/811250 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:14:53] (03PS1) 10Marostegui: site.pp: Remove insetup role from db2159 [puppet] - 10https://gerrit.wikimedia.org/r/811255 (https://phabricator.wikimedia.org/T311493) [11:15:51] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup role from db2159 [puppet] - 10https://gerrit.wikimedia.org/r/811255 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [11:16:21] (03CR) 10Hnowlan: Upgrade thumbor to Thumbor 7 and python3 (039 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [11:19:20] (03Abandoned) 10Hnowlan: Port Dockerfile to use buster [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/806333 (owner: 10Hnowlan) [11:21:11] (03PS2) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) [11:21:17] RECOVERY - puppet last run on mw1448 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:30:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MarcoAurelio) [11:31:26] (03PS1) 10Vlad.shapik: WIP: Adjust the online tests to new changes in the thumbor functionality [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/811257 [11:33:31] (03PS3) 10Muehlenhoff: Remove obsolete profile::base::linux419 [puppet] - 10https://gerrit.wikimedia.org/r/810308 [11:37:22] GitLab needs a short maintenance break of around 10 minutes at 12:30 UTC (in one hour) [11:37:49] (03PS2) 10JMeybohm: Alert on helm releases in bad state [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) [11:42:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete profile::base::linux419 [puppet] - 10https://gerrit.wikimedia.org/r/810308 (owner: 10Muehlenhoff) [11:43:47] (03CR) 10Muehlenhoff: [C: 03+2] profile::mariadb::packages_wmf: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810846 (owner: 10Muehlenhoff) [11:46:07] (03CR) 10Muehlenhoff: [C: 03+2] profile::mariadb::packages_client: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810847 (owner: 10Muehlenhoff) [11:47:01] (03PS3) 10Ssingh: bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 [11:47:08] (03CR) 10Ssingh: bird: add validate_cmd for bird.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [11:48:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36188/console" [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [11:48:56] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [11:48:58] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:49:21] (03CR) 10Muehlenhoff: [C: 03+2] uwsgi: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810321 (owner: 10Muehlenhoff) [11:56:14] (03CR) 10Slavina Stefanova: wmcs.openstack: move libs to it's own module (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro) [11:57:08] (03CR) 10Slavina Stefanova: [C: 03+1] novafullstack: black and isort [puppet] - 10https://gerrit.wikimedia.org/r/810949 (owner: 10David Caro) [11:58:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [11:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:52] (03PS1) 10Marostegui: install_server: Do not reimage db2153-2159 [puppet] - 10https://gerrit.wikimedia.org/r/811289 (https://phabricator.wikimedia.org/T311493) [12:00:56] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2153-2159 [puppet] - 10https://gerrit.wikimedia.org/r/811289 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [12:03:40] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [12:07:44] (03PS1) 10Volans: profile::homer::diff_timer_interval: change time [puppet] - 10https://gerrit.wikimedia.org/r/811291 [12:07:46] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:10:52] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [12:11:56] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [12:14:41] (03CR) 10Jaime Nuche: scap: Make scap3 provider packages depend on /usr/bin/scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [12:16:06] (03PS1) 10Muehlenhoff: Add MarcoAurelio to contributors [puppet] - 10https://gerrit.wikimedia.org/r/811292 (https://phabricator.wikimedia.org/T308013) [12:17:13] (03CR) 10Muehlenhoff: [C: 03+2] Add MarcoAurelio to contributors [puppet] - 10https://gerrit.wikimedia.org/r/811292 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:19:34] (03PS2) 10Filippo Giunchedi: prometheus: deploy custom probedown alerts [puppet] - 10https://gerrit.wikimedia.org/r/811241 (https://phabricator.wikimedia.org/T305847) [12:19:36] (03PS2) 10Filippo Giunchedi: prometheus: deploy alerts as yml not yaml [puppet] - 10https://gerrit.wikimedia.org/r/811242 (https://phabricator.wikimedia.org/T305847) [12:19:38] (03PS7) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [12:19:40] (03PS1) 10Filippo Giunchedi: prometheus: move blackbox http check to prometheus::rule [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) [12:19:42] (03PS1) 10Filippo Giunchedi: prometheus: introduce blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811295 (https://phabricator.wikimedia.org/T305847) [12:19:44] (03PS1) 10Filippo Giunchedi: prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) [12:20:42] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10Volans) p:05Triage→03High FYI this is still ongoing. We got 9 emails in the last 24 hours. [12:22:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/811233 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:22:48] (03CR) 10CI reject: [V: 04-1] prometheus: introduce blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811295 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:22:57] (03CR) 10Ayounsi: [C: 03+1] "LGTM! At some point we should add the row (location) or rack data to the virtual hosts as well, so that kind of data is more interoperable" [puppet] - 10https://gerrit.wikimedia.org/r/810956 (owner: 10Volans) [12:23:40] (03CR) 10CI reject: [V: 04-1] WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 (owner: 10Filippo Giunchedi) [12:24:04] (03CR) 10Slavina Stefanova: [C: 03+1] "lgtm, nice little refactoring!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810915 (owner: 10David Caro) [12:24:30] (03PS2) 10Filippo Giunchedi: prometheus: introduce blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811295 (https://phabricator.wikimedia.org/T305847) [12:24:32] (03PS2) 10Filippo Giunchedi: prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) [12:24:34] (03PS8) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [12:24:59] (03CR) 10CI reject: [V: 04-1] prometheus: move blackbox http check to prometheus::rule [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:25:12] (03CR) 10Ayounsi: [C: 03+1] profile::homer::diff_timer_interval: change time [puppet] - 10https://gerrit.wikimedia.org/r/811291 (owner: 10Volans) [12:26:27] (03CR) 10Ayounsi: [C: 03+1] "LGTM! as mentioned in I4a857d6c14c227a810233ff1259d5b01635005b0 it would be useful for users to have the rack/row location for VMs as well" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810955 (owner: 10Volans) [12:27:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:01] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:29:20] (03CR) 10CI reject: [V: 04-1] prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:29:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'T311106', diff saved to https://phabricator.wikimedia.org/P30859 and previous config saved to /var/cache/conftool/dbconfig/20220705-122941-ladsgroup.json [12:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:45] (03CR) 10CI reject: [V: 04-1] WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 (owner: 10Filippo Giunchedi) [12:29:46] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [12:31:45] !log draining ganeti2023 for eventual reimage T311686 [12:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:49] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [12:32:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:38] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:35:27] (03CR) 10David Caro: Add mypy, black and isort tests (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [12:35:36] (03CR) 10David Caro: [C: 03+2] Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [12:35:45] (03CR) 10CI reject: [V: 04-1] Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [12:36:48] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [12:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:34] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hadoop.roll-restart-workers (exit_code=97) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [12:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:39] (03CR) 10David Caro: [C: 03+2] novafullstack: black and isort [puppet] - 10https://gerrit.wikimedia.org/r/810949 (owner: 10David Caro) [12:38:18] moritzm: hey, can I merge your puppet patch? [12:41:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30861 and previous config saved to /var/cache/conftool/dbconfig/20220705-124101-ladsgroup.json [12:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:04] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [12:42:05] dcaro: oh sorry, please go ahead, yes [12:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:18] 👍 [12:47:32] (03PS4) 10David Caro: Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 [12:48:34] (03CR) 10David Caro: "Just rebased and reapplied (and fixed an error in the tox.ini specifying an env that does not exist)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [12:49:10] RECOVERY - puppet last run on ms-be2031 is OK: OK: Puppet is currently disabled (puppetdb maintenance), not alerting. Last run 22 hours ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:49:42] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently disabled (puppetdb maintenance), not alerting. Last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:49:56] RECOVERY - puppet last run on ms-be2033 is OK: OK: Puppet is currently disabled (puppetdb maintenance), not alerting. Last run 22 hours ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:50:04] (03PS5) 10David Caro: Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 [12:50:04] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [12:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:01] (03PS6) 10David Caro: Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 [12:52:18] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently disabled (puppetdb maintenance), not alerting. Last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:52:42] RECOVERY - puppet last run on ms-be2037 is OK: OK: Puppet is currently disabled (puppetdb maintenance), not alerting. Last run 1 day ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:52:46] RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently disabled (puppetdb maintenance), not alerting. Last run 20 hours ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:59:18] (03CR) 10Volans: [C: 03+2] profile::homer::diff_timer_interval: change time [puppet] - 10https://gerrit.wikimedia.org/r/811291 (owner: 10Volans) [12:59:54] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) With the same weight and both having p_s enabled: https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/normal-wi... [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T1300). [13:00:05] awight and matej_suchanek: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T1300) [13:00:14] I can deploy today! [13:00:26] PROBLEM - puppet last run on ms-be2031 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:00:35] I'm also happy to :-) [13:00:38] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) @LSobanski hello any update on this? [13:00:43] o/ [13:00:56] I can test my patches, at least. [13:00:58] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:01:10] PROBLEM - puppet last run on ms-be2033 is CRITICAL: CRITICAL: Puppet last ran 22 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:01:36] (03PS3) 10Urbanecm: Drop dependent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808803 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [13:01:39] (03CR) 10Urbanecm: [C: 03+2] Drop dependent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808803 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [13:01:41] (03CR) 10Muehlenhoff: [C: 03+2] apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/795380 (owner: 10Majavah) [13:02:06] great to see unneeded flags going away : [13:02:07] :) [13:02:15] (03PS9) 10Urbanecm: Drop deprecated feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [13:02:26] (03Merged) 10jenkins-bot: Drop dependent feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808803 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [13:02:28] (03CR) 10Urbanecm: [C: 03+2] Drop deprecated feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [13:02:51] awight: your first patch is at mwdebug1001, can you check? [13:02:57] ack [13:03:29] (03Merged) 10jenkins-bot: Drop deprecated feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [13:03:32] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:03:54] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:03:58] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Puppet last ran 21 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:04:07] urbanecm: Looks good. [13:04:15] syncing [13:06:58] (03CR) 10David Caro: [C: 03+2] Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [13:07:11] (03PS1) 10Muehlenhoff: Revert "apt::repository: use signed-by instead of apt-key" [puppet] - 10https://gerrit.wikimedia.org/r/811303 [13:07:54] (03CR) 10Ottomata: [WIP] Build spark assembly for Spark3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu) [13:07:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:03] (03PS1) 10Klausman: ml-services: add single draftquality inference service to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811304 (https://phabricator.wikimedia.org/T302195) [13:08:27] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 891057f6ba555b2ece0424e3364d853eb20555da: Drop dependent feature flags (T310684) (duration: 03m 37s) [13:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:32] T310684: Delete deprecated config from settings files - https://phabricator.wikimedia.org/T310684 [13:09:10] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) from looking at the graphs (and grafana graphs: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-... [13:09:28] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) @Ladsgroup let's do the same but with performance schema disabled and see what's the difference between all the graphs... [13:09:51] (03CR) 10CI reject: [V: 04-1] Revert "apt::repository: use signed-by instead of apt-key" [puppet] - 10https://gerrit.wikimedia.org/r/811303 (owner: 10Muehlenhoff) [13:10:08] awight: your second patch is at mwdebug1001 now too, can you check? [13:10:14] urbanecm: Great, checking now [13:10:16] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) Sure! [13:10:50] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert "apt::repository: use signed-by instead of apt-key" [puppet] - 10https://gerrit.wikimedia.org/r/811303 (owner: 10Muehlenhoff) [13:11:41] urbanecm: Still works! [13:11:46] great, syncing! [13:11:52] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Disabled: ` root@db1132:~# mysql -e "show global variables like 'performance_schema'" +--------------------+-------+ |... [13:12:21] matej_suchanek: hi, your patch will be next -- are you around? [13:12:32] yes I am [13:12:40] great! [13:12:49] (03PS6) 10Urbanecm: static.php: Update call to deprecated IContextSource::getStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [13:12:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] scap: Make scap3 provider packages depend on /usr/bin/scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [13:12:54] (03CR) 10Urbanecm: [C: 03+2] static.php: Update call to deprecated IContextSource::getStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [13:13:39] (03Merged) 10jenkins-bot: static.php: Update call to deprecated IContextSource::getStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810890 (owner: 10Matěj Suchánek) [13:14:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:14:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:06] (03Merged) 10jenkins-bot: Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [13:15:08] (03PS3) 10Filippo Giunchedi: prometheus: deploy custom probedown alerts [puppet] - 10https://gerrit.wikimedia.org/r/811241 (https://phabricator.wikimedia.org/T305847) [13:15:10] (03PS3) 10Filippo Giunchedi: prometheus: deploy alerts as yml not yaml [puppet] - 10https://gerrit.wikimedia.org/r/811242 (https://phabricator.wikimedia.org/T305847) [13:15:12] (03PS2) 10Filippo Giunchedi: prometheus: move blackbox http check to prometheus::rule [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) [13:15:14] (03PS3) 10Filippo Giunchedi: prometheus: introduce blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811295 (https://phabricator.wikimedia.org/T305847) [13:15:16] (03PS3) 10Filippo Giunchedi: prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) [13:15:18] (03PS9) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [13:15:39] !log urbanecm@deploy1002 Synchronized wmf-config/: 1287b969fc42aee6efae5ff1f1943394ba35e326: Drop deprecated feature flags (T310684) (duration: 03m 32s) [13:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:46] T310684: Delete deprecated config from settings files - https://phabricator.wikimedia.org/T310684 [13:15:47] awight: both patches should be live now! [13:16:08] matej_suchanek: your patch is at mwdebug1001. can you test it please? [13:16:26] urbanecm: thanks :-D [13:16:54] urbanecm: what is the best way to test the /w/static.php entry point? [13:17:03] (03PS1) 10Muehlenhoff: Revert "uwsgi: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/811307 [13:17:16] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10jcrespo) > with connection management I agree. If you have the resources/time, I also suggest an additional testing point (there a... [13:17:32] matej_suchanek: request a resource served through static.php. see https://gerrit.wikimedia.org/g/operations/puppet/+/619a797802955ae9ba38d421e07f755961dd9693/modules/mediawiki/templates/apache/mediawiki-vhost.conf.erb#64 for relevant rewrite rules. [13:18:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811307 (owner: 10Muehlenhoff) [13:18:26] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Revert "uwsgi: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/811307 (owner: 10Muehlenhoff) [13:18:59] (03CR) 10CI reject: [V: 04-1] prometheus: move blackbox http check to prometheus::rule [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:19:37] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:20:27] trying with https://en.wikipedia.org/w/extensions/GrowthExperiments/images/end-of-queue.svg on my end, seems to work fine. [13:20:29] 10SRE-tools, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10akosiaris) Didn't need a reboot after all. I fixed /etc/network/interfaces configuration and issued a `systemctl restart... [13:20:39] (03CR) 10CI reject: [V: 04-1] prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:20:50] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:20:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:55] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:20:57] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] "check" [puppet] - 10https://gerrit.wikimedia.org/r/811303 (owner: 10Muehlenhoff) [13:21:09] urbanecm: requested some icons loaded on a page load and seems fine as well [13:21:14] (03CR) 10CI reject: [V: 04-1] WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 (owner: 10Filippo Giunchedi) [13:21:18] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/811303 (owner: 10Muehlenhoff) [13:21:27] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Yeah agreed, let's test disabling P_S first and then if needed, start playing with thread pool disablement and/or tunin... [13:22:35] matej_suchanek: great. in that case, syncing the patch [13:25:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:40] !log urbanecm@deploy1002 Synchronized w/static.php: 300ef4a5ee6f0c35de831e88eb2f8169e7f66e97: static.php: Update call to deprecated IContextSource::getStats (duration: 03m 41s) [13:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:21] and it's deployed [13:27:24] anything else, anyone? [13:28:17] (03PS9) 10Cathal Mooney: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) [13:29:26] (03PS1) 10Filippo Giunchedi: icinga: pass in ip4/ip6 addresses for commons blackbox [puppet] - 10https://gerrit.wikimedia.org/r/811310 (https://phabricator.wikimedia.org/T305847) [13:30:09] (03CR) 10Klausman: [C: 03+2] ml-services: add single draftquality inference service to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811304 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:31:58] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: pass in ip4/ip6 addresses for commons blackbox [puppet] - 10https://gerrit.wikimedia.org/r/811310 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:32:22] moritzm: merged your change too [13:32:39] (03CR) 10Muehlenhoff: [C: 03+2] "This caused an error in apt-get update on the hosts with openstack packages, so I had to revert:" [puppet] - 10https://gerrit.wikimedia.org/r/795380 (owner: 10Majavah) [13:32:48] ack, thx [13:32:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:32:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:04] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/805815 (owner: 10Filippo Giunchedi) [13:33:20] !log UTC afternoon B&C window done [13:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:34] (03Merged) 10jenkins-bot: ml-services: add single draftquality inference service to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811304 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:33:51] (03CR) 10Cathal Mooney: [C: 03+2] Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [13:34:13] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:39] (03Merged) 10jenkins-bot: Add check to network report to ensure IPs match connected Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806389 (https://phabricator.wikimedia.org/T310299) (owner: 10Cathal Mooney) [13:36:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:14] (03PS1) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) [13:37:40] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Volker_E) Same, thanks @Dzahn for making the OOUIPHP demos work again! [13:38:28] (03CR) 10Matthias Mullie: [C: 04-1] "Do not merge yet. We'll do the first couple of runs manually to check output." [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [13:39:57] (03CR) 10CI reject: [V: 04-1] Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [13:42:15] (03PS2) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [13:43:04] (03CR) 10CI reject: [V: 04-1] novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [13:44:43] (03PS3) 10Jaime Nuche: scap: make scap::target require the scap class [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [13:46:24] (03PS1) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) [13:48:03] (03PS1) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [13:48:46] (03CR) 10CI reject: [V: 04-1] Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [13:53:36] (03PS1) 10JMeybohm: kubernetes::master: Increase the apiserver latency thresholds [puppet] - 10https://gerrit.wikimedia.org/r/811315 (https://phabricator.wikimedia.org/T310714) [13:53:39] (03PS2) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) [13:54:00] (03CR) 10Matthias Mullie: [C: 04-1] "Do not merge yet. We'll do the first couple of runs manually to check output." [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [13:57:05] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Increase the apiserver latency thresholds [puppet] - 10https://gerrit.wikimedia.org/r/811315 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [13:59:07] (03PS2) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) [14:00:28] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10herron) Looks like recoveries are consistently arriving 2m after the alert. I've increased the threshold from 3 checks with 10 second pause to 5 checks with 60s pa... [14:00:43] (03CR) 10Jaime Nuche: scap: make scap::target require the scap class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy) [14:01:03] (03PS3) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [14:01:05] (03PS1) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [14:02:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2164.mgmt.codfw.wmnet with reboot policy FORCED [14:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:41] (03CR) 10David Caro: "Untested for now, but feel free to comment on the refactor itself." [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro) [14:04:30] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops: Migrate thumbor Dockerfile to blubber - https://phabricator.wikimedia.org/T312104 (10hnowlan) [14:05:02] (03CR) 10Sergio Gimeno: [C: 03+1] GrowthExperiments: End mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [14:09:09] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:01] (03CR) 10JMeybohm: safe-service-restart.py: Ensure 'status' always has a value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) (owner: 10Ahmon Dancy) [14:17:34] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10Vgutierrez) >>! In T310303#8042271, @BCornwall wrote: > @Vgutierrez Indeed, do you have any reason to keep these *specific* instances around, or are you... [14:17:40] (03PS4) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [14:17:44] (03CR) 10David Caro: novafullstack: add types and some names refactor (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [14:17:48] (03PS2) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [14:18:26] (03PS1) 10David Caro: Revert "profile::mariadb::packages_wmf: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/811270 [14:18:51] (03CR) 10David Caro: "Sorry people, we will have to wait a bit more" [puppet] - 10https://gerrit.wikimedia.org/r/811270 (owner: 10David Caro) [14:20:43] (03CR) 10David Caro: "Some other things:" [puppet] - 10https://gerrit.wikimedia.org/r/810950 (owner: 10David Caro) [14:22:21] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [14:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:54] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw [14:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:03] (03CR) 10Giuseppe Lavagetto: safe-service-restart.py: Ensure 'status' always has a value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) (owner: 10Ahmon Dancy) [14:28:17] 10SRE-swift-storage: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10MatthewVernon) LGTM :) [14:29:20] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:30:23] (03CR) 10Herron: [C: 03+1] "Nice idea, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811295 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:31:36] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:32:21] (03CR) 10Herron: [C: 03+1] "LGTM, probably worth a PCC smoke test" [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:33:01] (03CR) 10Herron: [C: 03+1] prometheus: deploy alerts as yml not yaml [puppet] - 10https://gerrit.wikimedia.org/r/811242 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:34:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:45] (03CR) 10BPirkle: [C: 03+1] "Discussed this in a synchronous meeting, LGTM regardless of whether naming changes are made or not." [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [14:35:46] (03CR) 10Herron: [C: 03+1] prometheus: deploy custom probedown alerts [puppet] - 10https://gerrit.wikimedia.org/r/811241 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:36:33] (03PS5) 10David Caro: novafullstack: add types and some names refactor [puppet] - 10https://gerrit.wikimedia.org/r/810950 [14:36:35] (03PS3) 10David Caro: novafullstack: Refactor and minor fix [puppet] - 10https://gerrit.wikimedia.org/r/811316 [14:36:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10LSobanski) @Papaul The moss-be2001 and moss-be2002 are not currently in production so they can be powered off at any time. The thanos host will need more... [14:36:39] (03PS1) 10David Caro: novafullstack: allow running on codfw [puppet] - 10https://gerrit.wikimedia.org/r/811318 [14:36:53] (03CR) 10Herron: [C: 03+1] prometheus: move blackbox http check to prometheus::rule [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [14:37:15] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10LSobanski) The moss-be1001 and moss-be1002 are not currently in production so they can be powered off at any time. The thanos host will need more time (r... [14:42:19] (03PS3) 10Jcrespo: Add new user for dbbackups database for django dashboard [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) [14:42:21] (03PS2) 10Jcrespo: bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [14:43:30] (03CR) 10CI reject: [V: 04-1] bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [14:44:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: Switch disk type to DRBD [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: Switch disk type to DRBD [14:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:10] (03PS1) 10Cathal Mooney: Fix Network report to deal with IP address on host with no Vlan pfx [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811319 [14:48:15] (03CR) 10Jcrespo: "Puppet compilation looks fine: https://puppet-compiler.wmflabs.org/pcc-worker1002/36190/" [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [14:48:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2164.mgmt.codfw.wmnet with reboot policy FORCED [14:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:14] (03CR) 10Elukey: "Looks good, maybe we could reduce the number of pods to max 2/3 for each kind as starter? We should be fine with two workers anyway but I'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:52:19] (03CR) 10Klausman: ml-services: add some more revscoring services to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:53:09] (03PS1) 10Phuedx: DNM: Add web.ui_actions_tracking event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811322 [14:53:19] (03CR) 10Ahmon Dancy: "Looking for a +2 from someone" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 (owner: 10Ahmon Dancy) [14:53:25] (03PS3) 10Elukey: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) [14:53:48] dancy: dangerous words ^ [14:53:51] /j [14:54:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline. I have no idea why CI is complaining." [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [14:54:25] TheresNoTime: Gotta do what I gotta do! [14:54:43] (that being to beg) [14:55:14] (03CR) 10Muehlenhoff: [C: 03+1] bacula::storage: Remove support for stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [14:57:17] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811319 (owner: 10Cathal Mooney) [14:58:09] (03CR) 10Ottomata: [C: 03+1] Add a new Eventgate stream for revision-score events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [14:59:02] (03CR) 10Cathal Mooney: [C: 03+2] Fix Network report to deal with IP address on host with no Vlan pfx [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811319 (owner: 10Cathal Mooney) [15:00:38] (03CR) 10Elukey: Add a new Eventgate stream for revision-score events (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [15:00:40] (03Merged) 10jenkins-bot: Fix Network report to deal with IP address on host with no Vlan pfx [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811319 (owner: 10Cathal Mooney) [15:00:52] !log draining ganeti2024 for eventual reimage T311686 [15:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:55] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [15:01:57] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Create a cookbook to switch an instance to DRBD/plain disk storage - https://phabricator.wikimedia.org/T312116 (10MoritzMuehlenhoff) [15:02:25] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [15:02:38] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2164.codfw.wmnet with OS bullseye [15:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:36] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2164.codfw.wmnet with OS bullseye [15:05:24] !log installing firejail updates on stretch [15:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:28] !log ayounsi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db2169 [15:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:38] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db2169 [15:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:20] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:07:08] 10SRE-swift-storage: Reconcile Puppet profiles between clusters and thanos / ms - https://phabricator.wikimedia.org/T312118 (10LSobanski) [15:07:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:46] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2169 [15:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2169 [15:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:14] (03PS2) 10Phuedx: beta: Add web.ui_actions_tracking event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811322 (https://phabricator.wikimedia.org/T311268) [15:13:00] (03PS2) 10Ahmon Dancy: safe-service-restart.py: Avoid uninitialized access to 'status' [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) [15:13:13] (03CR) 10Ahmon Dancy: safe-service-restart.py: Avoid uninitialized access to 'status' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) (owner: 10Ahmon Dancy) [15:15:42] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:17:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2169.mgmt.codfw.wmnet with reboot policy FORCED [15:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10cmooney) [15:18:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) 05Resolved→03Open Hi @Cmjohnson I think there was a mix-up for cloudvirt1050 in Netbox for the cable details. Look... [15:24:19] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10Volans) >>! In T311926#8051744, @herron wrote: > Looks like recoveries are consistently arriving 2m after the alert. I've increased the threshold from 3 checks wit... [15:24:21] (03PS1) 10Muehlenhoff: Switch image reports over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811324 (https://phabricator.wikimedia.org/T298463) [15:27:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2164.codfw.wmnet with OS bullseye [15:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:14] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2164.codfw.wmnet with OS bullseye executed with er... [15:34:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2164.codfw.wmnet with OS bullseye [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:12] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2164.codfw.wmnet with OS bullseye [15:38:30] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:39:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] safe-service-restart.py: Avoid uninitialized access to 'status' [puppet] - 10https://gerrit.wikimedia.org/r/807624 (https://phabricator.wikimedia.org/T311182) (owner: 10Ahmon Dancy) [15:42:55] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10JArguello-WMF) 05Open→03Resolved [15:42:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) @LSobanski Thank you for the update will proceed with the moss nodes for now. [15:43:07] (03PS3) 10Phuedx: beta: Add mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811322 (https://phabricator.wikimedia.org/T311268) [15:44:43] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) [15:44:48] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) kafka-jumbo1010: E1 U17 Port 17 CableID 20220240 kafka-jumbo1011: E2 U19 Port 19 CableID 20220239 kafka-jumbo1012: E3 U19 Port 19... [15:45:28] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:50:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2169.mgmt.codfw.wmnet with reboot policy FORCED [15:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:32] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:08] PROBLEM - Host moss-be1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:54:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2164.codfw.wmnet with reason: host reimage [15:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:46] 10SRE, 10SRE-OnFire, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10matmarex) [15:57:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2170.mgmt.codfw.wmnet with reboot policy FORCED [15:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2164.codfw.wmnet with reason: host reimage [15:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:01] 10SRE, 10SRE-OnFire, 10Shellbox, 10serviceops, 10Sustainability (Incident Followup): Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10matmarex) Cross-referencing: https://wikitech.wikimedia.org/wiki/Incidents/2022-07-03_shellbox_request_spike (I think the tasks I merged are re... [15:59:10] RECOVERY - Host moss-be1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [15:59:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2171.mgmt.codfw.wmnet with reboot policy FORCED [15:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:03:04] PROBLEM - Host moss-be1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:05:13] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10Jclark-ctr) [16:06:28] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:04] (03PS1) 10JMeybohm: k8s: Retry checks for expected pods on drain [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) [16:08:24] RECOVERY - Host moss-be1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [16:09:20] (03PS1) 10Majavah: hieradata: cloudweb-dev: route striker to the docker port [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) [16:10:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2164.codfw.wmnet with OS bullseye [16:10:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36191/console" [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [16:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:36] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2164.codfw.wmnet with OS bullseye completed: - db2... [16:11:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2169.codfw.wmnet with OS bullseye [16:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:45] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2169.codfw.wmnet with OS bullseye [16:13:48] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [16:18:56] (03PS1) 10BryanDavis: Revert "striker: Open firewall for Docker-managed service" [puppet] - 10https://gerrit.wikimedia.org/r/811274 (https://phabricator.wikimedia.org/T306469) [16:19:39] (03CR) 10Dzahn: [C: 03+2] "I don't get it. Maybe something changed "upstream" in the prometheus class meanwhile." [puppet] - 10https://gerrit.wikimedia.org/r/806476 (owner: 10Dzahn) [16:24:31] (03PS1) 10DDesouza: QuickSurveys: Increase coverage of 'research-incentive' survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811335 (https://phabricator.wikimedia.org/T311015) [16:24:39] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [16:26:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2170.mgmt.codfw.wmnet with reboot policy FORCED [16:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2171.mgmt.codfw.wmnet with reboot policy FORCED [16:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:33] (03PS1) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) [16:28:37] (03CR) 10Jcrespo: [C: 03+1] "As you know we have different philosophies there :-P, in order not to argue, let's deploy this and we can keep arguing about performance v" [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [16:28:50] (03PS3) 10Jcrespo: bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [16:29:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2172.mgmt.codfw.wmnet with reboot policy FORCED [16:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:38] (03CR) 10CI reject: [V: 04-1] bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [16:29:43] (03CR) 10Jcrespo: [C: 04-1] "Actually, this needs more work on tests only, I think." [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff) [16:29:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [16:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [16:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:21] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [16:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [16:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:51] (03CR) 10CI reject: [V: 04-1] k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [16:44:17] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw [16:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:45] (03PS1) 10BryanDavis: labweb: point tlsproxy envoy at port 8080 for striker [puppet] - 10https://gerrit.wikimedia.org/r/811337 (https://phabricator.wikimedia.org/T306469) [16:47:40] (03PS1) 10Jdlrobson: Enable title above tabs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811339 (https://phabricator.wikimedia.org/T255319) [16:48:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2169.codfw.wmnet with OS bullseye [16:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:48] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2169.codfw.wmnet with OS bullseye completed: - db2... [16:49:45] (03PS2) 10Jdlrobson: Enable title above tabs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811339 (https://phabricator.wikimedia.org/T311773) [16:50:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2170.codfw.wmnet with OS bullseye [16:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:34] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2170.codfw.wmnet with OS bullseye [16:52:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2172.mgmt.codfw.wmnet with reboot policy FORCED [16:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:42] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36193/console" [puppet] - 10https://gerrit.wikimedia.org/r/811337 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [16:55:49] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/pcc-worker1002/36192/" [puppet] - 10https://gerrit.wikimedia.org/r/811337 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [16:57:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [16:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:58] (03PS2) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) [16:59:36] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [17:00:33] (03CR) 10Andrea Denisse: [C: 03+1] prometheus: deploy custom probedown alerts [puppet] - 10https://gerrit.wikimedia.org/r/811241 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [17:00:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2174.mgmt.codfw.wmnet with reboot policy FORCED [17:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:12] (03CR) 10Andrea Denisse: [C: 03+1] prometheus: deploy alerts as yml not yaml [puppet] - 10https://gerrit.wikimedia.org/r/811242 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [17:03:14] (03CR) 10Andrea Denisse: [C: 03+1] prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [17:09:54] (03CR) 10CI reject: [V: 04-1] k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [17:10:00] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:15:00] (03CR) 10Ayounsi: "Nice! Some comments." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [17:15:08] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:20] (03PS1) 10Andrew Bogott: Revert "Revert "magnum.conf: change the domain admin name"" [puppet] - 10https://gerrit.wikimedia.org/r/811275 [17:18:55] (03PS3) 10JMeybohm: k8s: Add KubernetesNode.taints propertry [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) [17:19:16] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "magnum.conf: change the domain admin name"" [puppet] - 10https://gerrit.wikimedia.org/r/811275 (owner: 10Andrew Bogott) [17:19:37] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:31:47] (03PS1) 10Muehlenhoff: Switch image builds over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811344 (https://phabricator.wikimedia.org/T298463) [17:33:40] !log installing haproxy security updates on stretch [17:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:25] (03PS1) 10Andrew Bogott: OpenStack Magnum: magnumdomainadm is in ldap, so in the default domain [puppet] - 10https://gerrit.wikimedia.org/r/811346 [17:35:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2174.mgmt.codfw.wmnet with reboot policy FORCED [17:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2170.codfw.wmnet with OS bullseye [17:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:34] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2170.codfw.wmnet with OS bullseye executed with er... [17:39:07] (03PS1) 10Dzahn: Revert "Revert "gitlab: add prometheus blackbox http monitor"" [puppet] - 10https://gerrit.wikimedia.org/r/811276 [17:39:58] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:42:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2170.codfw.wmnet with OS bullseye [17:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:40] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2170.codfw.wmnet with OS bullseye [17:50:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:54:49] !log disabling puppet on gitlab* - debugging gerrit:811276 [17:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:30] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2170 [17:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:55] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "gitlab: add prometheus blackbox http monitor"" [puppet] - 10https://gerrit.wikimedia.org/r/811276 (owner: 10Dzahn) [17:57:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2170 [17:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:12] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2171 [17:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2171 [17:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:12] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2172 [17:59:15] (03CR) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [17:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:23] (03CR) 10Dzahn: "I disabled puppet on gitlab*, reverted the revert and could NOT confirm any issue. puppet was just a noop on gitlab2002, no error there." [puppet] - 10https://gerrit.wikimedia.org/r/810899 (owner: 10Jelto) [17:59:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2172 [17:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] (03CR) 10Dzahn: "So looks like it must have been broken and fixed again on the prometheus side or elsewhere." [puppet] - 10https://gerrit.wikimedia.org/r/810899 (owner: 10Jelto) [18:00:05] jnuche and dduvall: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T1800). [18:00:24] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2173 [18:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2173 [18:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:07] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2174 [18:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2174 [18:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:01] (03PS1) 10Dzahn: gitlab/prometheus: 'body_regex_matches' expects an Array value, got String [puppet] - 10https://gerrit.wikimedia.org/r/811349 [18:06:30] (03CR) 10Dzahn: [C: 03+2] gitlab/prometheus: 'body_regex_matches' expects an Array value, got String [puppet] - 10https://gerrit.wikimedia.org/r/811349 (owner: 10Dzahn) [18:08:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2171.codfw.wmnet with OS bullseye [18:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:03] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2171.codfw.wmnet with OS bullseye [18:11:00] (03PS1) 10Ebernhardson: cirrus: Disable commonswiki writes to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811350 (https://phabricator.wikimedia.org/T309648) [18:12:22] jnuche: dduvall: Can you let me know once you are done with train deployment? I need to ship a small config patch (^, https://gerrit.wikimedia.org/r/811350) to unblock some operational tasks. [18:19:28] (03PS1) 10Dzahn: gitlab: switch gitlab2001 back to "insetup" role [puppet] - 10https://gerrit.wikimedia.org/r/811351 (https://phabricator.wikimedia.org/T307142) [18:19:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2170.codfw.wmnet with reason: host reimage [18:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:05] (03CR) 10Dzahn: "submitting https://gerrit.wikimedia.org/r/c/operations/puppet/+/811351 to avoid puppet error until the cookbook issue is resolved and has " [puppet] - 10https://gerrit.wikimedia.org/r/806863 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [18:21:09] (03CR) 10Dzahn: [C: 03+2] gitlab: switch gitlab2001 back to "insetup" role [puppet] - 10https://gerrit.wikimedia.org/r/811351 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [18:23:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2170.codfw.wmnet with reason: host reimage [18:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [18:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:41] (03PS2) 10Andrew Bogott: OpenStack Magnum: magnumdomainadm is in ldap, so in the default domain [puppet] - 10https://gerrit.wikimedia.org/r/811346 [18:31:43] (03PS1) 10Andrew Bogott: OpenStack magnum: change domain admin username again [puppet] - 10https://gerrit.wikimedia.org/r/811353 [18:32:18] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab2001.codfw.wmnet [18:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [18:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:54] !log power down moss-be2001 for NVMe installation [18:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:58] (03PS3) 10Andrew Bogott: OpenStack Magnum: magnum_domain_admin is in ldap, so in the default domain [puppet] - 10https://gerrit.wikimedia.org/r/811346 [18:34:09] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack magnum: change domain admin username again [puppet] - 10https://gerrit.wikimedia.org/r/811353 (owner: 10Andrew Bogott) [18:34:15] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Magnum: magnum_domain_admin is in ldap, so in the default domain [puppet] - 10https://gerrit.wikimedia.org/r/811346 (owner: 10Andrew Bogott) [18:34:42] (03PS2) 10Andrew Bogott: OpenStack magnum: change domain admin username again [puppet] - 10https://gerrit.wikimedia.org/r/811353 [18:35:00] 10SRE: errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook) - https://phabricator.wikimedia.org/T311446 (10Dzahn) [18:35:42] volans: thanks [18:36:06] 10SRE: errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook) - https://phabricator.wikimedia.org/T311446 (10Dzahn) 05Resolved→03Open Since Arnold is out I tried to run the cookbook again on gitlab2001. It failed to remove the VM from ganeti: ` Downtimed host on Icinga/Alertmanager Found Ganet... [18:36:10] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:34] 10SRE: errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook) - https://phabricator.wikimedia.org/T311446 (10Dzahn) So while it does "Found Ganeti VM" it _also_ says "Selection filter does not match any instances". [18:37:33] PROBLEM - Host moss-be2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:38:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2170.codfw.wmnet with OS bullseye [18:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:37] (03PS1) 10Ebernhardson: cloudelastic: Increase primary cluster heap from 45G to 55G [puppet] - 10https://gerrit.wikimedia.org/r/811355 (https://phabricator.wikimedia.org/T309648) [18:38:38] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2170.codfw.wmnet with OS bullseye completed: - db2... [18:39:17] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:39:17] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gitlab2001.codfw.wmnet [18:39:18] 10SRE: errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook) - https://phabricator.wikimedia.org/T311446 (10Dzahn) 05Open→03In progress a:05Volans→03Dzahn [18:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:04] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab2001.wikimedia.org [18:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:11] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:29] (03PS5) 10BCornwall: prometheus: Add custom vm.max_map_count metric [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) [18:44:48] (03CR) 10BCornwall: prometheus: Add custom vm.max_map_count metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [18:46:45] (JobUnavailable) firing: (4) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:47:15] (03CR) 10BCornwall: [C: 03+2] prometheus: Add custom vm.max_map_count metric [puppet] - 10https://gerrit.wikimedia.org/r/809038 (https://phabricator.wikimedia.org/T311445) (owner: 10BCornwall) [18:47:47] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8805.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:48] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db2171.codfw.wmnet with OS bullseye [18:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:52] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2171.codfw.wmnet with OS bullseye completed: - db2... [18:47:56] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2171.codfw.wmnet with OS bullseye executed with er... [18:48:15] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:09] (03CR) 10Volans: "question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [18:49:13] RECOVERY - Host moss-be2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms [18:51:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:52:14] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:15] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab2001.wikimedia.org [18:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:26] !log power down moss-be2002 for NVMe installation [18:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2172.codfw.wmnet with OS bullseye [18:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:05] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2172.codfw.wmnet with OS bullseye [18:54:59] 10SRE: errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook) - https://phabricator.wikimedia.org/T311446 (10Dzahn) 05In progress→03Resolved using the .wikimedia.org name correctly it is now: ` Found Ganeti VM Shutting down VM gitlab2001.wikimedia.org in cluster codfw ... VM removed ` After t... [18:55:58] 10SRE: errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook) - https://phabricator.wikimedia.org/T311446 (10Dzahn) [18:56:31] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:09] (03PS1) 10Andrew Bogott: OpenStack Magnum: clarify difference between trustee_domain_name and trustee_domain_admin_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/811361 [18:58:12] (03CR) 10CI reject: [V: 04-1] OpenStack Magnum: clarify difference between trustee_domain_name and trustee_domain_admin_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/811361 (owner: 10Andrew Bogott) [19:01:05] (03PS1) 10Dzahn: site/hiera: remove gitlab2001 after decom [puppet] - 10https://gerrit.wikimedia.org/r/811362 (https://phabricator.wikimedia.org/T307142) [19:01:06] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:01:58] (03CR) 10Dzahn: [C: 03+2] "VM is fully decom'ed now" [puppet] - 10https://gerrit.wikimedia.org/r/811351 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [19:02:44] (03CR) 10Dzahn: "there should be another follow-up. to remove "replica-old" DNS name" [puppet] - 10https://gerrit.wikimedia.org/r/811362 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [19:04:34] (03CR) 10Andrew Bogott: [C: 03+2] labweb: point tlsproxy envoy at port 8080 for striker [puppet] - 10https://gerrit.wikimedia.org/r/811337 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [19:05:03] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) [19:05:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) ` pt1979@moss-be2001:~$ sudo fdisk -l Disk /dev/nvme0n1: 1.5 TiB, 1600321314816 bytes, 3125627568 sectors Disk model: Dell Ent NVMe AGN MU AIC 1.... [19:13:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2172.codfw.wmnet with reason: host reimage [19:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:16] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:02] 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10herron) In addition to the dates above I see occasional flapping from this check in my inbox going all the way back to 2019. The sync job (which stops icinga for... [19:17:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2172.codfw.wmnet with reason: host reimage [19:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:53] (03PS2) 10Andrew Bogott: OpenStack Magnum: clarify trustee_domain_name vs trustee_domain_admin_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/811361 [19:24:55] (03PS1) 10Andrew Bogott: OpenStack keystone: add 'heat' service domain config [puppet] - 10https://gerrit.wikimedia.org/r/811364 [19:25:08] (03PS2) 10Ryan Kemper: cloudelastic: Increase primary cluster heap from 45G to 55G [puppet] - 10https://gerrit.wikimedia.org/r/811355 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [19:26:19] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Magnum: clarify trustee_domain_name vs trustee_domain_admin_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/811361 (owner: 10Andrew Bogott) [19:26:30] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack keystone: add 'heat' service domain config [puppet] - 10https://gerrit.wikimedia.org/r/811364 (owner: 10Andrew Bogott) [19:28:22] (03PS3) 10Ryan Kemper: cloudelastic: Increase primary cluster heap from 45G to 55G [puppet] - 10https://gerrit.wikimedia.org/r/811355 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [19:29:46] (03CR) 10Dzahn: "nevermind my last comment. gitlab2002/2003 are still on "insetup" that's why. actual issue fixed with https://gerrit.wikimedia.org/r/c/ope" [puppet] - 10https://gerrit.wikimedia.org/r/810899 (owner: 10Jelto) [19:29:53] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Increase primary cluster heap from 45G to 55G [puppet] - 10https://gerrit.wikimedia.org/r/811355 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [19:30:52] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: Increase primary cluster heap from 45G to 55G [puppet] - 10https://gerrit.wikimedia.org/r/811355 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [19:31:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2172.codfw.wmnet with OS bullseye [19:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:20] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2172.codfw.wmnet with OS bullseye completed: - db2... [19:33:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10nskaggs) @Cmjohnson you should continue to use public VLAN for this. Decisions around changing the architecture of the service shouldn't delay... [19:39:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:19] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:57] here [19:40:07] I have ACKed [19:40:09] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers labweb1002.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:40:17] getting to laptop [19:40:20] ACK ack [19:40:20] sukhe: thanks, here as well [19:40:23] it's wikitech [19:40:47] well, or one of the former backends of it [19:41:11] mutante: how did you determine that? [19:41:28] because it says "labweb" [19:41:40] and I know that used to be the wikitech machines [19:41:49] but also up there there is a ticket about "cloudweb" [19:41:54] former? [19:41:57] which sounds like it replaces that, because lab -> cloud [19:42:12] https://phabricator.wikimedia.org/T305414 [19:42:40] seemed like this could be related, might be wrong [19:42:41] PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8803.service,thumbor@8805.service,thumbor@8809.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:51] https://gerrit.wikimedia.org/r/811337 is the likely cause of any breakage, cc andrewbogott bd808 [19:43:05] (03PS2) 10DDesouza: QuickSurveys: Increase coverage of 'research-incentive' survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811335 (https://phabricator.wikimedia.org/T311015) [19:43:06] yea, let's ping andrew [19:43:13] the new cloudweb boxes aren't in service yet [19:43:23] yeah seems to be tjat [19:43:24] lemme get my laptop and see what's going on [19:43:24] that [19:43:51] (03CR) 10Dzahn: "we got paged: <+icinga-wm> PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Serv" [puppet] - 10https://gerrit.wikimedia.org/r/811337 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [19:44:08] wikitech wiki is up, fwiw [19:44:10] thanks mutante, was about to do [19:44:31] mutante: I'm in a meeting, but a revert is ok with me if y'all think that would help [19:45:07] so far nothing seems actually down [19:45:20] at least I can browse wikitech [19:45:32] I am trying to read through what might be the issue with the CR [19:46:24] https://toolsadmin.wikimedia.org/ is returning an error that I think is from envoy [19:46:47] maybe the ferm_srange change just disallows monitoring? [19:46:54] or that [19:47:09] what exact endpoint are we monitoring for this service? [19:47:23] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:49:25] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers labweb1002.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:50:02] * sukhe still grepping [19:50:11] taavi: that would make sense, unless $CACHES includes the monitoring source [19:50:22] mutante: let's revert since this broke toolsadmin, but I think we have a separate issue since I don't think the service actually monitors toolsadmin [19:50:23] what box does the blackbox based monitoring [19:51:18] we have a +1 from bd808 and taavi. I am fine with reverting it ("I" being the person on on-call) [19:51:26] mutante: jhathaway: what do you think? [19:51:47] jhathaway: probably prometheus*. I see an allow-all rule for the eqiad prometheus boxes on labweb, but not for the codfw ones [19:51:48] I think a revert makes sense [19:52:03] taavi: nod, thanks [19:52:05] (03PS1) 10Majavah: Revert "labweb: point tlsproxy envoy at port 8080 for striker" [puppet] - 10https://gerrit.wikimedia.org/r/811277 [19:52:13] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8804.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:15] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8814.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:20] nor are there automatic rules for lvs/pybal monitoring as far I can see [19:52:22] oh hi thumbor [19:52:26] I made a revert, can someone merge it? [19:52:29] taavi: on it [19:52:44] I would not be against a revert but I am also a bit skeptical if it will revert cleanly [19:53:01] why would it not? [19:53:08] mutante: why would that be the case? I mean, what part? [19:54:18] it would have to have puppet code to remove the config from /etc/envoy/listeners.d [19:54:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2173.codfw.wmnet with OS bullseye [19:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:26] but just do it anyways [19:54:27] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2173.codfw.wmnet with OS bullseye [19:54:30] doing it for now [19:54:32] (03CR) 10Ssingh: [C: 03+2] Revert "labweb: point tlsproxy envoy at port 8080 for striker" [puppet] - 10https://gerrit.wikimedia.org/r/811277 (owner: 10Majavah) [19:54:56] envoyproxy/init.pp has that dir as purge => true [19:55:09] taavi: great! [19:55:09] ok, cool [19:55:41] running agent manually [19:55:55] already doing [19:55:57] cool [19:56:59] Sorry I missed the ping earlier. Was anything actually broken, or just alarms on not-yet-installed hardware? [19:57:06] striker broke for a second [19:57:19] ok. Sorry about the blind merge, it looked like a no-op [19:57:31] I think I have everything except the lvs monitoring fails figured out [19:57:41] prometheus was due to a cert mismatch: 'x509: certificate is valid for horizon.wikimedia.org, toolsadmin.wikimedia.org, wikitech.wikimedia.org, labweb.discovery.wmnet, not labweb.svc.eqiad.wmnet' [19:58:15] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:58:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:58:17] lvs might be either that or a firewall issue, but not 100% sure and I don't think that's in logstash [19:58:51] striker being down is probably because the new deployment only listens on ipv4 [19:59:16] could I ask a prod root to look on the pybal logs to see if there's anything useful in there? [19:59:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:28] bd808: I'm about to step into a meeting but the backscroll here might be enlightening re: your recent lvs patch [19:59:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2171.codfw.wmnet with OS bullseye [19:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:48] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2171.codfw.wmnet with OS bullseye [20:00:05] RoanKattouw, Urbanecm, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220705T2000) [20:00:05] sergi0, danisztls, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] o/ [20:00:29] hi yall [20:00:35] hey! [20:00:37] i can deploy today [20:00:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [20:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:31] (03PS1) 10Andrew Bogott: Revert "OpenStack keystone: add 'heat' service domain config" [puppet] - 10https://gerrit.wikimedia.org/r/811278 [20:01:42] taavi: I don't see anything in the logs other than the alerts here [20:01:49] (03CR) 10Urbanecm: "Removing Kosta's -2, per T307985#8034017 and Growth planning meeting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [20:01:54] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: End mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [20:02:01] :( thanks anyways [20:02:06] (03CR) 10CI reject: [V: 04-1] GrowthExperiments: End mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [20:02:11] okay... [20:02:21] (03PS2) 10Urbanecm: GrowthExperiments: End mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [20:02:27] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: End mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [20:03:11] (03Merged) 10jenkins-bot: GrowthExperiments: End mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [20:03:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [20:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:10] sergi0: ftr, in the future, it'd be great to get the -2 removed in advance. usually, a negative CR (even a -1, but especially a -2) is a reason to deny deployment. in this case, i'm comfortable going ahead because i attended the planning meeting when we agreed to do it today, but other deployers likely won't have that confidence. [20:04:35] sergi0: anyway, pulled to mwdebug1001. can you check please? [20:05:02] urbanecm: alright, thanks for mentioning that. I'll make sure for the future [20:05:05] testing now [20:05:20] urbanecm: i have a patch too, adding it now. Turns out i had an edit conflict and never noticed (so i just have a tab open waiting to save) [20:05:42] ebernhardson: ack, thanks for the ping. do you want me to ping when done, or should i deploy it for you? [20:06:12] urbanecm: you can ping me when done, works [20:06:17] okay, will do [20:06:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:24] urbanecm: testing done. Looks fine to me. [20:08:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:31] sergi0: thanks, syncing [20:09:35] (03PS3) 10Urbanecm: QuickSurveys: Increase coverage of 'research-incentive' survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811335 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:09:53] (03CR) 10Urbanecm: [C: 03+2] QuickSurveys: Increase coverage of 'research-incentive' survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811335 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:10:45] (03Merged) 10jenkins-bot: QuickSurveys: Increase coverage of 'research-incentive' survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811335 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:12:12] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b1c217103753d886ab5b18b88f112ec26931bff2: GrowthExperiments: End mailing list campaign on eswiki (T307985) (duration: 03m 39s) [20:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:16] T307985: End GrowthExperiments welcome email campaign - https://phabricator.wikimedia.org/T307985 [20:12:24] sergi0: should be live! [20:12:44] danisztls: your patch is at mwdebug1001 in case you want to test it (but it probably isn't testable) [20:13:18] urbanecm: it isn't but tested anyways, lgtm [20:13:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2173.codfw.wmnet with reason: host reimage [20:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:50] syncing [20:14:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2171.codfw.wmnet with OS bullseye [20:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:39] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2171.codfw.wmnet with OS bullseye completed: - db2... [20:17:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: host reimage [20:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 66c973087b7736b22ce7edb5b830e50e31710e4a: QuickSurveys: Increase coverage of research-incentive survey (T311015) (duration: 03m 28s) [20:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:50] T311015: Deploy QuickSurvey on Japanese Wikipedia - https://phabricator.wikimedia.org/T311015 [20:17:51] danisztls: should be live! [20:18:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:18:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2174.codfw.wmnet with OS bullseye [20:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:19] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2174.codfw.wmnet with OS bullseye [20:18:35] ebernhardson: go ahead (but please ping me once you're done; I'm waiting for Jdl.robson to appear) [20:19:23] urbanecm: thanks! [20:19:32] (03PS2) 10Ebernhardson: cirrus: Disable commonswiki writes to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811350 (https://phabricator.wikimedia.org/T309648) [20:19:37] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Disable commonswiki writes to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811350 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [20:20:22] (03Merged) 10jenkins-bot: cirrus: Disable commonswiki writes to cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811350 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [20:21:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:12] urbanecm: thanks! and sry, I got a brief disconnect [20:22:25] danisztls: no problem at all. good luck with the survey :) [20:24:54] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:811350|cirrus: Disable commonswiki writes to cloudelastic (T309648)]] (duration: 03m 23s) [20:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:57] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [20:25:54] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 149 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:26:38] ebernhardson: fyi ^^. not sure if alert is related, but MW alerts after a MW deploy are generally worth checking :) [20:26:46] taavi: thanks for saying those bits about your investigation on my puppet patch for striker breakings things. The x509 cert missing names and the lack of an ipv6 bind for port 8080 both make sense as to how they would break expectations. [20:26:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:11] ebernhardson: and, based on logstash, it looks related. a lot of errors like `[72a7dc62-fbd8-47d8-80cc-b20088137dc9] /rpc/RunSingleJob.php RuntimeException: Received cirrusSearchElasticaWrite job for an unwritable cluster cloudelastic.`. should we revert? or is it temporary? [20:29:01] (03PS16) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [20:29:41] hey urbanecm around shortly for mine [20:29:47] Jdlrobson: ack [20:29:51] (and hello) [20:30:34] ebernhardson: ping. the patch you just deployed appears to cause a lot of fatals. [20:30:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:30:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2173.codfw.wmnet with OS bullseye [20:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:03] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2173.codfw.wmnet with OS bullseye completed: - db2... [20:31:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:51] urbanecm: that's expected, it will abandon the jobs already in the queue but it shouldn't create new ones [20:32:16] urbanecm: checking logstash, sec [20:32:23] thanks [20:32:34] (03CR) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 (039 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [20:33:18] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:45] not sure how large thta queue was though, hmm [20:34:29] exceptions still flowing, unfortunately :/ [20:34:38] err, hmm. There was a 3 day backlog of jobs :S [20:34:53] that's not good. I'd revert. what do you think? [20:34:56] turning the writes on isn't going to help much though...we are trying to fix the thing now that caused it to backlog [20:35:26] well, it's not safe to do MW deployments when the error log is cluttered with this [20:35:28] i can make a quick patch to squelch the errors, but otherwise they are expected and ok (there is another process we are already expecting to use to catchup the updates) [20:35:33] so it'll at least help with that [20:36:43] urbanecm: it's too late to restore the patch, the error will instead say that no index exists. I'll work up something to make these quieter, should only take a few [20:36:52] meh [20:36:56] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:06] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:37:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2174.codfw.wmnet with reason: host reimage [20:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:35] in that case, waiting on the make-it-quiter patch with the next deployment. [20:39:54] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 158 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:41:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: host reimage [20:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:51] (03CR) 10Chad: [C: 03+2] mediawiki chart 0.2.3: Add before-hook-creation hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 (owner: 10Ahmon Dancy) [20:45:07] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 124 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:46:37] (03Merged) 10jenkins-bot: mediawiki chart 0.2.3: Add before-hook-creation hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 (owner: 10Ahmon Dancy) [20:47:44] (03PS1) 10Ebernhardson: job queue: Squelch errors related to unwritable cloudelastic [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811279 (https://phabricator.wikimedia.org/T309648) [20:48:18] (03PS1) 10Ebernhardson: job queue: Squelch errors related to unwritable cloudelastic [extensions/CirrusSearch] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811280 (https://phabricator.wikimedia.org/T309648) [20:51:32] ebernhardson: is that the patch you mentioned? [20:51:55] urbanecm: yea, that should squelch this set of errors. I was just looking over the job queue to make sure i'm not missing something else [20:52:04] okay okay [20:53:34] (03CR) 10Ebernhardson: [C: 03+2] "resolving high-volume prod log messages" [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811279 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [20:53:41] (03CR) 10Ebernhardson: [C: 03+2] "resolving high-volume prod log messages" [extensions/CirrusSearch] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811280 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [20:55:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2174.codfw.wmnet with OS bullseye [20:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:51] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2174.codfw.wmnet with OS bullseye completed: - db2... [20:56:07] urbanecm: will there be time for https://gerrit.wikimedia.org/r/c/811339/ ? [20:56:30] Jdlrobson: not in the window, but we can run over, there's nothing scheduled after it [20:56:33] is that fine? [20:56:41] that's fine for me. Is it fine for you? Do you have the time to help? [20:57:16] urbanecm: if you need to do something else i can ship Jdlrobson's patch once mine clears jenkins [20:57:28] thank you <3 [20:57:40] that'd be great, thanks. [20:59:49] (03PS2) 10Andrew Bogott: Revert "OpenStack keystone: add 'heat' service domain config" [puppet] - 10https://gerrit.wikimedia.org/r/811278 [21:00:36] (03CR) 10Andrew Bogott: [C: 03+2] Revert "OpenStack keystone: add 'heat' service domain config" [puppet] - 10https://gerrit.wikimedia.org/r/811278 (owner: 10Andrew Bogott) [21:02:00] (03CR) 10BCornwall: [C: 03+1] netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [21:07:45] (03PS1) 10Ryan Kemper: elastic: disable saneitizer for perf reasons [puppet] - 10https://gerrit.wikimedia.org/r/811374 (https://phabricator.wikimedia.org/T309648) [21:08:15] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:14:17] (03Merged) 10jenkins-bot: job queue: Squelch errors related to unwritable cloudelastic [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/811279 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [21:14:21] (03Merged) 10jenkins-bot: job queue: Squelch errors related to unwritable cloudelastic [extensions/CirrusSearch] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811280 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [21:14:39] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/811374 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [21:17:03] (03CR) 10Ebernhardson: [C: 03+1] "matches how we've done this in the past." [puppet] - 10https://gerrit.wikimedia.org/r/811374 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [21:17:39] ebernhardson: any idea on rough timing? i need to grab a coffee and not sure if I have time. [21:18:08] Jdlrobson: both patches are through jenkins, i'm deploying the wmf.19 one, and then a few minutes later (to make sure nothing silly happens) will ship wmf.18. Maybe 10 minutes or so? [21:18:18] okay great! thanks for the update [21:19:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:19:19] (03CR) 10Dzahn: [C: 03+2] "jenkins saying +2 means the string does not already exist in the repo" [puppet] - 10https://gerrit.wikimedia.org/r/810403 (owner: 10Dzahn) [21:19:47] !log ebernhardson@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Backport: [[gerrit:811280|job queue: Squelch errors related to unwritable cloudelastic (T309648)]] (duration: 03m 43s) [21:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:50] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [21:22:13] (03CR) 10Dzahn: "has there been an outcome from yesterday?" [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [21:22:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:23:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:36] !log ebernhardson@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Backport: [[gerrit:811279|job queue: Squelch errors related to unwritable cloudelastic (T309648)]] (duration: 03m 37s) [21:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:40] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [21:27:45] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:29:12] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [21:29:51] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) 05Open→03Resolved @Marostegui All yours [21:31:17] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2073 - https://phabricator.wikimedia.org/T311837 (10Papaul) [21:31:37] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2073 - https://phabricator.wikimedia.org/T311837 (10Papaul) 05Open→03Resolved complete [21:32:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:33:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:05] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2091 - https://phabricator.wikimedia.org/T311803 (10Papaul) [21:33:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:21] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2091 - https://phabricator.wikimedia.org/T311803 (10Papaul) 05Open→03Resolved complete [21:34:57] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:35:20] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:811350|cirrus: Disable commonswiki writes to cloudelastic (T309648)]] (duration: 03m 42s) [21:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:25] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [21:35:25] (03CR) 10Dzahn: doc: remove support for stretch, add support for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [21:35:26] 10SRE, 10Infrastructure-Foundations, 10LuaSandbox: Build and deploy php-luasandbox 3.0.1 to Wikimedia wikis - https://phabricator.wikimedia.org/T187673 (10Krinkle) 05Open→03Resolved a:03MoritzMuehlenhoff We're on luasandbox 3.0.3 according to . [21:35:32] (03PS3) 10Dzahn: doc: remove support for stretch, add support for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/810401 (https://phabricator.wikimedia.org/T247653) [21:37:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:42] (03PS1) 10Ebernhardson: Revert "cirrus: Disable commonswiki writes to cloudelastic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811376 [21:42:10] (03CR) 10Ebernhardson: [C: 03+2] "created and pushed from deploy1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811376 (owner: 10Ebernhardson) [21:43:01] (03Merged) 10jenkins-bot: Revert "cirrus: Disable commonswiki writes to cloudelastic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811376 (owner: 10Ebernhardson) [21:43:06] Jdlrobson: ok i can merge yours when ready. Sorry that took longer than expected :( [21:43:11] great [21:43:34] (03CR) 10Ebernhardson: [C: 03+2] Enable title above tabs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811339 (https://phabricator.wikimedia.org/T311773) (owner: 10Jdlrobson) [21:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:44] (03PS3) 10Ebernhardson: Enable title above tabs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811339 (https://phabricator.wikimedia.org/T311773) (owner: 10Jdlrobson) [21:47:54] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2092 - https://phabricator.wikimedia.org/T311802 (10Papaul) [21:47:57] (03CR) 10Ebernhardson: [C: 03+2] Enable title above tabs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811339 (https://phabricator.wikimedia.org/T311773) (owner: 10Jdlrobson) [21:48:08] never sure when it wants a rebase or not... [21:48:20] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2092 - https://phabricator.wikimedia.org/T311802 (10Papaul) 05Open→03Resolved complete [21:48:47] (03Merged) 10jenkins-bot: Enable title above tabs everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811339 (https://phabricator.wikimedia.org/T311773) (owner: 10Jdlrobson) [21:48:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:31] Jdlrobson: live on mwdebug1002 [21:49:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:49:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:49] ebernhardson: checked. Please sync [21:53:35] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:54:53] (03CR) 10JHathaway: spdx: Add csr files to the list of files to ignore. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond) [21:55:25] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:811339|Enable title above tabs everywhere (T311773)]] (duration: 03m 23s) [21:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:28] T311773: [Layout] Deploy title/tab order everywhere (Make it so that the title should always be above tabs) - https://phabricator.wikimedia.org/T311773 [21:55:29] Jdlrobson: should be all synced out now [21:55:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:56:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:11] thanks ebernhardson [21:57:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:26] (03PS1) 10BryanDavis: labweb: point tlsproxy envoy at port 8080 for striker [puppet] - 10https://gerrit.wikimedia.org/r/811381 (https://phabricator.wikimedia.org/T306469) [22:11:46] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) ps1-e[1-4] are now in Librenms [22:12:42] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/36194/" [puppet] - 10https://gerrit.wikimedia.org/r/811381 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [22:18:37] (03CR) 10Thcipriani: [C: 03+1] admin: add gitlab-roots group to gitlab_runner role [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn) [22:19:39] (03PS1) 10Papaul: Add new PDU in row E Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/811382 (https://phabricator.wikimedia.org/T290899) [22:27:51] (03PS2) 10BryanDavis: hieradata: cloudweb-dev: route striker to the docker port [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [22:28:11] !log T309648 Manually restarting `cloudelastic1006` before proceeding to a normal rolling restart of cloudelasti [22:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:16] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [22:28:22] !log T309648 Manually restarting `cloudelastic1006` before proceeding to a normal rolling restart of cloudelastic [22:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:22] (03CR) 10Papaul: [C: 03+2] Add new PDU in row E Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/811382 (https://phabricator.wikimedia.org/T290899) (owner: 10Papaul) [22:30:29] (03CR) 10BryanDavis: "PCC for PS2: https://puppet-compiler.wmflabs.org/pcc-worker1001/36195/" [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [22:41:52] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) ps1-e[1-4] are now in Icinga [22:43:01] (03CR) 10Mary Yang: "Thank you so much for your help!" [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [22:48:44] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - ryankemper@cumin1001 - T309648 [22:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:48] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [22:53:03] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:58:18] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) ps1-e[1-4] are now in Grafana https://grafana.wikimedia.org/d/OBD1jy1Zk/filippo-pdu [22:58:43] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:04:21] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10wiki_willy) Thanks @Papaul! >>! In T290899#8053980, @Papaul wrote: > ps1-e[1-4] are now in Grafana > https://grafana.wikimedia.org/d/OBD1jy1Zk/filippo-pdu [23:11:27] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1004 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [23:15:38] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - ryankemper@cumin1001 - T309648 [23:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:43] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [23:19:17] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:22:57] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:30:34] !log start restore of commonswiki_file from thanos-swift to cloudelastic [23:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:52] (03CR) 10Dzahn: Add puppet profile and role files for wikifunctions. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [23:54:23] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook