[01:35:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1058.eqiad.wmnet}' (T419948) [01:38:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:38:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:38:43] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:38:52] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:38:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:39:05] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:39:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:39:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:39:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:39:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:39:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:39:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:40:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:40:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) [01:40:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:40:32] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:40:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:40:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:40:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:40:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:41:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [01:41:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [01:44:04] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1058.eqiad.wmnet}' (T419948) [01:45:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1049.eqiad.wmnet}' (T419948) [01:46:48] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1049.eqiad.wmnet}' (T419948) [01:47:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,nova [01:54:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,nova [01:54:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1049.eqiad.wmnet}' (T419948) [02:06:36] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1049.eqiad.wmnet}' (T419948) [02:06:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1048.eqiad.wmnet}' (T419948) [02:30:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1048.eqiad.wmnet}' (T419948) [02:30:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1047.eqiad.wmnet}' (T419948) [02:55:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1047.eqiad.wmnet}' (T419948) [02:56:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1046.eqiad.wmnet}' (T419948) [02:56:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1047 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [03:01:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1047 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [03:16:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1046.eqiad.wmnet}' (T419948) [03:17:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1045.eqiad.wmnet}' (T419948) [03:42:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1045.eqiad.wmnet}' (T419948) [03:42:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1044.eqiad.wmnet}' (T419948) [03:54:16] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857#11712004 (10Hawkeye7) Massviews was working on 7 December 2023. I am sure it was working in 2024. [04:09:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1044.eqiad.wmnet}' (T419948) [04:09:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1043.eqiad.wmnet}' (T419948) [04:31:32] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1043.eqiad.wmnet}' (T419948) [04:31:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1042.eqiad.wmnet}' (T419948) [04:32:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1043 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [04:37:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1043 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [04:52:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1042.eqiad.wmnet}' (T419948) [04:52:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' (T419948) [04:52:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1042 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [04:57:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1042 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [05:14:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1041.eqiad.wmnet}' (T419948) [05:14:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1040.eqiad.wmnet}' (T419948) [05:36:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1040.eqiad.wmnet}' (T419948) [05:36:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1040 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [05:41:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1040 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [08:11:22] (03open) 10r4356th: Preserve newlines [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/3 [08:20:52] (03update) 10r4356th: Preserve newlines [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/3 [08:22:05] (03update) 10r4356th: Preserve whitespace around newlines [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/3 [08:22:51] (03update) 10r4356th: Preserve whitespace around newlines [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/3 [08:23:33] (03merge) 10r4356th: Preserve whitespace around newlines [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/3 [08:48:32] FIRING: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [08:53:12] 06cloud-services-team, 10Cloud-VPS: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore - https://phabricator.wikimedia.org/T419996#11712337 (10fgiunchedi) @taavi mentioned that https://gerrit.wikimedia.org/r/c/operations/homer/public/+/970275 might have broken this commun... [09:03:32] RESOLVED: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [09:06:32] FIRING: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [09:06:32] FIRING: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [09:18:49] 10Tool-delintbot: Fix cases of tags not being closed correctly - https://phabricator.wikimedia.org/T417483#11712398 (10Kavaljeet_Singh) I have cloned the repository and explored the codebase. It looks like page text processing is handled through the performreplacements() function in str_replacements.py which is... [09:21:32] RESOLVED: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [09:21:32] RESOLVED: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [09:45:40] 06cloud-services-team, 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177 (10fnegri) 03NEW [09:46:33] 06cloud-services-team, 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11712598 (10fnegri) [09:53:34] 06cloud-services-team, 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11712670 (10fnegri) The automatic restart did not restart replication, so the host is currently lagging behind. @taavi depooled the host today, so... [09:57:07] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11712706 (10fnegri) p:05Triage→03High a:03fnegri [09:58:19] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11712711 (10FCeratto-WMF) For context, the upgrade included also: ` 2026-03-13 09:46:26 status installed linux-image-6.1.0-44-a... [10:05:00] (03open) 10r4356th: Only fix closing tags if the tag itself is known [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/4 [10:08:11] (03update) 10r4356th: Only fix closing tags if the tag itself is known [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/4 [10:12:42] (03update) 10r4356th: Only fix closing tags if the tag itself is known [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/4 [10:13:15] (03merge) 10r4356th: Only fix closing tags if the tag itself is known [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/4 [10:49:36] (03open) 10r4356th: Strip the opening quote if it does not have a corresponding end quote in an attribute's value [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/5 [10:49:52] (03update) 10r4356th: Strip the opening quote if it does not have a corresponding end quote in an attribute's value [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/5 [10:50:02] (03merge) 10r4356th: Strip the opening quote if it does not have a corresponding end quote in an attribute's value [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/5 [10:51:31] 10Tool-campwiz-nxt, 10Google-Summer-of-Code (2026): GSoC 2026: CampWiz NxT Redesign - https://phabricator.wikimedia.org/T414269#11712914 (10Only-Vikas) Hi @Nokib_Sarkar and @Tiven2240! Now that the contribution period has officially opened, I am excited to share my progress on the CampWiz NxT Redesign. I have... [11:10:50] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Permanently set 'noout' for cloudceph - https://phabricator.wikimedia.org/T419877#11713011 (10Volans) I brought this up to the ceph working group and the only comments I got were from Ben that was a bit surprised about our strategy and wanted to know more... [11:23:37] 10Tool-delintbot: Fix cases of tags not being closed correctly - https://phabricator.wikimedia.org/T417483#11713039 (10Redmin) Oh, sorry, I have moved the code for lint replacements to https://gitlab.wikimedia.org/toolforge-repos/delintbot in the meantime. The code should go in `delinter.py` (either inside the `... [11:29:28] 06cloud-services-team, 10Data-Services: [wikireplicas] Add an option to cookbooks to specify which hosts should be targeted - https://phabricator.wikimedia.org/T393387#11713091 (10fnegri) Related: {T273199} [11:30:10] 06cloud-services-team, 10Data-Services: Give the wmcs.wikireplicas.add_wiki cookbook a way to exclude a host - https://phabricator.wikimedia.org/T273199#11713095 (10fnegri) →14Duplicate dup:03T393387 [11:30:12] 06cloud-services-team, 10Data-Services: [wikireplicas] Add an option to cookbooks to specify which hosts should be targeted - https://phabricator.wikimedia.org/T393387#11713097 (10fnegri) [11:34:53] 10Tool-delintbot: Fix multi-colon-escape errors for soft redirect template users - https://phabricator.wikimedia.org/T420197 (10Redmin) 03NEW [11:35:07] 10Tool-delintbot: Fix multi-colon-escape errors for soft redirect template users - https://phabricator.wikimedia.org/T420197#11713122 (10Redmin) p:05Triage→03High a:03Redmin [11:44:38] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11713149 (10Ladsgroup) >>! In T420177#11712669, @fnegri wrote: > The automatic restart did not restart replication, so the host... [11:49:04] (03open) 10r4356th: Fix multi-colon-escape errors for soft redirect template users [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/6 (https://phabricator.wikimedia.org/T420197) [11:49:14] (03update) 10r4356th: Fix multi-colon-escape errors for soft redirect template users [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/6 (https://phabricator.wikimedia.org/T420197) [11:50:02] (03merge) 10r4356th: Fix multi-colon-escape errors for soft redirect template users [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/6 (https://phabricator.wikimedia.org/T420197) [12:04:52] 10Tool-delintbot, 13Patch-For-Review: Fix multi-colon-escape errors for soft redirect template users - https://phabricator.wikimedia.org/T420197#11713209 (10Redmin) 05Open→03Resolved Done with that MR, https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/commit/8d6c47c4eae0ba13cbfc9dfc518a94bd4e72a... [12:08:49] 10Tool-delintbot, 13Patch-For-Review: Fix multi-colon-escape errors for soft redirect template users - https://phabricator.wikimedia.org/T420197#11713233 (10Redmin) 05Resolved→03Open The query needs to be changed so the `lint_template = ''` check is not added for lint category 11 (multi-colon-escape). [12:25:53] 06cloud-services-team, 10Data-Services, 06Data-Persistence: Extend sre.mysql.upgrade to work with multiinstance hosts - https://phabricator.wikimedia.org/T420203 (10fnegri) 03NEW [12:26:14] 06cloud-services-team, 10Data-Services, 06Data-Persistence: Extend sre.mysql.upgrade to work with multiinstance hosts - https://phabricator.wikimedia.org/T420203#11713319 (10fnegri) [12:26:33] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/62 [12:26:33] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/34 [12:28:04] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1253474 (owner: 10L10n-bot) [12:30:24] 10Tool-delintbot, 13Patch-For-Review: Fix multi-colon-escape errors for soft redirect template users - https://phabricator.wikimedia.org/T420197#11713323 (10Redmin) 05Open→03Resolved Done with https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/commit/50ba71eaea146499c89e41460e82c5732398d685. [12:32:38] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11713333 (10fnegri) > I think the default behavior is not to start replication. Just issue "start slave" and it should be fineT... [12:33:40] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Persistence: clouddb1013 crashed after the upgrade to mariadb 10.11.16 - https://phabricator.wikimedia.org/T420177#11713345 (10fnegri) 05Open→03In progress [13:06:27] PROBLEM - mysqld processes on an-redacteddb1001 is CRITICAL: PROCS CRITICAL: 9 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:07:53] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Engineering, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Drop support for cl_to, cl_collation and il_to from wikireplicas - https://phabricator.wikimedia.org/T417492#11713534 (10BTullis) >>! In T417492#11703318, @fnegri wrote: > I ran th... [13:08:53] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Engineering, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Drop support for cl_to, cl_collation and il_to from wikireplicas - https://phabricator.wikimedia.org/T417492#11713539 (10fnegri) 05In progress→03Resolved a:03fnegri [13:10:02] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06Data-Engineering, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Drop support for cl_to, cl_collation and il_to from wikireplicas - https://phabricator.wikimedia.org/T417492#11713550 (10fnegri) a:05fnegri→03Zabe [13:19:10] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/62 (owner: 10l10n-bot) [13:19:13] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/62 (owner: 10l10n-bot) [13:23:42] (03update) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/34 (owner: 10l10n-bot) [13:25:26] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/34 (owner: 10l10n-bot) [13:25:29] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/34 (owner: 10l10n-bot) [13:37:27] RECOVERY - mysqld processes on an-redacteddb1001 is OK: PROCS OK: 8 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:45:37] 06cloud-services-team, 10Cloud-VPS: Reimage cloudgw hosts to Trixie - https://phabricator.wikimedia.org/T401899#11713673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudgw1003.eqiad.wmnet with OS trixie [13:52:16] 06cloud-services-team, 10Cloud-VPS: Deprecate and remove 'bastion-restricted' hosts - https://phabricator.wikimedia.org/T420213 (10Andrew) 03NEW [13:55:28] (03open) 10r4356th: Preserve content inside code tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/7 [13:55:56] (03merge) 10r4356th: Preserve content inside code tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/7 [14:00:42] 06cloud-services-team, 10Cloud-VPS: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore - https://phabricator.wikimedia.org/T419996#11713763 (10taavi) Since that firewall change is "correct" in terms of the administrative policy we want to do, and the cloudcumin hosts live... [14:17:56] 06cloud-services-team, 10Cloud-VPS: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore - https://phabricator.wikimedia.org/T419996#11713865 (10fgiunchedi) I agree cloudcumin talking via prod http proxy like any other client is the right fix here. @Volans what do you think... [14:29:05] 06cloud-services-team, 10Cloud-VPS: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore - https://phabricator.wikimedia.org/T419996#11713923 (10Volans) Conceptually that could work for me, but I fear that we might need to patch cumin for that. Given that keystoneauth1 uses... [14:34:33] 06cloud-services-team, 10Cloud-VPS: Reimage cloudgw hosts to Trixie - https://phabricator.wikimedia.org/T401899#11713954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudgw1003.eqiad.wmnet with OS trixie completed: - cloudgw1003 (**PASS**) - Downtimed on I... [14:45:52] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Increased openstack latency and rabbitmq rolling restarts on certificate update - https://phabricator.wikimedia.org/T418444#11713984 (10fgiunchedi) Confirmed that rabbitmq reloads certs without a restart: ` cloudrabbit1001:~$ sudo systemctl status rabbit... [15:21:35] PROBLEM - mysqld processes on clouddb1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:24:05] PROBLEM - Host clouddb1022 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:43] RECOVERY - Host clouddb1022 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:26:35] PROBLEM - mysqld processes on clouddb1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:29:35] RECOVERY - mysqld processes on clouddb1022 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:40:06] 10VPS-project-Phabricator, 06collaboration-services: phabricator.wmcloud.org account verification request: Pppery - https://phabricator.wikimedia.org/T420149#11714255 (10ABran-WMF) 05Open→03In progress a:03Dzahn [15:40:18] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857#11714261 (10daniel) >>! In T219857#11704747, @MusikAnimal wrote: > It looks like the rate limiting policy might have cha... [15:57:32] PROBLEM - Host clouddb1024 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:42] RECOVERY - Host clouddb1024 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [16:00:34] PROBLEM - mysqld processes on clouddb1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:03:34] RECOVERY - mysqld processes on clouddb1024 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:03:43] (03PS1) 10Btullis: Add dummy analytics-wikidata keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1253550 (https://phabricator.wikimedia.org/T404073) [16:03:48] 10VPS-project-Phabricator, 06collaboration-services: phabricator.wmcloud.org account verification request: Pppery - https://phabricator.wikimedia.org/T420149#11714396 (10Dzahn) Hello @Pppery I tried to verify you but the email address that is associated in LDAP with the user called pppery does not exist in th... [16:03:57] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy analytics-wikidata keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1253550 (https://phabricator.wikimedia.org/T404073) (owner: 10Btullis) [16:05:02] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Add floating IP and vanity domain for azwikimedia project - https://phabricator.wikimedia.org/T419582#11714405 (10fnegri) @Nemoralis your plan looks fine. For the PTR record, can you please create a sub-task? We should be able to configure it for you. Re:... [16:38:40] 10Toolforge (Toolforge iteration 26): [harbor,tools] Harbor object usage in S3 is steadily increasing - https://phabricator.wikimedia.org/T418528#11714589 (10Raymond_Ndibe) I digged deeper into this. https://github.com/goharbor/harbor/issues/22111 is one of our problems, but is not the major one. Below are other... [16:44:02] 10VPS-project-Phabricator, 06collaboration-services: phabricator.wmcloud.org account verification request: Pppery - https://phabricator.wikimedia.org/T420149#11714616 (10Pppery) I apparently used perry@olum.org on the test instance and mapreader@olum.org on LDAP. I created the account a while ago, don't rememb... [16:55:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T406516) [16:55:40] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [16:55:55] (03open) 10r4356th: Correctly preserve nested nowiki, code, syntaxhighlight tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/8 [17:01:18] (03update) 10r4356th: Correctly preserve nested nowiki, code, syntaxhighlight tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/8 [17:02:02] (03merge) 10r4356th: Correctly preserve nested nowiki, code, syntaxhighlight tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/8 [17:02:43] (03update) 10r4356th: Correctly preserve nested nowiki, code, syntaxhighlight tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/8 [17:03:15] (03update) 10r4356th: Correctly preserve nested nowiki, code, syntaxhighlight tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/8 [17:03:17] FIRING: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:03:36] PROBLEM - Host cloudservices1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1005.eqiad.wmnet' (T406516) [17:04:25] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [17:04:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T406516) [17:05:09] (03open) 10bd808: ci: Update pre-commit dependencies and fix new lint errors [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/68 [17:05:35] 10VPS-project-Phabricator, 06collaboration-services: phabricator.wmcloud.org account verification request: Pppery - https://phabricator.wikimedia.org/T420149#11714728 (10Dzahn) Gotcha! Done. ` dzahn@phabricator-bullseye:/srv/phab/phabricator/bin$ sudo ./auth verify perry@olum.org Done. ` [17:06:05] FIRING: [2x] HostBGPDown: BGP session for cloudservices1005 (172.20.2.4) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [17:06:06] RECOVERY - Host cloudservices1005 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [17:06:14] PROBLEM - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:06:14] PROBLEM - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:06:14] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:07:04] RECOVERY - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.026 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:07:04] RECOVERY - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.027 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:07:04] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.027 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:08:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:08:47] 10VPS-project-Phabricator, 06collaboration-services: phabricator.wmcloud.org account verification request: Pppery - https://phabricator.wikimedia.org/T420149#11714735 (10Dzahn) 05In progress→03Resolved [17:09:35] (03approved) 10jforrester: ci: Update pre-commit dependencies and fix new lint errors [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/68 (owner: 10bd808) [17:09:59] (03merge) 10jforrester: ci: Update pre-commit dependencies and fix new lint errors [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/68 (owner: 10bd808) [17:10:13] (03update) 10jforrester: channels: Remove wikimedia-collaboration [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/67 (owner: 10taavi) [17:11:05] RESOLVED: [2x] HostBGPDown: BGP session for cloudservices1005 (172.20.2.4) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [17:11:52] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore - https://phabricator.wikimedia.org/T419996#11714769 (10fgiunchedi) We discussed this in the team meeting today: to restore functionality I have https://gerrit.wik... [17:13:08] (03approved) 10jforrester: channels: Remove wikimedia-collaboration [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/67 (owner: 10taavi) [17:13:11] (03merge) 10jforrester: channels: Remove wikimedia-collaboration [toolforge-repos/wikibugs2] - 10https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/67 (owner: 10taavi) [17:13:17] FIRING: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:13:21] 10VPS-project-Phabricator, 06collaboration-services, 10Phabricator: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11714778 (10Dzahn) [17:14:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1006.eqiad.wmnet' (T406516) [17:14:46] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [17:17:35] FIRING: [4x] HostBGPDown: BGP session for cloudservices1005 (172.20.2.4) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [17:18:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdns in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:20:18] 10Tool-phab-ban, 10Phabricator: Temporary ban feature from phab-ban to quickly response to Phabricator vandalism - https://phabricator.wikimedia.org/T420136#11714804 (10bd808) 05Declined→03Invalid [17:22:35] RESOLVED: [4x] HostBGPDown: BGP session for cloudservices1005 (172.20.2.4) is down - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DHostBGPDown [17:24:10] 10Tool-inteGraality: Function "" is currently not supported by QLever. - https://phabricator.wikimedia.org/T420247 (10JeanFred) 03NEW [17:24:23] 10VPS-project-Codesearch, 10VerySmallGLAM, 10Wikibase Suite Team: Indexing of wikibase related repos - https://phabricator.wikimedia.org/T420067#11714828 (10Dzahn) It would be great if WMDE could prioritize T374926 before we add Github repos to our search. [17:40:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudweb.set_maintenance (T406516) [17:40:24] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [17:42:20] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudweb.set_maintenance (exit_code=99) (T406516) [17:42:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1006.eqiad.wmnet' (T406516) [17:48:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:48:22] FIRING: HAProxyBackendUnavailable: HAProxy service mysql backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:53:10] RESOLVED: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:53:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service mysql backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:59:52] FIRING: [13x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:00:10] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:01:30] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1006.eqiad.wmnet' (T406516) [18:01:36] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [18:04:52] FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:05:10] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:05:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1007.eqiad.wmnet' (T406516) [18:09:52] RESOLVED: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:10:40] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:10:52] FIRING: [4x] HAProxyBackendUnavailable: HAProxy service heat-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:11:07] FIRING: [5x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:11:22] FIRING: [13x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:15:25] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:15:52] RESOLVED: [16x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:17:41] 10VPS-project-Phabricator, 06collaboration-services, 10Phabricator: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11715141 (10Aklapper) @Dzahn: Hmmm how does this affect the production instance? Or what did you have in mind by adding the... [18:22:17] FIRING: JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:22] FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:25:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1007.eqiad.wmnet' (T406516) [18:25:34] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [18:26:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1011.eqiad.wmnet' (T406516) [18:27:17] FIRING: [2x] JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:28:22] FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:32:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:33:22] RESOLVED: [16x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:39:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudcontrol1011.eqiad.wmnet' (T406516) [18:40:04] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [18:41:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1011.eqiad.wmnet' (T406516) [18:49:25] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: 429 Too Many Requests hit despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715341 (10MusikAnimal) [18:50:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:51:52] FIRING: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:51:55] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:52:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1011.eqiad.wmnet' (T406516) [18:52:09] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [18:53:55] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715367 (10daniel) [18:55:10] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:56:27] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715374 (10daniel) @MusikAnimal how many requests does this tool need to make to provide a useful respons... [18:56:52] RESOLVED: [15x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:59:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1006.eqiad.wmnet' (T406516) [18:59:45] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [19:02:17] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715391 (10MusikAnimal) >>! In T219857#11715374, @daniel wrote: > @MusikAnimal how many requests does thi... [19:03:13] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715405 (10MusikAnimal) And heck, for Massviews specifically, maybe it's not too much to ask for users to... [19:06:28] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715406 (10Hawkeye7) Can we roll back this 429 change? I know I only use it once a year or so, but I rea... [19:08:24] PROBLEM - Host cloudnet1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:27] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet1006.eqiad.wmnet' (T406516) [19:10:34] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [19:11:06] RECOVERY - Host cloudnet1006 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:13:19] 06cloud-services-team, 06SRE Observability, 06Traffic, 13Patch-For-Review: Move wikimediastatus.net 301 to ncredir - https://phabricator.wikimedia.org/T419887#11715415 (10ssingh) Thanks for the task and the patch @colewhite. We will discuss this in Traffic and follow up here or on the CR itself. [19:13:49] FIRING: [4x] NeutronAgentDown: Neutron neutron-metadata-agent on cloudnet1006 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:15:44] 10VPS-project-Phabricator, 06collaboration-services, 10Phabricator: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11715419 (10Dzahn) @Aklapper The last question above was about a configuration change. Configuration changes affect all ins... [19:15:56] FIRING: [4x] SystemdUnitDown: The service unit neutron-dhcp-agent.service is in failed status on host cloudnet1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudnet1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:18:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1005.eqiad.wmnet' (T406516) [19:18:20] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [19:20:56] RESOLVED: [4x] SystemdUnitDown: The service unit neutron-dhcp-agent.service is in failed status on host cloudnet1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudnet1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:23:49] RESOLVED: [4x] NeutronAgentDown: Neutron neutron-metadata-agent on cloudnet1006 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:42:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1075.eqiad.wmnet}' (T419948) [19:43:33] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1075.eqiad.wmnet}' (T419948) [19:43:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1074.eqiad.wmnet' (T406516) [19:48:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:51:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1074.eqiad.wmnet' (T406516) [19:51:31] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [19:51:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1073.eqiad.wmnet' (T406516) [19:57:25] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715558 (10MusikAnimal) >>! In T219857#11715391, @MusikAnimal wrote: > … we could simply let the server a... [19:59:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1073.eqiad.wmnet' (T406516) [19:59:13] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [19:59:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1072.eqiad.wmnet' (T406516) [20:00:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:02:17] FIRING: JobUnavailable: Reduced availability for job openstack in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:03:56] RESOLVED: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:05:22] RESOLVED: [5x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:07:04] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1072.eqiad.wmnet' (T406516) [20:07:11] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [20:07:17] RESOLVED: JobUnavailable: Reduced availability for job openstack in cloud@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:07:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1071.eqiad.wmnet' (T406516) [20:14:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1071.eqiad.wmnet' (T406516) [20:14:28] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [20:14:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1070.eqiad.wmnet' (T406516) [20:16:48] 10Tool-phab-ban: Consider enabling permanent sessions for clients with poor session scoped cookie handling - https://phabricator.wikimedia.org/T420147#11715657 (10bd808) [20:21:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1070.eqiad.wmnet' (T406516) [20:21:47] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [20:22:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1060.eqiad.wmnet' (T406516) [20:30:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1060.eqiad.wmnet' (T406516) [20:30:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1061.eqiad.wmnet' (T406516) [20:30:16] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [20:31:26] 10Tool-phab-ban: Consider enabling permanent sessions for clients with poor session scoped cookie handling - https://phabricator.wikimedia.org/T420147#11715697 (10bd808) p:05Triage→03Low I'm pretty sure the behavior described in the use case is a broken user-agent or a user-agent that believes it has been as... [20:37:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1061.eqiad.wmnet' (T406516) [20:37:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1062.eqiad.wmnet' (T406516) [20:37:16] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [20:44:23] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1062.eqiad.wmnet' (T406516) [20:44:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1063.eqiad.wmnet' (T406516) [20:44:30] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [20:51:04] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1063.eqiad.wmnet' (T406516) [20:51:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1064.eqiad.wmnet' (T406516) [20:51:11] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [20:58:11] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1064.eqiad.wmnet' (T406516) [20:58:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1065.eqiad.wmnet' (T406516) [20:58:19] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [21:05:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1065.eqiad.wmnet' (T406516) [21:05:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1066.eqiad.wmnet' (T406516) [21:05:27] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [21:12:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1066.eqiad.wmnet' (T406516) [21:12:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1067.eqiad.wmnet' (T406516) [21:12:31] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [21:19:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1067.eqiad.wmnet' (T406516) [21:19:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1068.eqiad.wmnet' (T406516) [21:19:28] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [21:20:37] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715954 (10daniel) >>! In T219857#11715405, @MusikAnimal wrote: > And heck, for Massviews specifically, m... [21:26:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1068.eqiad.wmnet' (T406516) [21:26:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1069.eqiad.wmnet' (T406516) [21:26:24] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [21:28:18] 10Tool-Pageviews, 06Data-Engineering, 06Data-Engineering-Icebox, 10Pageviews-API: massviews hits 429 Too Many Requests despite making requests synchronously - https://phabricator.wikimedia.org/T219857#11715985 (10MusikAnimal) Thanks so much for the help! >> And heck, for Massviews specifically, maybe it's... [21:33:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1069.eqiad.wmnet' (T406516) [21:33:44] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [21:58:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1050.eqiad.wmnet' (T406516) [21:58:37] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:02:12] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282 (10Andrew) 03NEW [22:05:31] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1050.eqiad.wmnet' (T406516) [22:05:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1051.eqiad.wmnet' (T406516) [22:05:38] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:06:22] FIRING: [3x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [22:07:22] FIRING: [2x] HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [22:11:22] RESOLVED: [5x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [22:12:22] RESOLVED: [2x] HAProxyServiceUnavailable: HAProxy service designate-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [22:12:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1051.eqiad.wmnet' (T406516) [22:12:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1052.eqiad.wmnet' (T406516) [22:12:28] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:18:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1052.eqiad.wmnet' (T406516) [22:18:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1053.eqiad.wmnet' (T406516) [22:19:03] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:25:43] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1053.eqiad.wmnet' (T406516) [22:25:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1054.eqiad.wmnet' (T406516) [22:25:50] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:32:43] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1054.eqiad.wmnet' (T406516) [22:32:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1055.eqiad.wmnet' (T406516) [22:32:49] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:39:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1055.eqiad.wmnet' (T406516) [22:39:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1056.eqiad.wmnet' (T406516) [22:39:40] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:46:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1056.eqiad.wmnet' (T406516) [22:46:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1057.eqiad.wmnet' (T406516) [22:46:35] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [22:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:53:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1057.eqiad.wmnet' (T406516) [22:53:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1058.eqiad.wmnet' (T406516) [22:53:44] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [23:00:31] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1058.eqiad.wmnet' (T406516) [23:00:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1059.eqiad.wmnet' (T406516) [23:00:38] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516 [23:07:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1059.eqiad.wmnet' (T406516) [23:07:45] T406516: Upgrade openstack to version 'Flamingo' - https://phabricator.wikimedia.org/T406516