[00:10:15] FIRING: [3x] ProbeDown: Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:10:24] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T382290 (10phaultfinder) 03NEW [00:15:15] RESOLVED: [3x] ProbeDown: Service virt.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:15:31] 06cloud-services-team, 10Cloud-VPS: 'backy2 cleanup' fails on cloudbackup1004 - https://phabricator.wikimedia.org/T381548#10407914 (10Andrew) indeed, that same postgres crash is happening again [04:10:22] (03open) 10ibrahemqasim: Search in select list [toolforge-repos/whattodo] - 10https://gitlab.wikimedia.org/toolforge-repos/whattodo/-/merge_requests/1 [05:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:07:34] (03update) 10ibrahemqasim: Search in select list [toolforge-repos/whattodo] - 10https://gitlab.wikimedia.org/toolforge-repos/whattodo/-/merge_requests/1 [07:17:03] (03update) 10ibrahemqasim: Search in select list [toolforge-repos/whattodo] - 10https://gitlab.wikimedia.org/toolforge-repos/whattodo/-/merge_requests/1 [07:18:23] (03update) 10ibrahemqasim: Search in select list [toolforge-repos/whattodo] - 10https://gitlab.wikimedia.org/toolforge-repos/whattodo/-/merge_requests/1 [08:32:07] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:37:06] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:42:06] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:20:21] 10cloud-services-team (FY2024/2025-Q1-Q2): ProbeDown Service wan.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_wan_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://phabricator.wikimedia.org/T382222#10408635 (10cmooney) FWIW I ran a little test from here with similar results, 0.1% loss to the cloudg... [10:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:31:25] 06cloud-services-team, 10Cloud-VPS, 10SRE Observability (FY2024/2025-Q2): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10408775 (10cmooney) @fgiunchedi thanks for bringing this one up. In Netops we make little use of the LibreNMS stats exported to Gr... [11:52:19] (03PS1) 10FNegri: Add default owner_team for wmcs-cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) [11:53:53] (03CR) 10Volans: [C:03+1] "Thanks! LGTM" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) (owner: 10FNegri) [11:55:54] (03PS2) 10FNegri: Add default owner_team for wmcs-cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) [11:55:57] (03CR) 10CI reject: [V:04-1] Add default owner_team for wmcs-cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) (owner: 10FNegri) [11:59:22] (03CR) 10CI reject: [V:04-1] Add default owner_team for wmcs-cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) (owner: 10FNegri) [12:04:22] (03PS3) 10FNegri: Add default owner_team for wmcs-cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) [12:08:07] (03CR) 10Volans: [C:03+1] "LGTM" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) (owner: 10FNegri) [12:35:05] (03merge) 10ladsgroup: Search in select list [toolforge-repos/whattodo] - 10https://gitlab.wikimedia.org/toolforge-repos/whattodo/-/merge_requests/1 (owner: 10ibrahemqasim) [13:09:46] 10wikitech.wikimedia.org, 06serviceops-radar, 06SRE, 07SRE-Unowned: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10409212 (10Andrew) a:03Andrew Can't promise that I'll finish this task but I'm currently working on making an httrack copy of wikitech [13:17:34] 10tool-wscontest: WS Contest has stopped updating its score - https://phabricator.wikimedia.org/T382336 (10Dibya) 03NEW [13:19:01] 10tool-wscontest: WS Contest has stopped updating its score - https://phabricator.wikimedia.org/T382336#10409248 (10Dibya) [13:46:11] (03CR) 10FNegri: [C:03+2] Add default owner_team for wmcs-cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) (owner: 10FNegri) [13:50:18] 10cloud-services-team (FY2024/2025-Q1-Q2): ProbeDown Service wan.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_wan_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://phabricator.wikimedia.org/T382222#10409357 (10fnegri) 05Open→03Resolved a:03fnegri My theory is that there are some events th... [13:51:04] (03Merged) 10jenkins-bot: Add default owner_team for wmcs-cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1104983 (https://phabricator.wikimedia.org/T379258) (owner: 10FNegri) [13:55:55] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10409366 (10fnegri) More errors were logged last night, again around 00:00 UTC (a bit earlier than the previous night though): ` fnegri@cloudgw1002:~$ sudo journalctl -k -... [13:56:58] 06cloud-services-team, 10Cloud-VPS, 10SRE Observability (FY2024/2025-Q2): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10409367 (10fgiunchedi) Thank you for the extensive explanation @cmooney ! Yes definitely let's go over the issues you outlined in t... [14:05:08] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10409380 (10cmooney) I don't really know what is the cause of all the messages here. Not unlikely this is happening due to brief moments of CPU pressure. https://blogs.or... [14:27:51] 06cloud-services-team, 10Toolforge, 10Phabricator, 10GitLab (Auth & Access): Look for ways to consolidate "we trust this human" access lists - https://phabricator.wikimedia.org/T364516#10409459 (10Aklapper) [14:40:49] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T382290#10409546 (10fnegri) [14:40:50] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10409547 (10fnegri) [14:40:53] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T382290#10409548 (10fnegri) 05Open→03Resolved a:03fnegri [14:41:55] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T382290#10409564 (10fnegri) [14:41:56] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10409563 (10fnegri) [14:41:58] 10cloud-services-team (FY2024/2025-Q1-Q2): ProbeDown Service wan.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_wan_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://phabricator.wikimedia.org/T382222#10409565 (10fnegri) [14:42:05] 06cloud-services-team: ProbeDown - https://phabricator.wikimedia.org/T382290#10409567 (10fnegri) [14:42:07] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10409566 (10fnegri) [14:42:08] 10cloud-services-team (FY2024/2025-Q1-Q2): ProbeDown Service wan.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_wan_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://phabricator.wikimedia.org/T382222#10409568 (10fnegri) [15:05:01] 06cloud-services-team, 10Cloud-VPS: Trove volume resize doesnt always (ever?) work - https://phabricator.wikimedia.org/T381959#10409638 (10fnegri) [15:05:25] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10409640 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm we will get another ticket if this gets triggered again. we can link... [15:16:58] 06cloud-services-team, 10Cloud-VPS: Trove DB for glamtools/baglama2 - https://phabricator.wikimedia.org/T382058#10409685 (10fnegri) p:05Triage→03Medium [15:17:03] 06cloud-services-team, 10Cloud-VPS: 'backy2 cleanup' fails on cloudbackup1004 - https://phabricator.wikimedia.org/T381548#10409686 (10Andrew) Just setting ` shared_buffers = 4GB ` seems to work around the issue. [15:18:32] 06cloud-services-team, 10Toolforge: Toolforge Build Service does not support .python-version - https://phabricator.wikimedia.org/T381923#10409695 (10fnegri) [15:18:34] 06cloud-services-team, 10Toolforge: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble - https://phabricator.wikimedia.org/T380127#10409696 (10fnegri) [15:18:40] 06cloud-services-team, 10Toolforge: Upgrade python buildpack to v0.17.0 or newer for Poetry support - https://phabricator.wikimedia.org/T374056#10409697 (10fnegri) [15:19:14] 06cloud-services-team, 10Toolforge: Toolforge Build Service does not support .python-version - https://phabricator.wikimedia.org/T381923#10409701 (10fnegri) 05Open→03Stalled p:05Triage→03Medium Setting to "Stalled" until the subtasks are completed. [15:23:07] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T381453#10409744 (10fnegri) p:05Triage→03Medium a:03fnegri [15:27:13] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10409776 (10fnegri) a:03fnegri [15:43:28] FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-legacy-redirector-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:55:02] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10409862 (10Jhancock.wm) [15:58:28] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-legacy-redirector-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:59:09] 06cloud-services-team, 10Cloud-VPS: Trove DB for glamtools/baglama2 - https://phabricator.wikimedia.org/T382058#10409878 (10Magnus) 05Open→03Resolved a:03Magnus I have dropped the indices for the table, it works now. [16:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:21:30] (03PS1) 10Urbanecm: Switch to pymysql [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/1105031 [16:21:30] (03PS1) 10Urbanecm: jobs: Switch to Python 3.11 [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/1105032 [16:22:19] (03CR) 10Urbanecm: [C:03+2] jobs: Switch to Python 3.11 [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/1105032 (owner: 10Urbanecm) [16:22:19] (03CR) 10Urbanecm: [C:03+2] Switch to pymysql [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/1105031 (owner: 10Urbanecm) [16:22:46] (03Merged) 10jenkins-bot: Switch to pymysql [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/1105031 (owner: 10Urbanecm) [16:22:53] (03Merged) 10jenkins-bot: jobs: Switch to Python 3.11 [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/1105032 (owner: 10Urbanecm) [16:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:31:40] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10410029 (10aborrero) [16:37:20] 06cloud-services-team: PuppetFailure Puppet has failed on cloudbackup2003:9100 - https://phabricator.wikimedia.org/T381600#10410068 (10fnegri) 05Open→03Resolved a:03fnegri This is not firing anymore. The logs don't go back far enough, so I'm not sure what caused this. [16:42:31] 10cloud-services-team (FY2024/2025-Q1-Q2): KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220#10410080 (10fnegri) @aborrero mentioned on IRC that this server had similar issues a couple months ago: {T376589} [16:54:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 10PAWS, 13Patch-For-Review: Restrict outbound connectivity from PAWS hosts - https://phabricator.wikimedia.org/T381373#10410121 (10fnegri) > I would if possible like to keep the specific NAT rule in place for now, so that maybe in a week's time we can... [17:01:32] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10410142 (10Jhancock.wm) a:03Jhancock.wm [18:00:24] 06Toolforge-standards-committee: Adoption request for bullseye - https://phabricator.wikimedia.org/T380537#10410305 (10AntiCompositeNumber) - `IPCHECK_KEY` is https://ipcheck.toolforge.org/, which is still half-alive but not actively maintained by @SQL and @MusikAnimal. - `SPUR_KEY` was a WMF grant funded key th... [18:11:21] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T381453#10410340 (10LibUp-bot) A new upstream version of Pywikibot is now available: 9.6.1. * https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Pywikibot_image * https://gerrit.wikimedia.org/g/py... [18:11:50] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T381453#10410344 (10taavi) (Please ignore the dupe comment, testing for {T381647}) [18:46:14] 06cloud-services-team: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356 (10Andrew) 03NEW [18:47:42] 06cloud-services-team: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356#10410456 (10Andrew) [18:47:43] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455#10410457 (10Andrew) [19:21:25] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10410689 (10Jclark-ctr) So that was my mistake i have found out from dell that it only supports 6x dimms for cpu2. 10x dimm for cpu1. Al... [19:21:30] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10410690 (10Jclark-ctr) 05In progress→03Resolved [21:23:43] 06Toolforge-standards-committee: Adoption request for bullseye - https://phabricator.wikimedia.org/T380537#10411087 (10LucasWerkmeister) Alright, suggested course of action: - keep `SECRET_KEY` and `replica.my.cnf` as relatively harmless - keep `IPCHECK_KEY` to reduce disruption – if the IPCheck owners disagree... [22:19:23] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure), 10Release-Engineering-Team (Seen): Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup... - https://phabricator.wikimedia.org/T374830#10411260 [23:55:14] 06cloud-services-team, 10Cloud-VPS, 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure), 10Release-Engineering-Team (Seen): Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup... - https://phabricator.wikimedia.org/T374830#10411367