[00:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[00:21:51] <wmcs-alerts>	 FIRING: TfInfraTestApplyFailed: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[00:33:50] <wmcs-alerts>	 FIRING: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[01:10:29] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[02:25:29] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:04:12] <jinxer-wm>	 FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown  - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[03:09:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:19:28] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:23:27] <wmcs-alerts>	 FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[03:23:27] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down:  - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[03:24:28] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:28:27] <wmcs-alerts>	 RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[03:28:27] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down:  - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[03:35:16] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown
[03:35:16] <wmcs-alerts>	 FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[03:35:16] <wmcs-alerts>	 FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[03:40:16] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown
[03:40:16] <wmcs-alerts>	 RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[03:40:16] <wmcs-alerts>	 RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[03:47:44] <wikibugs>	 (03PS3) 10Pppery: ISA: New footer not included in translation message [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1053071 (https://phabricator.wikimedia.org/T340538) (owner: 10Aditya0545)
[03:49:10] <wmcs-alerts>	 FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[03:49:10] <wmcs-alerts>	 FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[03:49:44] <wmcs-alerts>	 FIRING: Toolforge Kyverno unknown state: Toolforge Kyverno has unknown state. Kyverno might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_unknown_state - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+unknown+state
[03:54:10] <wmcs-alerts>	 RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[03:54:11] <wmcs-alerts>	 RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[03:54:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:54:44] <wmcs-alerts>	 RESOLVED: Toolforge Kyverno unknown state: Toolforge Kyverno has unknown state. Kyverno might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_unknown_state - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+unknown+state
[03:56:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:26:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:37:28] <wmcs-alerts>	 FIRING: PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-harbor-2 on project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources
[05:55:28] <wmcs-alerts>	 FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-2 in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun
[07:04:12] <jinxer-wm>	 FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown  - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[07:08:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[07:13:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[07:24:07] <wikibugs>	 (03CR) 10Jean-Frédéric: "Yes, if you hover above the file it says “file mode changed from regular (100644) to executable (100755)”. Had to be done for toolforge-jo" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124 (owner: 10Jean-Frédéric)
[07:25:18] <wikibugs>	 (03PS3) 10Jean-Frédéric: Use toolforge-jobs to install requirements during deployment [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124
[07:25:18] <wikibugs>	 (03PS3) 10Jean-Frédéric: Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125
[07:26:07] <wikibugs>	 (03PS4) 10Jean-Frédéric: Use toolforge-jobs to install requirements during deployment [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065124
[07:26:07] <wikibugs>	 (03PS4) 10Jean-Frédéric: Remove `composer update` step from build-php script [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1065125
[07:32:19] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown
[07:32:19] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down:  - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[07:37:19] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown
[07:37:19] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down:  - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[09:08:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[10:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[10:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[11:04:12] <jinxer-wm>	 FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown  - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[11:08:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:50:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[12:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[12:39:05] <wikibugs>	 10Toolforge: failure in name resolution and Uncaught Error in stalktoy on toolforge - https://phabricator.wikimedia.org/T373266 (10Jeff_G) 03NEW
[14:09:39] <wikibugs>	 10Tool-techcontribs: Tech Contribs does not support parentheses in user names - https://phabricator.wikimedia.org/T373269 (10LucasWerkmeister) 03NEW
[14:50:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[15:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[15:04:12] <jinxer-wm>	 FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown  - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[15:08:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[15:13:28] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[15:16:56] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089926 (10Curb_Safe_Charmer) a:03Curb_Safe_Charmer
[15:23:25] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089942 (10Curb_Safe_Charmer) tools.refill-api@tools-bastion-13:~$ kubectl get pod refill-api-6c65bdbf7f-9zd7s --output=yaml  apiVersion: v1 kind: Pod metadata:   annotations:     cni.projectcalico.org...
[15:28:31] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089943 (10Curb_Safe_Charmer) running webservice restart made no difference  running ./restart.sh made no difference
[15:29:25] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089947 (10Curb_Safe_Charmer) 05Open→03Stalled
[15:29:31] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089948 (10Curb_Safe_Charmer) p:05Triage→03Medium
[15:30:41] <wikibugs>	 10Tool-refill, 07Wikimedia-production-error: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089944 (10Curb_Safe_Charmer) a:05Curb_Safe_Charmer→03TheresNoTime Are you able to take a look TNT?
[15:31:18] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089952 (10Curb_Safe_Charmer)
[15:31:41] <wikibugs>	 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T373233#10089950 (10Curb_Safe_Charmer)
[16:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[16:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[16:56:26] <wikibugs>	 10toolforge_i18n, 10Tools, 07I18n, 03Wikimania-Hackathon-2024: Extract Python library for Wikimedia tool i18n from Wikidata Lexeme Forms tool - https://phabricator.wikimedia.org/T283376#10089989 (10LucasWerkmeister) I just released toolforge_i18n 0.1.0, the first version without a “please don’t use this ye...
[17:18:57] <jinxer-wm>	 FIRING: [2x] SystemdUnitDown: The systemd unit backup_vms.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown  - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:23:16] <wikibugs>	 10Tool-techcontribs: Tech Contribs can't show user Quiddity - https://phabricator.wikimedia.org/T373272 (10Andrybak) 03NEW
[17:26:42] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack
[17:28:36] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:29:36] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:33:55] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0)
[17:50:39] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090028 (10Don-vip) Same for my tool (pod spacemedia-6fdcc8d798-8sncn). Started to fail at 2024-08-25T17:38:18.469Z with error message "java.net.UnknownHostException: tools....
[17:50:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[17:51:52] <wikibugs>	 (03CR) 10Multichill: [C:03+1] "Thanks for picking this up." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1064471 (https://phabricator.wikimedia.org/T174633) (owner: 10Lokal Profil)
[18:00:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[18:04:41] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090067 (10Yann) Failed on first try:  `Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.svc.wikimedia.cloud failed: Temporary...
[18:08:07] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090071 (10AntiCompositeNumber) getting this for AntiCompositeBot's nolicense task as well: ` 2024-08-25 18:06:37 nolicense ERROR: (2003, "Can't connect to MySQL server on '...
[18:15:19] <wikibugs>	 10Toolforge: failure in name resolution and Uncaught Error in stalktoy on toolforge - https://phabricator.wikimedia.org/T373266#10090074 (10Don-vip) Probable duplicate of T373243
[18:25:43] <wikibugs>	 10Toolforge: failure in name resolution and Uncaught Error in stalktoy on toolforge - https://phabricator.wikimedia.org/T373266#10090091 (10JJMC89) →14Duplicate dup:03T373243
[18:26:04] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090093 (10JJMC89)
[18:28:28] <wmcs-alerts>	 FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[19:04:58] <jinxer-wm>	 RESOLVED: MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown
[19:08:28] <wmcs-alerts>	 RESOLVED: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[19:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[19:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[19:43:38] <wikibugs>	 10Tool-techcontribs: Tech Contribs can't show user Quiddity - https://phabricator.wikimedia.org/T373272#10090186 (10Quiddity) Hah! Thanks for filing!  I'm glad I checked here before sending a bug-report via mastodon, or filing a duplicate. Let me know if I can do anything to help test/resolve it.
[19:52:55] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090192 (10mdaniels5757) I think this is related: ` ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes. ERROR: Please report this issue to...
[20:01:57] <wikibugs>	 10Tool-techcontribs: Tech Contribs can't show user Quiddity - https://phabricator.wikimedia.org/T373272#10090196 (10Quiddity) One more broken test-case (it's not just me, huzzah!) https://techcontribs.toolforge.org/cn/LMata
[20:05:29] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090198 (10Krinkle) tools.krinklebot is facing `Could not resolve host: commons.wikimedia.org` for production hostnames as well. This runs as scheduled toolforge job:  `line...
[20:22:15] <wikibugs>	 06cloud-services-team, 10Cloud-VPS, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade - https://phabricator.wikimedia.org/T373227#10090201 (10Andrew) The attached bug delays, but does not resolve the problem. It...
[20:43:07] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090234 (10Stuartyeates) Just got a different message from https://author-disambiguator.toolforge.org/names_oauth.php?... . This may be a result of a DNS failure not being c...
[21:02:53] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090235 (10mdaniels5757) p:05Triage→03Unbreak!
[21:06:22] <jinxer-wm>	 FIRING: HAProxyBackendUnavailable: HAProxy service magnum-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[21:19:12] <jinxer-wm>	 FIRING: SystemdUnitDown: The systemd unit neutron-openvswitch-agent.service on node cloudvirt1062 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[21:49:16] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0)
[21:52:14] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[21:57:14] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[21:58:21] <wmcs-alerts>	 FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[21:59:14] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:04:14] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:10:14] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:15:14] <wmcs-alerts>	 RESOLVED: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:16:14] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:19:22] <jinxer-wm>	 RESOLVED: HAProxyBackendUnavailable: HAProxy service magnum-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[22:20:29] <wmcs-alerts>	 RESOLVED: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:20:41] <jinxer-wm>	 FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[22:21:14] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:21:35] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack
[22:22:47] <stashbot>	 andrew@cloudcumin1001: Failed to log message to wiki. Somebody should check the error logs.
[22:25:29] <wmcs-alerts>	 RESOLVED: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:27:14] <wmcs-alerts>	 FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:28:21] <wmcs-alerts>	 RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown
[22:28:32] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0)
[22:28:51] <wmcs-alerts>	 FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[22:30:41] <jinxer-wm>	 RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks
[22:32:14] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown
[22:33:51] <wmcs-alerts>	 RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[23:03:58] <wikibugs>	 10Toolforge: DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least) - https://phabricator.wikimedia.org/T373243#10090266 (10Chlod) Noting here that I'm unable to use Build Service, probably due to the same issue. Related log line: ` [step-clone] 2024-08-25T22:59:56.754700588Z {"level":...
[23:06:31] <wikibugs>	 10Tool-techcontribs: Tech Contribs does not support parentheses in user names - https://phabricator.wikimedia.org/T373269#10090269 (10Chlod) Looks like it was a bad character escaping issue, and the fact that parentheses weren't part of the escaped characters list. I've replaced the mediocre escaping method with...
[23:08:16] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[23:08:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[23:13:16] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[23:16:17] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-17 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[23:16:17] <wmcs-alerts>	 FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[23:16:17] <wmcs-alerts>	 FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[23:18:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[23:21:17] <wmcs-alerts>	 RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown
[23:21:17] <wmcs-alerts>	 RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown
[23:24:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown