[00:16:28] FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:21:28] RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:00:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:47:43] 10Tool-yearinreview, 06Indic MediaWiki Developers UG, 06Indic-TechCom: Fix padding between two rows - https://phabricator.wikimedia.org/T365843#10108016 (10KKsurendran06) @KCVelaga, This issue was initially closed after I implemented the requested design changes, but it was reopened by Vasanth Gopa due to a... [06:20:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:30:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:33:22] 10Tool-techcontribs: tech-contribs: Show names and not IDs of new Cloud VPS projects - https://phabricator.wikimedia.org/T373730 (10taavi) 03NEW [07:01:59] (03PS2) 10Jean-Frédéric: Open a database connection for each dataset during harvesting [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069288 [07:03:56] (03CR) 10CI reject: [V:04-1] Open a database connection for each dataset during harvesting [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069288 (owner: 10Jean-Frédéric) [07:45:08] (03PS3) 10Jean-Frédéric: Open a database connection for each dataset during harvesting [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069288 [07:46:53] (03CR) 10CI reject: [V:04-1] Open a database connection for each dataset during harvesting [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069288 (owner: 10Jean-Frédéric) [07:54:49] (03PS4) 10Jean-Frédéric: Open a database connection for each dataset during harvesting [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069288 [08:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:00:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:39:29] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance cloudinfra-idp-1 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:00:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:01:43] PROBLEM - Host cloudvirt1048 is DOWN: PING CRITICAL - Packet loss = 100% [13:02:37] PROBLEM - toolschecker: Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/redis - 236 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:03:51] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:06:28] FIRING: InstanceDown: Project tools instance tools-redis-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:06:29] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-worker-nfs-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:07:02] FIRING: NodeDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1048 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [13:07:10] 06cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T373740 (10phaultfinder) 03NEW [13:11:28] FIRING: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-24 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:12:23] FIRING: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-24 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [13:12:24] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-worker-nfs-2 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [13:20:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:30:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:37:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.toolsbeta.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [13:43:07] RECOVERY - Host cloudvirt1048 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [13:44:37] RECOVERY - toolschecker: Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:48:43] 06cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T373740#10108284 (10Andrew) I rebooted the host via mgmt and it came back up and seems fine. [13:49:20] 06cloud-services-team: 2024-08-31 cloudvirt1048 NodeDown - https://phabricator.wikimedia.org/T373740#10108285 (10Andrew) ` 2024-08-31T13:00:08.210868+00:00 cloudvirt1048 kernel: [6222226.936880] Memory failure: 0x4d47b27: already hardware poisoned 2024-08-31T13:00:08.210880+00:00 cloudvirt1048 kernel: [6222226.9... [13:49:22] 06cloud-services-team: 2024-08-31 cloudvirt1048 NodeDown - https://phabricator.wikimedia.org/T373740#10108286 (10dcaro) [13:50:32] RESOLVED: NodeDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1048 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [13:50:54] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-worker-nfs-2 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [13:51:08] RESOLVED: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-24 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [13:51:58] RESOLVED: [3x] InstanceDown: Project tools instance tools-k8s-worker-nfs-24 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:56:25] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.toolsbeta.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [13:57:58] RESOLVED: InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-worker-nfs-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:00:36] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:17:50] 10Tool-openstack-browser: openstack-browser: 500 when accessing /api/projects.json - https://phabricator.wikimedia.org/T373742 (10Chlod) 03NEW [14:19:37] 10Tool-openstack-browser: openstack-browser: HTTP 500 when accessing /api/projects.json - https://phabricator.wikimedia.org/T373742#10108317 (10Chlod) [14:44:55] 10Tool-openstack-browser: openstack-browser: HTTP 500 when accessing /api/projects.json - https://phabricator.wikimedia.org/T373742#10108324 (10taavi) ` 2024-08-31T14:43:05+00:00 [openstack-browser-d697c9f9-5b7xz] [2024-08-31 14:43:05,164] ERROR in app: Exception on /api/projects.json [GET] 2024-08-31T14:43:05+0... [14:48:27] 10Tool-openstack-browser: openstack-browser: HTTP 500 when accessing /api/projects.json - https://phabricator.wikimedia.org/T373742#10108327 (10taavi) Fixed with https://gitlab.wikimedia.org/toolforge-repos/openstack-browser/-/commit/4656287adfebcde8884218c324289d64d9b682c9, although that doesn't give you the da... [14:59:43] 10Tool-openstack-browser: openstack-browser: HTTP 500 when accessing /api/projects.json - https://phabricator.wikimedia.org/T373742#10108332 (10taavi) 05Open→03Resolved a:03taavi https://openstack-browser.toolforge.org/api/project-names.json [15:02:59] 10Tool-openstack-browser: openstack-browser: HTTP 500 when accessing /api/projects.json - https://phabricator.wikimedia.org/T373742#10108337 (10Chlod) Nice! Thank you, Taavi! :D [18:50:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:30:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:01:49] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.954% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:30:31] (03CR) 10Lokal Profil: [C:03+1] "Looked at it and it looks sane. Difficult to say more since I couldn't find the spec for the toolforge api explaining the flag usage and c" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) (owner: 10Jean-Frédéric) [21:32:39] (03CR) 10Lokal Profil: [C:03+2] Open a database connection for each dataset during harvesting [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069288 (owner: 10Jean-Frédéric) [21:34:21] (03Merged) 10jenkins-bot: Open a database connection for each dataset during harvesting [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069288 (owner: 10Jean-Frédéric) [21:37:17] (03CR) 10Lokal Profil: [C:03+1] Rewrite jobs shell scripts for Toolforge Jobs framework (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) (owner: 10Jean-Frédéric) [21:49:49] (03CR) 10Lokal Profil: [C:03+1] "comparing this to the old update_monuments.sh the new harvester doesn't load any configuration changes into the source tables (line 16-35" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) (owner: 10Jean-Frédéric) [21:53:07] (03PS2) 10Jean-Frédéric: Rewrite jobs shell scripts for Toolforge Jobs framework [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) [21:55:42] (03PS3) 10Jean-Frédéric: Rewrite jobs shell scripts for Toolforge Jobs framework [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) [22:03:53] (03CR) 10Jean-Frédéric: "Thanks for looking into it!" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) (owner: 10Jean-Frédéric) [22:04:38] (03PS4) 10Jean-Frédéric: Rewrite jobs shell scripts for Toolforge Jobs framework [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) [22:05:51] (03PS5) 10Jean-Frédéric: Rewrite jobs shell scripts for Toolforge Jobs framework [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) [22:06:30] (03PS6) 10Jean-Frédéric: Rewrite jobs shell scripts for Toolforge Jobs framework [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) [22:07:24] (03CR) 10Jean-Frédéric: Rewrite jobs shell scripts for Toolforge Jobs framework (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) (owner: 10Jean-Frédéric) [22:10:00] (03PS7) 10Jean-Frédéric: Rewrite jobs shell scripts for Toolforge Jobs framework [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1069284 (https://phabricator.wikimedia.org/T319787) [22:20:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:30:56] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:55:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-22 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:11:49] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.899% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:29:49] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.915% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:34:49] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.932% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace