[00:43:23] 10Data-Services: Replicate DiscussionTools items tables to cloud - https://phabricator.wikimedia.org/T374584 (10Bugreporter) 03NEW [00:56:57] 10Toolforge (Toolforge iteration 14): [lima-kilo] allow for the creation of a multi-node high availability cluster - https://phabricator.wikimedia.org/T374585 (10Raymond_Ndibe) 03NEW [01:01:45] 10Toolforge (Toolforge iteration 14): [lima-kilo] allow for the creation of a multi-node high availability cluster - https://phabricator.wikimedia.org/T374585#10139492 (10Raymond_Ndibe) [01:05:26] (03open) 10raymond-ndibe: [lima-kilo] configure high-availability [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/189 (https://phabricator.wikimedia.org/T374585) [01:37:07] (03open) 10raymond-ndibe: [toolforge-deploy] upgrade metrics-server [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/520 (https://phabricator.wikimedia.org/T359641) [01:55:21] (03update) 10raymond-ndibe: Draft: [lima-kilo] configure high-availability [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/189 (https://phabricator.wikimedia.org/T374585) [02:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:06:26] (03open) 10chandrapratap25: Draft: Resolve T373919 "Center Align Username Placeholder for Consistency" [toolforge-repos/yearinreview] - 10https://gitlab.wikimedia.org/toolforge-repos/yearinreview/-/merge_requests/6 [03:10:14] (03close) 10chandrapratap25: Draft: Resolve T373919 "Center Align Username Placeholder for Consistency" [toolforge-repos/yearinreview] - 10https://gitlab.wikimedia.org/toolforge-repos/yearinreview/-/merge_requests/6 [03:10:22] (03reopen) 10chandrapratap25: Draft: Resolve T373919 "Center Align Username Placeholder for Consistency" [toolforge-repos/yearinreview] - 10https://gitlab.wikimedia.org/toolforge-repos/yearinreview/-/merge_requests/6 [03:22:11] 10Tool-yearinreview, 07good first task: Center Align Username Placeholder for Consistency - https://phabricator.wikimedia.org/T373919#10139555 (10ChandraPratap25) 05Open→03Resolved Resolve T373919 "Center Align Username Placeholder for Consistency" toolforge-repos/yearinreview!6 [03:52:52] (03open) 10raymond-ndibe: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) [04:01:36] (03update) 10raymond-ndibe: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) [04:03:08] (03update) 10raymond-ndibe: [toolforge-deploy] test multi-replica support for continuous jobs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/521 (https://phabricator.wikimedia.org/T341066) [06:59:40] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [06:59:48] 06cloud-services-team: CephClusterInUnknown - https://phabricator.wikimedia.org/T374593 (10phaultfinder) 03NEW [07:00:30] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:10:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [07:10:23] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632#10139781 (10phaultfinder) [07:22:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:32:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:43:06] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 07Wikimedia-Slow-DB-Query: [wikireplicas] Log very slow queries - https://phabricator.wikimedia.org/T372859#10139859 (10fnegri) a:03fnegri [09:12:09] 10cloud-services-team (FY2024/2025-Q1-Q2): Lint problems for NeutronAgentDownForLong and NeutronAgentDown - https://phabricator.wikimedia.org/T374513#10139915 (10aborrero) →14Duplicate dup:03T373878 [09:13:52] 06cloud-services-team, 10Cloud-VPS: openstack: fix missing prometheus metrics - https://phabricator.wikimedia.org/T373878#10139917 (10aborrero) [09:35:09] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:35:39] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599 (10aborrero) 03NEW [09:35:55] 06cloud-services-team, 10Cloud-VPS: openstack: fix missing prometheus metrics - https://phabricator.wikimedia.org/T373878#10140004 (10aborrero) [09:35:58] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140005 (10aborrero) [09:36:35] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632#10140007 (10aborrero) [09:36:36] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140008 (10aborrero) [09:36:44] 06cloud-services-team: CephClusterInUnknown - https://phabricator.wikimedia.org/T374593#10140009 (10aborrero) [09:36:46] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140010 (10aborrero) [09:37:47] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140015 (10aborrero) p:05Triage→03High [09:42:06] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140040 (10aborrero) [09:42:54] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140043 (10aborrero) [09:55:09] FIRING: CephSlowOps: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:55:18] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632#10140078 (10phaultfinder) [09:55:30] RESOLVED: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:56:50] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:57:02] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [10:00:31] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of -1 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [10:01:21] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140080 (10fgiunchedi) From my investigation so far on IRC: ` 09:51 so far I have more questions than answers :( both prometheus1005... [10:01:28] FIRING: InstanceDown: Project tools instance tools-k8s-ingress-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:01:51] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:02:29] FIRING: InstanceDown: Project gitlab-runners instance runner-1026 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:08:00] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 35.464 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [10:08:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-6 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [10:11:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-ingress-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:11:51] RESOLVED: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:12:29] RESOLVED: InstanceDown: Project gitlab-runners instance runner-1026 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:13:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-6 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [10:15:31] RESOLVED: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [10:24:01] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632#10140111 (10aborrero) [10:24:02] 06cloud-services-team, 10observability: cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10140112 (10aborrero) [10:28:39] RESOLVED: CephSlowOps: Ceph cluster in eqiad has 104 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:30:02] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140145 (10fnegri) 05In progress→03Resolved [10:31:27] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140140 (10fnegri) > the procedure looks similar to what is documented in MariaDB#Manipulating_the_Replication_Tree Re-r... [10:40:15] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140197 (10jcrespo) I wouldn't be responsible if I didn't tell you that GTID has been very error prone to us, and that is... [10:45:36] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140243 (10fnegri) @jcrespo thanks, does it mean that `repl.pl` is still used? Are the docs at [MariaDB#Manipulating_the_... [10:52:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:56:21] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140266 (10Ladsgroup) I think we use a script in wmfmariadbpy called `move_replica` [10:57:00] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140272 (10jcrespo) This method is used: https://wikitech.wikimedia.org/wiki/Primary_database_switchover but it only work... [10:57:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:01:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-27 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:06:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-23 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:11:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-16 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:16:03] FIRING: [9x] ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-12 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:17:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:21:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-27 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:22:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:34:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:36:04] 06cloud-services-team: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 (10aborrero) 03NEW [11:37:26] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-28 (T374612) [11:37:30] T374612: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 [11:39:08] 06cloud-services-team: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612#10140418 (10aborrero) [11:39:09] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T373632#10140419 (10aborrero) [11:39:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:42:57] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-28 (T374612) [11:43:01] T374612: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 [11:48:47] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-23, tools-k8s-worker-16, tools-k8s-worker-nfs-33 (T374612) [11:48:51] T374612: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 [11:51:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:52:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:54:13] !log aborrero@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-23, tools-k8s-worker-16, tools-k8s-worker-nfs-33 (T374612) [11:54:17] T374612: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 [11:57:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:59:42] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-33 (T374612) [11:59:46] T374612: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 [12:01:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:06:00] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-33 (T374612) [12:06:04] T374612: toolforge: workers with many procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612 [12:16:18] 06cloud-services-team: toolforge: workers with many D procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612#10140482 (10aborrero) [12:17:04] 06cloud-services-team: toolforge: workers with many D procs (2024-09-12 edition) - https://phabricator.wikimedia.org/T374612#10140479 (10aborrero) 05Open→03Resolved a:03aborrero [12:23:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:24:23] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1072530 (owner: 10L10n-bot) [12:29:25] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140497 (10fnegri) @jcrespo I have added your comment above to [MariaDB#Manipulating_the_Replication_Tree](https://wikite... [12:32:13] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140507 (10jcrespo) >>! In T365717#10140497, @fnegri wrote: > @jcrespo I have added your comment above to [MariaDB#Manipu... [12:33:52] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:45:02] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140589 (10fnegri) I wasn't sure if "cannot apply to wikireplicas" included Sanitariums or only clouddbs. If it also incl... [12:47:53] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140604 (10jcrespo) >>! In T365717#10140589, @fnegri wrote: > I wasn't sure if "cannot apply to wikireplicas" included Sa... [12:52:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:57:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:00:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 06Data-Persistence, 06Data-Persistence-SRE: [wikireplicas] Update Admin docs - https://phabricator.wikimedia.org/T365717#10140655 (10fnegri) Thanks, I've updated again [MariaDB/Sanitarium_and_clouddb_instances](https://wikitech.wikimedia.org/w... [13:53:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:58:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:00:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:05:52] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:10:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:13:36] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:15:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:16:21] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:20:55] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:21:06] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:29:32] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1072530 (owner: 10L10n-bot) [14:31:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:36:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:50:49] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry, 07User-notice: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158#10141169 (10fnegri) @UOzurumba apologies for the delay. Yes, it's a good idea to add it to the next Tech News! Here's a summary, feel f... [14:50:59] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 24.11 ms [15:09:45] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [lima-kilo] allow for the creation of a multi-node high availability cluster - https://phabricator.wikimedia.org/T374585#10141230 (10dcaro) +1 for being able to test upgrades in lima-kilo, though we would need to be able to replicate the upgrade procedu... [15:11:03] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: openstack: instrument VXLAN-based flat network - https://phabricator.wikimedia.org/T374020#10141235 (10aborrero) The network is in better shape now, VMs have now connectivity by default. I have recreated the VMs. To test: * `ssh arturo-test-vm3.cloudinf... [15:13:13] 10Toolforge (Toolforge iteration 14), 13Patch-For-Review: [lima-kilo] allow for the creation of a multi-node high availability cluster - https://phabricator.wikimedia.org/T374585#10141239 (10dcaro) This might be interesting: https://pkg.go.dev/k8s.io/kubeadm/kinder#section-readme [15:15:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:20:51] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:29:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:34:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:47:33] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Decision Request: To strictly enforce semantic versioning rules for toolforge services' APIs or not - https://phabricator.wikimedia.org/T373072#10141412 (10fnegri) I think Option 2 is a good trade off, it's true it is prone to errors, but I like... [16:43:52] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10141703 (10wiki_willy) It looks like it'll be 3 drives minimum from the latest email today, and @Jclark-ctr - you c... [16:44:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:49:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:00:56] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:01:11] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:01:26] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:02:11] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:02:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:12:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:15:52] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:20:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:36:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:41:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:54:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:59:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:28:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:33:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:46:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:51:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:06:52] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:11:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:25:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-5 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:35:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-5 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:35:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:40:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:13:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:18:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:42:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:47:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:16:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:21:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:38:29] 10VPS-project-Wikistats: Add moswiki to wikistats - https://phabricator.wikimedia.org/T374648#10142558 (10Dzahn) a:03Dzahn [21:39:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:44:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:35:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-55 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses