[00:03:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-1 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [00:05:55] FIRING: MaxConntrack: Max conntrack at 84.34% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:48:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-1 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [00:55:56] RESOLVED: MaxConntrack: Max conntrack at 82.82% on cloudvirt1067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [03:13:05] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:19:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [03:24:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [05:22:50] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10973902 (10ayounsi) There is currently only one switch per rack, so I suggest we only use one uplink for now, and revisit it the day we have more. [06:08:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:46:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance runner-1033 in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:01:31] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 3 deleted instances on gitlab-runners-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [07:11:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance runner-1033 in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:24:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudgw1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:28:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:29:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudgw1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:57:26] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166054 (https://phabricator.wikimedia.org/T390397) (owner: 10Bovimacoco) [08:15:33] 06cloud-services-team, 10Toolforge: [toolforge-cli-gen] review the https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-gen-cli client as potential consolidation - https://phabricator.wikimedia.org/T398651#10974128 (10Addshore) @dcaro want to schedule a call to walk through it all in more detail? [08:19:14] 06cloud-services-team, 10Toolforge: [toolforge-cli-gen] review the https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-gen-cli client as potential consolidation - https://phabricator.wikimedia.org/T398651#10974130 (10dcaro) >>! In T398651#10974128, @Addshore wrote: > @dcaro want to schedule a call to... [08:28:06] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:17:41] (03open) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:21:37] (03update) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:21:45] (03PS1) 10Essa237: Refined the landing page [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1166348 [09:23:21] (03update) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:25:02] (03Abandoned) 10Essa237: [Fix] added a landing page [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157586 (owner: 10Essa237) [09:28:15] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686 (10Tobi_WMDE_SW) 03NEW [09:29:57] (03open) 10taavi: logs: Move multi-pod fix from jobs-api to here [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/82 (https://phabricator.wikimedia.org/T398647) [09:33:17] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974387 (10Clement_Goubert) [09:33:29] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974388 (10Clement_Goubert) [09:33:34] (03update) 10dcaro: Draft: DONOTMERGE: always auth as tf-test [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/72 [09:35:50] (03open) 10taavi: Draft: Use logging multi-pod fix moved to toolforge-weld [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/179 (https://phabricator.wikimedia.org/T398647) [09:37:00] 10wikitech.wikimedia.org, 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10974404 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium @Tobi_WMDE_SW Can you or @sowmya.guru fill out the first part of the...