[08:13:04] !incidents [08:13:04] 5693 (RESOLVED) [2x] ProbeDown sre (dse-k8s-ctrl1002:6443 probes/custom eqiad) [08:13:04] 5692 (RESOLVED) [2x] ProbeDown sre (dse-k8s-ctrl1001:6443 probes/custom eqiad) [09:24:09] Thanks elukey. stevemunene, would you mind having a look, as you were working on the control plane recently? Thanks! [10:03:11] There is also high latency for several dse workers https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DKubeletOperationalLatency [10:03:15] started 3 days ago [10:10:02] I've seen these flap recently. How do we usually investigate these? Is it usually the control plane being loaded? The hosts themselves? [10:19:48] in this case it seems that run_podsandbox is taking ages and/or erroring out on various kubelets, I'd start checking the logs on the host to see if anything is happening [10:27:10] I'm seeing 2 errors like these from yesterday [10:27:10] Feb 23 11:34:21 dse-k8s-worker1002 kubelet[1624]: E0223 11:34:21.938179 1624 kuberuntime_gc.go:176] "Failed to stop sandbox before removing" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"339194cb9405e59ea2d22a5721605716ec50e22a9a62ae6a8c9f26d24c1c2e0c\": plugin type=\"calico\" failed (delete): error getting [10:27:10] ClusterInformation: Get \"https://dse-k8s-ctrl.svc.eqiad.wmnet:6443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.2.2.73:6443: connect: connection refused" sandboxID="339194cb9405e59ea2d22a5721605716ec50e22a9a62ae6a8c9f26d24c1c2e0c" [10:27:20] (on dse-k8s-worker1002) [11:25:54] as promised I published https://wikitech.wikimedia.org/wiki/Incidents/2025-02-17_maps (outage happened last Monday) [11:28:01] (I need to write another one for Wednesday, sigh) [11:47:33] ty elukey <3 [12:40:55] https://wikitech.wikimedia.org/wiki/Incidents/2025-02-19_maps [12:40:59] and second one published [13:04:06] elukey: can I s/clearly visible/visible/ in the Timeline? I sounds a bit blame-elukey, and I think the notes below show that the issue wasn't clear, even if it was visible once you knew what you were looking for [13:06:03] Emperor: done! It wasn't the intent, my point is that the dashboard was there and I should've checked it :) [13:06:18] (also thanks for reviewing!) [17:07:06] hey on-callers, as FYI I pooled again all the wikikube workers for kartotherian.discovery.wmnet (backend for maps.wikimedia.org) [17:07:25] they are serving traffic with half the pybal weight (compared to the bare metals) [17:07:32] All info/state/rollback in https://phabricator.wikimedia.org/T386926 [17:07:51] if anything goes south and I am not around, just depool as indicated in the task's description [17:28:00] thanks, elukey! [17:54:28] ok [18:18:06] Hello! In anticipation of the upcoming DC switchover (T385155), we will be running a live test on Thursday February 27th, starting around 1700 UTC. Please let us know in #wikimedia-serviceops if this conflicts with any ongoing work or you have any concerns- all going well this will be a non-disruptive test [18:18:07] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [23:09:14] First Wikipedia 3D colored models prototype [23:09:14] You can see it here: [23:09:15]     http://wikipedia3d.serreriabelga.es/index.php/Special:ListFiles [23:09:15]     http://wikipedia3d.serreriabelga.es/index.php/File:DamagedHelmet.glb [23:09:16]     http://wikipedia3d.serreriabelga.es/index.php/File:Wikipedia.glb [23:09:16] A group of community members are working to improve 3D support in Wikipedia. [23:09:17] We would love to talk with members of the SRE team. [23:09:17] https://meta.wikimedia.org/wiki/Telegram -> https://t.me/+tMgoJxx8D7I5NTQ8