[11:36:35] <_joe_> XioNoX: did we change something re: esams-eqiad routing today? [11:36:48] _joe_: not afaik, let me check [11:37:12] <_joe_> I see this jump in kafka RTT in esams https://grafana.wikimedia.org/d/RvscY1CZk/purged?viewPanel=36&orgId=1&from=1615935497437&to=1615981017642&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp3058 [11:37:53] <_joe_> but not in any other DC [11:38:02] _joe_: I see it there too https://smokeping.wikimedia.org/?target=esams.Hosts.bast3005 [11:38:28] <_joe_> that also means an additional 20 ms of rtt on every uncached request from esams [11:41:04] yeah, brief cut there too https://librenms.wikimedia.org/graphs/to=1615980900/id=6835/type=port_bits/from=1615894500/ maybe they re-routed ouw wavelength, let me check maintenance calendar [11:42:14] _joe_: there was a maintenance, but it ended, let me check more [11:44:01] "Click here to open a case for assistance on this scheduled maintenance via the Lumen Customer Portal. " [11:44:05] perfect [11:51:10] Your ticket #20890103 has been successfully created. [11:51:47] _joe_: is it causing an issue? there is the option of failing over to the backup circuit, but it can become expensive if used for a long period of time [11:58:49] <_joe_> XioNoX: just a perf degradation, nothing more [12:00:28] ok! will open a talk [12:00:54] _joe_: did something alert or you found out randomly about it? [12:01:22] <_joe_> XioNoX: I was looking at that dashboard, but I swear I had a good reason to do so [12:01:24] <_joe_> :P [12:01:29] :) [12:02:20] Lumen says it's a set of multiple maintenances, so may guess is that they diverted our wavelength the time to fix something, but let's see [12:10:46] https://phabricator.wikimedia.org/T277654 [14:52:29] o/ can someone merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/669477/? thanks! [14:57:11] Majavah: will do [14:59:47] thanks moritzm! [22:23:34] effie: is a daemonset more or less the same as "side car" ? [22:26:43] Hm.. seems they are simialr but the difference is that when a k8s worker runs multiple pods for the same app, deamonset effectively deduplicates the sidecar demand [22:26:56] I assumed sidecars already worked that way, so cool, better :) [22:27:20] daemonsets guarantees that each node will run one copy of a pod, in this case where mcrouter will live [22:28:57] and if I understand your breakdown correctly, unlike a sidecar, a deamonset would make mcrouter its own pod, and thus they can die separately [22:29:05] and pods running in that node can potentially access this pod [22:29:22] a sidecars live within the pod [22:29:51] Im not sure I understand the distinction between e.g. mcrouter process dying within a given MW pod, vs the mcrouter separate deamonset pod abe to die. [22:29:52] yes [22:30:02] is it more liely to die as a deamonset? [22:30:49] if it is running as daemonset, all pods running in this node, will lose access to a working mcrouter [22:31:37] while in the other case, the specific pod will lose access to mcrouter [22:32:12] ah, I see. so it's not per se that we're worried about the container's own failure likelihood [22:32:13] on th otgher hand, mcrouter has caused us little trouble when it comes to daily operation [22:32:17] but just the impact of failure in general [22:32:50] yeah, it would be amplified in that case. [22:33:09] if it's just one, I suppose that would quickly cause that pod to be killed or restart as a whole or otherwise recover, and thus affect fewer ongoing requests. [22:33:24] I do not recall a mcrouter in an mw* host being unavailable and in need of a restart unless there was a bad config deployment [22:33:53] for things that are as light as mcrouter appears to be though, maybe t's okay to duplicate as sidecar along each pod. The on-host memcached, with hits memory consumptino and improved cache reuse seems more benefitting from the deamonset approach. [22:34:02] (and has a better failure scenario anyway, just cache miss basically) [22:34:23] yes, an unavailable onhost memcached is not a problem [22:35:39] does k8s health-track each process in a pod (e.g. the app and the sidecar) separately such that it gets no traffic if either of them die as a process? [22:35:53] or does that only apply to the "main" process [22:36:01] e.g. the nginx or apache process I guess [22:37:15] k8s uses the readiness probles to know when a container can accept traffic [22:38:52] and the liveness to check if this container needs to be restarted or not [22:40:21] we do not make changes to mcrouter frequently, well apart from these days, where we have the the upcoming TLS work and server refresh [22:40:45] one thos eare done, I don't know when we will change the mcrouter config again [22:41:26] * effie off