[06:37:28] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [06:44:10] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [07:27:32] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [08:09:45] fyi, jayme and yours truly are going to start the reinitializing of the eqiad kube cluster [08:12:22] akosiaris: I was looking at yout home on argon and chlorine, no calicoctl.cfg there :) [08:13:43] there should not be 1 there anyway. that's on the nodes [08:14:19] the "live" on you mean? [08:14:38] I got a tmux session on deploy1001 "eqiad-etcd" [08:14:58] let me also create 1 on kubernetes1001 [08:15:52] jayme: kubernetes1001 tmux session "calico" [08:18:09] might if I jump in read-only ? [08:18:30] I will not even ask [08:20:28] sure [08:20:43] jayme: got a link to the etherpad with the process handy? [08:20:52] https://etherpad.wikimedia.org/p/migrate-k8s-etcd [08:21:52] thanks! [08:22:04] * akosiaris depooling the few not depooled yet services [08:23:28] akosiaris: you have a list for the record? [08:23:58] _jo.e_ depooled restbase-async and eventgate-main and I just depooled blubberoid. The rest is all the stuff that was depooled during the switchover [08:24:23] <_joe_> blubberoid was pooled? ooops [08:24:46] I don't think it was in switchover scope anyway [08:24:56] due to it being part of CI [08:24:59] so, not oops [08:25:41] Is there a nice way of knowing which services are actually on kubernetes without having to look it up in services.yaml or alike? [08:27:10] :-( [08:27:11] nope [08:27:14] <_joe_> jayme: k8sh; use eqiad; ls [08:27:18] lol [08:27:19] <_joe_> :P [08:27:31] that falls under "alike" :-) [08:27:36] <_joe_> hey I use k8sh every day [08:27:55] <_joe_> now I just need a weekend when I'm not exhausted to make it better :P [08:28:15] jayme: calico is still gonna talk to etcdv2 (etcdv3 still reports the old etcd v2 protocol) [08:28:27] and our calico version doesn't yet know etcdv3 anyway :-( [08:28:44] Ah, sure. We talked about that... [08:31:48] hmmm [08:32:17] * akosiaris doublechecking calico ippools in codfw [08:32:56] this is btw a very nice step where a mistake will consume hours to find [08:33:11] I managed at some point to put eqiad ips in codfw... it wasn't nice [08:33:38] effie: is the smaller width terminal yours ? [08:34:05] I don't think so [08:34:25] mine is bigger as well :-) [08:34:30] I got a small one from you [08:34:35] dito [08:34:48] (size 121x52 from a smaller client) [08:34:50] ?? [08:34:58] *ditto [08:35:02] me too [08:35:04] someone is watching :-) [08:35:08] hahaha [08:35:14] lol [08:35:30] and trolling... [08:35:50] _joe_: :P [08:36:02] I de-split my terminals, could be that [08:36:04] <_joe_> nope [08:36:16] I still get a small one [08:37:06] if it was me, then it should be fine [08:37:11] <_joe_> I'm trying to figure out why I can't reach restbase on port 7443 from mobileapp's pods [08:37:50] ok, I 've scheduled downtime for services, let me also do so for apiserver/scheduler and the rest [08:40:59] ok we are good to go I think [08:41:13] calico-node did not have after all icinga checks, but I 've scheduled the rest [08:41:38] I 'd say we are ready for step 5 [08:41:44] jayme: do you agree ? [08:42:17] akosiaris: ack [08:42:38] wanna have the honors of deleting the namespaces? [08:43:19] do we have a script for that or do you do that "by hand"? [08:44:10] by hand [08:44:24] I think just kubect get ns ; kubectl delete ns a b c d [08:44:41] ah, thought some helmfile magic [08:44:56] and let's see if anything breaks [08:45:05] nothing should, but you never know [08:46:50] okay then [08:50:40] <_joe_> why is it taking so long? [08:51:35] good q [08:51:41] * akosiaris looking [08:51:58] termination grace period of poods not reacting to sigterm? [08:52:11] <_joe_> possibly [08:52:35] icinga complaining [08:53:33] <_joe_> about what [08:54:27] <_joe_> icinga took exactly 55 seconds to reload for me :/ [08:59:55] and echostore is taking it's time [09:00:28] oh wait.. it's not that. It's issueing all the requests and then waits out for all of them to finish [09:00:45] We should probably note that to get it fixed (if it's not desired) [09:01:07] yes, yes. They should all be issued in parallel and then joined() [09:01:21] I am not sure we have much of control of it. It's part of the NamespaceLifeCycle I think [09:02:44] Depends. If the service is sensible to sigterm, it "should" shutdown pretty fast. It it's not, we're probably running into default gracetermination which means waiting 30s befor sending sigkill IIRC [09:03:47] per https://grafana.wikimedia.org/d/000000436/kubernetes-kubelets?viewPanel=25&orgId=1 it's indeed remove and stop container [09:05:51] we should figure out why we don't see events. I'm not 100% but I recall that terminations etc. should be logged there [09:06:00] <_joe_> I see all the pods in eventgate-main in termination status [09:06:08] <_joe_> Terminating sorry [09:06:56] <_joe_> same for all the namespaces we're deleting [09:07:03] <_joe_> now what is not terminating properly [09:07:53] when you do a describe on one of the pods that are 0/X running, you'll see that the workload containers are already shut down [09:08:06] or at least kubernetes thinks they are... [09:08:14] so maybe it's the PAUSE container [09:08:15] <_joe_> they are [09:08:54] <_joe_> so now in eventgate-analytics [09:09:08] <_joe_> all pods are Terminating with 0/3 active [09:09:19] <_joe_> so it should just shut down at this point [09:09:24] thats what I was looking at [09:09:39] done now :-) [09:10:50] <_joe_> uhm why can't I get events [09:11:06] there don't seem to be any [09:11:13] nobody can _joe_ [09:11:15] <_joe_> in any namespace? [09:11:25] <_joe_> that's pretty strange [09:11:27] <_joe_> anyways [09:11:28] just in some it seems, or only some events [09:11:35] try --all-namespaces [09:11:47] it seems like container stops aren't being logged as events [09:12:59] <_joe_> $ sudo -E kubectl get events --all-namespaces [09:13:00] <_joe_> No resources found. [09:13:08] <_joe_> that's pretty strange [09:13:28] it is [09:14:11] ah I know why [09:14:19] or at least I think I do [09:14:50] I doubt the NamespaceLifeCycle controller is going around and correctly following the dependencies between the resources [09:15:04] and it's the controllers of e.g. ReplicaSet that send out the events [09:15:11] but they can't if they are deleted [09:15:15] <_joe_> can I suggest we remove the namespaces one at a time [09:15:25] <_joe_> with a for bash cycle? [09:15:25] I 'll have to double check that this is the reason [09:15:35] <_joe_> maybe it will be faster :) [09:15:42] note btw, that the only reason we do this is to make sure nothing breaks [09:15:52] <_joe_> I know [09:16:20] we can try with one [09:16:27] and see if its faster... [09:16:38] just do all of them I 'd say [09:16:50] * jayme would vote for that as well [09:17:00] there isn't much point in optimizing this specific process. It's not like we do that often [09:17:02] <_joe_> cool [09:17:06] <_joe_> just go [09:18:05] nothing seems to have broken anyway up to now [09:19:01] akosiaris: hmm...do we see events in other clusters? Like when deleting a pod manually or so [09:19:05] I 'll do step 6 in the meantime [09:19:21] akosiaris: i've a cumin ready for that [09:19:32] jayme: ok go for it [09:19:33] akosiaris: sudo cumin 'A:eqiad and (A:kubernetes-masters or A:kubernetes-workers)' "disable-puppet 'Reinitialize eqiad k8s cluster with new etcd - T239835'" [09:19:42] +1 [09:20:49] done [09:22:48] jayme: if we manually we delete a pod we will definitely get the new pod being scheduled and created and we get indeed a "Killing" reason as well [09:23:15] okay. Glad that works :-) [09:24:23] <_joe_> akosiaris / jayme I'm looking at pybal rn [09:24:32] <_joe_> there is a new alert I don't understand [09:24:52] CRITICAL: Hosts in IPVS but unknown to PyBal: set(['kubernetes1008.eqiad.wmnet', 'kubernetes1001.eqiad.wmnet', 'kubernetes1010.eqiad.wmnet', 'kubernetes1014.eqiad.wmnet', 'kubernetes1009.eqiad.wmnet', 'kubernetes1007.eqiad.wmnet', 'kubernetes1006.eqiad.wmnet', 'kubernetes1003.eqiad.wmnet', 'kubernetes1005.eqiad.wmnet', 'kubernetes1012.eqiad.wmnet', 'kubernetes1013.eqiad.wmnet', 'kubernetes1002.eqiad.wmnet', 'kubernetes1004. [09:24:52] eqiad.wmnet', 'kubernetes1011.eqiad.wmnet', 'kubernetes1015.eqiad.wmnet', 'kubernetes1016.eqiad.wmnet']) [09:24:53] ? [09:25:26] <_joe_> yes [09:25:31] <_joe_> I think the check is just wrong [09:30:11] still waiting on some mobileapps pods to terminate it seems [09:31:35] _joe_: will downtime all eqiad workers and masters [09:31:56] weird ... it's clear that the containers are dead [09:31:59] <_joe_> akosiaris: it's 160 pods [09:32:03] why isn't the pod dead too? [09:32:17] _joe_: it's like another 3 left now [09:33:20] 10serviceops, 10Patch-For-Review: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by jayme@cumin1001 on 18 host(s) and their services with reason: Reinitialize eqiad k8s cluster w... [09:33:45] I mean I can see on the node that the containers aren't there anymore [09:33:56] but it still is in the API [09:34:16] are we waiting for some GC to gc them instead of a controller reacting to them I think [09:36:33] 1 pod left [09:37:17] Hmm...could it be that devicemapper takes the time on "docker rm" [09:37:39] I remember reading that it's horribly slow compared to aufs/overlayfs [09:37:50] (on removing I mean) [09:38:57] no pods left [09:39:15] kubectl returned [09:39:28] finally [09:39:34] ok let's proceed then [09:40:17] I started taking noted in the etherpad. Feel free to add stuff to catch up on there as well [09:40:31] ok, I 'll stop calico-node and apiservers [09:43:52] done. Let's merge changes? [09:45:21] Yep. Just checked again. I can merge them [09:45:41] go for it [09:47:33] jayme: btw, we might very well have to revisit the lvm approach when we upgrade to buster. It has a sufficiently new enough docker version and things have changed considerably [09:47:46] <_joe_> yes [09:47:59] <_joe_> still not sure overlay is the best choice in prod [09:48:05] <_joe_> but we can experiment at least [09:48:16] <_joe_> we can have a few nodes on different setups [09:48:33] ack, added to the "next steps in general" list :) [09:48:37] patches merged [09:49:29] ok, running puppet on kubernetes1001 and checking that calico-node is happy [09:49:42] kubernetes1014* [09:51:39] looks ok [09:51:50] I 'll run puppet on all nodes [09:51:54] ack [09:53:24] jayme: wanna pick one of chlorine/argon and run puppet there? [09:53:26] akosiaris: would you be so kind to launch another tmux in one of the k8s masters for us to join? [09:54:15] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [09:56:12] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) We let the first chunk of about 700k GET requests for about a week, but nothing stood up much. [09:56:16] sure [09:56:20] akosiaris: never mind. I launched tmux eqiad-etcd on chlorine [09:56:27] * akosiaris joining [10:00:17] akosiaris: you know from head how to disable PodSecurityPolicy? :) [10:01:19] jayme: it's the PodSecurityPolicy controller [10:01:48] akosiaris: yeah, I know. I expected it to be in admission controllers list tbh [10:02:01] it's removed indeed [10:02:04] figuring out why [10:02:38] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [10:03:35] jayme: I am perplexed... PSP controller is enabled for staging but not eqiad/codfw [10:03:47] :o [10:04:41] ok found it [10:04:45] 24ea5414cd5cc230121faad9918bfe9b61de524d [10:04:54] it was only enabled on the staging cluster it seems [10:05:20] adding to the list [10:05:34] https://phabricator.wikimedia.org/T228967 [10:05:39] yeah, task was never completed [10:06:17] I 'd say then re-enable puppet and let's move to step 14 [10:07:21] Ack. We should enable/run it on both maters than I guess [10:08:21] yes [10:08:24] doing so now [10:08:28] ah, ok [10:09:12] double checked kubectl get ns [10:09:19] and it's returning an empty cluster as expected [10:09:35] ack [10:09:44] should we take a 10m break? [10:09:54] okay to do so [10:10:03] cool, bb in 10' [10:10:43] mw1381 is currently running a 4.19 kernel (for T260329), ok to reboot back to 4.9 to bring it in line with the rest of the fleet? or is it still needed for some experiments? [10:11:15] 10serviceops, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) [10:11:47] <_joe_> moritzm: I would prefer to leave it on 4.19, unless it's a problem [10:12:11] <_joe_> I want to reboot it without the kernel "fix" and see how it behaves over a couple of weeks [10:12:42] 10serviceops, 10Parsoid, 10User-jijiki: Create per cluster error rate alerts on Mediawiki servers - https://phabricator.wikimedia.org/T262078 (10jijiki) [10:13:00] ok, fair enough [10:13:23] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [10:13:36] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Followup), 10User-jijiki: Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10jijiki) [10:16:23] 10serviceops, 10Operations, 10Kubernetes, 10User-fsero: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10jijiki) [10:17:33] good luck :) hope you move to calico 3.x where [10:17:44] X >= 10 (because it has a lot of eBPF goodies) [10:20:21] 10serviceops, 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10Performance-Team (Radar): Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki) [10:21:30] 10serviceops, 10Operations, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) [10:22:00] akosiaris: I'm back [10:24:54] 10serviceops, 10Patch-For-Review: Improve Scap2 testing - https://phabricator.wikimedia.org/T216518 (10jijiki) [10:32:52] jayme: back as well [10:33:47] ok [10:34:00] <_joe_> so the cluster is now up on etcd3 and clean? [10:34:28] <_joe_> the half-migrated situation will make scripting the revival of the cluster a bit slower [10:34:52] Yeah, to both [10:35:01] indeed [10:36:44] hmmm [10:36:49] stuck in pending? [10:36:54] yea [10:37:06] 59s Warning FailedScheduling Pod no nodes available to schedule pods [10:37:08] awesome... [10:37:13] missing nodes :) [10:37:19] clusterrolebinding [10:37:27] ah indeed [10:37:36] <_joe_> yep [10:38:09] <_joe_> uhh why editing by hand? [10:38:41] because docs say so and I did not prepare a command... [10:39:21] <_joe_> ahah ok [10:39:31] but we can craft one now to be added to initialiize-cluster [10:39:40] need to add subjects: [10:39:40] - apiGroup: rbac.authorization.k8s.io [10:39:40] kind: Group [10:39:40] name: system:nodes [10:39:51] <_joe_> nodeS [10:39:53] <_joe_> ? [10:40:14] _joe_: yeah, system:node now uses the node admission plugin [10:40:26] whereas system:nodes has all the nodes [10:40:38] the node admission plugin has a prereq of a usable and easy to manage PKI [10:40:41] which we don't have [10:40:43] <_joe_> yes [10:40:51] <_joe_> we will in a few :) [10:40:59] few years? [10:41:00] :P [10:41:11] <_joe_> nope, john is working on it :) [10:42:36] jayme: will it work that way? [10:42:53] akosiaris: that is the question, isn't it :-) [10:42:57] :_ [10:42:59] :) [10:43:28] 10serviceops, 10Operations, 10Scap, 10Wikimedia-General-or-Unknown, 10Release-Engineering-Team (Deployment services): "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10jijiki) [10:48:07] we are ready for step 17 I think [10:49:05] <_joe_> akosiaris: which are the sessionstore-reserved nodes? [10:49:50] _joe_: kubernetes100{5,6} and kubernetes101{5,6} [10:49:51] why? [10:50:12] <_joe_> out of curiousity, I didn't remember [10:50:43] --show-labels and it's the dedicated=kask label [10:51:00] there is a similar toleration but for what you wanted the label was enough [10:51:12] <_joe_> which is still not there, right? [10:51:24] it is, it's in hiera [10:51:31] <_joe_> oh it is [10:51:36] <_joe_> ack [10:51:45] <_joe_> I didn't remember how it was added [10:52:06] we are getting closer and closer to having everything in git [10:52:15] the only thing not in git is calico config (for now) [10:52:28] so, should I try cluster-helmfile ? [10:52:34] it's the next step [10:52:48] yeah, go ahead :) [10:53:34] dammit [10:53:47] what's going so wrong? [10:54:04] dns foo? [10:54:10] lookup kubemaster.svc.eqiad.wmnet on 10.3.0.1:53: read udp 10.64.64.192:41116->10.3.0.1:53: i/o timeout [10:54:13] hmm [10:54:31] kube-system tiller not being able to talk to DNS ? [10:56:45] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10jijiki) [11:00:59] <_joe_> ok, I'm going to lunch while you debug this [11:03:29] akosiaris: anything in container context? [11:04:11] jayme: no, not yet [11:04:27] this looks like a firewall [11:04:38] so... calico on the node? [11:05:10] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10User-jijiki: Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 (10jijiki) [11:07:01] akosiaris: this feels like some catch 22 [11:08:23] yeah but I can't remember it [11:08:56] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) [11:09:08] 10serviceops, 10Operations, 10User-jijiki: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10jijiki) [11:09:29] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) [11:10:39] akosiaris: but 10.3.0.1 is an external dns right? [11:10:47] yes [11:10:58] initialize_cluster sais we should use the internal one for tiller [11:11:11] it's the one that is on all hosts [11:11:37] Make sure we don't rely on the internal DNS service <- [11:11:40] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 2 others: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10jijiki) [11:11:57] Oh. Missread. Sorry [11:12:51] weird, cali-failsafe-out has a rule [11:12:54] 0 0 ACCEPT udp -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:nbWBvu4OtudVY60Q */ multiport dports 53 [11:12:54] 10serviceops, 10Performance-Team, 10Platform Engineering, 10Wikimedia-Rdbms, and 2 others: Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10jijiki) [11:15:24] ah found it [11:15:35] well, no more correctly found out it's not about firewall [11:15:42] it's about networking being awry somehow [11:16:38] what do you mean? [11:16:55] the pod ip is not pingeable, nor can it ping anything itself [11:17:23] uh [11:18:28] 10serviceops, 10User-jijiki: Roll out proxy gutter pool - https://phabricator.wikimedia.org/T258779 (10jijiki) [11:25:34] hmmm [11:25:53] ok back to the firewalling thing. Seems like I sent myself down the wrong path [11:26:04] the calico backing store is missing both profiles and policies [11:26:25] it is indeed a chicken and egg problem jayme, I think we need to calico policy controller to get those [11:26:32] but I think I can shortcircuit it [11:27:00] <_joe_> i kinda remember a setup-calico script [11:27:06] was about to ask where those values should come from as we did only add the basic config to calico etcd [11:30:32] * effie later [11:31:13] got it [11:31:23] I created a calico kube-system profile [11:31:40] let me continue and we need to revisit this [11:31:48] it's a bad catch 22 and 1 I don't remember at all [11:32:10] ok it's now proceeding fine, I 'll start applying [11:32:24] * akosiaris makes notes to eat more omega 3 oils [11:32:42] added to notes [11:32:59] the profile, not your diet stuff :D [11:34:10] * _joe_ bbl [11:41:37] akosiaris: did you just run again? [11:43:35] yeah, it doesn't feel particularly successful [11:43:41] I mean I get errors every now and then [11:43:47] and on the next run they succeed [11:43:58] I 've run it a couple of times already [11:44:47] therer is 1 consistent error [11:44:52] about system:node RBAC rule [11:44:53] maybe we should make it exit on first error so it's easier to investigate [11:45:12] yeah, I've removed the rule in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/625869 [11:45:13] it looks like most errors are transient [11:45:19] helm diff fails for some reason [11:45:24] and then succeeds just fine [11:47:41] jayme: I think we are ok now [11:47:46] looks loke [11:47:48] *like [11:48:07] jayme: have you had lunch? [11:48:21] downtimes are scheduled until 19:00 UTC, we aren't in any rush [11:49:59] akosiaris: no, I had not. Downtimes for master and nodes will run out earlier but they sould be okay not [11:50:11] those are ok already [11:50:15] it's just services now [11:50:20] yep [11:50:24] let's reconvene in say 1h ? [11:51:09] sounds good [12:48:24] * akosiaris back [12:51:41] <_joe_> akosiaris: I can't understand why a pod in staging can't speak with restbase via https, but it can via http [12:51:58] <_joe_> I thought it had to do with the default network policy, but mobileapps has egress disabled [12:52:31] it probably [12:52:35] it probably does [12:52:53] <_joe_> and even when I enabled it and I explicitly allow reaching restbase in the networkpolicy, it's still unreachable [12:52:56] the egress thing in values.yaml is in preparation for the calico upgrade [12:53:03] <_joe_> (only the https port) [12:53:05] it's not yet doing anything [12:53:21] <_joe_> so what is actually doing anything? [12:54:07] deployment-charts/helmfile.d/admin/common/calico/default-kubernetes-policy.yaml [12:54:32] <_joe_> ok, so there is no distinction between different services? [12:56:10] there is [12:56:15] look at the selector matches [12:56:49] <_joe_> heh not in the restbase section :P [12:57:16] <_joe_> also we'll need to point stuff in staging to talk to restbase-dev I guess, but that's another story [12:57:59] or better get rid of restbase, but anyway [12:58:06] so, which pod can't talk to restbase? [12:58:07] <_joe_> lol [12:58:22] <_joe_> akosiaris: should we re-add the services in eqiad? [12:58:41] I am waiting for janis to join and we 'll start instantiating them [12:58:49] hmpf, network-foo ... I'm back (if my last message had not made it [12:58:52] in the meantime I am trying to get the feel of the new hierarchy [12:58:57] I see it had not :) [12:59:02] :-) [12:59:14] ok, let's go then [12:59:34] 1sec for ssh [12:59:51] helmfile -e eqiad diff [12:59:57] so, for e.g blubberoid [13:00:11] that should only target eqiad, albeit with 3 different releases, right? [13:00:28] ah no 1 release "production" [13:00:28] <_joe_> akosiaris: yes [13:00:46] <_joe_> but you will get release=staging,installed=false [13:00:50] well, 2 but only 1 instantiated [13:00:51] ok [13:01:01] kinda confusing in the output I 'll say [13:01:04] <_joe_> so if you run "sync" it will apply "ensure no staging is present" [13:01:41] <_joe_> if you just run "apply" it will only install production [13:02:12] this apply vs sync is killing me at times [13:02:14] anyway [13:02:22] <_joe_> at times? :P [13:02:53] should I with the 7 services in the old format? [13:02:59] joes sentens topps that a bit :-) [13:03:00] or go for the new ones? [13:03:04] [08.09.20 15:01] <_joe_> so if you run "sync" it will apply "ensure no staging is present" [13:03:09] eheh [13:03:33] I say lets do the new ones and then the old ones [13:03:42] ah, in the new ones I don't need to source anything, right? [13:03:47] correct [13:03:54] ok lemme try blubberoid then [13:04:01] ok [13:05:13] that was fast [13:05:27] akosiaris: so you prefere sync over apply? [13:05:34] :P [13:05:41] nope. I have no real preference [13:05:49] <_joe_> he wants to have more sal log lines [13:06:04] Ah, I see. Thats an ok reason ofc [13:06:08] I tend to ask myself "Do I need to ensure that everything is like specified or do I know it already is and just want to apply a change?" [13:06:27] and depending on the answer I sync or apply [13:06:41] not sure it's worth it to even go down that rabbithole every time [13:07:59] <_joe_> akosiaris: proceed :) [13:08:18] teamwork terminal :) [13:10:13] akosiaris: you want to continue the new ones? I would start with the old ones in window-1 then [13:10:29] <_joe_> I'll look at pybal [13:10:36] sure [13:11:02] <_joe_> citoid and cxserver are coming back [13:15:19] eheh, was about to wonder why I don't log to SAL... :D [13:16:14] ;) [13:18:19] akosiaris: just curious, what are the eventgate deploys for? [13:18:32] (we have an alert right now about missing data from eg-analytics-external, need to investigate) [13:18:41] ottomata: https://phabricator.wikimedia.org/T239835 [13:19:05] we are in the last steps of reinitializing the entire k8s cluster [13:19:16] ...and it already pays off to impersonate akosiaris :P [13:19:17] <_joe_> ottomata: no need to investigate [13:19:24] the services are all pointing to codfw [13:19:29] <_joe_> as explained to elukey this morning [13:19:29] wow long task, night [13:19:30] so... why is there missing data? [13:19:32] nice* [13:19:35] akosiaris: probably unlreated [13:19:41] ah ok [13:19:42] i haven't started looking into it yet [13:19:51] <_joe_> ottomata: you mean the mirrormaker alerts? [13:19:54] just saw the deploys and was curious [13:19:56] <_joe_> they should be going away now [13:19:56] no its in hadoop [13:20:01] <_joe_> oh uhm [13:20:11] and its just an eqiad topic, but it is the eqiad ttest topic [13:20:21] that should always have data, beacuset eh readinessProbe creates it [13:20:45] i'll look into it [13:20:58] <_joe_> ottomata: no it should not [13:21:02] ottomata: there was no readiness topic for a while [13:21:10] oh? [13:21:10] <_joe_> as said above, we recreated the k8s cluster in eqiad [13:21:11] readinessProbe [13:21:19] OH [13:21:21] because downtime [13:21:22] <_joe_> the mirrormaker alerts were also related [13:21:28] so you were able to totally take it offline? [13:21:36] <_joe_> what do you mean? [13:21:38] yes, for multiple hours [13:21:40] so there really was no eqiad eg-analytics-external for a while? [13:21:42] ok perfect [13:21:44] that explains it then [13:21:45] <_joe_> yes [13:21:47] very good [13:21:53] from ~9:00 UTC to just about now [13:21:55] great, noĀ further investigation needed! [13:21:58] that sounds aboutu right [13:22:27] <_joe_> ottomata: also you'll be delighted to know [13:22:36] <_joe_> we moved all of the purges and resource_change to codfw [13:22:41] <_joe_> and that caused no crisis [13:24:02] cool, so the partition balancing was all that was needed? [13:24:11] seems strange...the traffic wasn't that much... [13:24:22] something really seemed different with kafka-main2003... [13:25:06] * akosiaris done [13:25:38] I've got a weird error from helm, see line 57 in https://etherpad.wikimedia.org/p/migrate-k8s-etcd ... [13:25:51] ofc it works on re-run... [13:26:10] maybe a race again *shrugging* [13:26:53] maybe downloading a new chart version? [13:27:29] <_joe_> afaict it's eventgate-main, eventstreams, and api-gateway that are still down [13:27:38] still doing those [13:32:07] akosiaris: I guess that happens when helm fetches a chart tgz in parallel in another deploy or so...as as instances use the same cache [13:32:16] yup [13:32:21] * jayme done [13:32:46] we should be all good now _joe_ [13:33:30] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [13:33:36] \o/ [13:33:54] * akosiaris waiting for the lvs alerts to also turn green but this looks like pretty well done [13:34:04] 49 deployments active in both clusters [13:35:30] we're loosing nodes :-o [13:36:17] yes, power cable issues [13:36:20] chris mentioned it [13:36:29] <_joe_> yeah eqiad is a mess rn [13:36:31] asw-d3 lost [13:36:53] okay. Happy if it was not us :) [13:37:07] it could have happened 30 mins ago [13:37:14] and we would be crying now [13:37:43] good that we took a lunch break [14:23:43] <_joe_> ok, let's coordinate fixing our stuff here [14:24:20] <_joe_> we need to check all those hosts with icinga criticals [14:26:26] šŸ‘‹ [14:31:01] _joe_: when you can, say more about what needs checking? I can get started on it [14:31:33] <_joe_> as I said elsewhere, we should check no proxy (scap, mcrouter) is down in eqiad [14:31:47] <_joe_> and then I think depool those hosts if this doesn't get resolved soon [14:31:52] 10:26:03 AM mw1320 and mw1287 are mcrouter proxies and are down [14:31:53] effie is on checking that AFAIK [14:32:20] those are just showing "check systemd state" now though, let me take a look [14:34:59] ^ both those hosts are healthy [14:35:02] rechecking the others just inc ase [14:35:42] (_joe_ ^) [14:36:17] <_joe_> the systemd state should recover soon [14:36:24] <_joe_> it was just ferm failing to resolve prometheus [14:36:48] yeah, recovered while I was looking [14:38:18] In what intervall do they retry? [14:38:38] can't seem to see this in the service file [14:39:41] they don't. that ferm unit needs to be restarted manually if failed [14:39:49] a big question is why did it even fail? [14:39:55] ferm only reloads on puppet changes [14:39:58] or reboots [14:40:14] but icinga retries every 1m IIRC for those [14:42:36] <_joe_> akosiaris: ferm tries to determine if it should run when puppet runs [14:42:44] <_joe_> first thing, it tries to resolve hostnames [14:43:00] <_joe_> that failed [14:43:11] that is something new [14:43:15] <_joe_> hence tha systemd failure [14:43:18] <_joe_> no it's not [14:43:56] you mean the status command failed? [14:46:40] <_joe_> yes [14:47:08] hmm there doesn't seem to be a status command ofc ... I 'll have to dig a bit into this [14:47:16] but overall ferm is created exactly so those issues don't exist [14:47:29] I 'll need to dig into it a bit more [14:48:38] <_joe_> akosiaris: what issues? We are the one who ask ferm to resolve hostnames [14:51:19] on which hosts did ferm fail? [14:51:55] i've restarted it on kubernetes1005 at least [14:52:54] <_joe_> apergos: on any host that lost connectivity with the rest of prod and had a puppet run scheduled during it [14:53:42] got it, thanks [14:54:11] restarted it on ~ 30ish [14:56:54] <_joe_> jayme: can you depool the k8s nodes that are down from pybal at least? [14:57:10] <_joe_> set them pooled=inactive [14:59:46] _joe_: sure. Might be a dump question but: How do I know what PyBal thinks is down? I don't see any complains in icinga [14:59:52] akosiaris: I think it's another occurrence of https://phabricator.wikimedia.org/T254477, symptoms match on e.g. mwdebug1002 [15:00:07] <_joe_> jayme: oh I thought we still had some down hosts [15:00:09] apart from 1004 and 1013 which alex depooled [15:00:14] hmmm, could we put rack metadata in confctl attributes, making it possible to do something like: confctl select dc=eqiad,rack=d[1-4] set/pooled=no ? [15:00:35] <_joe_> bblack: not really [15:00:37] oh, aprdon. he cordoned them. I'll double check [15:00:47] <_joe_> oh alex depooled them or cordoned them? [15:00:51] <_joe_> that's different [15:00:59] sure...checking [15:01:11] cordoned [15:01:25] <_joe_> jayme: ok so we need to depool them from any service :) [15:01:37] yep, on it [15:01:41] just kubesvc [15:04:44] they're only in service kubesvc AFAICT ... and ofc switched from down to unknown after depool :) [15:04:56] <_joe_> lol [15:05:09] <_joe_> just came back [15:05:11] <_joe_> ahahah [15:05:29] eheh [15:11:02] I'll wait for icinga to green and then repool and uncordon them [15:12:30] <_joe_> <3 thanks jayme [15:23:25] 10serviceops, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10lmata) Adding to our backlog for prioritization [15:30:33] akosiaris: if you have a min would appreciate review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/624092, need that for a monitoring task i'm working on [15:39:51] <_joe_> ottomata: done [15:40:29] interesting, was scrolling through catalog looking for ways to do get the svc name [15:40:48] <_joe_> from monitoring if present [15:40:49] _joe_: what declares the actual svc name, just DNS? [15:40:56] <_joe_> yes [15:40:59] aye [15:41:09] you thikn pulling it out of monitoring is better? [15:41:16] also using that for the param check? [15:41:17] <_joe_> you can get it from monitoring.sites.$site [15:41:18] <_joe_> yes [15:41:24] ok [15:48:30] <_joe_> jayme, akosiaris pybal seems happy re: k8s in eqiad [15:48:51] _joe_: yeah. nodes already repooled and uncordoned [15:49:18] <_joe_> I was also referring to services [15:49:23] <_joe_> they all seem up and running [15:49:32] <_joe_> can I get a review of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/625900 ? [15:49:47] Ah, that other thing we did today... :P [15:50:11] sure thing, I'll have a look [15:52:58] _joe_: updated patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/624092 [15:54:54] <_joe_> ottomata: I don't remember if I wrote a spec for that function [15:54:56] <_joe_> lemme check [15:55:20] nope i looked [15:55:26] just ran PCC on restbase nodes (only place I can see this used) [15:55:34] <_joe_> no I didn't :P [15:55:41] <_joe_> ottomata: cool [15:57:09] never written a puppet spec, [15:57:25] if i add someĀ fake service::catalog hiera to wmflib/spec/fixtures/hieradata [15:57:29] perhaps I could? [15:57:31] is that how that works? [15:58:17] meh dunno :p [15:58:32] ok, thanks for +1s, merging, will run pupet on a restbase node to double check [16:00:45] cool, noop [16:00:46] ty [16:01:05] <_joe_> yw! [16:01:12] <_joe_> and thanks for improving that function [16:01:31] <_joe_> akosiaris: how do I apply that egress policy change? kube-system? [16:02:30] cluster-helmfile sync [16:02:34] but I am doing that now [16:02:41] <_joe_> start from staging pls [16:03:20] too late [16:03:29] staging is happening in about 2 s [16:04:23] _joe_: try it out [16:04:29] <_joe_> yes it worked [16:04:39] <_joe_> mobileapps says "all endpoints are healthy" [16:04:49] <_joe_> and hear this [16:05:01] <_joe_> it's not calling meta via the public url anymore [16:07:05] https://www.youtube.com/watch?v=bsfdxHLD3Lk [16:07:36] it wasn't btw since the k8s migration [16:07:48] michael worked hard on it [16:10:13] <_joe_> you are wrong :) [16:10:38] <_joe_> mobile_html_rest_api_base_uri and mobile_html_local_rest_api_base_uri_template are both still used. [16:16:42] <_joe_> akosiaris: see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/625936 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/625937/1 [17:23:00] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [17:29:00] hey, i took 2 days of vacation. i will be working again Thursday [17:36:32] 10serviceops, 10Operations, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [17:58:32] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10User-jijiki: Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 (10jijiki) Due to some issues when rebuilding the package, it is not certain if this will be rolled out this week. [18:01:43] 10serviceops, 10Scap, 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10User-jijiki: Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 (10thcipriani) >>! In T261234#6443739, @jijiki wrote: > Due to some issues when rebuilding the package, it is not certain if this wi...