[07:12:31] <_joe_> https://phabricator.wikimedia.org/T226048 has been reopened [07:12:39] <_joe_> someone needs to look into it this morning [07:15:41] ema and myself will pick this up, all the varnishes are now rebooted to a fixed kernel, so the next step is to a/b test whether re-enabling TCP SACKs makes a difference here [07:16:06] <_joe_> moritzm: I think there are two issues superimposing [07:16:24] probably even more than that... [07:16:47] <_joe_> https://phabricator.wikimedia.org/T226318 the title is misleading [07:17:01] <_joe_> trying to render some svg thumbnails results in a 429 [07:17:14] <_joe_> from the backend [07:17:23] ack [07:17:27] <_joe_> so i suspect some thumbor rate-limiting edge case [07:50:53] jynus: o/ - if you are still ok for db1107/8 I'll start the prep work to stop traffic/replication [07:51:06] yes [07:51:19] let me know when things are green to proceed [07:51:25] super [07:54:52] jynus: free to go (already downtimed db1107/8 for an hour) [07:58:06] I am going to leave it some time while I prepare [07:58:16] so e_sync is in sync [08:02:35] ack [08:11:44] elukey: going for restart then, starting with the master [08:14:09] +1 [08:21:54] elukey, marostegui: advice https://phabricator.wikimedia.org/P8645 [08:22:55] jynus: Last time what I did with chris was to wait for him and exchange it with some other module and then hit F1 [08:23:09] So given it is the master, probably you want to F1 and create a ticket with chris to do that later [08:23:46] if it allows you to boot.. [08:23:49] +1 [08:24:10] and that is why I like to reboot servers from time to time [08:24:42] now that I think about it...https://phabricator.wikimedia.org/T222050 [08:24:51] jynus: you might want to re-open that one [08:25:10] it is A3, the same one that your paste shows [08:25:12] it booted, but then got stuck on "Loading initial ramdisk ..." [08:25:47] I love how the logs says: It has been corrected by h/w and requires no further action [08:25:57] And then when it boots...it is no longer recoverable :( [08:33:49] jynus: db1107 is still under warranty, I just checked [08:37:54] going with db1108 next [08:39:33] jynus: did it boot at the end or is it still in a weird state? [08:39:43] it booted [08:39:50] after 2nd try [08:39:50] ah nice [08:39:53] thanks a lot! [08:40:15] you should probably insist on https://phabricator.wikimedia.org/T222050 [08:40:27] as probably it is not the last time we hear from this [08:41:28] yep I agree [08:41:56] normally I would suggest to switch master and replica [08:42:07] but I think in this case the replica is mor important for the service [08:47:08] I'll ping chris today if he is around [08:47:20] great [08:47:33] hopefully we'll be able to fix this issue and return the hosts soon to the spare pool :) [08:47:43] <3 <3 [08:48:01] the other booted without issues [08:49:04] elukey: in order to finish the upgrade, I need to run puppet once [08:49:09] but I need to sync with you [08:49:54] I guess I can start mysql first [08:51:27] jynus: if db1107 is up and running fine, then it is ok to re-enable puppet on 1108 anytime [08:51:36] (if mariadb is up is better yes) [08:52:04] ok, I will let you do that, but check if something else changes on puppet run [08:52:14] from the mysql perspectice everything is done [08:52:35] https://tendril.wikimedia.org/host/view/db1107.eqiad.wmnet/3306 [08:52:40] https://tendril.wikimedia.org/host/view/db1108.eqiad.wmnet/3306 [08:52:47] super thanks a lot! [08:52:48] doing it now [08:53:05] the problem is that after upgrade [08:53:12] some puppet code overrides packages [08:53:28] so I try to run puppet after upgrade to avoid issues [08:53:46] in theory puppet should not interact with dpkg, but in practice it dopes [08:54:32] the only thing that changed was nagios-nrpe-server.service [08:54:43] ok [08:55:03] then from our side nothing else is due unless some alert happens [08:55:36] thanks a lot for the work! [08:56:01] you also have now a safe kernel in addition to an upgraded mysql [08:56:10] 2x1! [08:57:08] \o/ [13:26:45] hey akosiaris how do you populate `/etc/kubernetes/tokenauth` in the current puppet code? [13:33:48] arturo: from the puppet private repo. It's the k8s_infrastructure_users hiera var [13:34:10] which I 've been wanting very very much to remove, but ... toolforge [13:34:42] it's an ugly global hiera var that's used confusingly in >1 places [13:37:00] arturo: why do you ask though? Some toolforge issues? [15:06:17] akosiaris: I'm writing the puppet code for the new k8s version in toolforge, and I don't know what to do with that file (or even, if I need it at all) [15:13:28] arturo: you probably need it, unless you have plans to move to some either type of authentication [15:13:35] s/either/other/ [15:23:30] akosiaris: is that for ABAC based auth? We would like to use now RBAC [15:24:17] arturo: RBAC/ABAC are authorization, not authentication. Completely orthogonal [15:24:38] ok I see [15:24:54] that part has been well designed by the kubernetes devs [15:25:02] many pluggable authentication mechanisms [15:25:09] I'm seeing the actual tokenauth file in our currently-running k8s cluster in toolforge and I understand it now [15:25:16] that's probably what maintain-kubeusers is doing [15:25:32] yup, in toolforge maintain-kubeusers should be the code handling it [15:25:55] ok, so for authentication, we would like to use now x509 certs, but that is an open question right now [15:26:06] open question: how to handle x509 certs for each tool [15:26:14] probably extending maintain-kubeusers [15:39:58] arturo: you probably want to read https://phabricator.wikimedia.org/T177393 [15:40:30] * arturo reading [15:41:45] really interesting akosiaris [17:00:58] halftime score is 1–1 [17:15:04] jaufrecht: spain-USA isn't it? who are you supporting? :-P [17:15:35] I'm just cheering for the beauty of sport and the integrity of FIFA [17:47:42] jijiki: sorry I didn't read the ticket when we were in the meeting. i think that one does need some kind of SRE approval because it is a new shell account request./ [17:48:06] I think the request wasn't filed particularly well, tho, linked to the procedure. [17:48:08] I think we are delegating that [17:49:15] jaufrecht: two penalities....is that the only way USA can score? [17:49:38] it is also asking for deployment hosts too [17:57:16] ottomata: I just remembered, you are right [17:57:24] I pinged them for it, and they never replied [17:57:44] this is totaly mybad [17:57:53] I will update the sre doc [17:59:31] * jijiki is confused [17:59:50] ottomata: ok it is late here so I am not thinking properly [18:00:08] jijiki: no worries i am confused too! :) [18:00:18] my final answer is I don't know what we should do in that case [18:03:36] i think the request was just filed badly, this is a new production shell account request [18:24:39] marostegui: it may be if the refs won't call persistent infringement [18:26:11] arturo: as we cosmopolitan elits say, πŸ‡ͺπŸ‡Έ v. πŸ‡ΊπŸ‡Έ [18:26:19] * jaufrecht is very elit [18:49:48] where this elit mit to kick bal with fit