[04:32:13] _joe_: updated in your behalf at https://phabricator.wikimedia.org/T266055#6782828, as Aaron was wondering. hope I got that right, more or less. [08:19:11] In 40 minutes I will restart m5 (wikitech) database primary master, wikitech will be unavailable for around 1 or 2 minutes [08:54:37] In 5 minutes I will restart m5 (wikitech) database primary master, wikitech will be unavailable for around 1 or 2 minutes [09:10:39] <_joe_> vgutierrez: I'll upgrade pybal on the ulsfo backup lvs now [09:10:50] cool, go ahead [09:44:27] FYI, I'm enabling CAS for debmonitor in ~ 10 minutes [09:45:22] ack, thx [09:46:56] <_joe_> will that affect docker-report? [09:47:10] <_joe_> I'm guessing it won't, given it doesn't use password auth [09:47:53] yeah, this only affects the web UI, the internal endpoint for image submission will remain unchanged [10:09:22] <_joe_> hnowlan: I don't find a dns PTR for the IPs of the sockpuppet service, nor the A record for what it's worth [10:09:50] <_joe_> which means, given the service is in state:production, that I'm not sure how icinga might monitor it [10:10:02] <_joe_> oh you have the discovery record, I see [10:25:38] elukey: I commented on the archiva/nginx config. Thanks for jumping into it! I am now wondering whether the slowness could be caused by Nginx itself since it seems to be buffering proxied requests. I have commented on https://gerrit.wikimedia.org/r/c/operations/puppet/+/608812 :] [10:36:10] hashar: thanks for the review, I commented as well :) [10:36:31] it feels more that Jetty/Archiva is the culprit, and send_file in nginx should help a lot [10:36:55] but I am open to try different settings [10:54:15] elukey: yeah your change (bypassing jetty/archiva) kind of make sense [10:54:27] but that left me to wonder how the perf could be THAT crippled :] [10:55:16] last night I looked at Archiva doc a bit, it seems to indicate it has an in memory buffer for artifacts. So assuming that cache is enabled on our setup, I would assume the files to be served straight from memory after a second maven run [10:55:19] _joe_: true, I have a CR open for that https://gerrit.wikimedia.org/r/c/operations/dns/+/658976 [10:55:56] <_joe_> hashar: it's java(TM) [10:56:06] <_joe_> hnowlan: ack :) [10:56:33] hashar: yes I agree that it could be default settings + jvm + jetty the main culprit, but also remember that archiva1002 is relatively tiny ganeti vm so we cannot cache a ton of things in memory (but we can expand it if needed) [11:09:31] elukey: yeah dully noted :] [13:15:02] elukey: disabling nginx buffering made it faster (from 25 seconds to 17 seconds) (maven central is 7 seconds) [13:15:11] but yeah that got improved [13:15:31] it still spend 1.1 seconds idling before the transfer starts though [14:36:33] This is pretty cool https://blog.seekwell.io/gpt3 [14:50:04] hashar: thanks great! [14:52:44] can we get a GPT-3 that writes OKRs for us? :) [15:10:52] <_joe_> bblack: who told you someone doesn't already have it? [16:02:22] can someone tell if this homer diff is OK to merge? [16:02:25] https://www.irccloud.com/pastebin/M0cJza1d/ [16:02:53] nevermind, that wasn't the device I was looking for [16:03:27] volans and I have been puzzling over that diff for a day or two now [16:03:53] arturo: please let me know when your homer run is completed [16:03:59] cdanis: done [16:04:03] ty [16:04:10] https://www.irccloud.com/pastebin/S9XrZB3F/ [16:04:13] for the record [17:18:47] Hi SRE folks. Can someone +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/659335 ? [17:20:40] I can [17:23:20] Thanks effie! [17:24:49] :D [18:07:40] <_joe_> dancy: btw, I did score some progress en route to restart php-fpm at every deployment [18:07:55] I saw that! [18:08:03] Happy to see the progress. Looks promising [18:08:05] <_joe_> there is one detail - how scap will handle failures from the restart script [18:08:22] Nod. I believe Tyler answered that question well. [18:08:23] <_joe_> I need to understand that, basically, then I think we can start testing [18:08:28] (the answer is, it keeps going) [18:08:31] <_joe_> oh ok, I missed it :) [18:09:16] <_joe_> heh yeah I might want to do some more tweaks then to the way we restart the service, then [18:09:48] <_joe_> right now the script is "defensive". if something goes wrong, it exits with non-zero exit code expecting someone with experience to take a look [18:10:08] <_joe_> we might want to instead just restart unconditionally if something goes wrong with depooling/repooling [18:10:57] Sounds reasonable to me.