[06:55:46] vgutierrez: just seen the curl -i vs -v issue, sigh, sorry [06:55:53] no problem :) [06:55:57] and good morning/afternoon :) [06:56:05] yey! 3pm here already [06:56:19] the funny part is that I didn't make a mistake in the Dockerfile but the curl was wrong [06:56:22] ahahhaa [06:56:34] anyway, going to check again, useful to have a repro [06:56:40] yup :) [06:56:49] I attached a Dockerfile as well [06:56:54] using the dev stretch image [06:57:18] but you should be able to trigger in your setup as well [06:57:18] if last upstream still works with -v (need to retest to be sure) then we might use git bisect to find the commit that fixes it, or simply see if buster version fixes it [06:59:04] while thinking about git bisect, I remembered this https://github.com/apache/httpd/blob/2.4.x/CHANGES#L443-L444 [06:59:38] that is a generic change in the http output filter (so after mod_proxy_*) [07:00:15] httpd on buster is 2.4.38, so in theory we shouldn't repro on it [07:02:54] interesting [07:04:07] as long as it fixes it.... :) [07:11:55] if this is the case, we'd need to figure out what to do right now [07:12:20] we could think about adding a patch to the current httpd version waiting for buster [07:13:07] yep [07:23:09] vgutierrez: with buster's version I don't see the "Excess found in a non pipelined read" [07:23:26] just tested with docker and fpm 7.2 [07:30:22] the fix that I was talking about it http://svn.apache.org/viewvc?view=revision&revision=1837056 [07:34:11] yep [07:34:19] I got the same result here [07:34:27] debian:stretch KO, debian:buster OK [07:34:29] ok added the info to the task [07:34:31] nice cetach [07:34:31] nice :) [07:34:34] *catch [07:35:02] _joe_: ^^ so apache2 shipped in buster doesn't have the 304 body issue [07:35:18] _joe_: elukey is suggesting to backport the patch, what do you think? [07:35:35] <_joe_> vgutierrez: we do that all the time [07:35:42] <_joe_> but we should also fix the code [07:35:49] wikibase code? [07:35:53] <_joe_> yes [07:36:02] yep I agree [07:36:37] <_joe_> elukey: interestingly nothing of that patch has to do with mod_proxy_fcgi [07:36:58] could you reply to WMDE folks on the task _joe_? https://phabricator.wikimedia.org/T237319#5681384 [07:37:14] probably your answer would be some orders of magnitude more accurate than mine [07:39:50] _joe_ yes basically the fix is done at the http output filter level, just before the response is returned to the client (or better, while it is streamed to the client) [07:39:59] so it catches all corner cases [07:40:43] vgutierrez: ahahah I just replied with a super simple answer :P [07:40:54] <3 [07:40:59] thanks elukey [07:44:56] <_joe_> so, is there a way to log when we get a 304 with a body in apache? [07:45:03] <_joe_> I mean with the backported patch? [07:45:42] not that I know [07:45:59] we should instruct the code to do it [07:46:04] shouldn't be too difficult [07:46:08] <_joe_> yeah I'm on the verge there [07:46:18] <_joe_> on one hand I want to be liberal in what we accept [07:46:30] <_joe_> on the other hand I don't want to hide programming errors for years [07:46:40] makes sense [07:50:40] the main problem is that we'd need to keep a patch on top of buster and even later (if we don't convince upstream to log) [07:51:44] which apache release has the patch? or no release yet? [07:51:53] moritzm: 2.4.35 [07:52:22] but then buster has the patch and no need to keep the patch rebased? [07:52:28] indeed [07:52:30] buster is not affected [07:53:07] <_joe_> bbiab [07:53:48] moritzm: we were discussing to add extra logging, that would need a patch even for buster, but probably we are not going to do it :) [07:54:19] <_joe_> without logging, I'm against patching apache rn [07:54:37] <_joe_> but I'll expandf in a bit [08:12:05] good morning [08:12:23] so we do know more about the 304+body issue now it seems :) [08:14:02] morning [08:14:45] ema: that's the way to go with smokeping? [08:15:05] usually we generate PuppetCA certs for those [08:15:28] but I'm confused by librenms setup there [08:15:54] hmmm I see... [08:15:59] vgutierrez: usually, yes. There are cases in which we use apache+acme_chief though, see 9c2492284739504940d995bd37402706428bd110 [08:16:10] smokeping.wikimedia.org is behind the caching layer [08:16:17] but librenms handles their own TLS termination [08:17:23] and those are served from the same origin, netmon1002 [08:17:48] right, which only has librenms.wm.org in SAN [08:18:19] actually, now that I think about it [08:18:36] vgutierrez: shouldn't we add smokeping.wm.org as a SNI to the librenms entry? [08:18:45] in hieradata/role/common/acme_chief.yaml that is [08:18:55] ema: why there and not to the netbox one? [08:18:56] ;P [08:19:19] I guess we need some clarification from volans, XioNoX and friends [08:19:27] yes there's some confusion :) [08:19:55] also I don't see where they use the netbox cert on netmon boxes [08:20:04] so maybe that auth. regex is deprecated and should be removed [08:50:12] <_joe_> so going back to the 304 with body issue [08:50:36] <_joe_> I think we might apply the patch now, while WMDE fixes the bug [08:50:42] <_joe_> and unapply it afterwards [08:50:49] what does the patch do again? [08:51:07] <_joe_> suppress the response body if status is 204 or 304 [08:51:15] right :) [08:51:18] <_joe_> now I just read the fcgi spec (boring) [08:51:30] <_joe_> and it says nothing about any of it, given it's not tied to HTTP [08:51:44] <_joe_> so php-fpm is correctly not fixing what you do [08:52:07] <_joe_> it's your responsibility to respect the HTTP protocol in your responses [08:52:42] fcgi is not tied to http? [08:52:46] trusting humans... sihh [08:52:48] *sigh [09:11:58] <_joe_> ema: yes, but no [09:12:19] <_joe_> ema: it is tied to http requests, but nothing about respecting the protocol is said anywhere in the spec [09:12:26] <_joe_> unless I'm missing something [09:14:01] oh I see [09:20:48] _joe_ +1 [09:21:19] <_joe_> I'll add a test to httpbb [09:21:28] <_joe_> btw I was thinking [09:26:24] yes? :) [09:29:21] ThoughtTrainTimeout [09:29:24] * vgutierrez hides [09:29:40] he only said he was thinking, not that he was about to say what he was thinking :-) [09:45:04] <_joe_> yeah sorry [09:45:13] <_joe_> I got distracted by someone asking things :P [09:45:21] <_joe_> no stashbot here? [09:45:28] <_joe_> or wikibugs whatever is the name [09:46:35] <_joe_> so, I was thinking, we have over and over to debug the same http request all across our copious stack [09:46:46] <_joe_> it would be great to have a program that automates it [09:48:05] <_joe_> something like stack-curl https://en.wikipedia.org/wiki/Main_Page --compress [09:48:34] <_joe_> and munges the url appropriately to get the response at each place in the stack [09:48:56] <_joe_> tls termination, fe cache, be cache, applayer [09:49:20] <_joe_> and somehow visualize the differences [09:49:35] <_joe_> there is one issue with this. The urls might vary too [09:49:54] <_joe_> but we could start small and go with MediaWiki at least [09:51:49] the urls shouldn't vary, we can just file bugs for those cases :) [09:54:57] vgutierrez: I was staring at the browsersec warning stuff again, thinking "oh, this doesn't actually catch API reqs, nor anything on cache_upload", which is part of the reason for all our different filters for this over the past several iterations [09:56:04] that's on purpose AFAIK [09:56:11] or am I missing something? [09:56:33] I think since we never fully documented past thinking, we can only guess [09:56:45] (at what were our own motivations at the time) [09:56:56] the one we have now is, I think, what we used at 100% last time around [09:57:20] <_joe_> bblack: like restbase I mean [09:57:20] "the one" meaning the URL/method filtering part: [09:57:22] && req.url ~ "^/wiki/" && req.url !~ ":" && req.method == "GET" [09:57:31] _joe_: yes, exactly, me too [09:57:52] I actually have a phab ticket about that, which has been open for years [09:58:08] <_joe_> bblack: I was about to write a task trying to summarize the ideas we floated around at the meeting with services about API routes [09:58:49] <_joe_> the more I think about your proposal (have the same middleware used service-to-service and edge-to-service) the more I like it [09:58:52] or maybe it's buried in an unrelated phab ticket [09:59:19] <_joe_> but that's probably more of a design doc than a task, dunno [09:59:49] while I'm dredging up old tasks, cxserver was supposed to migrate into restbase some year, but I guess that's no longer in the cards heh [09:59:54] https://phabricator.wikimedia.org/T133001 [10:00:14] ah here's the one: https://phabricator.wikimedia.org/T167972 [10:00:26] (about not mangling URIs to be different at public and restbase service layers) [10:00:33] <_joe_> bblack: if we go with the middleware idea, it would basically do what you liked about restbase [10:00:44] <_joe_> the api routing and abstraction from the edge router [10:00:51] <_joe_> s/router/cache/ [10:01:44] so there's a few different angles to take on the API-layer/router thing [10:02:01] <_joe_> or well, any api router really [10:02:12] I think I've reached a point where I understand most of them independently, but I haven't yet re-integrated all of them into a shared view in my mind that's clear enough for specific recommendations anymore [10:02:35] re: the lower-level tech topics like "could a declarative envoy config be enough?" [10:03:24] because I'd really like a world in which whatever this "layer" is, it's a virtual layer and not yet another real service which is its own cluster and layer and point of failure and all of those things (and real code) [10:03:38] but I'm not 100% sure that's going to meet all inter-related needs anymore [10:04:36] maybe it will though, with some service beneath it for certain more-complex translation cases [10:05:13] <_joe_> so, I wholeearthedly agreed [10:05:24] <_joe_> and whatever is the implementation we decide to go with [10:05:52] and by "translation" cases... part of this thing's job, I think, will be to own the canonical, enduring, versioned public API for all things wikimedia. [10:06:15] <_joe_> it should be a middleware, meaning we'd run it on every cache node [10:06:18] and be the thing that insulates everything above it from the shifting sands of different projects/services/programming-languages/eras of various backend services implementing parts of it [10:06:33] <_joe_> that's a tall order heh [10:06:41] it is! [10:06:42] <_joe_> if you don't want "real code" there [10:06:51] <_joe_> I agree, don't get me wrong [10:06:53] but 90% of that job, will be at most URI-rewriting [10:06:58] <_joe_> yes [10:07:01] there will be some edge cases though... [10:07:21] where you have to have some real "translation" code to meet some need (e.g. transform the actual content) [10:07:38] <_joe_> rewritoid [10:07:54] but I think those might be limited-enough cases, they don't have to be a core capability [10:08:08] <_joe_> or well [10:08:16] there can be some separate real-code actual-service that this thing uses to reach through to other services for those edge cases. [10:08:17] <_joe_> one could argue that if you need to translate content [10:08:27] <_joe_> you deploy a lambda on kubernetes [10:08:42] <_joe_> and stop polluting the routing layer with code [10:08:56] by translate content, what I mean is that maybe during a transitional period while cobbling some API spec out of the legacy bits we have, we have to take the content output from service X and re-format it from xml to json or something dumb like that. [10:09:35] <_joe_> sure, and that can be a small lambda running separately, the router will just need to know something like [10:09:51] <_joe_> "If accept is application/xml, the route here" [10:10:01] <_joe_> *then [10:11:24] I suspect the first time this canonical API comes together, it will be starkly different than the underlying APIs it pulls together underneath, and so most URIs will be rewritten between the public view and the per-service internal request view. So it will be a regression on that front, in that sense. [10:12:03] but over time we'd want them to re-converge. where ideally even if the underlying services are swapped out regularly and functionality shifted between them, etc... they're cognizant enough of the upper-layer API to avoid the need for most rewrites. [10:14:44] vgutierrez: back on that other topic, I think we should probably be intercepting API reqs, and maybe cache_upload too (they won't see the 200 output probably, but at least they might see some breakage and dig), but we still have to be careful to avoid loops (don't hit sec-warning or the static image it loads), and avoid Special:Foo pages, which we had filters before [10:15:06] so something more like: [10:15:09] && req.method == "GET" && (req.url !~ "^/wiki/" || req.url !~ ":") [10:15:12] && req.url != "/sec-warning" && req.url !~ "^/static/images/" [10:15:25] will make a patch and rebase the X% stuff onto it, in a sec [10:16:06] although I hate the not-or-not pattern in chained booleans, it's always so hard to follow mentally :P [10:16:30] but in this case, I don't know if it can be made clearer without making the rest way more complicated [10:17:15] <_joe_> it's pretty straightforward [10:18:05] <_joe_> "request is GET, is not a wiki page or has ":" in the url, the url is not /sec-warning or a static image" [10:18:16] <_joe_> don't we also need to allow css? [10:18:26] yeah that's probably part of the missing thinking before [10:18:46] we've iterated through several versions of this "filter what we'll sec-warning-redirect" during the past few such runs [10:18:56] I don't think we've ever landed on a perfect version of it heh [10:19:18] what's running right now is a very restrictive one that's very safe [10:19:47] it's only "request is a GET, and matches ^/wiki/, but has no : in the URL" [10:19:58] which avoids all the css/images/sec-warning/etc problems [10:20:14] but it also fails to alert API users, or sites fetching/proxying just cache_upload images independently, etc [10:22:41] maybe a better way to focus on the non-human direct traffic is to look at UA [10:23:10] in recent data when looking at bad-tls cases anyways, all the bot cases matched one of two conditions: empty UA, or a UA that matches /[Bb]ot/ [10:23:25] and none of the legit human UAs happen to match /[Bb]ot/ either [10:24:18] && ((req.url ~ "^/wiki/" && req.url !~ ":") || (req.http.user-agent || req.http.user-agent ~ "[Bb]ot")) [10:24:39] so we could do that, and avoid impacting other stuff for human UAs (like css, image loads, etc) [10:37:58] so we should provide a valid format for api requests [10:38:16] returning a HTML payload doesn't seem right [10:38:34] good luck with that though [10:38:48] the important thing is it will be a broken request and someone will have to look at it, IMHO [10:39:08] (and it does carry a content-type at least) [10:39:38] most of the bots probably won't even follow the 302 [10:39:50] so the operator is going to see the 302 in logs and the /sec-warning URI [10:39:54] (we hope) [10:40:21] there's no universal output format for our API reqs, different services [10:40:29] some of the services vary output content-type on Accept header too [10:42:00] right [10:59:02] vgutierrez: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552488/ ? [10:59:17] (I rebased the others on it too) [10:59:57] hmm creative code review [11:00:01] let's see [11:01:18] looks good to me [11:01:31] take into account that's 7pm of a Friday here [11:01:39] some beer has been involved in my assessment [11:10:26] some beer probably helps with all reviews :) [11:15:53] hmm can I quote you to get a wellness refund on my beer consumption? [11:16:09] 🍻 [11:51:59] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10WMDE-leszek) Thanks @elukey and @Joe for translating from leet speak! I've filed T238901 about the problem in Wikibase, and we'll be looking into fixing the b... [12:10:06] <_joe_> 🍺 vgutierrez [12:59:22] 10Traffic, 10Operations, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) >>! In T237687#5679746, @Krinkle wrote: > The issue - When `X-Wikimedia-Debug` is enabled (e.g. via the WikimediaDebug browser extension), I am no longer able to brow... [12:59:30] 10Traffic, 10Operations, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) p:05High→03Normal [13:05:50] ema: re that bug above... I suspect in the move from 'text' to 'common', the relative ordering also changed a lot (into recv_early)... [13:07:12] as in, that return (pass) may now have moved above a bunch of other logic which it used to be beneath [13:16:09] heh [13:16:38] if only choosing to skip the cache wouldn't mean returning from the current subroutine... [13:17:07] well now you're just being a crazy dreamer [13:17:18] next you'll want real subroutines and local lexical variables [13:30:58] bblack: I think we should move the "if XWD" conditional back to text_common_recv and have a similar one in misc_recv_pass [13:31:39] working on the patch [13:42:29] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552504/ [13:49:06] or actually, we should probably get rid of it altogether considering that the only misc website that needs to support XWD is noc.wm.org I think? [13:49:52] we just add 'caching: pass' to the noc director and call it a day [13:50:05] yeah that might be simpler [14:38:50] lots of really interesting scrollback in here :) [14:52:41] 10Traffic, 10Operations, 10RESTBase: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) Confirmed the upgrade fixes the Server: header output: ` restbase2018:~$ curl -k https://restbase2018:7443/de.wikipedia.org/v1/page/references/Der_Junge_mit_dem_gro%C3%9Fen_schwarzen_H... [14:53:31] 10Traffic, 10Operations, 10RESTBase: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) @Vgutierrez I think you can just upgrade envoy across the fleet when you feel confident enough. [17:07:45] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) Please note this has had all the RAM/riser/cards reseated and continues to pass all Dell ePSA tests. @bblack: With the reseating of everything, shall we reimage and try using this sys... [17:17:28] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3056.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911221... [17:17:40] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3056.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3056.esams.wmnet'] ` [17:18:01] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3056.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911221... [17:19:23] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) a:05RobH→03BBlack Attempting reimage (see above). If it fails like before, it won't get very far (certainly not into production use). [17:49:20] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams, 10Patch-For-Review: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3056.esams.wmnet'] ` and were **ALL** successful. [17:53:48] 10Traffic, 10CX-cxserver, 10Citoid, 10Operations, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10Pchelolo) Nothing to do here for the core platform team anymore. [17:54:40] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams, 10Patch-For-Review: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) So far so good - it has completed all the initial puppetization stuff, which is much further than it got before. Given it's Friday and this node has a fishy hi... [17:57:32] 10Traffic, 10Operations, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [18:12:59] bblack: mehhhh, push it into service its server russian roulette ? [18:13:04] (cp3056) [18:13:06] ;D [18:13:15] Oh, you wanted to enjoy your weekend? [18:13:23] :P [18:16:29] so it reimaged before though right? [18:16:35] and failed when doing os stuff [18:16:59] just wondering how optimistic i should be heh [18:27:17] robh: before it would make it through the actual OS installer, but crashed out during the first puppet run, and I rebooted it like 10 different times (trying different levels of "reboot" aggressiveness from the drac pov) and each time I couldn't get it to complete a single puppet run before it would crash out again. Sometimes it wouldn't even make it long enough to launch the puppet run, if I wa [18:27:24] s too slow. [18:27:31] huh [18:27:43] so its promising, i can be ever so mildly optimistic ;D [18:27:47] but I tried racreset, and racadm's hard poweroff too [18:27:59] yeah, this is already probably 20x longer than it lasted before :) [18:29:04] with the ssd pcie card needing reseating it makes it seem like whoever assembled it was drunk that day heh [18:29:18] it cannot flex that much to unseat anythign without breaking stuff [18:29:39] yeah quite possibly [18:29:49] I mean, the dell assembly line peeps, they have to drink sometime [18:30:35] and what better time than as you're pushing in your 3,754th memory module of the day? [18:30:46] put in a dimm, take a shot, repeat! [18:46:08] 10Traffic, 10Operations, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10Dzahn) [19:30:13] 10Traffic, 10Operations, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10CDanis) At ~18:36 there was another spike in long-tail latency, but then, latency seemed to return to 'normal': https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red... [19:40:17] 10HTTPS, 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486 (10bd808) >>! In T120486#5680210, @Krenair wrote: > done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/482142 ? My guess is that @D... [19:41:19] 10Traffic, 10Operations, 10Phabricator, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ayounsi) [19:44:07] 10HTTPS, 10Traffic, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486 (10Dzahn) Yea, that's true. It's been a long time since i wrote that and i had a per-proxy feature in mind. I am ok with closing this ticket i... [22:10:06] 10Traffic, 10Operations, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10jijiki) [22:10:48] 10Traffic, 10Operations, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10jijiki)