[07:12:32] hello everybody, as FYI druid datasources (like wmf_netflow, webrequest_128, etc..) are now available in https://superset.wikimedia.org/superset/sqllab (select database Druid Analytics SQL) [07:12:52] so it will be easy to check data in there with SQL-like queries [07:13:07] Cc: cdanis (probably interested) --^ [07:13:49] that's cool!, thx! [07:14:34] will also create some docs [07:14:39] <_joe_> general q: why aren't we using CAS for more things? [07:15:22] moritzm, jbond42, maybe ^ ? [07:16:21] _joe_ I think that there is a plan for doing it, I am already following up with Moritz for all the Analytics dashboards [07:16:40] <_joe_> yeah the question was for them more than for anyone else [07:19:23] mid-term all our web services will only use CAS, but there's still plenty of lower level integration work happening, so initially it's mostly used by SRE-facing services, but we're adding more [07:25:08] e.g. I'm currently adding a staging IDP server to be able to test changes without impact to prod etc. [07:34:31] <_joe_> moritzm: no testing in production? I'm disappointed [07:42:16] _joe_: I would like to start digging around a bit on https://phabricator.wikimedia.org/T249218. Is it okay/desired that I assign the task to myself or is ther any special process to follow? [07:42:37] <_joe_> jayme: just assign it to yourself [07:43:09] <_joe_> I'm very happy to give you a tour of things around our docker images whenever you want [07:45:27] Thanks! I had a introduction to how the base images are build, where to find the puppet stuff for that and how to possibly test changes via jenkins from Alex yesterday. Will see how much of that I still know today and get back to you whenever I have questions :) [07:45:54] <_joe_> aha! ok great [07:50:13] I've acepted an invitation for the "Cultural orientation" modules for today and tomorrow (starting at 16:00 UTC). That overlaps the discussion meeting tomorrow by 15min - hope thats okay [07:51:16] <_joe_> jayme: sure [08:38:08] I'm updating grafana in production shortly, no impact expected [08:46:00] _joe_: definitely the plan to use it everywhere as moritz said; but if you had specific ones in mind that you'd like to prioritize over others, happy to take that feedback in :) [08:46:41] <_joe_> paravoid: no I am just sad every time I see an http basic auth dialog, and impatient about getting sso everywhere :D [08:47:20] <_joe_> but I guess icinga/grafana/kibana are the three things I would love to see under sso first [08:48:28] icinga is at https://cas-icinga.wikimedia.org/ and Kibana was planned for Q3 but had some complications (cf. T246998) [08:48:28] T246998: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 [08:58:53] <_joe_> paravoid: I'm using cas-icinga [09:53:12] not sure if known or who would be the best person to point this to, but the kubernetes firewalls are dropping a ton of traffic towards eventgate-analytics.svc.eqiad:4592 [09:55:43] <_joe_> XioNoX: please explain a bit better - you mean iptables? [09:56:35] _joe_: yeah iptables, I'm still digging through the dashboard, does this work: https://logstash.wikimedia.org/app/kibana#/dashboard/AW5v7YTUarkxubcmAwPB?_g=h@5f39482&_a=h@2b2c0b7 [09:56:56] <_joe_> no it does not [09:57:40] _joe_: maybe https://logstash.wikimedia.org/goto/67e0646c8d1c3f51d403ee8d1f8d9508 [09:57:49] <_joe_> XioNoX: that's the public nodeport so it seems like a false positive [09:58:22] <_joe_> XioNoX: is it all kubernetes nodes or just specific ones? [09:58:34] the source IPs of the iptables drop logs are from mw hosts [09:58:45] looks like all [09:58:53] <_joe_> XioNoX: that date is when we turned on TLS on eventgate-analytics [09:59:53] started for real on March 18th, with a big spike around the 11th [10:03:25] and a 1/3 less hits towards eventgate-main.svc [10:03:53] _joe_: let me know if I should open a task [10:04:35] unrelated there is restbase->restbase 3050/udp traffic beeing dropped too [10:05:11] started on April 2nd [10:05:48] <_joe_> that part I dunno about, but the eventgate things are most probably just false positives, but yes open a task [10:06:22] ok! [10:21:38] <_joe_> nice find anyways. I see similar DROPs for other services but those two are definitely the outliers [10:25:00] I ended up on this forgotten dashboard by browser autocomplete error :) [10:27:31] <_joe_> lol [10:31:29] _joe_: opened https://phabricator.wikimedia.org/T249700 and https://phabricator.wikimedia.org/T249699 [10:42:44] now that I'm at it... https://phabricator.wikimedia.org/T249701 [12:06:48] trying to find out how we can upgrade mongodb from 2.4 on jessie and actually keep the data. sigh.. at least 3 steps are needed, 2.4 -> 2.6 , 2.6 -> 3.0, 3.0 -> 3.x it's for xhgui, the tool used by performance [12:07:33] mongo support says to use their apt repos [12:08:08] "12:05 < kali> as much as i like debian, i would recommend you to try and obtain the exception to use mongodb apt repos [12:08:29] also they do _not_ recommend to use dump/export [12:37:40] these versions are now non-free [12:37:56] so we cannot use them in production I'm afraid [12:38:10] oooh.. i see. ugh :/ [12:38:14] is that for webperf? [12:38:17] yes, it is [12:38:23] it's to gett off of tungsten [12:38:26] IIRC perf team has said they wanted to move away from mongo anyway? [12:38:46] have you talked to dpifke? I think he was the one assigned to the task to move away from tungsten [12:39:48] paravoid: i just pinged him on the ticket today to do that. i was not aware about moving away from mongo. my latest status was from: https://phabricator.wikimedia.org/T180761#5759870 [12:40:32] so let's see soon again [12:41:34] MongoDB switched a new homegrown license that is not open source and was subsequently dropped from Debian/RHEL/Fedora/etc. [12:42:06] meh.. ack, i kind of remember now [13:35:34] _joe_: Envoy is the TLS terminator for apiservers now, right? [13:35:45] asking because of https://phabricator.wikimedia.org/T249680 [13:37:55] cdanis: that's been reported a few times, I think the canonical task is https://phabricator.wikimedia.org/T249526 [13:38:33] I think there's some disagreement about the purpose of that task [13:38:48] yeah I'm coming to agree with you [13:39:10] anyway both these things were reported April 6 or later, which is when we switched to use Envoy for the apiservers [13:39:38] nod [13:40:28] I don't know that envoy touches the header casing on the way through, but I don't know that it doesn't either, it's certainly allowed to -- the client is unambiguously in the wrong here but we should consider fixing anyway [13:40:47] from anomie: >My vaguely-informed guess would be that something has recently started to use HTTP/2 in communication between the front and back ends, and HTTP/2 of course requires header field names be transmitted in lowercase (RFC 7540 ยง 8.1.2). [13:40:58] ahhh that'll be why, yes [13:41:28] I think we should ask clients to fix themselves, but also, we might want to put some overrides in varnishfe as well? [13:41:44] iirc envoy primarily uses HTTP/2 semantics internally, it speaks HTTP/1.1 via what's basically a compatibility layer [13:42:04] I'd forgotten HTTP/2 is lower-case headers only but yeah of course that's why [13:42:37] headers in http/1.0 or /1.1 aren't supposed to be case-sensitive anyway [13:42:58] yeah, that's what I mean about the client being in the wrong, we're def on the same page about that [13:43:42] I think we should put in a temporary varnishfe fix, but we should communicate with the author of this library and make sure they know it's temporary [13:44:09] just so we don't find ourselves either having the same conversation in 30 days or whatever, or translating that header until HTTP/1.1 finally dies in 2038 [13:44:46] as always I appreciate your optimism [13:45:11] what, about HTTP/1.1 dying in 2038? it will, along with everything else made of transistors, but we can talk about my pessimism later [13:46:59] wrt libraries adapting to the spec though, yeah, I know [13:47:21] maybe we do just end up translating "set-cookie" to "Set-Cookie" forever and warning our grandchildren not to touch it? I don't know :/ [13:48:27] <_joe_> cdanis: I refuse to consider that a bug [13:48:37] ^ see, THAT'S optimism [13:48:45] <_joe_> :) [13:49:26] <_joe_> also, I think envoy /can/ normalize header casing [13:49:36] ok well I cc'd you on the not-a-bug [14:25:32] mutante: o/ - one qs - I was trying to live hack a partman recipe on install1003 (for new kafka jumbo hosts), and after a while I realized that the recipe seems pulled from apt.w.o (hence apt1001). Is it expected? [14:31:23] elukey: ehmm.. no. that should be the profile::installserver::preseed which is on install1003 [14:31:45] looking more [14:32:05] mutante: yeah looks strange, I noticed apt1001 from squid logs [14:32:13] on install1003 [14:32:54] I see things like GET http://apt.wikimedia.org/autoinstall/partman/custom/kafka-jumbo.cfg [14:34:43] I am not sure if I am not getting things right, but after live hacking the recipe on apt1001 I was able to finally see some results [14:35:02] *if am or not getting [14:39:38] i see where the wget command is but not why apt1001 has the files in /srv/autoinstall yet [14:39:51] well, i hope you are not blocked and i can keep figuring this out [14:40:58] mutante: nono I am testing now, I just wanted to raise it in here since it seemed strange [14:41:13] let me know if I can help! [14:42:05] thanks for raising it. i'll get back to you [15:15:52] <_joe_> jokes aside, I'm not inclined to "fix" the header casing for the api [15:20:27] re: the header-case stuff above: the real issue is basically that some clients are ignoring headers like set-cookie, because they're matching case-sensitively for the camel-cased way that used to be common? [15:20:40] yes [15:20:40] correct [15:20:52] that's annoying [15:20:56] yes :) [15:21:13] does the protocol itself (HTTP/2) enforce/fix everything back to lowercase? [15:21:25] <_joe_> yes IIRC [15:21:25] HTTP/2 insists on lowercase yeah [15:21:29] ok [15:21:31] sorry, http/2 ๐Ÿ™ƒ [15:21:39] <_joe_> but http/1.1 is supposed to be case-insensitive [15:21:54] http/2 requires lowercase, yeah -- envoy will use lowercase by default even when speaking http/1.1, because the spec allows it [15:22:02] yes, but software engineers are, as a general rule, not good at things like this :) [15:22:31] so what we could do, is at the varnish-fe layer on the way out, we could have some generic code that camelcases all header titles. [15:22:44] and for actual http/2 clients it will still become lowercase [15:22:51] and for this oddball case that's breaking people, it will fix them [15:23:09] fwiw, envoy can also camelcase all headers for http/1.1 clients -- what it *can't* do is leave all headers as-is [15:23:37] well that's pretty much forced if the proxy in front of envoy is using H/2 anyways [15:23:40] (since it lower-cases everything internally, it's only offering to re-mangle it for you on the way out) [15:23:48] does ats-be -> envoy use H/2? [15:24:02] oh, does it? I assumed it didn't, but I don't know why I assumed that [15:25:12] <_joe_> i would hope it does [15:25:20] bblack: actually it must not -- if ats-be->envoy used http/2, the envoy header case change couldn't break this wikidata client [15:25:24] right? [15:25:36] <_joe_> right [15:25:50] if ats-be->envoy used h/2, it would mean all mediawiki-emitted headers would arrive at ats-be lowercase-normalized [15:25:51] oh, unless ats-be->envoy is h2 but ats-be->envoy was h1.1 [15:25:58] *unless ats-be->envoy is h2 but ats-be->nginx was h1.1 [15:26:09] <_joe_> you know we can check it [15:26:18] * _joe_ checks [15:26:27] and the paste in that ticket, I think it has all the MW ones lowercased, and the ones we set from varnish in camel [15:26:44] <_joe_> possibly, yes [15:26:53] ohh okay I'd buy that [15:28:52] <_joe_> envoy_http_downstream_rq_http2_total{envoy_http_conn_manager_prefix="ingress_http"} 0 [15:29:03] <_joe_> so yeah, we don't use http2 [15:30:00] envoy downstream means ats-be-facing or mediawiki-facing? [15:30:10] envoy downstream is ats-be [15:30:16] <_joe_> ^^^ [15:30:33] "Downstream: A downstream host connects to Envoy, sends requests, and receives responses." [15:30:40] <_joe_> sorry I gtg as I have an interview in 30, but I think this is matter of making a call. If we're using camelcase everywhere in mediawiki (this can be verified by looking at what the headers emitted from apache are) [15:30:40] downstream-towards-the-internet I guess [15:31:15] as far as standards are concerned, by re-camel-casing things we're not breaking anything (they're still supposed to be case-insensitive) [15:31:24] <_joe_> yes [15:31:42] yeah the standard says we can do whatever we want, I'd only be worried about the same kind of practical breakage we had with lower-casing [15:32:08] <_joe_> rlazarus: you can verify it my making a few requests to apache instead of envoy [15:32:16] <_joe_> on the backend [15:36:42] so, where is the H/2 happening in the stack, which is lowercase-normalizing these headers before they reach the outside client? [15:37:01] (it's not the connection from ats-tls to the client obviously, that must be http/1 in the broken case) [15:38:22] I don't know that there's any actual h2 on the wire in our stack -- Envoy just downcases header names everywhere because it uses h2 semantics internally [15:39:30] when the client connects via h1.1 it does some processing on the request and then pretends it was h2 from then on, then does some processing on the h2 response at the last second to send it via h1.1 [15:40:21] ok [15:40:32] and likewise when it has to speak h1.1 to the upstream server [15:40:35] so one of the things that gets lost in that process is header name casing [15:41:00] so the *best* thing we could do to fix this and not break more things, would be to patch envoy somehow to preserve header case, in such a case, somehow. [15:41:09] but that sounds risky and/or invasive [15:41:39] otherwise the information is lost, and we can either live with the breakage already documented, or normalize everything to camelcase somewhere (e.g. v-fe) and hope that breaks fewer things than it fixes. [15:42:32] yeah agreed on all counts [15:43:00] I've worked on envoy's codebase before, I wouldn't mind patching it, but I think this change wouldn't be practical [15:48:19] worth making a varnish (or ats) patch? maybe ideally we limit it to the known sub-case here (as in, do it in ats-be on receipt of headers from the mediawiki backends, but not other services) [15:55:29] if we decided to camelcase everything, envoy can do that as a config option, which is probably the right place to do it [15:55:57] I'm just not sure if that's better or worse than leaving it as-is -- especially since that one wikidata client is already patched, it looks like [15:56:16] (cdanis also suggested camelcasing just Set-Cookie, which we'd have to do at varnish or ats) [15:56:45] any of those three options is spec-compliant obviously, but I also think they're all reasonable ways to go here [15:57:52] ok