[02:33:32] ok, then let's repool tomorrow AM [08:01:22] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10fgiunchedi) Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheu... [10:21:25] 10Traffic, 10Operations, 10ops-ulsfo: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592 (10MoritzMuehlenhoff) Note that this host also emits SMART errors since two days, not worth investigating further as it's going to be decommed. [10:26:48] 10Traffic, 10Operations: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) p:05Triage>03Normal [10:34:32] 10Traffic, 10Operations, 10vm-requests: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Krenair) [10:36:32] 10Traffic, 10Operations, 10vm-requests: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) a:03Vgutierrez [10:37:26] vgutierrez, ^ I filled out the form for it :) [10:37:50] yeah.. not really needed in this case anyways :) [10:39:23] but is still good to have the requirements in the task :D [12:08:35] vgutierrez, so, puppet review next [12:08:35] ? [12:10:48] yup [12:24:51] 10Traffic, 10Operations, 10vm-requests, 10Patch-For-Review: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) certcentral1001 created with the following cmd: ``` sudo gnt-instance add -t drbd -I hail --net 0:link=private --hypervisor-parameters=kvm:boot_order=netwo... [12:27:34] vgutierrez, so what happens if you don't set the pxelinux path? [12:28:08] it defaults to stretch? [12:28:31] indeed [12:33:49] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10fgiunchedi) >>! In T179050#4644666, @fgiunchedi wrote: > Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Pr... [13:55:42] Krenair: on modules/certcentral/manifests/central.pp we should deploy the private key of the ACME prod key using secret(), right? [13:56:52] vgutierrez, the problem is to generate that stuff you need certcental to already be installed right? [13:58:47] I assume in the above you mean a fixed long-term account key? [13:58:50] it can be done locally [13:58:56] indeed bblack [13:59:02] if so, we can just use the openssl CLI to generate that into puppet-private, yeah [13:59:15] (like we would for any manual private key really) [13:59:58] ok [14:01:51] I haven't looked, but re: ACME providers (e.g. LE), I guess that's configurable when configuring a cert? [14:02:40] indeed [14:02:43] because we might want to do our first exercise-tests of the deployed certcentral using LE's alternate integration testing endpoint to avoid wasting ratelimiters and such if we have to sort some things out [14:03:00] yes [14:03:07] you can set the directory per certificate [14:03:08] we support pointing at any acme directory [14:03:15] cool [14:03:36] this I mean: https://letsencrypt.org/docs/staging-environment/ [14:03:43] so you can get certs from the LE staging API and others from the real API [14:03:56] oh, yeah, technically I think it has to be the v2 one [14:03:59] but you get the point [14:04:06] yep [14:04:20] you can see an example pointing to the staging endpoint here bblack: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441991/39/hieradata/labs/deployment-prep/common.yaml [14:05:01] nice! [14:05:26] then you can set a default account or just list every certificate to its account [14:16:53] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Gilles) 05stalled>03Open [14:22:59] Krenair: nginx::site it's already handling the ferm rules for us? [14:24:30] no don't think so vgutierrez ? [14:26:21] vgutierrez, I guess we should set ferm to permit ports 80 and 8140 in from domain_networks ? [14:28:55] vgutierrez, oh, yeah what do we do about regr.json? [14:29:53] Krenair: I can generate both of them as soon as we decide on the email contact addresses for the production account :) [14:30:07] bblack: ^^ noc@w.o? [14:30:15] was gonna say noc@ [14:31:58] which address is using certspotter for reporting? [14:32:30] noc@wm.o [14:32:49] modules/profile/manifests/certspotter.pp: address => 'noc@wikimedia.org', [14:32:58] so I guess that's the sane pick [14:35:23] I think I sent an email there once [14:35:33] Bugzilla gave me a weird error and pointed me there [14:35:38] don't think I ever got a response [14:48:45] yeah [14:48:49] also, 80? :) [14:49:17] (but yeah I guess it's simpler than having to then also define static local certs for the service itself too) [14:49:24] but it's kind of ironic :) [14:55:06] taking into account that's going to be puppet consumed, maybe we can issue a puppet CA certificate for that [15:34:00] bblack, that's the one that can be used to support http-01 challenges [15:34:04] which wikimedia probably won't use [15:35:06] by the way, the checking of client TLS certs will only be looking at CNs. hopefully that's not a problem [16:11:14] Krenair: which "checking of client TLS certs" do you mean? [16:11:30] the port 8140 fileserver stuff? [16:11:34] yeah [16:11:48] yeah we should probably fix that [16:12:17] I've checked it works with the standard puppet certs [16:12:54] otherwise, anyone on any host in a production network that can reach certcentral:8140, can fetch private keys for issues created for others, by just connecting with a fake fresh client cert with a correct CN [16:13:07] s/issues/certs/ [16:13:10] .... no [16:13:20] it has to be signed by the puppetmaster [16:13:29] the problem is it won't be checking SANs [16:13:54] ssl_client_certificate /var/lib/puppet/ssl/certs/ca.pem; [16:13:55] ssl_verify_client on; [16:14:09] uwsgi_param HTTP_X_CLIENT_DN $ssl_client_s_dn; [16:14:22] then certcentral uses the CN part of the DN [16:15:52] oh ok, that's probably fine [16:16:12] when you said "will only be looking at CNs", I thought you meant "not verifying the signer" [16:16:16] no [16:16:19] that's crazy [16:16:36] I don't actually know if nginx stores a client cert's SANs anywhere [16:18:12] <_joe_> https://phabricator.wikimedia.org/T206339 might need the attention/thoughts of you all [16:18:14] looks like you might be able to grab the full client cert PEM using $ssl_client_cert/$ssl_client_escaped_cert, pass that through uwsgi_param and extract the SAN :D [16:19:07] _joe_: I assume by "cookie" you mean "value of X-Powered-By?" [16:19:18] <_joe_> from the client? [16:19:25] <_joe_> no :) [16:19:27] I have no idea [16:20:17] it doesn't make a ton of sense to me. how do we tell appservers.svc which engine we want for a request? [16:20:21] (we being varnish) [16:20:27] <_joe_> the client will store a cookie, sent from MediaWiki, based on either the enabling of a beta feature (first), or on a weighted random extraction (later) [16:20:37] <_joe_> bblack: no, the cookie from the request will do [16:20:45] an actual Cookie: ? [16:20:51] have a different svc host for the different engine? [16:21:09] <_joe_> it's the same thing we did last time; only difference was we had to direct to different backends from varnish last time [16:21:18] <_joe_> no, same svc host this time [16:21:29] <_joe_> apache is able to direct to the correct backend locally [16:21:40] is cookie a Cookie header, or is it just the generic meaning of a cookie? [16:22:02] <_joe_> I was thinking of a Cookie header [16:22:18] <_joe_> if we need to remember if a user "choose" php or hhvm [16:22:31] ok [16:22:54] probably XBP won't really factor into the decision, although we can maybe check it for verification or something [16:23:08] what's annoying is if it were anything other than Cookie, we could easily Vary [16:23:10] <_joe_> X-Powered-By is in the response from the backend [16:23:22] <_joe_> and it's what you should base Vary [16:23:35] <_joe_> as thats what will tell you what engine produced the result [16:23:50] Vary works on request headers, not response headers :) [16:24:00] <_joe_> oh, heh, right [16:24:15] but I guess we can hack something up to translate things around [16:24:29] <_joe_> the issue is we can't tell browsers to send us a header [16:24:35] <_joe_> apart from Cookie: :P [16:25:02] <_joe_> bblack: that's what we did with hhvm/php IIRC [16:25:19] strip fake usephp7 header if real client sent it; if Cookie ~ php7magic: set fake reqheader usephp7: true, otherwise set fake reqheader usephp7: false [16:25:40] <_joe_> see? done! [16:25:41] and then have all the MW backend traffic effectively vary on the client reqheader php7magic (we can inject the vary) [16:26:03] <_joe_> well apache could use the cookie itself [16:26:26] I guess for your purposes, yes, but not for Vary [16:26:34] <_joe_> oh, yes, sure [16:26:40] (because it will be blended with other meaningful Cookie values, polluting Vary) [16:26:46] <_joe_> apache doesn't care about Vary [16:26:55] for you :) [16:26:59] <_joe_> or well, it could [16:27:13] <_joe_> yes, sorry, in the context of routing the request [16:27:22] http://httpd.apache.org/docs/2.4/mod/mod_cache.html ! :) [16:27:38] who needs ATS when you have that from the same foundation? :) [16:27:51] <_joe_> bblack: we could put restbase in front of apache on each appserver! [16:28:41] so long as restbase routes the request through aqs which queries wdqs and then checks in with parsoid and the pdf renderer before it loops the traffic back to MW, I think it could work [16:29:20] <_joe_> that's MCS [16:29:27] <_joe_> almost literally [16:29:51] <_joe_> we just need to plug the pdf renderer into it [16:30:24] :/ [16:31:53] lol [16:34:49] <_joe_> I discovered the other day it's now plugged into aqs too [16:36:23] don't we put firewalls up around these things _joe_? [16:38:48] 10Traffic, 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10BBlack) [16:38:54] ^ updated [16:43:34] <_joe_> Krenair: no, we don't do that kind of firewalling. What we should do is whitelist outgoing connections [16:43:39] <_joe_> from applications [16:43:52] <_joe_> and kubernetes will allow that [16:45:05] I guess what I had in mind was MCS -> AQS directly [16:45:14] you could say AQS only accepts connections from x, y, and z [16:45:25] but if it's going via varnish I guess that's more tricky [16:56:30] not sure if already discussed, but there are some emails to root@ about etcd and varnish-backend-restarts [16:57:20] ERROR:conftool:Failure writing to the kvstore: Backend error: This cluster is in read-only mode : Cluster configured to be read-only [17:00:06] so profile::etcd::tlsproxy::read_only is indeed true for conf2*, but it shouldn't be IIUC what we did today [17:00:16] there's a whole tricky debate to be had about how inter-service routing should ideally work in general [17:00:36] whitelist outbound traffic and make people justify it [17:00:58] because the current notion of how a services connect to other services doesn't scale well in terms of explaining it and writing client code for it over N services [17:02:00] for a contrived example, if wdqs needs to fetch some readonly data from enwiki, what do you tell the wdqs developer about how they should do that? [17:02:53] they need to know that even if what they want is related to enwiki/wiki/Foo, that to reach enwiki on the inside they should translate the enwiki hostname to one of: appservers-ro.discovery.wmnet, or appservers-rw, or api-ro, or api-rw, depending on the URI and the nature of the request. [17:02:56] make them talk to varnish vs. internal LVS vs. pick a random appserver ? [17:03:33] appservers being distinct from the api? [17:03:43] so there's this "route canonical public URL X to internal service hostname Y (possibly with slight URL mangling in some cases!)" layer, which currently exists only at the back edge of Varnish, in VCL. [17:04:21] and right now we're basically telling app service writers to infer/duplicate that logic themselves in whatever framework they're using. Or they just don't ask and use the public varnish entrypoint and then we chide them for looping our stack. [17:05:16] does chiding usually do the trick? [17:05:28] so, clearly, we should separate that pure-routing layer out somewhere as an inter-service routing layer of some kind, and the internal view of the public DNS hostnames would map internal requests to that thing, and then the backside of varnish can just hit it too [17:06:19] (although I'm leaving a lot of things vague there, because you could argue it shouldn't be a centralized service, perhaps just a centralized configuration of a little router that sits out on all the service hosts and just helps them make their outbound connections, avoiding the concentration of interconnections on a new separate routing service) [17:06:23] do we do any split-brain DNS with prod at the moment? [17:06:31] we can, but we don't [17:06:33] <_joe_> bblack: you're basically describing envoy [17:06:50] <_joe_> used as a forward proxy and not just a reverse proxy [17:07:07] only place I've come across it in wikimedia has been the old labs floating IP recursor hack [17:07:18] right, something like "deploy envoy on every service host of every service and send all your outbound conns proxied through it with the standard configuration of all our routing" would work [17:07:39] <_joe_> bblack: s/host/kubernetes pod/ [17:07:44] <_joe_> and that's my long-term plan [17:07:53] sure, but I don't think you'll achieve /g for that [17:08:07] <_joe_> ahah [17:08:10] <_joe_> hopefully yes [17:08:19] <_joe_> mediawiki is the big scary beast [17:08:22] maybe within some limited scope, but there will always probably be a need for this from other things that aren't inside the k8s stuff [17:08:48] if nothing else, we'll want to save all the code/logic duplication and have our ATS cache hosts also deploy the same thing for routing from edge-caches -> services. [17:09:23] and then also be sure to have the thing hooked into etcd so it can manage the transitions we do with discovery-dns today [17:10:03] <_joe_> well envoy has a pluggable model for service discovery, basically [17:10:37] <_joe_> well envoy on the edge caches could do TLS-TLS for varnish too :) [17:11:34] oh good, then we can stay on this buggy open-core software forever! [17:11:39] <_joe_> this is all way in the future, I'm stuck fixing ugly apache vhosts for now [17:12:40] shame no one has forked and added TLS support [17:13:19] <_joe_> Krenair: oh they have TLS to the backend support in the commercial version [17:13:30] yes [17:13:31] bu [17:13:32] t [17:13:37] not the open source one [17:13:42] o O ( https://phabricator.wikimedia.org/T104755 ) [17:14:42] <_joe_> paravoid: if we did that long time ago, I wouldn't have spent a quarter uniforming our peculiarly strange apache configs [17:14:46] <_joe_> :) [17:14:51] yeah that too, but it's a separate problem [17:14:55] why didn't you do that instead :P [17:15:16] <_joe_> because that's more than a quarter of work :P [17:15:18] if we had that MW-internal URL routing layer, we could also have it sanely handle things like the m-dot subdomains for itself, instead of the VCL hacks for that. [17:15:24] speaking of quarters [17:15:32] are the goals up somewhere I can view them? [17:15:43] <_joe_> Krenair: mediawiki.org, lemme find the url [17:16:26] Krenair: the traffic team ones for this Q and last Q live officially at: https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC1:_Reliability,_Performance,_and_Maintenance/Goals [17:16:31] <_joe_> Krenair: start from here https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2018-19_Q2 [17:16:46] (although it's more or less good luck that they all land in the same TEC program right now) [17:16:52] <_joe_> bblack: this quarter you also have TEC4 [17:16:58] <_joe_> ;) [17:17:10] ty [17:17:12] as a supporting team :) [17:17:32] grumble grumble something about program-vs-teams misalignments and all the mess of how resourcing works with that [17:17:34] <_joe_> eheh good try :P [17:18:30] <_joe_> ok, now I really gotta go back to $life [17:19:24] enjoy it while it lasts! [17:19:47] bblack: if you have time can you check the varnish-backend-restart emails? I think that we are ok (now etcd shouldn't error anymore) but better safe than sorry [17:26:51] elukey: yeah I'll take a look. most likely we just failed to do some restarts [17:28:27] super [17:37:33] btw I tried out facebook's static analyzer https://fbinfer.com/ [17:38:01] all it netted me was having to dig through something like 60 false-positive reports, most of which were clearly senseless, and no actual new bugs found :P [18:04:10] for some reason I'm receiving coverity reports for gdnsd :P [18:04:36] I presume you've already seen them right? one today, one 2 days ago? [18:14:52] yeah I push those up as I release betas [18:15:00] the last one is yet again a false positive [18:15:39] but in the overall, coverity is a pretty high quality tool, it's totally worth sorting through their reports [18:15:54] some things are just difficult for analyzers to guess about [18:16:37] (which in some cases, I'm willing to concede and refactor the code to make it make more sense to the analyzer, because it probably helps with human clarity too, but this wasn't one of those so I just flagged it to ignore) [18:17:01] bblack: I send an email to the previous Juniper tech who assisted me about that v6 ND issue and CCed you [18:17:09] ok [18:17:24] XioNoX: afaik it's still unresolved, and we're still sitting in a semi-unstable state with pybal disabled on lvs1002 [18:17:27] right? [18:17:37] bblack: correct [18:18:11] yeah maybe I'll send an email to ops or something, just in case anyone's annoyed by the puppet disable, and change the puppet disable's actual wording a bit [18:19:02] still borked, just checked [18:19:08] if only switches would just switch traffic :P [18:19:26] bblack: that reminds me of https://phabricator.wikimedia.org/T133387#2835986 actually, but we're running more recent code version [18:19:45] er, "No updates possible to this Service Request as this service request has has been closed for more than 7 days." I'll open a new one [18:39:24] https://phabricator.wikimedia.org/P7644 [18:39:30] ^ so much magic goes into that heh [18:40:12] daemon binary version upgraded, systemd still thinks the service has been up for ~2 weeks even though it has a new PID, and the stats output that's being sent to prometheus->grafana is unperturbed (and correct) [19:10:56] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Opened Juniper case 2018-1005-0549 about the ND issue. [20:16:29] https://rt.cpan.org/Ticket/Display.html?id=127182 - so this discussion is kind of irritating [20:18:17] at the end I think he means 1.18 rather than 1.07, the version of the package we get from stretch [20:19:57] it's tempting to just add a trailing full stop in our puppet repo to bypass the first issue, and put a ticket in to ferm and ask them to handle the second issue [20:20:31] /107/110 [20:22:02] hi Platonides [20:22:18] hi Krenair [21:45:10] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dzahn) Yea, it's not ferm, it's the wrong backend IP per change above. [22:54:16] Krenair: yeah.... gdnsd's testsuite use to use upstream Net::DNS long ago too. Then they made too many changes that weren't back-compatible. [22:54:34] Krenair: ended up forking a local ancient copy and stripping out all the bits we didn't need