[08:42:31] 10netops, 10Operations, 10ops-codfw: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10Peachey88) [08:59:29] so.. our "old" LE puppetization handles dhparam deploy as well, and the current certcentral::cert fails to do that [08:59:38] we need something like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475968/ [08:59:51] but pleasing our lovely WMF linter [09:01:47] and probably we should take care of deploying the LE intermediate CAs certs as well [09:53:41] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) [09:55:16] vgutierrez, the chain should contain the LE intermediate CA cert? [09:55:28] it's included there yes [09:56:14] vgutierrez, so why do we need to manually include it separately? [09:56:51] currently they're being deployed by letsencrypt::cert::integrated [09:58:16] what's the current status of ATS for text and upload? are there already test servers up? [09:59:01] I'm currently verifying some things about HTTP/2 priorities and might as well test our future setup and not just the current nginx/varnish state of affairs [09:59:06] vgutierrez, sure but our new puppetisation should take care of it automatically unless I'm missing something? [09:59:13] without being LE-specific? [09:59:28] the previous setup also deployed the chains [09:59:45] gilles: yes we have two test clusters, one in eqiad and one in codfw [09:59:48] okay [09:59:52] it did it manually [09:59:55] it hardcoded LE stuff [10:00:02] now we do it automatically don't we? [10:00:09] ema: with different config for text and upload "matching" varnish? [10:00:19] or close to what it's intended to launch as, anyway [10:00:30] gilles: nope, we're gonna have one single ATS config for both upload and text [10:00:37] ok, great [10:00:58] ema: can I hit those clusters publicly? [10:01:02] Krenair: nope, acme-tiny generates the same .chained.crt and .chain.crt files as certcentral [10:01:28] so the chain is missing the intermediate cert? [10:01:32] if it a different URL, a header or something to have my requests go to ATS? [10:01:38] surely ACME provides a way to get the right intermediate cert? [10:01:47] gilles: not really, you can from within wmnet though [10:02:13] eg: curl -sv -H "Host: upload.wikimedia.org" 'http://cp1071.eqiad.wmnet:3129/wikipedia/commons/7/75/Salvator_Rosa_%28Italian%29_-_Allegory_of_Fortune_-_Google_Art_Project.jpg' > /dev/null [10:02:23] thanks, I'll make it work [10:04:01] Krenair: no it's not missing it [10:04:09] gilles: let me know your findings :) [10:04:15] vgutierrez, so why do we need to do anything extra? [10:04:22] ema: https://phabricator.wikimedia.org/T210141 if you want to track it [10:05:43] Krenair: basically our old LE puppetization adds the intermediate certificates to the system SSL cert store, I guess we did that to avoid issues validating certs issued by LE: https://github.com/wikimedia/puppet/blob/production/modules/letsencrypt/manifests/init.pp#L91-L97 [10:06:26] didn't it just install those so we could manually bundle the intermediate with the signed cert? [10:07:12] hm [10:07:16] maybe not [10:07:23] should figure out why it was doing this and whether we still need it [10:08:16] yep, taking into account that we already moved some prod systems to certcentral, let's restore the functionality [10:08:23] we can get rid of it later if needed [10:09:06] I guess technically on those current systems this stuff is present unpuppetised? [10:09:15] and we don't really know why it's there? [10:21:44] right now yes, it's unpuppetised due to our changes [10:31:00] let's put the resource back in for the time being, then propose a change to ensure => absent it and discuss why it was needed in the first place? [10:32:46] gilles: I've modified the nginx configuration on pinkunicorn by hand to point to ATS if you set the request header 'X-Use-Ats: true' [10:33:26] gilles: it should be enough to do that with a browser extension and add `208.80.154.42 en.wikipedia.org` to your /etc/hosts (or equivalent) [10:33:45] ema: how transparent will be nginx in between, though? [10:34:24] I can try both anyway, I think that our internal webpagetest should be able to hit the ATS url [10:34:43] nice [10:34:51] if they both end up being the same, the header-based routing will definitely make testing easier [10:34:57] what do you mean with 'transparent'? [10:35:11] that it won't mess with the HTTP/2 streams in any way [10:35:40] well so far it's nginx that speaks HTTP/2, we haven't configured h2/tls on ATS at the moment [10:35:42] I meant that in the sense of a transparent proxy that won't touch anything it ships back and forth [10:35:49] aaaah [10:36:00] ok, good to know [10:36:19] so yes, the header should be the same, right? no extra layer [10:37:00] (h2/tls still not configured because varnish-frontend is here to stay for a while longer, and it does not speak tls) [10:37:51] ok, so in practice varnish will still be responding in front of ATS for a while [10:37:56] even if it's a passthrough request [10:38:18] hitting ATS the way you've set it up now, is it without varnish in the mix? [10:38:24] correct [10:38:57] will varnish frontend get rid of the upload/text distinction as well? [10:40:05] nope [10:41:06] so I imagine that for your tests it would be more useful to test nginx -> varnish -> ats [10:41:20] yes [10:42:06] but it shouldn't be *that* different from the current prod setup [10:42:26] I also need to ensure that some of the images in the test are large enough to hit the backend, in which case ATS/varnish might differ [11:12:00] Krenair: perfect :) [11:42:33] gilles: I've just pointed cp1008 to an ATS host to make everyone's life easier :) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475992/ [11:43:49] so now you can perform whatever test you like against cp1008 and you will hit: nginx on cp1008 -> varnish-fe on cp1008 -> ATS on cp1071 [11:44:18] nice! [11:44:34] meanwhile, /me reads enwiki via ATS [13:11:04] cool [13:15:26] re: the system install of the chain certs, I think there's some other scripts that need to be able to find them, e.g. possibly the OCSP stapler (which we'll eventually add to puppetization of LE certs as well) [13:20:46] bblack, that doesn't pull them from the chain.pem ? [13:20:56] or was it chain.crt [13:21:07] crt yes [13:21:28] I'm not sure! [13:23:12] no, it doesn't used the chain/chained files [13:23:17] ack [13:23:37] it uses the singular leaf cert file, and looks up the intermediate in the system CA dirs [13:24:15] so now we know why is there :) [13:25:09] there are other ways to attack the problem I guess, but the OCSP stapling script has to work with the non-LE certs and probably at least originally some legacy cases too [13:25:49] ultimately "update-ocsp" (the script that does the work) shells out to openssl, and the openssl CLI wants the leaf cert and the intermediate as separate file arguments. [13:26:30] we could structure that around .chain.crt or whatever, if we're sure we're consistent that all things that deploy certs that might be stapled have those variants available, etc [13:26:46] (probably!) [13:27:06] but for now it's probably easier to skip that refactoring rabbithole and just deploy the intermediates :) [13:27:29] yep, it's a little bit out of scope tackle that right now IMHO [13:30:26] I think we're slowly approaching a state where the new LE puppetization will be responsible for almost all certs anyways, at which point we can really refactor/simplify a lot of related bits around the puppet tree. [13:31:14] the only cases I know of where we don't have a long-term answer and will continue manual issuance is the big unified with multiple vendors for redundancy. [13:31:34] (and arguably, fundraising's cert should be similarly redundant based on importance, but currently isn't) [13:32:08] but that's like, ~2-4 manually-managed certs, and the rest automated. We can make them conform to whatever schema makes more sense for the automated solution. [13:34:20] Krenair: do I still need to update the commit message? :) [13:35:37] no [13:36:46] please update that -1.. it would be weird to merge that with a -1 standing there :) [13:56:09] thx :D [14:24:56] vgutierrez, I'm not sure https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476025/2/modules/role/manifests/netmon.pp https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475981/4/modules/profile/manifests/archiva.pp and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475978/2/modules/role/manifests/tendril.pp make sense? [14:25:09] what do you mean? [14:25:15] shouldn't you include sslcert::dhparam require where the ssl_ciphersuite call is made? [14:25:30] then we wouldn't please WMF linter [14:25:34] why not? [14:25:35] I discussed this approach with _joe_ [14:25:50] cause it would be a class included from another class (that's not a profile) [14:25:55] god dammit [14:26:03] I know [14:26:06] sanity -= 10 [14:26:09] right now [14:26:27] this is the rule that appears to be supposed to decrease dependencies between modules [14:26:46] instead we just move it around a bit and hide all the dependencies in profiles [14:27:12] IMHO this is better that it was before [14:27:29] having the dhparam inside the LE puppetization when it is actually a TLS configuration requirement [14:29:20] 10Traffic, 10Operations, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [14:30:40] vgutierrez, why is the apt cert dated 25th nov? [14:31:16] no wait [14:31:42] ignore me [14:34:05] looking good here [14:35:40] yes I was wrong [14:35:47] reading failure [14:47:36] dhparam is a temporary problem anyways, it will go away sooner than many other related problems. [14:48:30] (we could arguably already dump at least DHE support for most endpoints other than the big public cache termination, and there's also no reason to support it if TLSv1.0 isn't supported either, which again we should probably already have dropped from most other entrypoints, and will eventually for the big public ones too) [14:49:14] we just haven't done all the requisite refactoring of ssl_ciphersuite and its callers, etc. There's some patches I made a few months ago that are related, but never finished working on them. [14:49:47] but for now, odds are decent we may deploy CC-based certs to the cache clusters before we dump TLSv1.0 and/or DHE from them, so it's easier to leave it all in place for now until we can remove it completely. [14:50:22] <_joe_> if you're about to do any refactoring of functions, I'd suggest you also port them to the new puppet function api [14:51:02] <_joe_> (we have one such example, wmflib/lib/puppet/functions/role.rb [14:51:16] <_joe_> it's a cleaner interface at the very least [14:51:29] I think eventually ssl_ciphersuite will get simple enough that we'll just dump it, in favor of just two different fixed strings and fairly trivial settings [14:51:32] <_joe_> and it's also going to be supported in the future [14:51:44] but that might be several months out yet! [14:51:44] <_joe_> bblack: hopefully! [14:52:30] it's all about the TLSv1.0 removal for the big cache entrypoints. Once we're over that hurdle, a lot of other things clean up quickly. [14:53:31] (because at that point or very shortly after, we can probably dump public support even in the worst case for both DHE-based forward secrecy, and the need for dual certs to support RSA as well) [14:54:21] it will clean up our nginx patching a bunch as well, to not have to support reliable OCSP + dual certs [14:58:15] for reference: if today we suddenly dumped support for: DHE fs, RSA public keys, and TLSv1.[01], we'd lose ~2% of client requests (which isn't a great number, but it's much better than it has been historically, and getting better all the time). [14:59:10] all of those factors have some overlaps with each other. Whichever order we drop them in, each one reduces the remaining percentage needing the rest as we go. [15:00:41] vgutierrez, gerrit-slave and gerrit next? [15:01:52] yup [15:02:18] then dumps, mirrors, lists - in that order/ [15:02:18] ? [15:04:45] yeah.. maybe icinga at the end [15:10:03] oh icinga [15:10:07] could do that next [15:10:09] it's behind LDAP login so very few people will be hitting it [15:21:23] mail servers last [16:06:23] bblack: the codfw-eqord maintenance is done, but codfw-ulsfo might still happen again tomorrow (no completion notice received), I do think it's safe to repool ulsfo though [16:12:48] XioNoX: there were some odd events after the maint window was done today, too [16:13:20] bblack: ah? [16:13:32] XioNoX: I happened to observe a few icinga criticals for upload@ulsfo reliability on #-ops, let me see if I can find timestamps or other history [16:13:47] thx [16:14:08] good thing we depooled [16:15:27] 13:17 <+icinga-wm> PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:15:31] 13:18 <+icinga-wm> PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:15:35] 13:22 <+icinga-wm> PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:15:39] 13:32 <+icinga-wm> RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [16:15:43] 13:33 <+icinga-wm> RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:16:00] ^ this was the set I observed (times UTC, which is a bit after both windows were done). I'm not even sure if circuit issues were the cause, and I haven't trawled logs to see if there were other issues before/after yet [16:16:50] it could also be something else, e.g. something swift/thumbor -related was temporarily unreliable, and it only showed up in ulsfo reliability percentages because the background traffic level was so small while depooled. [16:24:08] bblack: so that alert means a spike of 5xx, right? [16:24:19] something like that, I think! :) [16:24:49] it's odd that it was just upload and not text, and also odd that it was just ulsfo [16:24:59] so yeah, multiple possible causes [16:25:27] sort of, 500 / (200+500), I think it is like bblack said just a few errors would cause availability to drop when not a lot of traffic is there [16:30:06] https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=2&refresh=1m&fullscreen&orgId=1&from=1543324088500&to=1543326276010&var-site=ulsfo&var-cache_type=upload&var-status_type=5 [16:30:42] seems like an incredibly small amount of 500/s [16:38:31] ok [16:40:30] bblack: also interesting, BFD on the ulsfo-codfw haven't flapped in 6 days, so whatever work they did didn't cause that link to go down [16:42:16] 10netops, 10Operations, 10ops-codfw: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10akosiaris) > ores2001, 2*ganeti, 15*mw > cc @akosiaris to know what specific actions need to be taken for Ores and Ganeti for ores2001, nothing is really required aside from some downtime in i... [16:43:22] 10netops, 10Operations, 10ops-codfw: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10akosiaris) > @akosiaris for ores2008 Just schedule downtime in icinga and do whatever actions are required. The service will happily keep chugging along on the other 8 hosts in eqiad. [16:49:39] bblack: so yeah, I still think it's safe to repool, but I can understand if we want to be extra safe [17:11:33] XioNoX: right, that's the one with no completion notice that hasn't flapped, so I'd guess they've deferred to their second window (or at least, our part of it shifted out to there, if there's multiple subparts to the work) [17:16:47] indeed [17:22:37] XioNoX: so yeah, I guess repool today [17:23:07] ok [17:23:18] we didn't lose any purges, and if we get it done Soon we'll be und er the 24h mark and still have not quite lost all our cache contents :) [17:24:15] actually they say it shouldn't go down, but they're sending the notification as their work is close to our fi [17:24:24] actually, "** The work has been listed as Service Affecting due to the complexity of the work taking place in close proximity to lit services. **" [17:24:30] so it might not even go down at all [17:25:02] ok [17:30:55] repooled [18:46:57] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-production-error: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) [19:23:17] 10Traffic, 10Operations, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10greg) Just checking: this task is in the "active situation" column of the #wikimedia-incident project and has been open for a while. I see there are sub-tasks that look like follow-ups. Shoul... [19:27:38] 10Traffic, 10Operations, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10BBlack) Seems reasonable to close this; the event itself is long over. There are still risks present for a followup event, but if we close up all the actionables that goes away eventually.... [19:28:25] 10Traffic, 10Operations, 10Wikimedia-Incident: Puppet doesn't restart ferm on failure - https://phabricator.wikimedia.org/T206951 (10greg) [19:29:15] 10Traffic, 10Operations, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10greg) 05Open>03Resolved a:03ayounsi Done, thanks! [19:36:24] bblack: if you have time for a quick chat: there is someone asking on the maps ML about using maps and using referrer or not [19:40:51] he would like to not use a referrer, for privacy concern. This would violate the maps policy (https://foundation.wikimedia.org/wiki/Maps_Terms_of_Use) [19:41:40] as I understand, requesting that 3rd party users have referrer is more general than just maps [19:43:30] any opinion on the subject? [19:48:53] gehel: there's a way to do a referrer that includes only the site name and not the specific URI, which is what we do outbound to others as well, IIRC... [19:49:20] e.g. when we link from en.wp to a third-party site, they only get "Referer: en.wikipedia.org" basically, and not the article name [19:49:50] * gehel knows mostly nothing to how that works browser side [19:50:08] there's a policy thing for it, I'm just having a hard time remembering it or digging for it [19:51:11] Oh, looks like Gergo already replied with similar info [19:51:51] https://www.w3.org/TR/referrer-policy/ [19:51:57] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Krinkle) See also T194814, which this task could resolve. > x-analytics As I understand this, this field mainly exists to transmit data... [19:51:59] we use "origin-when-cross-origin" I think [19:52:17] 10Traffic, 10Operations, 10media-storage, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10Krinkle) [19:52:46] looking at current live enwiki outputs, we actually emit these 3 meta tags: [19:52:49] [19:52:52] [19:52:54] [19:53:09] I'm not sure about that first one, but the latter two I know where about some broken spelling of the option in standards/browsers, to workaround older Safaris or something [19:54:20] obviously, they *can* choose to send us no referrer, and then I guess it's up to us whether we reject requests without referrer (which we don't currently chose to do) [19:54:45] it's a nice thing to do, and o-w-c-o above doesn't leak much that can be used to correlate against a user [19:55:18] yeah, we don't enforce the referrer, but since someone asked and is trying to follow our terms of service, I'll try to help them do the right thing :) [19:57:19] Ok, I'll try summarise all that. Thanks! [19:57:46] from a pragmatist/defensive POV if I were in their shoes: if they choose to provide no referrer data at all while most nice users do, and then we get hit by some overwhelming spam of traffic without referrer from , we might choose to defensively block un-referred requests quite quickly, and then they'll suffer. [19:59:11] yeah, they are proposing to send a pseudo-referrer as a query parameter, but that does not sound like a good idea [20:00:01] they probably were just like me and did not know about sending just the site [20:01:03] assuming they're https-only/mostly on their end, I think by default with a modern browser going cross-origin from one HTTPS site to another provides no referrer data at all, if you're unaware to set that meta tag. [20:01:41] so when we added that (when we learned the hard way after our HTTPS enforcement some time back!), we added it to restore basic origin-only referrer, whereas initially we were sending nothing-by-default. [20:02:06] because we do link to e.g. GLAM sites and that also like to know the traffic is coming from us, etc [20:56:47] 10Traffic, 10Analytics, 10Operations, 10Performance-Team: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10TheDJ) what about ?debug=true ? We already vary on that right ? might as well vary which set of headers is let true... [23:15:23] 10netops, 10Operations, 10ops-codfw: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10Gehel) > @Gehel for wdqs2006 Depooling and downtime in Icinga should be good enough. There should be no user traffic on this server and updater will catch up on lag once connectivity is restored.