[07:03:35] 10Domains, 10Traffic, 10QRpedia-General: qrpedia.org and qrwp.org are down - https://phabricator.wikimedia.org/T209019 (10Samwilson) [07:58:57] 10Traffic, 10Operations, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` ['cp2006.codfw.wmnet', 'cp2012.codfw.wmnet'] ``` The log can be found in `/va... [08:13:38] 10Traffic, 10Operations, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2006.codfw.wmnet', 'cp2012.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2006.codfw.wmnet', 'cp2012.codfw.wm... [08:15:57] 10Traffic, 10Operations: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [08:16:04] 10Traffic, 10Operations: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) p:05Triage>03Normal [08:16:17] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:16:50] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:16:51] 10Traffic, 10Operations: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [08:17:02] 10Traffic, 10Operations: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [08:43:27] interesting, I've tried reimaging cp2006/cp2012 and the first puppet run failed with: [08:43:30] Evaluation Error: Error while evaluating a Function Call, cron_splay(): this host not in set at /etc/puppet/modules/cacheproxy/manifests/cron_restart.pp:15:14 on node cp2012.codfw.wmnet [08:43:58] that's because the hosts are indeed not yet in cache::text::nodes, usual trick to avoid icinga spam [08:44:43] we could call cron_splay only if the node is in $all_nodes (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472384/), but I wonder how this has ever worked in the past? [08:47:25] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:49:17] usually when I jump straight to the 6th stage of debugging I'm wrong though [09:16:15] bblack: do you want to lower the webp threshold further today? or wait until next week? [09:29:37] ema: is the list passed to cron_splay coming from a puppetdb query? [09:32:41] volans: nope, it comes from hiera [09:34:56] ah ok, because the trick I've used there is to merge the array with the current FQDN (if is not there) [09:35:18] (for the puppetdb query, as for the first run the host will not be in the list) [09:46:49] ema: in case you need it something like: [09:46:50] unique(concat(query_nodes('Class[Role::Cumin::Master]'), [$::fqdn])) [09:48:10] * volans wondering if we should add this functionality it to cron_splay directly [09:52:45] volans: you mean adding the host to the set instead of raising a ParseError? [09:53:39] yep, I'm thinking if there are cases in which we call or could call cron_splay legitimately from a host that is not in the list [09:53:55] doesn't make much sense AFAIK and the function already fails [09:54:09] but it might hide other underlying issues [09:56:08] * volans brb in ~20 [10:03:13] also, there is another side effect that I should mention [10:03:44] doing that basically alters the crons, given that the host for which the FQDN will be appended will have a different list from the others [10:04:26] hence it will be "incorrectly" splayed, until this host is actually part of the hiera/puppetdb list and all the other host have run puppet so to have all of them the new splayed cron [10:32:35] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Gilles) 05Open>03Resolved The basic functionality is there. If we want to iterate on that, it should be the subject of a new task. [12:06:55] ema: it's a known issue when adding new nodes to a cluster I think. The reason we generally avoid putting them in the hieradata node list right away, is it starts the clock on ipsec alerts too. [12:07:27] ema: e.g. for slow installs in esams, I usually let them get through most of the mess up through their first failed puppet run for lack of the hieradata I think, then add them and re-run puppet and try to beat the ipsec alerts :) [12:08:46] besides, I think even if you did have the hieradata there from the start, I don't think current cp puppetization succeeds on the first run anyways, it takes a few to get through various dependency issues, and something about our old hacks that prevented the initial (unpuppetized) starts of nginx/varnishd from interfering no longer works either. [12:11:01] so I think the last few I've done, it was more like: 1-2 puppet runs that do a lot and end in failure -> stop varnishd, stop nginx -> run puppet a time or two again, possible stop varnishd and/or nginx again -> run puppet 2-3 more times... [12:11:05] it's kind of a mess right now :/ [12:12:52] on the specific topic of cron_splay(): it's an almost-perfect solution to an ultimately unsolveable problem heh. But pulling the list from puppetdb would probably make things worse than they are now. [12:14:44] before cron_splay, we used randomization for the cron timing (which is also tricky to get right over e.g. a whole week with even distribution, but tractable). [12:15:34] the problem is that with this kind of node count (currently ~80?) and roughly weekly timing, it's statistically likely that "random" timing will do Bad Things, even if you retry with a bunch of different seeds. [12:16:19] Bad Things being things like: execute the cron on two nodes from the same DC+cluster very close together, and other such anomalies. [12:19:03] so cron_splay() attacks that by being deterministic. For a given nodelist and seed value, it puts the whole global list in a deterministic order. Spacing between nodes in one DC is maximized, the order of nodes within any given DC is deterministically shuffled, and the time intervals are exactly the average (no very short gaps or very long ones between global crons). [12:19:54] but then that creates a new problem. Even if you leave the seed unchanged, the exact timestamp of a given node's cron entry depends on the contents of the global node list. [12:20:19] if you add or remove a node from the list, everyone's timestamp moves a little bit to accomodate the new spacing. [12:21:01] so if you're operationally relying on "this cron must execute once per week or something's going to fall over and die".... [12:22:04] live nodeA might have its weekly cron stamp coming up 37 minutes from now, then you go add nodeB to the global list because it's a new install, and nodeA runs puppet and its new time is now 32 minutes in the past, so it has effectively skipped over its weekly run. [12:22:19] and then something breaks [12:22:48] that is the extended story of "this is why cron_splay is nifty, and also why it only trades one problem for another and still doesn't resolve this hard problem" [12:23:32] I suspect making cron_splay's input nodelist be something more dynamic (e.g. puppetdb input over static hieradata) would increase the gyrations of existing nodes' timestamps and cause that skip effect above to become worse in practice. [12:45:37] (or at least, make them less noticed by humans vs editing the list) [12:51:03] totally agree, I was not advocating for that in this case btw, just pointing out how I solved the issue for a similar case, but it's a much more smaller set and honestly I don't care about the timing, it's a low-importance one [12:57:44] bblack: the thing is, for both cp2006 and cp2012 puppetization failed with the error triggered by cron_splay() very early, at catalog fetch phase [12:58:01] yes [12:58:07] then you add them and run again :) [12:58:49] add them to cache::text::nodes? [12:58:59] yes [12:59:24] they have to get there eventually for things to work (ipsec and inter-cache fetches and cron splay, etc) [12:59:41] the debate is about when/how they get there [13:00:18] if you do manifests/site.pp for the install and put them in the hieradata nodelist in the same commit, then other nodes will puppetize their ipsec to talk to that node and start failing at it, while you're still waiting on reboots/installer stuff. [13:00:24] and they'll probably reach critical and spam. [13:00:50] what do you mean with "run again"? [13:00:53] if you wait, you have this initial puppetfail and then you add them and probably beat the race [13:00:57] run puppet again [13:01:00] I cannot even ssh into the nodes :) [13:01:05] it seems you need two different lists, staged and active (like the new hardware lifecycle) ;) [13:01:08] you can, using the new_install key [13:01:26] ah! [13:01:43] so some steps use staged+active and some just active [13:01:49] but yes, this means you can't rely on lazy auto-reimage, you have to step in manually [13:02:27] we used to have different lists at one point I think: one for the intercache and cron_splay stuff, and another for ipsec. [13:02:49] I don't recall when/why that changed, but it did [13:16:05] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10ayounsi) [13:20:05] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10ayounsi) 05Open>03Resolved This has been completed successfully. Everything went as expected, nothing other than C4 went offline. Maintenance took 45min longer than... [13:20:55] alright, now there's a new bootstrapping issue :) [13:21:08] puppet fails at installing nginx, and that's I think because update-ocsp-all [13:21:40] which in turn fails because /etc/update-ocsp.d/*.conf isn't there yet [13:23:15] let's see if /usr/sbin/policy-rc.d helps there [13:26:58] yes, by telling dpkg not to start daemons upon package installation, I've managed to go past that [14:03:04] ema: sorry, I just made that new one yesterday! [15:08:21] bblack: so, next webp threshold lowering, today or next week? [15:09:47] gilles: next week if you don't mind, there's a bunch of other risks/timelines/priorites playing out through the end of this one. [15:09:59] sure thing [15:11:22] well maybe "a bunch" isn't very accurate. one very critical one for sure though, and I can't afford multiple emergent problems :) [16:01:12] <_joe_> bblack: you are saying I shouldn't convert the wikis to php 7.2 tomorrow evening? [16:01:28] ema, bblack, XioNoX I'm feeling alone in the "meeting room" [16:05:30] sorry! [16:05:43] _joe_: just make sure you do it after midnight california time on friday [16:06:12] <_joe_> to keep up with old traditions? [16:06:20] <_joe_> will do! [17:01:05] Krenair: so, https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/472188/ checks order status after creation, iff the new order already has the corresponding authz marked as valid, the order status will be set to ready and not to pendind and then certcentral can skip challenge validation, this will happen at least once for every certificate that we issue [17:01:32] let's say that the first one handled is the rsa-2048 one, the order for the ECDSA one is going to take advantage of this optimization [17:06:01] 10Traffic, 10Operations, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ema) cp2006/cp2012 reimaged and added to cache_text. The nodes are currently depooled but ready to be put back into service. [17:25:21] https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/472487/ gets rid of poll_and_finalize(). That allows certcentral to gain fine control about the finalize process. In some cases certcentral was missreading the timeout error raised by poll_and_finalize() because the same exception was used for 2 things, the timeout during the poll authorizations phase means that the [17:25:27] challenges have not been checked yet and during the finalize phase means that the finalize request has been sent but the certificate has not been issued yet [17:25:46] that + the short deadline we were using (2 seconds), caused the valid state error [17:26:01] in some occasions certcentral was finalizing twice the same order [17:50:02] vgutierrez, are we going to fix upstream code to raise different exceptions? [17:51:50] not at the moment [17:52:58] but yeah we can open an issue in GitHub and follow up with them [17:53:36] it's also pretty useful to split their finalize method in two [17:53:46] finalize + fetch certificate [17:54:24] otherwise you cannot retry fetching the certificate if a timeout has been raised after the order had been finalized [17:54:53] I want to see how all of this works for us before suggesting it to upstream [17:55:12] ok [17:55:24] pebble it's happy with the changes but let's see what Boulder thinks [17:59:15] vgutierrez, so under what circumstances do we get to skip the validation process? [18:00:46] when we create a new order but a valid authz is already there [18:00:56] the validation isn't specific to the privkey used to make the CSR (e.g. ECDSA vs RSA) - it's generic for the SANs used in the order and persists for days [18:01:02] that makes that the newly created order moves automatically from pending to ready [18:01:09] so once you've already challenged a given SAN for a given account#, you don't have to do it again [18:01:21] indeed [18:01:29] the authz expires after 11-12 days [18:01:37] so during that timespan you don't need to prove again that you control that SAN [18:01:54] something to be sure we think about, since current one-off testing may not hit this: [18:02:04] it's also quite likely we'll have SAN overlaps wrt to authz [18:02:28] so cert1 does initial authz for foo.com + bar.com, then 3 days later cert2 does a SAN with bar.com + baz.com, and bar.com is pre-authd and baz.com needs new authz [18:02:50] just to be sure the whole "already auth'd" thing is noted per-SAN, not per Order/CSR. [18:03:07] hmmm I need to test that, that would trigger a new re-auth for bar.com right now [18:04:08] in the current state is already beneficial for certcentral because of the ECDSA/RSA dual issuance [18:05:56] ah so they do use the 'account' system for something useful beyond alerting for expiries :) [18:06:12] and easy revocation [18:13:25] vgutierrez, not sure I fully understand https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/472188/2/tests/test_certcentral.py [18:14:24] oh right [18:14:24] I'll get you back later on this [18:14:28] get_acme_session_mock.return_value.push_csr.return_value = {} [18:14:29] got it [18:14:31] never mind [18:14:36] ack [18:14:53] looks good, just left an unused param [18:16:23] nice thx [18:21:06] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10ayounsi) [19:40:58] vgutierrez, shall we make a new release? [19:41:11] I'm finished for the day [19:41:20] go ahead if you want [19:41:33] otherwise I'll do it tomorrow first thing in the morning [19:41:44] I'll leave it to you :) [19:44:00] ook [19:44:07] have a nice evening [22:01:07] 10Traffic, 10Operations: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) `Must-Staple` didn't turn out to be a realistic option for GlobalSign, we'll look at it again later/elsewhere! [22:06:31] 10Traffic, 10Operations, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) The dual RSA+ECDSA certs above have: ` Not Before: Nov 8 21:37:02 2018 GMT Not Before: Nov 8 21:21:04 2018 GMT ` Which leaves us plenty of room for clock skew on the deploy...