[07:32:55] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4241965 (10ayounsi) [08:01:35] XioNoX: If I remember correctly, now that row c migration has been finished, we can continue with lvs1015 installation? O:) [08:01:53] (morning!) [08:02:18] correct! I need to identify which switch ports we can use, do you have the task handy? [08:02:48] https://phabricator.wikimedia.org/T184293 [08:03:25] long task is long [08:04:55] yup [09:05:09] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4242095 (10ayounsi) For lvs1015, @Cmjohnson can you cable the following? |host|hostport|switch:switchport| |---|---|---|---| |lvs1015|enp4s0f0 (primary)|asw2-c-eqiad:xe-7/0/... [09:05:15] vgutierrez: https://phabricator.wikimedia.org/T184293#4242095 [09:05:19] <3 [11:32:26] bblack: hey, it would be awesome to give some love to (y)our update-ocsp script and release it in it's own repo... it's a problem that's non well-solved yet [11:47:06] indeed [11:51:23] * mark needs more reviews! :) [11:55:19] mark: it's my fault... I give you some and you ask for more :( [11:57:12] we can ask ema to do some more, he's been slacking! ;) [11:58:44] * vgutierrez hides [12:03:17] of course, if I can't even get away with a lazy "Small fixes" commit message, that means you get more work to do... [12:03:35] hahahah [12:20:18] i guess be thankful i only do coding 2 days a week ;) [12:41:23] * ema reviewed the easy one :) [12:41:28] haha [12:42:01] that's cheating :-P [12:42:02] * volans hides [12:42:06] but I'm the newbie here! [12:42:07] mark: any reason for not merging https://gerrit.wikimedia.org/r/#/c/428901/ yet? [12:42:10] that's unfair [12:43:50] ema: there was a little thing that bugged me about it [12:44:02] and i pondered about it here on the channel [12:44:06] ...and then didn't actually put it in gerrit [12:44:09] and now i forgot what it was [12:44:12] but i know it was small [12:44:15] so I guess I will merge it now! [12:44:19] see, time fixes everything [12:45:11] ha [14:26:02] ema: are you playing with cp1046? [14:29:01] vgutierrez: I was checking a few things on the host, yes, I'm done now [14:29:08] ack [15:06:34] mark: I'm missing some integration tests on the BGP side [15:07:27] these are unit tests, not integration tests [15:08:07] what specifically are you thinking of? [15:08:24] I know, but for our sanity I think that some integration tests on the BGP side would be awesome [15:08:55] you're going to kill me [15:09:01] but I think that we could leverage gobgp for that [15:09:06] *gobgpd [15:10:08] yeah i'd like to add some eventually [15:10:08] https://osrg.github.io/gobgp/ [15:10:43] but i don't think i'll get to that before i migrate to python3 [15:11:00] that's why i'm completing unit test coverage; to test simple syntax errors etc [15:11:06] small binary, easy to run standalone or in a container.. and with a CLI / API that can let us change the behaviour within the test cases [15:11:26] mark: is the plan to migrate to py3 completely or just add support for it? [15:11:31] completely I think [15:11:40] i don't really see the point to maintain backwards compatibility [15:11:47] mark: hmmm maybe it would be easier if we could help you coding all of that :P [15:11:53] depends if/how much pybal is used elsewhere [15:11:57] volans: not much at all [15:12:03] it has never been advertised [15:12:16] eventually we want to do that, after we clean it up a bit and get rid of all the tech debt [15:12:31] vgutierrez: what do you mean? [15:13:08] there are a bunch of tools pretty useful to easy the migration, feel free to ping me if you need any advise once you start tackling it ;) [15:13:16] volans: anyway, even for the few pybal users out there, i don't think it's a big burden to tell them that from version X.0 onwards, they have to use python3 [15:13:39] +1 for me, py2 is close to end of like (maybe for real this time?) [15:13:45] *life [15:13:50] yeah, so now it's finally time to make the switch hehe [15:14:23] volans: i think you did a complete switch for cumin as well, correct? [15:15:11] indeed [15:15:23] from version 3.0.0 is py3 only [15:15:28] I made also the versions align :D [15:15:33] hehe [15:15:40] *coff*OCD*coff* [15:16:01] vgutierrez: you have no idea how much I blamed (and still blame) myself for skipping a version [15:16:12] yeah I'm not sure I can do that for pybal [15:16:20] and then we need another major version to include the FSM work [15:16:32] vgutierrez: 1.2.0 is missing [15:17:46] anyway [15:17:54] for pybal itself, the bgp code is not even all that criticial [15:17:59] it's pretty easy to test [15:18:14] it does a single prefix announcement on startup, never retracts it (that's broken atm actually ;) [15:18:26] so as long as the connection handling works, pybal is happy [15:18:55] bgp collision detection was broken too, didn't really matter [15:19:48] i have a bunch of NaiveBGPPeering test case expansions/bug fixes that I haven't pushed yet [15:20:07] "small fixes"? :-P [15:20:08] * volans hides [15:20:57] -1! [15:20:57] if you guys get so irked by that i might push some of those through ;p [15:21:32] always fun to tease OCD people [15:49:40] haha, complains about many reviews but automatically has himself added to every one? ;) [15:51:37] wut? that must be a bug [15:51:46] :P [16:24:07] bblack: when you get a few minutes, I would appreciate your thoughts on the patch I have at . It is an attempt at safe HTTP->HTTPS redirection for tools.wmflabs.org [16:36:01] bd808: it's probably the right first step. eventually you have to deal with POST and such as well, but this contains the initial breakage to a smaller subset, then can tackle that. [16:37:03] I remember the pain of closing the POST loophole all too well ;) [16:38:55] the TL;DR for baby steps is: (1) Redirect just GET/HEAD + start sending HSTS -> (2) 403 all the non-GET/HEAD traffic. It seems like there are better ideas in the middle, but they don't work out in practice, so you just take these two breaking steps (the first will affect clients that fail to implement HTTPS, or implement only ancient HTTPS, or fail to follow redirects) [16:39:41] morning bblack :D [16:39:54] it's almost lunchtime here :) [16:41:23] vgutierrez: re earlier comments about update-ocsp: yeah it's not a well-understood problem-space. The trouble with update-ocsp is I think it's somewhat specific still to how we layout our installed certs and/or the tools/manifests of the rest of our modules/sslcert . I'm not sure what it would take to divorce it as generically-useful. [16:41:50] most implementors won't even try this approach because it's hard to get right, and hey nginx/apache/etc have auto-stapling as a feature. [16:42:47] best-effort thingie, it doesn't work if you need to use a proxy... [16:43:03] except that all the auto-stapling implementations suck: on server start/restart and similar situations, they necessarily either stall initial requests while fetching staples, or they send unstapled responses until things are working. last I saw all of them also fail to refresh staples early, so there's another stall/unstapled-window at TTL=0 when refreshing. [16:43:30] so they're more like "auto-staple-most-of-the-time-kinda" implementations [16:44:14] but firefox will freak out if the staples go missing and the upstream is unreachable, so for a reliable high volume site that matters, you really need the offline-style stapling like this, that refreshes early. [16:44:58] and of course we're now beginning to enter the era where certs having the Must-Staple property are feasible, which makes that even more hardcore a requirement. [16:46:06] but yeah, aside from un-WMF-ifying the script a bit, it really should be a daemon rather than a one-shot script. [16:46:24] mostly so it can manage expiry->refresh more-intelligently. there's a ticket somewhere about that which is long-stale. [16:47:32] (TL;DR - we don't want to pointlessly be refreshing super-early all the time... but once we decide to refresh a staple later in the staple's life, if we start seeing failures we want to retry increasingly-faster as the deadline approaches. So one-shot script with fixed timing from cron can't really do this stuff optimally) [16:48:19] right now we opt for just having the cron timing aggressive enough that it gives us decent reaction time if repeatedly-failing (which will alert icinga) [16:49:27] relatedly, I think as browser sec improvements + must-staple take hold and people start looking at it from this angle and/or desiring some update-ocsp like tooling... there will probably be more pressure to upstream something like our stapling-multi-file patch to nginx. [16:49:37] yup.. I've seen all of that checking T163541 between pybal reviews [16:49:37] T163541: cache hosts should auto-repool iff OCSP files are sane - https://phabricator.wikimedia.org/T163541 [16:50:40] I don't think I even tried to upstream that particular patch, because the previous 1-2 similar SSL-related patches we tried to upstream were met with apathy and never upstreamed, until they finally just re-did it themselves in a completely different way a year or two later. [16:51:25] basically if upstream is going to be annoying/slow at feedback or inclusion process, it's not worth my time to go out of my way to force them to look at patches :P [16:52:21] but perhaps the landscape has gotten better, it's worth another shot! :) [16:55:52] upstream needs some love sometimes [16:56:17] but yeah :) [16:56:22] I get your point [17:00:44] saga starts in august 2015, where we basically refreshed/amended patches that had been floating around a couple years and tried to upstream them again after we're already using them: http://mailman.nginx.org/pipermail/nginx-devel/2015-August/007225.html [17:02:01] bumped again in May 2016, to which the response was that they're working on their own separate patch to be released Soon: https://forum.nginx.org/read.php?29,261089,266905#msg-266905 [17:02:18] so my recollection was only vaguely correct, but anyways it was annoying [17:07:33] anyways, they finally did release their official version in that same month with 1.11.0 [23:20:35] i ran into this again, cant run apache-fast-test on anything that is on cache_misc if we restrict firewall to cache_misc [23:20:38] https://gerrit.wikimedia.org/r/423557 [23:20:49] suggesting to add that testscripts to cache_misc itself for that [23:45:11] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4244733 (10RobH) Ok, so no good news: I went ahead and did the following things, in the following order. Each step was followed by checking the optics diagnostics to see the send/rcv powe...