[03:29:37] 10Traffic, 10Operations: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Krenair) p:05Triage>03Normal [03:41:11] 10Traffic, 10Operations: certcentral: Provide script for certificate revocation - https://phabricator.wikimedia.org/T203423 (10Krenair) p:05Triage>03Normal [13:14:17] 10Traffic, 10Operations: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Vgutierrez) I've been working under the assumption that basically our client was too aggressive and some times pebbles wasn't quick enough, but apparently it's getting stuck during... [13:35:04] 10Traffic, 10Operations: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Vgutierrez) Same happens with http-01 validation, but in this case pebble output is more helpful cause it's more verbose: ```expected pebble output during http-01 validation Pebble... [13:56:26] vgutierrez, around? [13:57:44] yip [13:57:57] debugging the phantom tests issue [14:01:47] vgutierrez, so tox will pull dependencies from setup.py tests_require? [14:02:46] yup, by running python setup.py install or python setup.py tests [14:03:09] BTW, replacing == with >= it's kinda tricky [14:03:21] maybe we should run the tests twice [14:03:42] once against the expected versions in stretch, and another one against *at least* the stretch versions [14:06:16] hmmm I already know why our tests fail sometimes [14:06:30] the fake dns server inside test_pebble.py is not resilient enough [14:07:15] I'm pretty tempted to rewrite the DNS server with dnspython [14:07:36] and drop that dnslib stuff [14:12:01] .. sigh! never trust random code snippets [14:12:15] running strip() on a bunch of bytes is a pretty bad idea :) [14:13:21] :) [14:13:41] volans: regarding testing against different versions of the deps [14:13:55] volans: IIRC you do that somewhere, right? [14:14:05] vgutierrez: yes, but I'm entering a meeting now [14:14:21] volans: ttyl <3 [14:14:27] see https://github.com/wikimedia/cumin/blob/master/setup.py#L50 [14:14:31] and the related tox.ini [14:15:02] thx [14:15:12] Krenair: ^^ we can use that as an inspiration :) [14:16:02] it's not perfect, you could argue that you should test all possible combinations [14:16:51] the assumption is that all install deps are >= or ==, never free version [14:18:15] free versions? we're not savages! [14:18:16] ;P [14:30:46] 10Traffic, 10Operations, 10Patch-For-Review: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Vgutierrez) As mentioned in https://gerrit.wikimedia.org/r/457915, performing a string strip() operation on a bunch of bytes that are actually a DNS query it'... [14:34:15] lol @ "I'm pretty tempted to rewrite the DNS server" <- is probably always a bad idea in the general case, the rabbithole is infinite [14:34:46] although for a test harness piece, I'm sure it can be managed, the statement just stood out in general :) [14:34:47] bblack: well... it's a tiny fake DNS server used in the integration tests [14:35:11] for the real thing I let it you handle it ;P [14:36:02] exactly :P If only someone had said that to me 10 years ago! [14:38:06] bblack, was gdnsd supposed to be tiny? [14:38:55] I was a couple months into a new job, and we were contemplating buying some exorbitantly expensive F5 Global Load Balancers to do GeoDNS, and I said something like "I'm pretty tempted to rewrite the DNS server [in perl]" [14:39:03] and then I did, and then the rabbithole just kept going [14:39:18] haha [14:39:54] Krenair: BTW, it will be also pretty interesting to test our codebase against python 3.5-3.7, to see how it will behave after upgrading to buster [14:40:08] but of course that could be included in another commit and not in this one [14:40:23] https://github.com/wikimedia/cumin/blob/master/tox.ini <-- something like this [14:40:40] I've been using python 3.5 [14:41:03] it also works in python 3.7 in OSX [14:41:04] O:) [14:41:05] the very first version was a perl daemon, which only had the bare minimum bits to do GeoDNS for A-records, and you had to delegate a "zone" for e.g. www.example.com to the perl DNS server just for that one IP, because it couldn't do all the other general-case parts of the authserver [14:42:41] looks like buster has 3.6 [14:42:46] yep [14:43:02] so at least covering 3.5 and 3.6 will be useful IMHO [14:46:18] vgutierrez, [14:46:22] 14:45:05 ERROR: InvocationError: could not find executable 'pylint' [14:46:25] 14:45:10 ERROR: InvocationError: could not find executable 'coverage' [14:47:32] right, you cannot ask for pylint or coverage without tellint tox that they should be installed before [14:47:41] https://github.com/wikimedia/cumin/blob/master/tox.ini#L39-L43 [14:47:48] I think you're missing something like that :) [14:49:05] and the proper adjustment in setup.py ofc [14:49:37] (too late, seeing PS19) [14:50:00] oh right [14:51:57] Krenair: I'm wondering if I should do a rebase and inject the test fix before the packaging commit [14:52:14] could do in a minute [14:52:24] it would make our lifes easier merging the 3 other commits [14:52:53] 10Traffic, 10Operations, 10TechCom-RFC, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Ottomata) > But I'm unsure whether we should also standardise the value format of the header (as UUID). I'd be fine with not re... [14:57:14] Krenair: I'm already checking the error in PS20 [14:58:52] Krenair: right, the dependency name is not openssl, is pyOpenSSL [15:00:16] and uh [15:00:17] pyyaml [15:00:54] :) [15:02:43] need to move some stuff from tests_require into extras_require? [15:03:00] yup [15:03:25] requests_mock and dnslib are test deps [15:05:08] hmm [15:05:19] wrong version of acme? [15:05:59] let me check [15:07:03] yeah.. too newer I'm afraid [15:07:06] acme==0.26.1 [15:07:39] that's what I was saying that testing against >= and == are different things :) [15:08:27] yeah [15:09:06] of course it would be nice to be able to detect that kind of issue ASAP, so IMHO both tests are needed [15:09:23] but of course I'd break the build only if it fails against debian versions [15:13:00] bblack: ATS returns 404 to PURGE requests if the object is not cached [15:13:14] 15:11:28 ************* Module certcentral.certcentral_api [15:13:14] 15:11:28 I: 61, 0: Locally disabling unused-variable (W0612) (locally-disabled) [15:13:14] 15:11:28 E: 95,19: Method 'logger' has no 'info' member (no-member) [15:13:14] 15:11:28 E:115,23: Method 'logger' has no 'info' member (no-member) [15:13:25] bblack: which is probably a feature, we can now tell precisely how many purges were useless [15:13:41] flask change maybe? [15:14:05] we say >= 0.12.1, it's got 1.0.2 [15:14:12] yup [15:14:14] huge change [15:14:42] could we set all the install deps as == for this commit? [15:14:47] bblack: that being said, vhtcpd logs a message every time the purge returned status code > 400. We probably want to change that, maybe log a message if status code > 400 and != 404? [15:15:00] so we get the same versions that we get in debian stretch packages [15:15:12] looks like they went 0.12.1 through 0.12.4, then 1.0, 1.0.1, 1.0.2 [15:15:19] you can set >=0.12.1,<1.0.0 [15:15:20] and then we'll sort out a strategy to give compatibility against new versions [15:15:21] ema: yeah seems like the sane thing to do [15:15:32] what volans suggests makes sense as well [15:15:59] see https://github.com/wikimedia/debmonitor/blob/master/setup.py#L11 for example [15:16:19] bblack: >= 400 that is. Thanks! :) [15:16:44] well, there's another subtlety in there [15:17:04] there's also this gem: pr->status_ok = p->status_code < 400; [15:17:23] oh it's all part of the same check, right [15:17:43] yeah [15:18:02] yeah so just update that conditional to exclude 404 and change the warning output text, should be good! :) [15:19:28] alright well tests are succeeding again finally [15:19:39] nice :D [15:20:07] the following are the ones without >= [15:20:08] 'pylint < 1.7.0', [15:20:08] 'acme == 0.25.1', [15:20:08] 'flask == 0.12.1', [15:20:20] Krenair: btw, why are you limitting pylint? [15:20:29] I don't remember [15:20:38] also get rid of that empty line in setup.py:11 :) [15:20:42] newer ones usually complain about more things, but is good :) [15:21:29] Krenair: hmm do we need to have dup. dependencies between install_requires and extras_requires? [15:21:45] I'm not sure [15:21:51] no [15:22:10] when you use .[foo] that installs install_requires + extras_requires[foo] [15:22:34] I think I've put that in a comment in cumin's tox.ini IIRC [15:22:48] nice, so lets get rid of the dups as well :D [15:23:05] but overall this is looking pretty well \o/ [15:23:38] so I can drop the stuff from extra_requires that are in install_requires? [15:23:59] indeed [15:26:43] let's return to fixing pylint, acme, flask etc. for the future in another commit [15:26:58] for now I'm trying volans' >=0.12.1,<1.0.0 etc. [15:27:15] and acme >= 0.25.1, < 0.26.0 [15:27:26] (I assume it was 0.26.0 that broke things and not 0.26.1) [15:28:35] ack [15:29:01] (PS27 fail is the phantom test bug) [15:29:07] yup [15:30:23] third time lucky... [15:36:34] hmm let's apply https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/457915/ ASAP :) [15:37:16] yeah [15:37:20] will do [15:38:07] Krenair: BTW, it would be nice to include in the README.md the steps required to generate the .deb package itself [15:40:22] bblack: something like this? https://gerrit.wikimedia.org/r/#/c/operations/software/varnish/vhtcpd/+/457927/ [15:40:49] vgutierrez, I'm beginning to loose track of all the things we've said we should do over IRC [15:41:02] with all the things I haven't addressed from previous comments [15:41:08] ack [15:42:10] bblack: jenkins failing because it tries to run debian-glue stuff but this is the master branch, trying to figure out how to avoid that :) [15:43:00] I've got [15:43:09] * fix new versions of pylint/acme/flask dependencies [15:43:21] * README for .deb creation [15:43:28] * https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/17/cli/make_account [15:43:31] * https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/17/debian/install [15:43:52] * https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/7/debian/source/format [15:43:58] vgutierrez, does that list sound complete? [15:44:24] https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/17/debian/service from moritzm [15:44:35] also https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/17/debian/postinst [15:44:38] vgutierrez, thought I did that one? [15:44:54] yeah look, that got done: https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/27/debian/service [15:45:07] right.. 28 change sets.. [15:45:07] :) [15:45:15] as did your second one: https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/27/debian/postinst [15:45:40] hmmm [15:45:49] if we're failing cause www-data doesn't exist [15:46:04] it should be sane to require either nginx or apache as a dependency, right? [15:47:04] other web servers might work [15:47:15] I've written the nginx config for it but I imagine apache could do it [15:48:17] ema: yeah looks fine. we never had a separate debian branch since it was only really for making a local deb. maybe should, but whatever works easiest [15:49:06] Krenair: I guess that any package providing "httpd" [15:49:23] looks like both apache2 and nginx packages provide it [15:53:01] Krenair: https://packages.debian.org/stretch/httpd [15:53:21] Krenair: listing httpd as a dependency will require any of those installed, but not one in particular [15:53:44] do we know if they all provide www-data? [15:54:33] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519 (10ayounsi) 05Open>03Resolved Yep, all good! Thanks! Alert created with this runbook: https://wikitech.wikimedia.org/wiki/Network_monitoring#Port_with_no_description_on_... [16:00:39] bblack: we do have a debian branch actually :) [16:01:45] Krenair: sadly nope :( [16:01:53] Krenair: yaws doesn't do it for instance [16:02:22] 10Traffic, 10Operations, 10Patch-For-Review: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Krenair) 05Open>03Resolved [16:02:25] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) [16:02:37] ema: oh hmm [16:02:55] ema: maybe jenkins shouldn't check master? [16:03:54] bblack: it should only check the "debian" branch. Unfortunately you can call the debian branch whatever you want (master is a common choice) [16:05:10] it's even gbp-buildpackage's default for --git-debian-branch [16:05:27] maybe it should skip if there's no debian/ subdir? [16:06:03] yeah that seems like a good idea [16:06:34] Krenair: let's keep it like that unless we find something better then [16:07:05] vgutierrez, like it currently is? [16:07:19] we could depend on http and do the www-data user check [16:07:20] Krenair: yup, in PS28 [16:07:29] httpd* [16:07:35] right [16:07:46] we could echo a warning if it doesn't exists [16:07:58] like.. fix the systemd unit file before running certcentral [16:10:37] Krenair: BTW, run_backend looks like it should be a main() function in certcentral.certcentral [16:10:55] and just list that main function as an console entry_point in setup.py [16:11:09] (adding a comment..) [16:13:42] vgutierrez, yeah you already did a comment about that [16:13:44] haven't got to it yet [16:13:54] ack [16:13:58] next on the list though [16:14:01] :D [16:15:47] bblack: ok to release? tests are green on the debian branch https://gerrit.wikimedia.org/r/#/c/operations/software/varnish/vhtcpd/+/457934/ [16:16:42] vgutierrez, how about PS30? [16:17:12] (I know I need to do the same thing on the new make_account commit) [16:17:45] bblack: not urgent of course, just looking forward to seeing ATS dealing with all the purges :) [16:18:05] Krenair: looking good :D [16:20:25] apparently jenkins disagrees [16:20:53] oh because I made a new function without a docstring [16:22:39] Krenair: yup.. linters are crying [16:24:59] okay so [16:25:02] after dinner [16:25:03] * fix new versions of pylint/acme/flask dependencies [16:25:03] * README for .deb creation [16:25:03] * https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/7/debian/source/format [16:25:15] ema: yes! [16:33:42] also: [16:33:43] W: python3-certcentral: binary-without-manpage usr/bin/certcentral-backend [16:33:43] W: python3-certcentral: binary-without-manpage usr/bin/certcentral-make-account [16:37:12] just as a double check always do a dpkg -c of the generated deb to check if everything is there [16:37:19] and if there aren't things supposed to be there ;) [16:37:36] (in addition to lintian of course, but I see you're already doing that) [17:05:28] volans, yeah I'm doing that [17:05:32] alex@alex-laptop:~/Development/Wikimedia/Operations-Software-Certcentral (review/alex_monk/T203422)$ cat build.sh [17:05:35] gbp buildpackage --git-export-dir=../build-area/python3-certcentral -b --git-ignore-new [17:05:37] dpkg -c ../build-area/python3-certcentral/python3-certcentral_0.1-1_all.deb [17:57:41] bblack: are there any limit in place on the gdnsd side for the TTL of the discovery records? [18:05:02] * volans afk, will read later [18:23:59] 10netops, 10Operations: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 (10ayounsi) a:03ayounsi Opened JTAC case 2018-0904-0650. [19:36:19] 10netops, 10Operations: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 (10ayounsi) Confirmed by JTAC, it's PR1315128. [19:47:15] volans: do you mean like a minimum floor value? [19:49:14] volans: the default gdnsd config has a "min_ttl" setting of 5s, and it will raise anything to at least that (even static records). We can explicitly configure that down as low as 1, but not 0. [19:51:03] volans: and in the zonefile template (wmnet) they're configured with a TTL of 300/10 (max 300, min 10) [19:51:23] bblack: yeah that 300/10 I was looking for [19:51:31] but didn't find at first look [19:52:20] the intent of the ranging there (max=300, min=10) is to let healthchecks vary the TTL within the range, but I'm not honestly sure what currently happens with that range and our lack of healthchecks, just manual admnistrative states... [19:52:31] well "manual" via confd I mean [19:53:48] but you can always skip that complexity and set a fixed value there with a slash [19:53:55] e.g. 5/5 for a fixed 5s TTL [19:54:14] ok, but yeah that answers completely my question, I was looking in the discovery config files and didn't thought of checking the zone file [19:55:28] the whole automagic TTL mangling was meant to be dropped in 3.x anyways, it's confusing and tries too hard to be smarter than it is [19:55:37] it may have to wait for 4.x though, to get killed! [19:56:50] lol, so what's the plan you have in mind ? [19:57:09] volans: looking closer a bit myself, it looks like our confd stuff for discovery-dns has a TTL setting in confd too, the 300/10 just caps the range it can express (so does the global config's min_ttl=5) [19:57:09] it's actually used for the switchdc, we lower the TTL before it and re-raise it after [19:57:34] yes, that I was aware (the confd part) [19:58:00] I don't have a fully-formed plan for how to replace it, that's part of why it's probably not making 3.x [19:58:13] lol, fair enough [19:58:17] but to be clear, I don't think being able to administratively tweak TTLs is a bad idea. [19:58:34] the Bad part of the current stuff is how the healthchecks automagically affect them, in cases where healthchecks are used [19:58:46] got it [20:00:03] the gdnsd healthcheck plugins have anti-flap counters to smooth raw health results into stable states. e.g. "Starting from perfect, it will take X isolated probe failures without an intervening run of Y sequential successes to trigger state transition to unhealthy", but a bit more complex than that. [20:00:45] so because of this, even though the true state of the resource is unpredictable, it can predict the minimum time it will take for a persistent failure to show through the anti-flap and be exposed as a state change. [20:01:10] so it would automatically gradually turn down TTLs based on this anticipated time-to-declared-failure. [20:02:01] which is kind of neat, but also really complicated and poorly understood, and probably not as useful as it seems at first :) [20:05:07] right, it's ahead of it's time ;) [20:07:04] I may bring back something similar, but without it being TTLs. Maybe have it adjust weights on a weighted pool of alternate answers (slowly drop the weight of a failing resource before removing it completely) [20:07:08] I donno [20:07:33] there's a lot of possibilities for feedback loops in either case though [20:08:15] (e.g. dynamic weight drop stabilizes a failing service, reaching some unintended intermediate state of barely-functional-but-kinda-failing) [20:09:35] but anyways, if that magic is killed, we wouldn't need these min/max TTL concepts on dynamic records. there's no real reason to put range caps on something that can only be explicitly configured (e.g. by the mechanism confd uses, or the admin_state file) [20:10:10] as of now it actually helps me to know there is a cap [20:10:36] how so? [20:11:01] and in general might be useful to avoid mistakes, but you might argue that this check should be elesewhere, in the confd template or conftool itself [20:11:09] right [20:11:26] it's just quicker, after lowering the TTL to say 10, we want to wait old_ttl before doing anything [20:11:40] ah I see [20:11:42] and right know I know that waiting 300s cover all cases [20:11:56] if it was open I should get max(TTL) from conftool [20:11:59] not a big deal [20:12:09] yeah, we should probably have some rules in the etcd/confd part of it for capping it, I donno [20:12:18] so don't let this corner use case drive any logic change [20:12:36] you could also make the argument that the zonefile TTL should still be honored as max, just no min-cap setting there [20:13:08] it seems like an un-useful case to support, using dynamic administrative input to raise a TTL larger than it's specified in the zonefile [20:14:23] anyways, 3.x already has a net reduction of ~3600 lines of code, in spite of being better-documented and enforcing a new code style that added at least a line per function :P [20:14:29] more complexity reduction can wait for 4.x! [20:14:43] ahahah [20:14:45] nice! [20:15:11] I'm not sure there aren't useful cases in which you could legitimate ask to set a record TTL higher than the zone file [20:15:14] 's one [20:15:44] well there are definitely useful times to say "hey this TTL should be higher, let's change it", but that seems more like a persistent change you'd make to a zonefile. [20:16:02] temporary changes, seem like they would always be reductions against the default/max [20:16:08] (in operational practice, I mean) [20:16:19] for example as the zone TTL is the default, you could for sake of DRYness and reduce verbosity avoid to repeat your low default everywhere [20:16:30] but then having 5 records that you want with higher TTL, dunno, just rambling [20:17:44] you could still put explicitl TTLs on just those 5 in the zonefile [20:18:18] and change them dynamically up tp that value? [20:18:26] someone should make a patch against ops/dns to remove all the pointless 'IN' on all the lines too heh [20:18:29] yeah [20:18:47] the 'IN' part of the RRs is optional and IN is the default [20:18:59] and also the only allowed value in gdnsd's case [20:19:19] eheheh right [20:19:41] if/when you do it tell me that I might have to update the zone validator script ;) [20:19:45] maybe wait for after the Great Netbox Transition, when much of our data becomes un-manual [20:20:08] makes sense [20:21:07] 251 files changed, 12999 insertions(+), 16606 deletions(-) [20:21:20] ^ current master..3.x net diff :) [20:21:33] wow [20:21:42] I'm almost there, just trying to squeeze in a few last minute things [20:21:44] quite some work! [20:23:12] I'm hoping/planning to push back to the main repo's master sometime Soon, and then cut some kind of alpha release we can try out here before trying to make a real 3.0 release [20:23:21] maybe later this week, if no annoying setbacks in QA stuff [20:23:46] (but I said that last week heh) [20:24:40] ehehe [21:25:04] hola [21:25:28] wondering if this wiki is frontend by varnish: fixcopyright.wikimedia.org/wiki/Main_Page [21:27:08] cc bblack [21:27:15] nuria: it is [21:27:20] bblack: k [21:27:41] nuria: need something patched up in transit? :) [21:28:36] bblack: jaja, no, was wondering if we will see pageviews in kafka just like anything else [21:28:43] *everything else [21:30:48] ok :) [22:43:27] bblack: hmm, I'm seeing weird stuff re: A-L & fixcopyrightwiki [22:43:33] km@km-pt ~> curl -I "https://fixcopyright.wikimedia.org/wiki/Main_Page" | grep vary [22:43:33] vary: Accept-Encoding,Cookie,Accept-Language,Accept-Language,Accept-Language [22:43:40] is it supposed to be there 3 times? [22:43:54] note that MW isn't even outputting vary: A-L yet, so I guess that would add a fourth [22:51:21] I'll comment on the ticket [22:51:53] 10Traffic, 10Operations, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) [15:43:27] bblack: hmm, I'm seeing weird stuff re: A-L & fixcopyrightwiki [15:43:33] km@km-pt ~> curl -I "https://fixcopyright.wikimedia.... [23:22:59] legoktm: ack, looking [23:41:44] 10Traffic, 10Operations, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10BBlack) Should be fixed now, pending caches clearing out old results. I don't think it actually harms anything in the meantime. [23:43:47] bblack: thanks :) [23:55:38] 10Traffic, 10Operations, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) Great :) And the special page TTL was fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/457097 to be 24h. [23:57:27] np!