[05:49:38] 10Traffic, 10netops, 10Operations, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10faidon) Makes sense, +1, go for it! A lot has happened since this task was filled in 2015 (e.g. not having precise anymore, T163196 etc.) and including `int... [08:04:39] ema: hello, as usual thank you for taking care of adding debian-glue on debian repositories :)) [10:05:08] 10Traffic, 10Operations, 10Performance-Team, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) Thanks @Gilles for kickstarting this! For context these are the notes I took when we did the first round of cleanup a couple of years ba... [12:19:28] 10netops, 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) p:05Low>03Normal My team agreed on following up with eqiad1. The only requirement is we have a clear rollback pla... [12:58:57] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) @BBlack that refactoring is awesome! As for why the task got stuck, I did a first a... [13:23:29] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) >>! In T205439#4816740, @hashar wrote: > Out of curiosity: how do you ship the GeoDN... [13:34:01] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) ^ Fixing it to be self-explanatory! :) [14:17:34] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Some interesting stuff here (see also the Mailing Lists link there in the datatracker for discussion): https://datatracker.ietf.org/doc/draft-moura-dnsop-authoritative-recommendati... [16:09:49] fyi, codfw has been depooled from DNS and eqsin/ulsfo caches redirected to eqiad [16:09:58] for codfw row B recabling [16:16:37] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [16:20:43] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [16:38:48] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [16:40:32] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10ayounsi) 05stalled>03Resolved Actually, this can be closed. [17:08:26] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [17:40:51] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) So I see @Joe has merged up some Dockerfile stuff. What's our next step to flip ope... [17:42:12] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) BTW: https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 is a good test job whe... [17:57:28] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [18:56:38] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [19:17:27] XioNoX: it looks like we lost ns1/authdns2001 traffic in codfw from ~16:27 - 16:50 or so earlier. Is authdns2001 on row B today? [19:18:08] bblack: it's on A5 [19:18:26] the times line up with e.g. the RB 503s and such during that period [19:18:36] not sure what could have caused that [19:18:45] well, network maintenance and shitty juniper [19:19:08] yeah, but totally different failure domain (vlan, hardware, etc..) [19:19:10] we don't have to understand the exact mechanism to know they're pretty likely not independent events :) [19:19:32] how come monitoring didn't go off? [19:19:37] https://grafana.wikimedia.org/d/000000341/dns?orgId=1&from=now-6h&to=now [19:20:00] because it was probably reachable from wherever it was monitored from, it just wasn't reachable from outside-world clients that drive its dns req stats [19:20:22] ok [19:20:29] oh wait, I read that stupid stacked graph wrong [19:20:36] it's authdns1001 that dropped off a while there [19:20:38] wtf? [19:21:09] but I think, maybe this is a monitoring/stats anomaly [19:21:29] because when we truly lose external traffic to 1/N authdns, you usually see it also as an increase at the others [19:21:34] bblack: and only the edns graph, no? [19:21:36] and the missing numbers aren't even zero, just no-data [19:22:14] better view: [19:22:15] https://grafana.wikimedia.org/d/000000341/dns?panelId=1&fullscreen&orgId=1&from=now-6h&to=now [19:22:34] bblack: going to repool codfw https://gerrit.wikimedia.org/r/c/operations/dns/+/479262 + https://gerrit.wikimedia.org/r/c/operations/puppet/+/479263 [19:22:37] edns may be aggregated [19:23:04] XioNoX: the nginx alert just fired again heh [19:23:22] I *think* that one may be unreliable under low traffic conditions, though [19:23:28] that one keeps flapping since I depooled the site I think [19:23:33] yeah [19:23:42] (as in, we have a low background rate of 5xx, and without the "normal" traffic it looks bigger percentage-wise) [19:23:59] and the restbase alert, I have no idea what caused it [19:24:18] what ended up being wrong with the mgmt thing? [19:24:58] bblack: papaul unplugging cables to be able to (un)rack the switch [19:25:13] ok [19:26:20] it's only the mgmt ports of the servers already planned to be offline, so no impact afaik [19:26:25] ok I'm kinda hovering/waiting to test a dns upgrade too on authdns2001 [19:26:31] but I'll wait till after the dust settles from repool [19:27:23] ah okay, let me know if I should postpone repool, I'm about to merge the CR [19:27:28] no go ahead [19:27:44] they're independent, I just don't want more alerting overlap/confusion if either one causes anything :) [19:30:25] done [19:31:06] actually I think I'll wait till after the MW train in ~2.5h, or defer to tomorrow if I get busy with something else [19:31:19] so no dns software upgrade attempt right now [19:45:38] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) A build against master: https://integration.wikimedia.org/ci/job/operations-dns-lint... [19:46:20] bblack: good afternoon. Giuseppe nicely sprinted to add a docker container for dns.git :) The new job is experimental and you can run it on https://gerrit.wikimedia.org/r/#/c/operations/dns/+/462693/ simply by commenting "check experimental" [19:46:35] and if all your recent refactoring worked fine, it should be a success \o/ [19:47:42] and thank you for the detailed explanation about the 261-bytes mock database! [19:48:51] hashar: yeah it seems to work: https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/2/console [19:48:56] yeah it is magic [19:49:06] and sorry for the delay due to miscommunication [19:49:33] + [19:50:08] hashar: my only question now (which is really separate and nonblocking): I just uploaded a slightly newer gdnsd package to stretch-wikimedia a very short time ago (43 minutes), and the docker doesn't have it yet on this run. [19:50:11] I completely forgot about that one after my last comment mid october. That and I guess I had absolutely no idea how to add the GeoDNS db to the container given they are proprietary [19:50:34] so the docker container has been build like 4 hours ago [19:50:37] I'm assuming it's because only runs apt-get update and/or upgrades packages once in a while [19:50:40] and thus it is already outdated :/ [19:51:04] can we make upgrades automatic? [19:51:06] docker is really just a snapshot of .deb packages at some point in time. The containers get outdated rather quickly unfortunately [19:51:41] or maybe: flip things around to where the installation of the latest of is actually done in realtime inside the docker on test, instead of part of the static/stale image? [19:52:25] but really even just an easy way we can invalidate the docker and get it rebuilt when we know we've upgraded relevant packages, etc would be fine too [19:52:28] yeah we have the same "problem" for Chromium which is updated every couple months or so [19:52:30] I just don't know what the process is [19:52:56] but while we're going through live beta on gdnsd, there's likely to be more of these upgrades over the next several weeks. [19:53:05] potentially the promise is to use debmonitor to scan containers and report obsolete packages. But the work on that is not a priority [19:53:19] what's the manual process? [19:55:03] hmm [19:55:11] nice to see it working! [19:55:35] it is terrible. Gotta bump the changelog in integration/config.git:dockerfiles/operations-dnslint/changelog [19:55:39] via a dch -i -c [19:55:45] hashar: in any case, we're better off on the current slightly-stale new stuff than we are on the legacy CI, I'd still like to switch ASAP [19:56:05] commit, gerrit, +2. Once merged, we have a Fabric task to run docker-pkg on contint1001 and get the new container build [19:56:12] hashar: bumping that changelog doesn't sound too awful really [19:56:24] then the Jenkins job get updated (again commit, gerrit +2, deploy of job) to simply change the container tag [19:56:51] as for the job [19:56:54] yeah we can just switch it [19:57:03] and even rebuild the container right now to have the latest version [19:57:11] sure that'd be swell :) [19:58:54] you could make the argument, that in our definition of the installed debian packages in the dockerfile, we should explicitly name the exact revs we want installed, instead of just asking for the latest [19:59:09] at which point it would be justifiable and normal that we have to edit those and bump the changelog to get new stuff :) [19:59:30] yeah [19:59:46] the good point is that the package versions being stalled, CI is slightly more stable [19:59:54] right [20:00:02] previously we ran unattended upgrade on a daily basis and it had bunch of funky side effects [20:00:23] so once in a while in the morning I had to figure out why X Y Z jobs ended upbeing broken [20:00:34] so at lesat now we have some control [20:00:39] least [20:01:09] the cool thing about this model, is we're keeping most of the actual "real" CI work down inside the repo-being-tested now. So other than meta-meta issues and/or upgrades of a few debian packages. [20:01:10] but, when we rebuild a container to get some specific package to be upgraded, all other ones get upgraded as well. So the new image is quite unpredictiable [20:01:23] We should be able to fix/upgrade our own CI scripts in-band without involving releng [20:01:36] yeah [20:01:49] that is arguably a huge improvement for everyone! [20:02:26] with time we have switched to delegate the test logic to the developers. CI jobs being as dumb as possible [20:02:46] for a java maven repo, the job just: git clone && mvn clean package ( rpughly) [20:03:11] anyway [20:03:34] for packages what I though is to run all containers one by one. Run an apt update && apt upgrade --dry-run and report whether there is any delta [20:03:43] so at least we can notice they are outdated [20:04:00] for gdnsd, I am afraid right now you will have to remember to get the container rebuild :/ [20:04:20] yeah realistically "grab the latest" isn't great anyways, as we might upload a new one but want to keep running CI against the old for a few days before we actually upgrade the servers' packages [20:04:32] yeah [20:05:03] another fun thing that is doable, it to run the operations/dns.git script for patchsets made to gdnsd.git :) [20:05:30] yeah, that's tricky but totally doable [20:05:59] then you have some guarantee that gdnsd.git is backward compatible [20:06:07] I could in theory, since our Docker stuff is public.... [20:06:36] define a travis job over on github that pulls in our ops/dns + ci stuff and checks the current ops/dns repo against every commit to upstream github gdnsd/master [20:07:05] so that it alerts as a build/test fail over in github if I even commit a change to master that would break current ops/dns :) [20:07:47] but that's pretty far down my wishlist right now of related things to fix! [20:07:58] so first thing I guess [20:08:11] is to drop the old jobs and use the new one from now on and close that task [20:08:22] right [20:08:32] then I guess we can loop _joe_ in to figure out a way to get the package to magically upgrade [20:09:22] don't worry about magical upgrades. [20:09:37] the next time I need a gdnsd version bump, I'll just put the version in explicitly and bump changelog, etc [20:12:44] hopefully the next one will be 3.0.0 anyways, but I wouldn't count out my ability to keep deferring that and slipping in last minute feature changes :P [20:13:14] ah yeah +1 on using the gdnsd version as a version of the CI container [20:13:43] or not, either way [20:13:52] https://gerrit.wikimedia.org/r/#/c/479270/ does the upgrade [20:13:58] the container could be -0.0.2 which includes installing gdnsd=3.0.0 or whatever [20:14:03] the only relevant part for you is https://gerrit.wikimedia.org/r/#/c/integration/config/+/479270/1/zuul/layout.yaml :) [20:14:16] namely the new job replaces the two other jobs. Rest is cleanup [20:16:06] in an ideal world, you would tag a new version in operations/debs/gdnsd which would result in a package pushed to apt.wm.o and the CI container to be rebuild [20:16:12] well it removes the two other jobs, the third was already there, but I guess some "experimental" flag blocked it from normal runs? [20:16:19] yeah [20:16:28] test-prio / experimental are pipelines [20:16:39] each pipeline reacts to different event and can have different precedence [20:16:57] "test-prio" reacts on new patchset being uploaded and votes verified -1/+1 [20:17:30] "experimental" reacts solely when someone comments "check experimental" in Gerrit and does not vote (it just report the jobs) [20:17:46] oh I get it now, I didn't see the last file in the patch yet [20:18:01] "test-prio" also reacts when someone comments "recheck" in gerrit [20:18:21] I have deployed the CI change so a "recheck" or a new patchset on https://gerrit.wikimedia.org/r/#/c/operations/dns/+/462693/ would run the new job :) [20:18:58] yeah I think I just raced it earlier, trying again! [20:19:26] ah there we go [20:19:28] thanks so much! [20:19:45] https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 [20:19:49] V+2 :) [20:19:54] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) a:05CRoslof>03tramm Hi @tramm Did you see the comment above? Can we move forward with this? I got reminded this is still open because today we got... [20:20:06] magiiiic [20:21:09] yeah maybe tomorrow, I'll go through the process to bump to 9944 as an explicit version and see how it all goes [20:21:14] so I have the process down for later [20:21:35] 10Traffic, 10Operations, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) 05Open>03Resolved @BBlack refactored the operations/dns test to mock anything th... [20:21:56] I have marked the task resolved with a simple summary [20:22:01] ah [20:22:28] really, I should be better at reaching out to the repo owners since usually they are in the best place to figure out how to run the tests [20:23:01] 10HTTPS, 10Traffic, 10Pywikibot: SSL CERTIFICATE_VERIFY_FAILED on generating family file - https://phabricator.wikimedia.org/T211813 (10SgtLion) [20:24:12] 10HTTPS, 10Traffic, 10Operations, 10Pywikibot: SSL CERTIFICATE_VERIFY_FAILED on generating family file - https://phabricator.wikimedia.org/T211813 (10SgtLion) [20:24:26] thanks again :) [20:28:42] bblack: and you totally aced the test refactoring. Thank you for all the mocking !! puppet+geodnsd have been a hassle more than a couple time :) [20:29:23] np! [20:29:36] <_joe_> bblack, hashar: as far as gdnsd package version goes: it's easy to declare which version we want somewhere in the dns repo and check we have the right version installed [20:30:31] _joe_: just to complain about it in output, or does that lead to letting the docker upgrade its own package on the fly when such a job runs? [20:30:45] <_joe_> why not both? [20:30:52] good question! [20:31:07] I assume since the job is concurrent, there's some tricks about racing if it upgrades itself in-place [20:31:17] <_joe_> in the future, we'll have an automated mechanism to update the container image every time we have a new gdnsd version available [20:31:52] <_joe_> bblack: well in the run.sh script I can check if the version is what's expected, and if not run apt to install the right version of the package [20:32:09] <_joe_> we do it for ops/puppet for new versions of the gem bundle [20:32:15] hmmmm ok [20:32:39] buttt [20:32:57] <_joe_> but i don't really love this solution. it's a stopgap. [20:33:02] apt update && apt install require some root privilege. But i guess that is sortable [20:33:03] is there a preferred standard way to declare the package version metadata in ops/dns? [20:33:28] <_joe_> bblack: for starters, no [20:33:33] <_joe_> hashar: yeah that too [20:34:00] or I'd be fine with it being external in the docker template too [20:34:09] <_joe_> so ideally, we would use a predefine template to define the image and its run script *in the dns repo* [20:34:42] yeah that does sound more idela [20:34:42] <_joe_> jenkins should be able to read that file, see if a new image version is needed, build it, and run tests within it [20:34:44] *ideal [20:35:24] <_joe_> and once the change is merged, I donno, publish it via gate-and-submit? [20:35:26] but in the meantime, I think it's probably not-awful to do it manually [20:35:31] I assume: [20:35:34] dockerfiles/operations-dnslint/Dockerfile.template:{% set pkgs_to_install = """gdnsd python3 python python-jinja2""" %} [20:35:46] ^ that there's some syntax here I can use to say gdnsd=1.2.3 [20:35:54] and then bump changelog and shove it through and it will eventually rebuild [20:36:00] <_joe_> it's the string passed to apt-get [20:36:03] right [20:36:36] <_joe_> bblack: so yeah for now you can do that [20:36:37] that's fine for now until some later ideal solution emerges [20:37:19] <_joe_> implementing the ideal solution would include finding some jenkins plugin that does something similar to this, probably [20:37:37] <_joe_> I'm not the right person for that honestly :) [20:37:47] well [20:37:56] that sounds like something for blubber / pipeline? :) [20:38:31] imagine operations/dns producing a container that has the proper gdnsd version, the zones etc. Then one can run that containers to run the wikimedia dns server locally :) [20:39:51] if the container ran as bare metal under a systemd slice or whatever, it could work [20:39:55] <_joe_> hashar: you could adapt parts of the pipeline, yes [20:40:06] <_joe_> but I doubt blubber supports generic containers [20:40:18] but then there's all the other bits too, the authdns-update stuff that coordinates deploying changes [20:40:32] anyways [20:40:34] for another day! [20:40:53] we've made huge progress on a bunch of debt this week, and I'm unblocked on more things than I can even attack today, so we're good :) [20:41:07] <_joe_> happy to hear that :) [20:41:09] thanks again all! [21:01:43] \o/ [21:09:34] * volans waits for the script to be able to run docker locally :-P [21:09:46] * volans happy too to see this live! [21:13:42] 10netops, 10Operations, 10ops-codfw, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [22:30:18] bblack: I don't understand what's wrong in https://integration.wikimedia.org/ci/job/operations-dns-lint-docker/6/console [22:31:06] for https://gerrit.wikimedia.org/r/c/operations/dns/+/479337 [22:38:52] 10netops, 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) To be pushed: `lang=diff,name=cr1-eqiad [edit interfaces ae2 unit 1120 family inet] address 10.64.22.2/24 { ..... [22:44:25] XioNoX: look for the red ;) [22:44:33] E002|MISSING_OR_WRONG_IP_FOR_NAME_AND_PTR: Missing IPv4 '208.80.155.92' for name 'vip-gw-cloudnet.eqiad.wmnet.' and PTR '92.155.80.208.in-addr.arpa.'. Current IPs are: ['10.64.22.4'] (defined in 155.80.208.in-addr.arpa:45) [22:44:40] volans: yeah but what does that mean? [22:45:11] I should be able to add the same PTR for 2 different IPs, no? [22:45:33] you defined only the PTRs without the related direct A/AAAA records [22:45:45] well for starters, it's odd that a public IPv4 maps to a .wmnet hostname [22:46:13] but also, there's an existing record in templates/wmnet for vip-gw-cloudnet.eqiad.wmnet -> 10.64.22.4 , which is not the IP you just put in the PTR [22:46:41] brandon was quicker than me to write :D [22:46:50] and also, there's no forwards in the patch for any of those new 4? [22:47:51] some of the others have existing entries in wmnet vs the new ones in wikimedia.org too [22:48:01] not all zones/subnets are checked for all things yet, but it all seems a little fishy [22:48:10] templates/10.in-addr.arpa:1 1H IN PTR vrrp-gw-1120.eqiad.wmnet. [22:48:25] vs your public-subnet: [22:48:26] 89 1H IN PTR vrrp-gw-1120.wikimedia.org. [22:48:31] I want to reserve the IPs based on their current names [22:49:13] well it only reserves against other humans editing the file really. You could put in a comment to claim the territory maybe? [22:50:47] yeah, I guess I'll do that for now [22:51:51] well, I should be able to assign PTR for all router IPs except the labs side and let wmcs name it the way they want [23:00:04] XioNoX: PS2 it's still adding warnings though, as we consider bad practice to add PTRs without forward records [23:00:20] ok [23:00:22] I know, CI is not telling you... one thing at a time :) [23:00:58] but I'll leave it to Brandon to decide if that should be our policy or not :) [23:01:18] we've already a lot of violations, and the rules are yet to be written in stone ;) [23:03:55] for most of the infrastructure IPs we add PTR for reservation and to show up in traceroute and co. [23:04:27] looking in the wikimedia.org zone file, I don't see any forward for infra IPs [23:07:05] 10netops, 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) @aborrero Everything is ready to be merged/commited. I used the name `vip-gw-cloudnet.wikimedia.org.` let me know if...