[00:44:44] 10Domains, 10Traffic, 10DNS, 10Operations, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10BBlack) There's a lot else to be said about the subject of the `.wiki` TLD (much of which has been said before on past tickets), and I tend to... [08:37:44] pybal's canDepool patch is still looking for reviewers: https://gerrit.wikimedia.org/r/#/c/403677/ [08:54:20] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3902078 (10Gilles) It turns out to be quite common for load.php calls to take more than a minute: https://logstash.wikimedia.org/goto/7... [09:35:03] i can review tomorrow [09:35:12] my tech work day! [09:36:07] \o/ [09:36:09] thanks! [13:32:03] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3902515 (10Joe) [15:26:31] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog: Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3902896 (10Fjalapeno) [15:47:00] so, authdns upgrades [15:47:16] I've added https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS which reflects my understanding of the procedure [15:47:47] I don't know how to perform the static route changes though (eg: route ns0 to baham instead of radon) [15:48:45] show route for ns0 looks like this on cr1-eqiad: [15:48:50] 208.80.153.231/32 *[OSPF/150] 3d 08:55:03, metric 0, tag 0 [15:48:50] > to 208.80.154.194 via ae0.0 [15:48:50] [Static/200] 73w5d 01:24:12 [15:48:50] > to 208.80.154.93 via ae3.1003 [15:50:01] show route for *ns1*, that is [15:50:58] for *ns0*: [15:51:00] 208.80.154.238/32 *[Static/5] 8w0d 18:22:05 [15:51:02] > to 208.80.154.93 via ae3.1003 [15:51:02] [OSPF/150] 8w0d 18:21:59, metric 0, tag 0 [15:51:02] > to 208.80.154.194 via ae0.0 [15:51:53] so 208.80.154.238 (ns0) goes to 208.80.154.93 (radon), good [15:53:04] it's unclear to me why 'Static/5' would win over 'OSPF/150' (wasn't it that the highest number wins?) [15:54:03] haha no [15:54:08] > In every routing metric except for the BGP LocalPref attribute, a lesser value is preferred. [15:56:48] mark, XioNoX: how would one go about routing (for example) ns0 to baham instead? Couldn't we use BGP med and change the MED values when doing maintenace for this stuff too? [16:00:08] that's something else actually [16:00:12] the /5 and /150 are not local preference [16:00:17] they are "distance" [16:00:21] so there, lower wins again yes [16:00:31] static routes win over dynamic routes, basically [16:00:40] seems reasonable [16:01:22] easiest is probably to temporarily change the static route, since it exists [16:01:31] if we drop statics and only use bgp, of course we could do stuff there too [16:01:33] but atm static wins [16:03:29] ema, afaik, there is no BGP daemon advertising the VIP from the DNS servers, we could though, maybe with https://phabricator.wikimedia.org/T98006 or pybal [16:05:54] so the first entry is the static route to radon, the second one (OSPF) is to cr2-eqiad [16:06:01] what's the purpose of the second entry? [16:06:51] on cr2-eqiad, the second one points to cr1-eqiad, so clearly some type of failover, but how does it work? [16:10:24] ema: the routers have a rule that says "export the local static routes to the OSPF neighbors" [16:13:23] ema: that's because both routers have a leg in the vlan where radon is, so the prefered way from cr1 or cr2 to radon is directly to the host, but the backup way is for example cr1 -> cr2 -> radon [16:14:53] XioNoX: Cool. Based on what does the router choose to go the backup way? [16:14:58] as the prefered route is static, that protects mostly from the interface ae3.1003 on the routers going down [16:15:26] ok so if ae3.1003 is marked as down -> backup way? [16:16:09] ema: based on the routes and their metric. For the router to go the backup way, the static route would have to be unusable, and for static it's not common [16:17:47] actually, as the routes are configured with no-resolve, even if the interface goes down, they will most likely stay up [16:20:09] XioNoX: that doesn't sound good :) [16:22:52] mmh so many questions [16:22:54] ema: it would be great if all of it was using BGP [16:26:20] XioNoX: ok, I have too many questions. :) Let me go back to the original one. How does one do this? https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS [16:27:01] mark mentioned editing the current static routes, can we document how that is done? [16:27:56] ema: yeah, instead of next-hop radon, the route need to say net-hop baham, etc.. [16:28:59] and last time we did the esams one, there was a routing issue that blackholed the traffic I think [16:29:14] ouch [16:30:09] ema: do you need to do it now/shortly? [16:31:22] XioNoX: it isn't urgent, no [16:55:40] I was looking at the mtr reports from here to ns[012]: https://phabricator.wikimedia.org/P6595 [16:56:23] is it normal that in the case of ns[01] the hop before ns[01] is a zayo address while in the case of ns2 it is cr2-esams? [17:03:56] that's the case for text-lb.{eqiad,codfw} vs text-lb.esams too [17:24:19] ema: yeah, it's because 64.125.129.70 are the IPs allocated to us by Zayo, but they are on our routers (our side of the zayo-wiki link) [17:24:56] ah! [17:25:22] 80.249.209.176 is actually owned by AMS-IX, and they let us configure the reverse DNS, and it's on our router as well (our side of the AMS-IX-wiki link) [17:28:51] ema: ^ (in case you didn't see the 2nd part) [17:30:53] XioNoX: thanks! [19:32:38] 10netops, 10Operations, 10hardware-requests, 10ops-eqiad: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390#3903795 (10Ottomata) p:05Triage>03Normal [19:39:19] 10Traffic, 10Operations, 10media-storage: "Error: 404, Requested domainname does not exist" when accessing Commons categories/images; works on mobile page - https://phabricator.wikimedia.org/T181801#3903849 (10Ottomata) p:05Triage>03Normal [19:39:43] 10Traffic, 10Operations, 10Page Content Service, 10RESTBase, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3903853 (10Ottomata) p:05Triage>03Normal [19:44:00] 10Traffic, 10Cloud-VPS, 10DNS, 10Operations, 10Beta-Cluster-reproducible: Create some mechanism for instances in projects to modify the project Designate records - https://phabricator.wikimedia.org/T184245#3903900 (10Ottomata) p:05Triage>03Normal [20:07:11] 10Traffic, 10Operations, 10Page Content Service, 10RESTBase, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3897945 (10Pchelolo) I've found another issue here: for mobile content, redirects are returned by #restbase w... [20:07:14] 10Traffic, 10Operations, 10media-storage: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#3904023 (10Ottomata) p:05Triage>03Normal [20:09:31] 10Traffic, 10Analytics, 10Operations, 10Research, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3904037 (10Ottomata) p:05Triage>03Normal [20:14:05] 10netops, 10Cloud-VPS, 10Operations: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3904055 (10Ottomata) p:05Triage>03Low [20:50:59] 10Traffic, 10Operations, 10Page Content Service, 10RESTBase, and 3 others: Inconsistent behavior when fetching redirected pages with Cache-Control header - https://phabricator.wikimedia.org/T184833#3904146 (10Pchelolo) Submitted a PR for RESTBase to fix inconsistencies on RB side: https://github.com/wikime...