[05:39:14] <_joe_> mutante: I don't think we need apache-fast-test for anything that doesn't have 100+ servers in the cluster [05:46:43] https://www.slideshare.net/InternetSociety/international-bandwidth-and-pricing-trends-in-subsahara-africa-79147043 [06:52:44] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4245080 (10ayounsi) Thanks, codfw uses MMF with those optics [[ https://apps.juniper.net/hct/model/?component=QFX-QSFP-40G-SR4 | QSFP+-40G-SR4 ]] ulsfo is SMF with [[ https://apps.juniper.n... [07:15:52] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4245085 (10ayounsi) a:05ayounsi>03RobH [09:25:04] 10Traffic, 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4245361 (10ayounsi) [10:39:28] I've a question for debmonitor config for the dns-discovery part, I'm wondering if we need the LVS endpoint anyway or given that it will just be 1 host/DC we can bypass that [11:09:42] bblack: for when you're awake, no hurry ;) ^^^ [11:23:16] I'm worried our current puppettization doesn't allow it though [12:05:46] volans: the puppetization allows for it, it's just a question of whether we're ok with it on a redundancy design level (I think in this case, we are, since it's tooling rather than something in the production runtime path, re: having to use DC-level redundancy as our only redundancy layer) [12:08:00] it's also an active/passive service fwiw [12:08:24] and the discovery part will be used only internally by each host to connect to debmonitor for sending updates [12:09:28] bblack: the puppettization allows for it? I saw hieradata/common/discovery.yaml and related pp files and seems that lvs is kinda required [12:15:25] I guess you're right, the puppetization for the discovery-dns stuff does seem LVS-centric [12:15:48] hypothetically it doesn't need to be, though. the cache_misc side of things certainly isn't [12:16:16] Fresh diagrams: https://wikitech.wikimedia.org/wiki/Network_design [12:19:49] XioNoX: nice :) [12:20:32] volans: I know we had this conversation before, but remind me about how debmonitor a/a stuff works from the external pov of web clients and/or whatever it's doing towards all the hosts? [12:20:42] volans: does it need discovery-dns? [12:20:55] nice work arzhel :) [12:21:23] 10Traffic, 10Analytics, 10Operations: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066#4245756 (10elukey) p:05Triage>03Normal [12:22:01] thanks, now I have to keep them up to date (and add eqiad) [12:22:03] XioNoX: no eqiad picture to match codfw [12:22:28] bblack: can I get back to you in ~20m getting some lunch ;) [12:23:08] volans: ok. I kinda suspect we don't need to do disc-dns for this. we can just stuff it behind cache_misc directly for API and browser access... [12:24:21] yes and that's https://gerrit.wikimedia.org/r/#/c/436504/, but for the internal endpoint, I'll explain better in a bit [12:26:01] XioNoX: OCD happiness++ [12:36:33] Am I having a cache invalidation issue? on https://wikitech.wikimedia.org/wiki/File:Wikimedia_network_overview.png do you see the image that has a direct link between eqsin and codfw? [12:37:05] 10Traffic, 10Analytics, 10Operations: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066#4245793 (10elukey) As reference, `prometheus::node_gdnsd` might be an example about how to proceed. [12:37:41] which should be the current version but also the one from "12:31, 31 May 2018" [12:37:44] XioNoX: nope, no link between eqsin and codfw [12:39:02] current is "Reverted to version as of 12:31, 31 May 2018 (UTC)", the version of "12:31, 31 May 2018" has the correct thumbnail, but the current one doesn't... [12:44:06] yeah it seems like something odd is going on there [12:46:01] the 600px thumbnail being used on wikitech isn't recorded as a known thumbnail size in https://wikitech.wikimedia.org/wiki/File:Wikimedia_network_overview.png [12:46:13] and comes up wrong (no link) [12:46:45] other thumbnail sizes look correct (have link), but then the original-sized image with no thumbnailing also looks wrong [12:47:39] if I ask for new unknown thumbnail sizes (new renders), they do have the correct link [12:47:49] e.g. I just made it render: https://upload.wikimedia.org/wikipedia/labs/thumb/5/5f/Wikimedia_network_overview.png/688px-Wikimedia_network_overview.png [12:48:21] so seems like cache invalidation issues on the original and the 600px variant. possibly a race condition? [12:48:33] that's https://upload.wikimedia.org/wikipedia/labs/archive/5/5f/20180531123416%21Wikimedia_network_overview.png ? [12:49:30] that's the correct image [12:49:48] but this is the canonical URL for the latest original, which is still wrong: https://upload.wikimedia.org/wikipedia/labs/5/5f/Wikimedia_network_overview.png [12:50:15] I also tried ?action=purge on the File: page, but it didn't change any of the issues [12:50:35] maybe we don't really have purging hooked up for wikitech correctly? [12:51:29] it's using upload/swift for image hosting, but it's not part of the normal MW server clusters either. there might be issues around how its purging is configured or operates... [12:51:49] (either the direct-multicast part or the jobqueue part?) [12:53:25] * volans back [12:55:06] so bblack for the external endpoint debmonitor is a normal active/passive application behind cache misc with cache pass [12:55:16] I updated another image and it's having the same issue [12:55:24] for the internal endpoint we can either use a CNAME or dns-discovery [12:55:37] XioNoX: yeah I doubt it's "random purge loss" (which is usually the first thing blamed), it's probably systemic [12:56:08] unless you had anything else in mind. All hosts needs to connect to the active host, so they need something to know which one is it and this hostname must match the TLS certificate on nginx [12:58:55] ok, which I guess is planned to be our current LE puppetization for the cert? [12:59:35] it's internal only, I've created an ecdsa cert with our standard script signed by the puppet ca [12:59:49] how does failover happen? I mean, is there some process aside from flipping cache_misc routing and the CNAME, at the application level, to promote/demote the active/passive relationship? [12:59:52] the external one is just the misc endpoint with the *.wikimedia.org [13:00:04] it depends on the DB, it's on m2 cluster [13:00:23] so either the db migrates too or the config needs to be changed to point to the other one [13:00:33] ok right [13:00:56] currently points to "m2-master.eqiad.wmnet" (as I was told, although seems wrong to me given that is DC specific, but that's another topic ;) ) [13:01:07] either way, we can't blindly flip the switches at the dns/cache layer. someone has to coordinate a db master switch in the app config or in the db layer itself. [13:02:00] I would dodge the "refactor dns-discovery for non-LVS use" bullet then, and just use a CNAME. You have to do a commit in puppet to switch cache_misc routing for now anyways. What's one more commit on the DNS side in this case? [13:02:11] yes, keep also in mind that from the web side of things the only users will be us (SRE) and for the internal one any failed update will be reconciled in 24h through a crontab [13:02:33] it's ok for me I guess [13:03:09] do we have a dc-agnostic internal domain that is not discovery.wmnet ? [13:03:37] I still have half-formed thoughts that I don't like the dns-disc layer as part of some final solution, it seems more like a necessary hack while we're in transition to a truly a/a multi-dc world. [13:04:14] but since this app isn't a/a anyways, and we'll probably never reach a state where all misc apps are a/a... meh... [13:04:52] once we have TLS for mysql clients would be easier I guess [13:05:16] this could easily be converted to a/a on the traffic side if the app can connect cross-dc to the DB [13:05:28] volans: you can use discovery.wmnet, it's still logically-appropriate, and then it also fits a certain model that if such a service later got multiple hosts per DC and went behind LVS and dns-disc, the hostname clients use wouldn't change. [13:06:00] sounds good to me! [13:06:09] there's already one such case apparently, in discovery.wmnet: [13:06:11] ; Will become a proper discovery endpoint once we add more registries [13:06:11] docker-registry 300 IN CNAME darmstadtium.eqiad.wmnet. [13:06:29] * volans will follow the flow then [13:08:33] my half-formed thoughts about the long-term view of dns-disc: so we have this picture where there are N inter-dependent services, all running on both sides. The purpose of a dns-disc-like mechanism is essentially for: (1) Allowing some of them to be A/P, where regardless of where the source traffic from another service originates, it must end up on the 1x active side for this service. and (2) fo [13:08:39] r both A/A and A/P, it's allowing the switching to happen on a per-individual-service level instead of whole-dc. [13:09:31] right [13:10:06] in a more-ideal (but unreachable? or maybe reachable-enough?) state, most services are A/A, most traffic should be dc-local (it starts on one side and flows through the web of services on the same side), and disabling a DC is done for DC-level maintenance/outage, at which point we disable the whole side at once, not flipping off one service or the other. [13:10:24] at which point the complexity of dns-disc-like things is largely not useful. [13:11:31] the desire at that stage to maintain per-service switches seems to be more about abusing dc-level redundancy to avoid maintaining 100% uptime within a single DC (basically cheating on your redundancy design inside 1x DC) [13:11:43] also assuming we reach a fully a/a state, I'm not convinced we could ever reach a state in which for each maintenance we switch off an entire DC, concurrent maintenance on different services will surely make this impossible [13:12:11] well there's different definitions or types of "maintenance" here [13:12:58] :) [13:13:00] but I'd argue a reliable application cluster should be maintainable without ever failing to provide service. We shouldn't have to take down one DC at a time at the per-application level to allow application-level maint (e.g. code upgrades or server swaps) [13:13:32] that application cluster needs a more-resilient design. dc-level failover is solving a different problem at a different layer, don't abuse it to make up for the app cluster's failings. [13:14:14] I know none of these ideals are things we're anywhere near today, so it's all "far off in future/ideal world we may never reach" [13:14:28] yeah [13:14:37] but we should have, at least, a clearer and better-defined picture of what the desired end-state of this multi-dc effort is, and whether it's achievable. [13:14:59] agree with the principle, we should probably take some time at the offsite for this [13:15:02] so we know whether some peices of the puzzle are temporary or not, or some applications are or aren't compatible with the end-goals or not, etc [13:17:20] (of course, some "services" don't need that 100% SLA anyways if they're internal non-critical tooling, it's the things that are ultimately in the pathway of providing reliable wikis to the world that matter) [13:17:47] yes, indeed [13:18:19] (but if most services are stateless like they should be, we should be able to achieve that ideal for the critical-path services at most layers until you get down to the separate stateful layers, which is a whole separate world for multi-dc. [13:18:23] ) [13:18:47] but that's divorced from things like service hostnames and disc-dns-like/cache-routing -like concerns. [13:19:08] application-layer ones, I mean [13:22:20] * bblack end morning random rambling mode [13:22:27] lol [13:23:26] XioNoX: on the wikitech thing: wikitech's wmf-config has the right purging setup for article purges to multicast to cache_misc, but I bet something else is missing for hooking up File: purges over to cache_upload, at some level... [13:24:25] XioNoX: actually, I suspect one part of it is that currently all mediawikis use un-split multicast, and upload doesn't listen to the cache_misc multicast IP, but should to handle this case. But there could be other parts missing in the picture too, re: jobqueue-level stuff for purging? [13:26:09] I think I'd need a diagram to understand that :) [13:26:28] we can at least fix the multicast end of it easily [13:27:17] more seriously, maybe a good topic for Prague [13:27:24] XioNoX: so wikitech.wikimedia.org is behind cache_misc (vs e.g. enwiki being behind cache_text). There's some separate multicast purging IPs that different wikis are configured with for purging, which are still not used in a very ideal way. [13:27:28] there seem to be many moving parts [13:28:18] Related, I cleaned up https://wikitech.wikimedia.org/wiki/Multicast_IP_Addresses with a question for you on https://phabricator.wikimedia.org/T167842#4211880 [13:30:08] XioNoX: I think Krinkle accidentally did that strikethrough at some past point and nobody noticed. It's still in use. [13:30:18] (from a multicasting perspective) [13:30:27] ok [13:30:43] anyways, so short explanation: [13:30:54] so wikitech.wikimedia.org is behind cache_misc (vs e.g. enwiki being behind cache_text). There's some separate multicast purging IPs that different wikis are configured with for purging, which are still not used in a very ideal way. [13:31:18] https://wikitech.wikimedia.org/wiki/Multicast_IP_Addresses is effectively what the wikis are configured with (if you look for wgHTCPRouting in the mediawiki-config repo) [13:32:04] we never did get upload purging split off (perhaps that effort should be revived). So in practice, nothing is really sending image purges to the separate "upload" multicast. so e.g. enwiki purges articles using the cache_text multicast, but also sends image purges to that same address. [13:32:36] the cache_upload cluster listens on all of the text+upload+maps multicast IPs to catch them, since text mediawikis purge images via the text multicast IP [13:32:50] but it's not currently listening to the cache_misc multicast IP (.115) the same way [13:34:45] XioNoX: https://gerrit.wikimedia.org/r/#/c/436526/ would fix that part of the puzzle anyways, I just don't know if there's other missing bits re: jobqueue-based purging-related things, etc... [13:36:21] oh, that's because wikitech images are hosted on upload [13:36:33] right [13:36:55] in general all wikis' images are hosted on upload, but wikitech is unusual in that the text side of it's on cache_misc instead of cache_text [13:37:07] (which are eventually becoming one thing anyways, but we're not there yet) [13:37:07] got it [13:43:35] .... and now that that's deployed, I did a ?action=purge on the File: in question, and now the original updates [13:43:54] the 600px thumbnail did not update, though [13:44:30] (but it's also not apparently tracked as an available thumbnail size, so I donno wtf, but it's not a pure cache-layer problem. it's just a variant the app will not purge at present) [13:45:01] https://upload.wikimedia.org/wikipedia/labs/thumb/5/5f/Wikimedia_network_overview.png/600px-Wikimedia_network_overview.png [13:45:26] which is the resolution used on the wikitech article about network design, but is not listed in: https://wikitech.wikimedia.org/wiki/File:Wikimedia_network_overview.png [13:46:32] so I should use one of the defined sizes and not a custom one [13:46:40] my picture of what happens in those cases is fuzzy, but I think other rendered resolutions like the 600px variant are supposed to be tracked in a database table somewhere and purged as well, but ... yeah I donno what could be going wrong there. I think that part of wikitech's mediawiki doesn't realize it needs to purge that 600px variant anymore [13:46:59] I don't know if you're supposed to need to know that or use the defined sizes, but it's certainly a valid workaround in this case. [13:47:50] good to know [13:48:36] (at least, I should say, I'd hope it purges the sizes it bothers to list on the File: page, since it clearly knows about those! Maybe something else is broken here in the mw-config and such, and no thumbnails are being purged. I have no idea really) [13:49:11] you could make a tiny edit and upload it again, and see what happens this time around on a fresh upload. maybe that would fix the 600px variant? no idea... [15:00:20] I'm running a minute or two behind, be there shortly [15:00:27] ack [15:16:12] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030#4246171 (10RobH) >>! In T196030#4245080, @ayounsi wrote: > > Probably still need to swap RX/TX. I replaced both of the optics with wholly different optics and a wholly different fiber cab... [15:21:15] hey folks! [15:21:34] in a couple of weeks I will attend Netfilter Workshop in Berlin. Perhaps it would be good if I can collect some WMF questions, doubts, suggestions, use cases, requirements, improvements, critics, whatever... regarding iptables, nftables, ipvsadm, lvs and friends [15:21:45] cc bblack ema XioNoX vgutierrez paravoid [15:21:53] hi arturo, we are in the middle of our weekly sync meeting :) we'll come back to you later :) [15:22:09] great [15:23:21] I'm thinking on T187994 as well [15:23:22] T187994: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 [15:36:23] * vgutierrez hides [15:37:15] BTW, what bblack mentioned a few months ago regarding eBPF looks like is a trend now: https://code.facebook.com/posts/1906146702752923/open-sourcing-katran-a-scalable-network-load-balancer/ [15:40:59] my call is not about comparing tech X with Y [15:41:35] is about a concrete set of technologies, which are all related to the Netfilter project, which is the workshop I'm attending :-P [15:43:49] and in this case, I have some concerns regarding following facebook in their FLOSS path [15:44:59] nah, I'm not suggesting adopting katran [15:45:14] I was simply mentioning that eBPF is being adopted for load balancing use cases :) [15:45:53] well, some of the major developers of eBPF are fb engineers... so makes sense [16:01:09] well the important thing here is that the trend-setters are listening to me :) [16:01:30] it's the only sane thing to do [16:01:58] (but seriously, it was already a growing trend when I mentioned it, I just happened to be the first to mention it here!) [16:03:18] I'm probably going to end up writing some eBPF integration stuff in gdnsd-3.x for a different purpose (for doing IRQ-pinning->CPU-core pinning "correctly"), so maybe that process will give me a little more insight into how hard it is to implement LB-related things with it too [16:40:20] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4246465 (10RobH) [16:43:20] 10Domains, 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4231062 (10Nuria) Option b) sounds good. [16:52:47] 10Domains, 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations: HTTP 404 on stats.wikipedia.org (Domain not served) - https://phabricator.wikimedia.org/T195568#4246485 (10Krinkle) [16:54:14] 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) [16:54:18] 10Domains, 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations: HTTP 404 on stats.wikipedia.org (Domain not served) - https://phabricator.wikimedia.org/T195568#4231062 (10Krinkle) [16:54:26] 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) [16:55:24] 10Traffic, 10Analytics, 10Analytics-Wikistats, 10Operations, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) >>! In T195568#4233129, @Dzahn wrote: > option a) delete stats record from the wi...