[05:48:38] 10Traffic, 10Operations, 10conftool, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) I think I will try to implement the following RBAC schema: - | user | RO | RW | from | | - | * | - | * | | root | * | * | cumin | | confto... [06:05:54] 10Traffic, 10Operations, 10ops-eqsin: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10Vgutierrez) 05Open→03Stalled @robh done. let's see how it goes, thanks! [06:52:16] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Ladsgroup) >>! In T243701#6152282, @Gehel wrote: > I'm wondering if exposing both MySQL lag and WDQS la... [07:15:16] that looks weird: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?panelId=5&fullscreen&orgId=1&from=1589964227876&to=1590033897416 [07:26:57] vgutierrez: hello can I get a +1 https://gerrit.wikimedia.org/r/c/operations/dns/+/597738 ? thanks :) [07:27:04] * vgutierrez checking [07:28:09] +1ed [07:28:12] thx! [07:32:19] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [08:12:23] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [08:30:51] 10Traffic, 10Operations: check_http and SNI support - https://phabricator.wikimedia.org/T253292 (10fgiunchedi) [08:31:06] would appreciate your input on ^ [08:31:40] I was thinking of trying option #1 on the standby icinga today and see if we get a screenful of red alerts or not [08:33:41] godog: option 3) seems the easiest of all, no risk and we already have many different check_commands using check_http in different ways so another one does not hurt us? [08:41:36] mutante: I haven't fully unwrap option 3) in my head on what it'd mean, e.g. yet another option in puppet to select the "sni only" case [08:41:53] I agree it is the easiest, though certainly not the simplest [08:43:11] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [08:47:25] godog: in modules/nagios_common/templates/checkcommands.cfg.erb add a snippet copied from f.e. line 342 to 345 from "check_https_url", call it check_https_url_sni, add the --sni option. then use check_https_url_sni in check_command parameter of monitoring::service [08:50:59] 10Traffic, 10Operations: check_http and SNI support - https://phabricator.wikimedia.org/T253292 (10Dzahn) I think option 3) is the easiest of all, has no risk to break existing checks and we already have many different check_commands using check_http in different ways so another one should not hurt us. In `mo... [08:55:38] mutante: that and also surface the option on service::catalog, since the monitoring checks come from there [08:55:52] it might come to option 3) but tbh I'd rather see if we can avoid it first [08:58:07] i see. i guess that's a downside of the abstraction layers [08:58:20] ack [09:26:33] 10Traffic, 10netops, 10Operations: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [09:38:23] 10Traffic, 10Core Platform Team, 10Operations, 10serviceops, and 2 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10aaron) I'm not fond of the idea of not sending purges for indirect edits nor using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO... [09:44:00] 10netops, 10Operations: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10ayounsi) Good idea! What's the limit? I'd suggest: * Comcast - large US ISP - https://atlas.ripe.net/probes/6080/ - https://atlas.ripe.net/probes/6072/ * RIPE to have somethin... [10:23:00] bblack: authdns1001 being offline is blocking acme-chief certificate renewal process [10:50:30] feel free to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/597755 as soon as it's back online again :) [10:59:24] in hieradata/common/profile/trafficserver/backend.yaml [11:00:35] if there is a rule like "action: never-cache" for a certain dest_host. can i assume that the dest_host refers to the same name used further up in a "replacement:" line that defines the backend? [11:01:12] I think so, that host is a HTTP Host header value AFAIK [11:01:13] let's say i use https://foo.discovery.wmnet in a 'replacement' and in DNS it's a CNAME for an actual host [11:01:36] ok, thanks. it seems so from looking at others unless they made the same mistake [11:03:28] thanks for the +1 [11:04:01] np [11:21:01] 10netops, 10Operations: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10jbond) Which measurements to you plan to scrap? - all measurements the anchors are performing outbound? - the anchoring measurements directed at thes anchor. If its t... [12:20:12] 10Traffic, 10Operations: check_http and SNI support - https://phabricator.wikimedia.org/T253292 (10fgiunchedi) >>! In T253292#6154832, @Dzahn wrote: > I think option 3) is the easiest of all, has no risk to break existing checks and we already have many different check_commands using check_http in different wa... [12:22:48] 10Traffic, 10Operations: check_http and SNI support - https://phabricator.wikimedia.org/T253292 (10CDanis) +1, option 3 is the most expedient but creates the most technical debt. Re: option 1, I think it would be relatively straightforward to dump all check_http from the puppet db (or from the icinga config f... [12:50:48] 10Traffic, 10Operations, 10conftool, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10akosiaris) RBAC without roles isn't really Role Based Access Control, but I digress. LGTM on my side for those permissions. [12:54:26] 10Traffic, 10Operations, 10conftool, 10discovery-system, 10services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10CDanis) These permissions LGTM. [14:00:28] for your eyes when you get a chance: https://gerrit.wikimedia.org/r/c/operations/puppet/+/597765 [14:04:28] godog: hmmm that would avoid checking the default cert for any service using any of those checks [14:04:56] I don't know if we have any service besides the unified cert on the caching nodes where we need to check that non-SNI TLS works as expected [14:08:26] vgutierrez: good point [14:09:10] I'm wondering if we're checking SNI at all now [14:10:11] so on the text nodes we do at least when we check the wikiworkshop.org cert [14:10:42] cause that CN isn't covered by the unified cert [14:11:31] ah! yeah that makes sense, and that would be check_ssl as opposed to check_http --ssl [14:11:34] (?) [14:12:19] yeah, for cp nodes we are using check_ssl_ats and check_ssl_ats_ocsp [14:15:07] ok thanks, I'll check (ha ha) the rest if non-sni cases are covered [14:27:50] vgutierrez: sounds like a design problem! :) [14:31:19] re: sni, I think we've probably crossed the time boundary where non-SNI is basically-irrelevant [14:31:31] although someone should look into that a little more before we kill any existing checks, etc [14:32:06] but I suspect with the last round of TLS deprecations (min version = 1.2), it's unlikely a client can still connect to us that doesn't understand how to send SNI [14:47:14] yeah I'm tallying check_http --ssl usage and see if there's potential for problems in non-sni cases [14:59:04] 10Traffic, 10Operations, 10Patch-For-Review: check_http and SNI support - https://phabricator.wikimedia.org/T253292 (10fgiunchedi) In terms of `check_http` usage ATM we have these invocations, for which we'd stop checking the default (non-SNI) certificate: ` $ # the only check_http check not named after che... [14:59:13] looks like this ^ [15:03:56] XioNoX: this zayo circuit 120003 fails so often [15:04:08] about to copy and paste my email from 7 days ago... [15:05:16] doesn't look like a planned maintenance neither [15:05:19] no [15:05:39] wait, OSPF is back [15:05:48] oh okay, apparently fixed itself [15:05:57] still not great [15:06:14] I tried to keep track of all the outages in a spreadsheet, but it's just impossible [15:06:44] it should be automated and the value of such tool is not very high [15:06:53] had literally just finished adding the penultimate one [15:06:55] >Since about 14:58 UTC we're seeing our circuit OGYX/120003//ZYO down.  Of note is that this is the same circuit that failed on Feb 23rd and on Mar 6th and on Mar 15th and on May 14th; see also TTN-0003904997 and TTN-0003933687 and TTN-0003950338 and TTN-0004088896. [15:29:36] 10Traffic, 10netops, 10Operations: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [17:24:43] 10Traffic, 10netops, 10Operations: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [17:26:55] 10Traffic, 10netops, 10Operations: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) 05Open→03Resolved Confirmed that if dns4001 and dns4002 are down, ulsfo will stop advertising `198.35.27.0/24` to the world but still had routes to 198.35.27.27/32 via codfw. [17:26:59] 10Traffic, 10netops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [17:31:00] bblack: XioNoX: this has been awesome to watch :) [17:34:12] yeah, in hindsight probably a bunch of our pages of /msg should've been somewhere like here too, discussing all the tiny details (many of which are still unsolved problems) :) [17:34:32] we can at least make tickets for some of those issues [17:34:57] yeah, or a design doc *g* [17:35:04] 😂 [17:35:35] you seem to be under the illusion that this is a design rather than an emergent system! :) [17:35:48] most design docs are retrospective in my experience anyway [17:36:00] but yeah [17:36:19] we could at least document our half-assed design guesses, and then amend it as we learn how it should really work [17:36:54] I can start a blogpost about external anycast to complete the previous one [17:37:48] as things stand right now, we could in theory add this new IP to our NS set and start trying live service out with users, etc [17:38:01] but there are still a number of unsolved issues that make me uncomfortable with that [17:38:12] in terms of nailing down the reliability and consistency of it all [17:38:24] not to mention the IP addressing/space things that are still kinda up in the air [17:40:43] I think X is going to work on the problem of inconsistent L[34] hashing by the junipers next, and he tends to be good about making sub-tasks for such things [17:41:07] current best guess is have bird send different meds to cr1 vs cr2 everywhere, so that cr2 prefers to let cr1 make the decisions if cr1's alive [17:44:26] and then, there's the (much less important, I think, for $reasons) ICMP PTB-routing issue - the above may help? but I'm not even sure that juniper's traffic hashing is smart enough to route those even with a single router in play. [17:44:33] it's not [17:44:50] so what does it do with them? [17:44:57] hash them in L3 and route them [17:44:57] just forward them randomly to one of the set? [17:45:01] ok :) [17:45:27] which may or may not be the same host as the one sending the packet :) [17:45:31] and then we're still facing some systemd/puppet/daemons -level issues on nailing down a bunch of failure corner cases and how quickly routes are withdrawn, etc [17:46:15] (as things stand in the currently-merged state of affairs, if a config change causes pdns-recursor to restart, or it crashes, the anycast authdns route to that host will also be pointlessly withdrawn) [17:46:39] (but the easy fix for that causes some nasty losses of recdns traffic on routine config-restart of pdns-recursor, too, so more thinking needed) [17:47:16] we're doing L3 hashing (not L4) so it might actually work as expected? [17:47:42] no, the packet too big would typically be sent by a router in the path [17:47:42] the PTB can have a different source IP than the real client traffic, some intermediate router [17:47:52] ah right [17:48:05] then real client IP is inside the PTB message as data, but the router has to be smart enough to recognize that case and hash on that data [17:48:17] yep [17:48:28] linux LVS/ipvs does it, but apparently juniper does not [17:48:34] if juniper had that smart capability, it would be most likely be buggy [17:48:53] probably because some asic or fpga makes that routing decision and is very simplistic and looks at the fixed-offset initial header bytes [17:49:15] as bblack knows, I maintain that using the junipers for loadbalancing within the site is asking for trouble :P [17:49:22] it may be ok as an MVP [17:49:47] and I have a separate fear that MVP/PoCs have a tendency to stick around for a long time in our org :) [17:50:10] well yeah [17:50:39] the general counterpoint, though, is that if we aim too high before we deploy , nothing ever happens [17:51:00] which this particular case has been a victim of, I think. We're now merging patches referencing a 5-year-old ticket :) [17:51:22] 10Traffic, 10Analytics, 10Operations: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10RLazarus) p:05Triage→03Medium a:03ssingh Trying to route this -- @ssingh, should this be assigned to you? [17:52:25] is that longer or shorter than the "fix pybal to do X better" tickets though? *g* [17:52:29] 10Traffic, 10Operations, 10Readers-Web-Backlog: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10RLazarus) [17:52:49] heh [17:53:03] hi traffic friends -- is T252227 for you? ^ any suggested assignee? [17:53:04] sorry :) [17:53:04] T252227: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 [17:53:11] in all seriousness, let's get something in the AP around loadbalancers [17:53:33] or at least consider it vis-a-vis other priorities? [17:53:35] 10Traffic, 10Analytics, 10Operations: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10ssingh) Hi, yes that's fine for now. The privacy threats will be more suited for the Security team but I will triage it again when required... [17:53:55] for the price of several hundred thousand expensive starbucks orders, you too could have a team big enough to manage its own debt! :) [17:55:55] paravoid: yeah, we should. But swapping roles in this debate with you as we move to the next point: I'd rather not spin wheels on half-assed small fixes, I'd like it to include some serious time on research and experimentation towards a better-designed end state we'll be happy with. [17:56:10] haha [17:57:49] (and maybe whatever that is, it can also be inclusive of better solutions to these anycast-balancing woes as well) [17:58:59] when you really look long-term and hard at the LB issues, it opens up a huge rabbithole of deeper questions in other related areas too [17:59:34] about where we're going on our hardware routers in general (do we eventually get some kind of more soft-defined ones, or linux-based ones, etc), and eventual L3-to-the-host, and how we lay out our DC internal networks, etc, etc [18:00:21] it's be nice to have an integrated vision of where we'd like to land at on all related things that stretches out a few years, and has some things we can accomplish towards it along the way that aren't wasted efforts down eventually-wrong paths [18:09:26] 10Traffic, 10Operations, 10Patch-For-Review: Implement a prometheus exporter for rdkafka in golang - https://phabricator.wikimedia.org/T253197 (10RLazarus) p:05Triage→03Medium [18:12:28] rzl: looking to see how difficult it is [18:12:52] thanks! [18:24:55] 10Traffic, 10Operations, 10Readers-Web-Backlog: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10BBlack) Interestingly, the mobile redirect code in varnish doesn't strip any parameters. The problem is that the analytics-side VCL code that consumes the `wprov` parame... [18:30:42] rzl: I still didn't assign it heh [18:31:09] 10Traffic, 10Analytics, 10Operations, 10Readers-Web-Backlog: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10BBlack) [18:31:22] but maybe analytics will give some input first [18:31:52] good enough for me with my clinic hat on, thanks for taking a look