[00:26:03] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3208453 (10Cmjohnson) @ayounsi I would like to get an early start on this NLT than 0930 EST. Will that be possible? Thanks [06:37:23] interestingly, cp3033 caused a couple of 503 spikes right now because of mailbox lag. What's interesting is that: 1) it's cache_text, not upload 2) it recovered on its own 3) the lag grew and recovered too quickly for the icinga check to catch it [07:46:21] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3208907 (10ayounsi) @Cmjohnson unfortunately there is another maintenance scheduled to end at 10:00 EST (14:00 UTC), doing the maintenance after... [08:08:31] 10Traffic, 06Operations, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3208932 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving as the swift upgrade is complete and varnish bandaids have been r... [10:13:37] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3209149 (10elukey) >>! In T148506#3205842, @ayounsi wrote: > **Days before** > Move kafka1020 to row B T163002 Note about this move: today I wil... [11:05:14] I'm trying to compare a bit the cpu usage of the expiry mailbox thread with and without prio change [11:05:52] comparing cp2017 (prio change) and cp4007 (default) as varnish-be has been running there for a comparable amount of time [11:07:00] and the number of lru.locks is also comparable, so they seem decent candidates [11:08:54] sudo timeout -s INT 10 perf stat --event=task-clock -t $tid [11:09:06] that gives, on cp2017: [11:09:08] 1733.915670 task-clock (msec) # 0.175 CPUs utilized [11:09:11] cp4007: [11:09:18] 196.316808 task-clock (msec) # 0.020 CPUs utilized [11:12:25] however, filtering with -p $pid instead and running perf it also seems that cp2017 is busier than cp4007 so perhaps the comparison is not entirely fair [11:12:37] (0.663 CPUs utilized vs. 0.290) [11:12:55] still, there seems to be a difference (as expected!) [11:15:48] I haven't really looked since last night my time, but I assume the codfw backends are all operating sanely and not lagging? [11:15:58] (I did upgrade them all!) [11:19:34] they seem fine, yeah [11:28:38] oh, I've upgraded cp3033 (text!) this morning too [11:29:44] on that host, mailbox lag grows really quickly and fetches start to fail fast too, then it recovers [11:30:45] with the patch it does that? [11:30:53] without [11:32:27] I've upgraded it after it got affected by lag/errors, in the last 2 days it misbehaved [11:33:02] ok [11:33:05] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams%20prometheus%2Fops&from=1493002805651&to=1493103626517 [11:35:01] yeah that's odd that it did that two days in around around the same time of day +/- an hour or so [11:36:14] there's also a probably-unrelated spike of icmp input errors around a day ago: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3033&var-datasource=esams%20prometheus%2Fops&from=1493038041025&to=1493038542429&refresh=1m [11:38:01] correlating with a decrease in disk usage funnily enough [11:39:58] yeah there must have been some brief network issue I guess? client request rate went down (hence disk usage) [11:49:36] 10Traffic, 06Operations: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525#3209358 (10ayounsi) [12:24:49] 10Traffic, 10Mobile-Content-Service, 06Operations, 10RESTBase, and 3 others: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#3209461 (10NHarateh_WMF) [12:42:51] 10Traffic, 06Operations, 06Performance-Team, 06Reading-Infrastructure-Team, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3209641 (10NHarateh_WMF) [13:15:49] 10netops, 06Operations: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3209774 (10ayounsi) a:03ayounsi ARIN is also very straightforward (everything can be done online). See this copy of a blog post I wrote in 2013 https://labs.ripe.net/Members/mirjam/mozilla-u... [14:12:31] hey! I'm getting an error 500 on wikitech when trying to save a page I edited [14:13:30] I see SAL saves still happening, it must be able to write in general [14:13:47] did you use the Visual Editor? [14:14:02] there's a known bug about this, triggered by the datacenter switchover [14:14:15] paravoid: https://pageshot.net/ygT5f1SLWizGL9t2/wikitech.wikimedia.org [14:14:33] yeah that's VE [14:14:40] the WYSIWYG editor [14:14:56] https://phabricator.wikimedia.org/T163438 [14:14:57] and when I try to switch to the source editor I get "Error loading data from server: apierror-visualeditor-docserver-http." [14:15:34] ah okay, thx! [14:56:46] ema: just cache_text lacks 4.9 reboots so far right? [14:59:47] bblack: correct! Let's fix that [15:00:31] and cp1008 is running new nginx [15:00:43] looks sane in basic testing, expected to be sane in general [15:00:48] nice [15:01:07] but still, we might want to do some test upgrades [15:01:15] maybe 1x from each cluster or something [15:01:48] so now we have 3x different version upgrades in various states of flux [15:02:02] bblack / ema: when you have some time, I'd love to have review on https://gerrit.wikimedia.org/r/#/c/346148/ and https://gerrit.wikimedia.org/r/#/c/346146/ (adding LVS for relforge) [15:02:04] the 4.9 kernels, the new varnish packages, and the new nginx packages now [15:03:49] gehel: I don't know much about relforge. is it an internal-only service or? [15:04:04] yes, internal only [15:04:20] who/what/how connects to it? [15:04:39] It is used only to validate work on elasticsearch / cirrus [15:04:42] (I ask to be sure we mean the same thing by "internal") [15:04:50] who validates how? :) [15:04:54] it is connected to from labs mainly [15:05:02] labs won't be able to connect to that [15:05:09] damn! [15:05:09] (the LVS service you've defined) [15:05:27] is there a way to expose an LVS service to lab? [15:05:39] without it being an external service? [15:05:58] well, LVS isn't really part of the problem [15:06:22] the better question is "Is there a way for labs to connect to anything in our private internal production networks?" and the answer is no [15:06:42] (following along in case I can help) [15:07:07] we have an exception for relforge [15:07:17] if it's something you need labs to be able to consume, the usual model would be to make it a public service the whole internet can consume, labs being a part of the internet from this perspective [15:07:19] afaict it's "labsupport", I think it's a bit different but don't really know the details [15:07:27] no, relforge is in labs-support which is it's own thing [15:07:30] make an external service where the iptables rules on relforge still enforce that it's from a labs address should work [15:07:42] "make an external service"? [15:07:51] (public IPs?) [15:08:03] yes, sorry that's what I meant [15:08:12] (and yes, labs-support is different than what I was thinking of as "labs" as in instances) [15:08:26] labs labs labs [15:08:36] I was responding to gehel for the labs-support thing but yeah terminology here is E_TOMUCHLABS [15:08:53] even I'm bothered that was TOO [15:09:01] wasn't, ok I need coffee [15:09:01] * gehel is going to get some pain killers... probably useful here... [15:09:04] ok [15:09:17] so relforge is a production service that labs-support nodes need access to? [15:09:34] and the way to get that is access over a public IP? [15:09:37] relforge is a prod server that labs instances hit to do relevancy testing [15:09:47] and right now instances are allowed to hit that service in labs-support directly [15:09:52] I think it is the opposite, but I'm not sure I understand our network all that well [15:10:05] (I have no backstory here on what they are trying to do now just trying to help clarify) [15:10:05] yes, just what chasemp said [15:10:39] so right now, you're saying we have some kind of proxy in labs-support to let regular labs instances come through labs-support proxy and hit relforge in prod? [15:11:03] I think of labs-support as prod [15:11:21] (using I guess non-LVS relforge, e.g. relforge1001.eqiad.wmnet) [15:11:23] so relforge1001/1002 (? I think) live in labs-support and instances hit them directly in that vlan [15:11:34] is my understanding yep [15:11:36] oh [15:11:50] relforge exists in labs-support, not in regular production vlans [15:11:55] yes [15:12:21] this seems like an odd abuse of the labs-support concept to me, but I clearly know nothing about this :) [15:12:34] it's a service made just for instances in consume elasticsearch datasets [15:12:43] to consume even [15:13:04] elasticsearch datasets that the public can't otherwise reach because they're internal prod stuff... ? [15:13:16] that's what I'm getting at, why are we poking holes here :) [15:13:21] it's basically historically what labs-support is for afaiu which is a service we treat like prod that only instances consume [15:13:22] right [15:13:48] I always thought of labs-support being stuff that infastructurally labs relies on [15:13:57] but I also don't often look or think of that stuff at alL! [15:14:07] like, nfs servers and such [15:14:10] I think that's true and this got put in that logical bucket [15:14:13] basically, this is a labs service, but which requires real hardware, there is no reason to access it outside of labs [15:14:17] which I agree is arguable definitely [15:15:03] the initial discussion indicated that the correct place for that kind of service is in labs-support since we only have VMs in labs itself [15:15:06] bblack: so we coudl drop these two relforge servers in a public vlan and use iptables to restrict to instances and the reason that wasn't done is lots of back and forth about where it should live and it landed on the current [15:15:14] which I'm not saying is the most loveable of all solutions [15:15:14] ok skipping past the fact that I only half-understand most of the above and it sounds security-questionable or whatever [15:15:30] LVS isn't going to work in labs-support [15:16:00] (in general, because the LVS machines need direct connections to the relevant VLANs, and they only have them to the 4x standard per-row public/private VLANs in each DC) [15:16:38] (but also, I imagine even if we had the additional interfaces/VLANs connected, there would probably be issues between LVS's direct-routing magic and whatever's happening for labs routing between labs instances and -support, etc) [15:17:35] I'm missing the context for why LVS is desired here atm [15:17:38] so on that basis, I can say at least the current approach doesn't look feasible in that patch (that relforge100[12] live in labs-support and you're trying to define an internal LVS service for them on the prod LVS machines, for any reason) [15:17:42] ok, so if we really want LVS, we need to make it a public service, which we could but isn't really necessary (and feels a bit weird to me, but the whole situation looks a bit weird) [15:18:50] we don't generally do public LVS services for one-offs, either, only for the cache cluster terminators and a couple of other special cases [15:19:18] the other special cases being: recdns (not actually public, but uses public IPs for $raisins), and git-ssh->phab [15:19:21] Adding LVS there was to try to get this setup closer to what we have in production. I came to this from T162037 (trying to align our SSL certs). [15:19:22] T162037: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037 [15:19:53] * chasemp off to a meeting [15:20:06] LVS is not really necessary here, things work as they are. So I'll probably just drop that. [15:20:11] ok [15:20:59] other question since I already interrupted you... [15:22:17] We have a hole in the FW from analytics network to some elasticsearch production servers (used to publish updated scores that are calculated in the analytics cluster). It seems that it make more sense to have a hole for the LVS endpoints than for some specific servers. [15:22:26] Or am I again misunderstanding somethign? [15:23:45] well, probably all of those things are poorly thought out and implemented presently in general, about how we control access between labs/analytics/prod/etc and what our policies really are [15:24:14] but in general analytics is kind of firewalled off and only gets holes for specific things, because analytics really is a different security domain, since it has accounts for 3rd-party researchers and such [15:24:43] (and maybe other reasons I'm not thinking of, too) [15:24:54] yep, that much I mostly understand... [15:25:44] also, punching holes for LVS services is complicated at best, probably not reasonable feasible [15:25:54] because LVS involves asymmetric routing [15:26:17] and LVS isn't a singular IP or service, even in a given traffic-class, too [15:27:00] Ok, so that at least explains why we have holes for individual servers [15:27:01] LVS is basically a router, for only one side of a connection [15:27:17] the general idea goes something like this (using made up stuff): [15:27:48] foo100[123].eqiad.wmnet are machines in eqiad in private VLANs, with let's say private IPs 192.0.2.[123] [15:28:17] foo.svc.eqiad.wmnet is a virtual service hostname to use LVS to round-robin to those three, and it resolves to say 10.2.2.42 [15:29:10] there's an LVS service defined, which maps 10.2.2.42 to those 3 servers (with healthchecks and depool control, etc) [15:29:34] what that actually does (defining that LVS service) in puppet and real network terms is like this: [15:30:07] 1) It creates IP address 10.2.2.42 on the loopback interface of say lvs1003.eqiad.wmnet [15:30:30] 2) It creates the same IP address 10.2.2.42 on the loopback interfaces of foo100[123].eqiad.wmnet as well [15:31:10] 3) None of the above respond to ARP on real interfaces since they're on loopback, so none of these definitions/uses of 10.2.2.42 are "reachable" in any normal way from random hosts yet... [15:31:42] 4) lvs1003.eqiad.wmnet advertises to cr[12]-eqiad (our core routers) that it is a router for the IP 10.2.2.42, so forward traffic for that to it [15:31:47] (over BGP) [15:32:29] now, when $random_host in some random production network tries to create a tcp connection to 10.2.2.42:80 or whatever, what happens is: [15:33:18] random_host doesn't have 10.2.2.0/24 as a local network (nobody does, it doesn't exist as a real VLAN anywhere), so it sends the SYN packet that's marked as from $random_host's IP and to 10.2.2.42:80 to the default gateway (one of those core routers) [15:33:39] the router forwards it to lvs1003.eqiad.wmnet because it knows that from the BGP advert [15:33:54] That's the part I was missing. The core router are an integral part of how LVS works. [15:34:03] Ok, it is starting to make sense [15:34:22] lvs1003 has a service definition for the IP:port in question, makes a decision to use foo1002.eqiad.wmnet, and forward the packet to foo1002.eqiad.wmnet (the packet still says src:$random_host, dst:10.2.2.42) [15:34:47] foo1002.eqiad.wmnet accepts the packet because it does understand that 10.2.2.42 is itself (it's defined on loopback interface) [15:35:26] when it response with the SYN-ACK response packet, it uses its normal public interface, and sends a response with 10.2.2.42 as the source address and $random_host as the destination address. Assuming $random_host isn't on its local network, this will go through the router. [15:35:42] it never saw the BGP advert and doesn't know about LVS, and doesn't forward the back-traffic through LVS [15:36:21] so the path is assymetric. one side of the connection goes $random_host->cr1-eqiad->lvs1003->foo1002. The other side goes foo1002->cr1-eqiad->$random_host [15:36:44] (unless foo1002 and $random_host happen to be on the same actual vlan, in which case cr1-eqiad is skipped on that second side) [15:38:01] how does it not work with firewall in between... [15:38:18] well it can, but it's complicated [15:38:27] firewall rules are defined per-VLAN [15:38:44] LVS service IPs exist outside of the world of our VLANs, and their traffic flows over those VLANs [15:39:07] so to make it work for analytics to reach the above example service, you'd have to do something like: [15:39:47] 1) Figure out that foo100[123].eqiad.wmnet actually live on production VLANs private1-a-eqiad and private1-c-eqiad (which rows they're currently installed in) [15:40:12] I really like to learn about all this, but if you have something more urgent, I can live with "it's too complicated / does not make sense"... [15:40:31] 2) Add to relevant firewall rules between private1-a-eqiad<->analytics and private1-b-eqiad<->analytics rules for the LVS IP 10.2.2.42 to pass between them [15:40:44] 10netops, 06Operations: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3210252 (10Multichill) Hey, a new network engineer. :-) Fun info at https://stat.ripe.net/AS43821#tabId=routing and https://stat.ripe.net/AS14907#tabId=routing . Would love to see some progres... [15:40:53] 3) Add to somewhere else in the firewall that it's also ok for analytics to traffic with .... something somewhere to reach LVS for that IP too [15:41:34] maybe it would be simpler to avoid defining the specific prod subnets the hosts live on, I donno [15:42:01] make some generic rule that allows analytics to talk both ways to that LVS IP via any other VLAN [15:42:05] I donno [15:42:16] we've never done it and it sounds complicated :) [15:43:04] but also it's just questionable in general [15:43:47] Ok, so let's keep holes for each server and not have balancing (as we do at the moment). We don't really need HA there so no real issue atm. It just also looks messy to define holes for each node and to keep them up to date... [15:44:21] for the labs case, aside from certain true infastructure (e.g. actual LDAP stuff for authentication, etc), generally I tend to think if we're exposing some $random_application_service_or_data which lives in production to labs instances, we may as well expose it to the public. [15:44:30] so just set it up as a public service and labs can reach it too [15:44:46] * gehel feels that the solution is to do something completely different... like having a single communication queue / router between analytics and prod [15:44:50] analytics is a much trickier case [15:46:22] well, re: single router between them or whatever, with all of this we're trying to be reasonably efficient too [15:46:51] in the case of relforge, the only issue in exposing it to the internet is that people might start to rely on it. And since we want to experiment new ways to rank, we want to be able to break it. [15:47:02] you could physically divorce it and say "this analytics-subnet stuff gets its own separate row, its own separate switches, it's own separate router, and then we place this hardware firewall between that router and the production routers" [15:47:18] gehel: (for context) I think this setup made sense at the time we wanted to have an elastic cluster accessible in labs with data updated in real time, but for relevancy testing we can maybe evaluate having everything isolated in prod? [15:47:48] when I say "single" I mean some kind of message queue cluster. [15:47:49] but in the real picture of how things work in our datacenter that's very wasteful. We can place their equipment in the same rows and switches using the same routers, and just use VLANs and firewall rules on the routers to segregate them and achieve much the same ends in a much simpler way. [15:48:22] (and not introduce perf bottlenecks to boot) [15:49:40] bblack: thanks a lot for your time! [17:36:56] 10/30 cache hosts done, tomorrow morning I'll upgrade the rest [17:37:03] see you! [17:37:24] cache_text, that is :) [20:12:25] 10netops, 06Operations, 10ops-eqiad: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3211448 (10Cmjohnson) @ottomata I would like to do this first thing in the morning (0830) 04/26 before the racks are shutdown. I will update this task with the switc... [20:59:23] 10netops, 06Operations, 10ops-eqiad: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3211701 (10Ottomata) Hm, we got a problem! These Kafka nodes are in the Analytics VLAN networks, AND have IPv6 configured. There is no IPv6 VLAN setup in Row B. I'...