[00:19:08] 10Traffic, 06Operations: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2743594 (10BBlack) 05Open>03Resolved a:03BBlack [01:29:29] ema: I rebooted maps + misc clusters for kernels, and the secondary lvs, and the extra eqiad primaries. so we're left with the "real" primaries with active traffic, and text+upload clusters now. [01:30:14] (well 2x of those lvs are still going, I guess a salt command hung, but they'll be done before you read this) [10:43:43] bblack: (whenever you have time) - any special procedure to reboot the primary lvs like you did today? [10:44:09] (I am curious about how to do it safely) [10:45:55] well yeah, it's one of those things where no amount of automation or monitoring is going to make it worry-free enough to fire and forget :) [10:46:23] I usually do whatever change/reboot/etc on the secondaries first (many hours ago in this case), and verify they look sane in the aftermath [10:47:12] and then for the primaries: I double-check the matching secondaries are still healthy just before (e.g. pybal is actually running, ipvsadm -Ln output looks right) [10:47:34] and then I stay logged into the matching secondaries and confirm ipvsadm -Ln shows traffic coming in when pybal stops on the primary [10:47:50] and wait for it to flip back post-reboot as well [10:48:54] in some cases in the past (but not today), I've compared the full set of service hosts in ipvsadm output as well [10:49:25] in case a whole service was configured on one but not the other, or a service was lacking certain backend hosts on one or the other (etcd issues? or differential healthcheck results?) [10:49:41] but I've been looking at that a lot in general lately so it didn't seem high risk [10:50:16] it would be nice to have an icinga check that could compare an lvs-pair [10:50:53] (compare their hierarchically-sorted ipvsadm -Ln outputs to see service-level or host-level state diffs that persist more than a check or two) [10:55:31] 10Traffic, 10netops, 06Operations, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2744318 (10BBlack) The recdns case is fully-fixed now (the old/bad IP no longer present anywhere or functional). [10:59:23] thanks! [11:01:17] I was thinking to in flight TCP connections on the primary LVS when it stops [11:02:27] is the primary going to gracefully drain them or it will just cause rst? [11:02:39] neither :) [11:03:12] so there's two pieces on each LVS host that matter: the actual kernel IPVS stuff, and the pybal daemon [11:03:48] pybal does the BGP advertisement to the router (all LVSes do this, regardless of whether we think of them as primary or secondary) [11:04:22] the router makes the call on which one "wins" in the case of multiple LVS advertising the same IP (pri+sec) [11:05:04] but if pybal stops, the BGP connection is broken. this immediately stops the advertisement of the route to the router, and so the router immediately sends traffic to whichever is still advertising it in BGP. [11:05:28] but when pybal stops, it doesn't actually clear the IPVS state in the kernel. the kernel will still accept the traffic, if the router sends it. [11:06:04] (that covers up any minor races that would otherwise ensue if pybal tried to tear down IPVS state when it shut down. it also covers a lot of failure scenarios if pybal crashed out) [11:06:23] this clarifies a lot of doubts that I had [11:06:53] the routers also have static backup routes for the service subnets, so that if all pybal dies (nothing is advertising BGP), the lowest-priority (but only remaining) route is to the one we're calling "primary" LVS (the lowest-numbered of a pair) [11:08:14] and of course, the LVS don't actually terminate TCP connections. they just forward the inbound (from the client) side of their packets to the backend hosts. [11:08:47] this is another good point, they are working at L3 [11:08:49] so, in theory, if a service uses "sh" mapping (like our public services do), and both lvs1001 and lvs1003 route the same client IP to the same backend (cache_text node), TCP doesn't have to RST. [11:09:06] the packets just start flowing through a different router and the connection is fine from client and cache node perspective. [11:09:32] in practice, I think often it's not perfect like that, because sh (and the way pybal manages sh) leads to inconsistencies in their mappings [11:09:59] but worst-case, on LVS flip for a service, a bunch of RST ensue, but the client can reconnect successfully and instantly. [11:11:25] awesome [11:12:24] thanks for the explanation :) [11:14:23] there's two ways we can fix the RST problem: [11:15:19] 1. For "sh" cases (public services), we could improve "sh" itself and/or improve how pybal manages sh endpoints (not sure if that's enough without "sh" changes though), to make sure the mapping is really a consistent hash that would always be consistent across both LVS [11:15:56] (but "sh" is an ipvs kernel module. they are a bit arcane and scary to work on) [11:16:20] Lvs also keeps per connection state, so not sure it works for existing conns [11:16:43] 2. For non-"sh" cases (like all our internal services, which are better off wrr because the set of client IPs is small (a set of cache nodes, for most traffic), and also covering the above case: we could have ipvs do multicast connection-state syncing [11:16:45] I have one such kernel mod already written, but it failed due to that [11:16:50] I think we have a ticket for (2) [11:17:04] It was pretty easy though [11:17:42] I *think* (but I'm not sure) that LVS's state is really just "which host is it mapped to". I don't think it enforces things like TCP flags sanity and such? [11:18:03] but I could be wrong, it's been a while since I looked [11:18:39] for NAT-mode and such it would be different, but I think for DR it doesn't need to track or enforce much state other than the selected endpoint for each socket [11:18:42] i think it only evaluated it on new connections [11:18:53] and would then use its per conn state for ongoing packets [11:19:01] but this was 2008 or thereabouts, a lot might have changed [11:19:35] well the main thing is I don't think it will fail if it picks up a connection that's already established. the question is just whether it routes to the right backend. [11:20:17] yeah [11:20:17] but... I could be wrong! it could only route on SYN and then reject non-SYN that don't have an existing connection table entry (although that seems like a poor idea) [11:20:20] with my module it would [11:20:32] heh i can't even find it in google anymore [11:20:36] it used to live in svn.wikimedia.org [11:21:03] I've come to be fond of the fact that my old code slowly dies and gets harder to find. When I do find it, it tends to be embarassing :) [11:21:25] yes exactly [11:22:37] sh (at least originally) didn't claim to do consistent hashing, but in practice the way it works, if you load up the same IP addresses in the same order, they map for clients the same way. [11:23:03] so in the should-be-common case where nothing ever gets depooled or fails a healthcheck, we happen to get consistent mapping [11:23:17] but as soon as pybal starts removing and re-adding hosts, they get re-ordered in the sh mapping [11:23:29] yeah my module did a libketana like thing [11:23:31] so it would work [11:23:46] exact same thing as I did in the varnish 3 director [11:23:51] yeah [11:24:16] gee i really have to go deep in my backups/archives to find that code apparently [11:25:08] oh [11:25:09] found it [11:25:29] https://phabricator.wikimedia.org/rSVN37389 [11:25:31] T136944 for (2) above [11:25:32] T136944: Set up LVS connection sync - https://phabricator.wikimedia.org/T136944 [11:25:53] so this worked in some 2008 kernel ;-) [11:26:28] my end goal was to be able to horizontally scale lvs balancers [11:26:33] because we were hitting single-system limits [11:26:43] and i wanted to have the router load balance the traffic [11:26:49] which basically runs into the same problems [11:27:06] return addr * 2654435761UL; [11:27:13] ^ knuth multiplicative :) [11:27:17] yes [11:27:34] this was actually a homework assignment as well, at university [11:27:43] i was attending a linux course by some linux kernel hacker [11:28:22] who happens to be a professor at my uni [11:29:25] I think that doesn't hash very well given the effective table size is a power of two. I think its one of those ones that only works well if the table is prime-sized. [11:29:39] no it didn't work very well [11:29:46] there's plenty room for improvement, i just made it "work" [11:29:52] yeah :) [11:30:30] and then when I found out that it wouldn't work with ipvs's existing design [11:30:34] i never followed up and improved it [11:30:49] as there was no chance I had the time to fix up ipvs as well [11:31:18] I mean, techops was robh and me at that point ;) [11:31:25] and i was 16 hours a week parttime hehe [11:31:29] ipvs is one of those things... it kinda reminds me of the OpenSSL problem everyone blogged about the past few years [11:31:42] it's so critical to so many ops here and there and everywhere, and it's technically open source [11:31:54] but the docs suck, the code is arcane and difficult to understand, etc [11:31:55] it works well enough for 98% of use cases [11:32:24] the project just doesn't have enough people interested it making it cleaner and clearer and better-documented, etc... [11:32:51] I guess because everyone that relies on it either (a) only uses it in trivial obvious ways or (b) has someone around willing to dive through the source code to figure out wtf is going on [11:33:15] * moritzm initially misread that as 16 hours a day parttime :-) [11:33:21] heh [11:33:32] i got paid 16 hours a week, I usually put in more [11:33:40] but refused more hours as I wouldn't possibly finish my degree otherwise [11:34:32] so there was that other ticket too, about weight=0 [11:34:50] "modern" ip_vs_sh supports that, but we have to add some support on the pybal side [11:35:07] (such that you don't remove a host on depool, simply set its weight to zero, and then sh behaves somewhat-more-like a real chash) [11:35:21] that's a big part of the reason for the state machine thing [11:35:36] T86650 [11:35:36] T86650: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650 [11:35:41] when I did a naive attempt at adding it to existing pybal on some flight from SF, i wrote some horrifying code [11:36:30] I remember! [11:45:02] ema: async reboot status update: maps and misc and all lvs/authdns done. text and upload TODO. I think I'm going to start on text soon after a coffee or two. [11:47:04] mark: or paravoid: any idea what will happen if I advertise the same serviceip from two different LVS clusters? I assume the router will make some arbitrary decision between the primaries that share a metric, and things will still "work", I just may not be able to predict which it uses? [11:47:25] that's the definition of anycast [11:48:08] yeah that [11:48:09] well, anycast with no differentiation at all between two possible routes :) [11:48:29] does it route randomly at that point, or use some dumb tiebreaker like "lowest-numbered router IP"? [11:48:35] no, [11:48:42] each router will pick the lvs cluster that's closest [11:48:52] but they're both close [11:49:04] oh like that, then yes [11:49:06] I don't mean 2x DCs, I mean high-traffic1 + high-traffic2 in eqiad [11:49:07] things like lowest ip matter [11:49:24] ok [11:49:42] it will still be deterministic per router, so not random [11:50:13] we have some IPs to move between high-traffic2 + low-traffic. I'm figuring I can stop puppet on all related LVS, push the change, puppet->pybal-restart on the "new" LVS cluster for the IP (now it's getting advertised for both), then do the old to remove it there, and no real hiccup. [11:50:35] yeah that should be fine [11:50:46] you can also force it with a temp static route [11:51:09] there's probably going to be an "sh" hiccup [11:51:17] that's what we discussed above yes ;) [11:51:22] well sure, but we get that even on pri->sec failover for internal services [11:51:30] you could setup the lvs sync thing if it still exists and works [11:51:33] to sync conn state [11:51:45] yeah we have a ticket to look into it [11:51:48] i've done that manually on occasion [11:51:53] just for migrations [11:52:02] but that wouldn't work for sh would it [11:52:07] why not? [11:52:14] I think it would for sh [11:52:43] the sh decision is only for unknown sockets, the state syncer syncs them like known already-mapped ones. [11:52:48] yup [11:53:07] as I said above, exactly the reason why I never pursued the wcsh route further [11:53:35] it's pretty trivial to configure, we just need to do some real sanity-testing about how heavy the multicast traffic is, and whether the rate of it's sane in our environment and everything holds up, etc [11:54:13] I used the multicast sync at $job-1 though and in practice it worked fine, at much lower client-connect volumes :) [11:54:21] it had some issues too [11:54:25] which is why I never left it enabled [12:35:04] 10Traffic, 10netops, 06Operations, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2744631 (10BBlack) ocg and git-ssh are fixed as well! [12:36:00] bblack: awesome! [12:36:04] nice work :) [12:36:10] shall we resolve the task then? [12:36:28] I'm gonna do one more commit to document it at least a little in puppet [12:36:35] I documented in revdns commentary already, but JIC [12:43:03] 10Traffic, 10netops, 06Operations, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2744637 (10BBlack) 05Open>03Resolved a:03BBlack [13:12:54] I keep staring at the nginx commit log. It sure seems like nothing else since 1.11.3 could have this effect. It's all relatively-irrelevant changes to modules/code we don't use, except what I've already tried backporting, or stuff that couldn't possibly impact SSL. [13:13:11] (or the stuff I tried reverting) [13:13:56] there's still the dynamic record size thing. The one effect of that patch that can't be runtime-disabled is it changes NGX_SSL_MAX_SESSION_SIZE from 4k to 16k [13:14:20] which seems harmless (if possibly wasteful), but there's some chance it could have some impact on openssl-1.1 [13:14:48] also notable, cloudflare released an updated version of that patch for 1.11.5, 2 weeks ago [13:15:08] but the updated patch is just working around code-style changes and updating line offsets, etc. it didn't functionally change at all. [13:15:42] so either CF isn't using it with nginx-1.11.5+openssl-1.1, or something else is going on. I know they're using a go implementation to test TLSv1.3 [13:16:33] https://github.com/cloudflare/tls-tris [13:16:49] so maybe they're not even trying openssl-1.1+nginx due to their focus on tls-tris [13:26:27] just a thought, Fedora recently made the switch to openssl 1.1 in rawhide, maybe their BTS contains some hints [13:35:34] bblack, elukey: I've added https://wikitech.wikimedia.org/wiki/LVS#Planned_reboot_of_LVS_servers please feel free to correct/expand [13:53:55] bblack: OK to merge the nginx-{full,light,extras}-dbg patch? https://gerrit.wikimedia.org/r/#/c/317790/ [13:56:19] <_joe_> ema: how are you? [13:57:05] ema: I edited your patch and included it in a different build for testing already [13:57:19] _joe_: feeling better, thanks! [13:57:23] <_joe_> :) [13:57:34] ema: not really merging anything yet, till I'm done experimenting and making a mess of the branch [13:57:47] bblack: ok nice, I'm gonna close the CR then [13:57:51] no [13:57:58] I'm using it, to pull from :) [13:58:15] oh sorry I've misread what you did :) [13:58:17] ok [13:58:28] I just edited it to not mess with changelog, so I could bundle it up with a bunch of other changes [13:59:21] a few more test builds today at some point should settle things out and confirm or deny whether it's even possible to have a working nginx+openssl-1.1 at all [13:59:48] (if not, then either openssl-1.1 itself has bugs, or nginx's use of it does up through current latest master) [14:01:03] I'm starting to think it's ngin'x io buffer mgmt needs updating for 1.1, but it's hard to say yet [14:01:17] e.g. the code in src/event/ngx_event_openssl.c that does stuff like: [14:01:26] if (rbio != wbio) { [14:01:26] (void) BIO_set_write_buffer_size(wbio, NGX_SSL_BUFSIZE); [14:03:14] it might have been simpler to spin up a cp1008-like instance for cache_upload and try to get a repro isolated from other traffic, and then maybe debug harder to find the source of the RST in isolation... [14:03:34] but I'm halfway down this path now and it's easy to make some more builds and keep trudging through this kind of testing for now, too [14:04:21] there's a good chance whatever's wrong with nginx+openssl-1.1 would trip sanitizers or valgrind, too [14:04:34] (if these other test builds don't work out and give an easier answer) [14:05:18] ok, can I find your builds on copper? I think a stap probe like this should work but can't confirm (because no -dbg!) :) [14:05:21] probe process("/usr/sbin/nginx").statement("ngx_http_v2_state_headers@src/http/v2/ngx_http_v2.c:1100") { printf("concurrent streams exceeded %ui", $h2c->processing); [14:05:24] } [14:05:45] well probably with a \n [14:06:16] I don't think it's likely to be that [14:06:48] it was a good theory at one point, but I've already tested the exact same nginx-1.11.4 build, just against openssl-1.0.2 instead of 1.1, and everything worked fine. [14:07:38] something's actually wrong with how the nginx+openssl code are interacting, for 1.1 but not 1.0.2, or something's wrong with openssl-1.1 itself, I think. [14:08:13] and possibly what's wrong with nginx+openssl is covered by some post-1.11.4 commit I haven't noticed or backported, or possibly it's fallout of our cloudflare dynamic tls record patch (both seem unlikely, but worth at least trying) [14:08:56] oh right I forgot of the openssl-1.0.2/1.1 test! At any rate it would be useful to be able to write probes and confirm [14:09:00] but probably nginx+openssl-1.1 is just insufficiently tested and lacking in some non-obvious way and we're among the first to start noticing it [14:09:30] yeah it would [14:09:41] but then we also have the reboots coming, which will change the kernel for stap builds too heh [14:10:15] * ema is ready with linux-image-4.4.0-2-amd64-dbg :) [14:10:20] :) [14:10:43] do you install systemtap packages from jessie-backports explicitly? I've noticed I had to do that in some other stap testing recently [14:10:51] (to get stap scripts to compile) [14:11:08] yes [14:11:38] maybe we should add that to our puppetization somehow. for all jessie really, since they're all supposed to be on these newer kernels. [14:11:39] it could be useful to just install and prepare the whole systemtap circus on a single host (copper?) [14:11:59] it's ncie to be able to use it on indivudual hosts without compiling elsewhere and copying .ko objects around [14:12:00] in order to have a working stap dev environment for everyone [14:12:43] (to just run the script and let auto-compilation/caching work) [14:13:18] I ended up using it on some nodejs service last week [14:14:06] (to debug that cxserver on scb1001 was the process that was caching outdated /etc/resolv.conf info b stapping for UDP socket operations on the bad address_ [14:14:46] maybe should start reboots with cache_upload to get past that mess and continue nginx testing [14:14:50] and do text later [14:15:42] I'll kick off a loop on neodymium for them now [14:16:48] ok [14:19:07] and then we get to see how many fail to reboot, of course :P [14:19:11] hopefully, not many [14:19:49] heh hopefully less than last time [14:20:17] I'm doing them serially with a 5 minute sleep, I guess I can halt it if too many bad ones pile up [14:20:25] for h in `cat cache_upload`; do hs=`echo $h|cut -d. -f1`; echo === $hs ===; salt -v -t 30 einsteinium.wikimedia.org cmd.run "icinga-downtime -h ${hs} -d 900 -r kernel-reboot"; salt -v -t 10 $h cmd.run 'touch /var/lib/traffic-pool/pool-once; echo reboot |at "now + 1 minute"'; sleep 300; done [14:20:30] ^ that [14:20:57] maybe I should reduce the downtime to something like 7 minutes, to catch dead ones in icinga-wm faster [14:21:06] they shouldn't take that long to reboot anyways [14:21:46] yeah 7 minutes should be fine [14:22:02] either way, it's going to take the better part of 4 hours [14:26:31] we'll probably see some minor 503 issues. I expect ~5-10% failure rate of pool/depool from Raft Internal Error, which will either leave some pooled when they reboot, or fail to repool them when they come up [14:27:55] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2744953 (10mark) [14:28:44] anyways, I'll check on etcd state and icinga state periodically and see what needs fixing up, and hopefully the problems don't pile up much faster than I'm looking [14:48:10] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2744991 (10mark) p:05Triage>03High [14:49:20] well, so far 1x upload cache in each DC rebooted ok :) [14:49:49] nice :) [14:51:56] ema: want to do cache_text, if you're up for it? [14:52:12] they should be able to run independently [14:53:21] (note: text has 1x basically-dead host right now: cp1052, can skip/ignore that one) [14:53:40] bblack: sure, finishing my bcn expense report first for your managerial pleasure [14:53:49] :P [14:54:03] * mark smiles [15:18:20] bblack: ok, pinkunicorn is also already done, right? [15:18:41] I don't see icinga-wm spam, assuming 7 minutes works fine for the downtime [15:18:47] yeah seems to [15:18:54] and yeah cp1008 was done before everything [15:22:49] uh and icinga-downtime -d 420s is exactly what I used last time :) [15:23:31] s is implicit! :) [15:24:04] UUOS [15:24:14] if you're copying from old commands, keep in mind s/neon/einsteinium/ [15:24:21] ah! [15:25:51] going to be interesting to see how the hitrate graph fares with all of cache_upload rebooting over ~4h (or maybe a little less if no hiccups) [15:26:23] they're ordered as 123412341234 etc, and not really linearly within each DC, but that's about it [15:26:32] good test-case :) [15:27:56] so far it's like 1/3 of the way in maybe, and total hitrate has dropped from ~97.4 to ~95.7 [15:28:07] so not awful, it has time to recover the most-important stuff as it goes [15:28:23] I wouldn't be surprised if the graph bottoms out a bit lower before it recovers though. maybe mid 80s? [15:29:00] maybe lower, we'll see [15:29:15] if it starts looking awful, I might have to pause for a few hours and let it recover more content before continuing [15:31:35] I didn't stop the cron restarts either, so that's adding to the mix a bit as well, but should only be a couple during this [15:39:33] ok done cp1054 as a trial to confirm that my script still does the right thing, rolling reboots started (5 minutes between hosts, one DC at a time) [15:45:21] so far the minimum dip on cache_upload total hitrate is 93.8%, but it's still bouncing up and down a bit, it's higher than that now, but may eventually get lower [15:46:39] (of course, another way to think of that which puts it in a scarier light 97.4->93.8 is about double the normal swift traffic heh) [16:16:53] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745288 (10mark) p:05Unbreak!>03High @cmjohnson: "Unbreak Now!" (UBN) is reserved only for critical emergency, "don't go home until this is fixed" kind of things.... [16:29:24] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2745351 (10mark) I've added new ports for the row A-D uplinks to the aggregated links (desc no-mon), so they can be moved one by one. No other ports (transit/transport/etc) yet. [16:49:23] the new lowest-dip on upload hitrate is now 91.9, but we're ~2/3rds through the process at this point, and it's holding up pretty good overall. it tends to bounce back from new lows pretty quick. [16:49:42] (e.g. currently it's at 93.2) [16:53:00] iowait's climbing on swift, too, but so far the load/traffic there seems manageable [16:53:34] we've just recently crossed the 1:1 barrier there on 1min loadvg vs #cpus [16:56:40] meanwhile, 16/28 text nodes rebooted [16:56:56] might be related to the conntrack warnings for swift I see in icinga [16:58:06] yeah might want to raise those limits there [16:58:30] we probably don't want conntrack to be a limiter anywhere near realistic production values [16:58:41] _text hitrate went down from 83 do 80% compared to yesterday same time [16:59:03] no I'm retarded I'm looking at the wrong graph [16:59:43] that was the frontend hitrate, total rate unchanged compared to yesterday [17:00:32] yeah text tends to recover easier, since the dataset is smaller [17:00:49] upload is the edge case because even in steady-state, backend caches are smaller than the total dataset [17:01:52] for upload, there's a small growing trend of spiky 503s out in ulsfo+esams as codfw/eqiad hosts restart [17:02:32] I don't think it's swift fallout, I think it may be depool-fallout (not waiting long enough + heavier-than-usual miss traffic)? [17:02:53] it's not big enough to be a big problem, yet [17:03:30] there was a strange drop in total #reqs on upload about an hour ago [17:03:41] I wonder if that's missing stats from some errant stats daemons post-restart [17:04:51] hmmm looking by DC it's not really a drop. eqiad+codfw both had an unexpected rise for a while, which later subsisded back to normal at ~16:00 [17:05:09] in the global aggregate it looks different because of the differing underlying daily patterns [17:07:06] I've had no reboot failures so far this time around, so that's a plus. [17:07:13] neither did I [17:08:24] I had one during misc+maps, but it was because of the known memory errors on cp3009. it detected it in some startup test and stopped to ask to press F1 or whatever. [17:11:50] the recent reboots exposed this race in sysctl setting, https://phabricator.wikimedia.org/T136094 with the correct time_wait setting we have about 45% more leeway [17:15:59] yeah that's an interesting edge-case in general [17:16:21] that distros tend to have a sysctl "service", but it may contain settings for modules that aren't even loaded until unrelated late services start [17:17:12] it sounds like somewhere where sysctl handling should be beefed up in general. e.g. a new interface to userspace where a sysctl daemon can listen for modules registering new sysctl slots and then give them the configured values as they appear (maybe even as their initial settings). [17:18:35] systemd would be the ideal candidate, but the capabilities of systemd-sysctl.service are really limited [17:19:15] yeah [17:19:27] it's very similar to the udevd sort of problem [17:19:37] but for sysctl names instead of device-tree entries [17:23:46] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745562 (10Cmjohnson) [17:24:28] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10Cmjohnson) @faidon all 8 switches are accessible via serial and connected to mgmt. [17:24:41] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745564 (10Cmjohnson) [17:25:27] only 2x esams + 1x ulsfo left to start rebooting for upload now [17:26:36] pretty intense on the hitrate drop and swift load, but I think it turned out barely-within-reason all things considered [17:27:19] and now we know the edge-case guess was about right [17:27:51] if we need to deploy an emergency kernel or varnishd restarting patch, our worst-case cluster can do it in ~4h. [17:28:10] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2745581 (10mark) cr1-eqiad's row A-D links have been moved to xe-3/0/[0-3] and xe-3/1/[0-3] respectively. All port descriptions should be correct, and the old port configs have been cleaned... [17:28:12] (it's actually going to end up about 3h20m today, since there wasn't any artificial holdup) [17:28:53] maybe for future non-emergencies, we can space them out more and be safer now that we see how close to the edge this is [17:30:18] right, last time I think I used 15m between upload servers and then went a bit faster towards the end [17:30:38] yeah [17:31:11] I think 10m even-spacing, rotating between the 4x DCs, seems a reasonable target for non-emergency, and then 5m like today for something urgent. [17:31:16] (for upload) [17:32:32] of course if we get persistence back with ATS, we won't have these limitations so much :) [17:32:42] :) [17:32:53] s/if/when/ ! [17:39:51] 10Traffic, 06Operations: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#2745596 (10ema) [17:40:14] 10Traffic, 06Operations: cp1066.mgmt.eqiad.wmnet is unreachable - https://phabricator.wikimedia.org/T149217#2745622 (10ema) p:05Triage>03Normal [18:04:23] so not a single reboot failure this time [18:04:54] 10netops, 06Operations, 10ops-eqiad: Decommission psw1-eqiad - https://phabricator.wikimedia.org/T149224#2745777 (10mark) [18:05:21] \o/ [18:07:17] 10netops, 06Operations: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#2745808 (10mark) [18:08:42] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745822 (10mark) [18:08:45] 10netops, 06Operations: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#2745821 (10mark) [18:10:00] just double-checked uname, nothing missed [18:11:10] moritzm: all confirmed: authdns, lvs*, cp* all on: 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) [18:55:57] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2745997 (10mark) All ports on cr1-eqiad FPC5 have been moved to FPC3, except for the uplink to pfw1-eqiad, which we need to schedule downtime for with Fundraising. @Jgreen Let's schedule s... [19:13:09] bblack, around? [19:13:45] Krenair: somewhat :) [19:14:18] how many servers would be ideal for the secure redirect service? Me and Merlijn have been talking [19:14:36] one? [19:14:42] no backup? [19:14:44] it won't be much traffic, or difficult to deal with in perf terms [19:14:59] a backup is nice, but any live loadbalancing just makes the LE solution much harder [19:15:05] yeah so that's the thing [19:15:23] the problem is the acme-challenge request needing to get to the right server [19:15:30] yup [19:15:46] we were talking about something similar for the labs proxies, and he came up with this: https://www.irccloud.com/pastebin/hw2T0a8b/ [19:16:38] seems a little racy, but possibly works [19:17:10] the challenges only have to work in near-realtime, they don't have to persist [19:17:31] I'd say the simplest thing is to do acme challenges locally and treat it mostly like a single server [19:17:45] and do an rsync cronjob to the warm backup that copies over /etc/acme/ [19:17:55] (all the fetched/generated certs/keys) [19:17:59] I wouldn't put this in the initial secure redirect commit but it's a possibility later [19:17:59] hm [19:18:32] even if it's behind on some recent renewals, it just has to redo those when you flip the role from spare->master in puppet or whatever, which turns on allowing it to fetch its own acme stuff [19:18:48] and the renewals are happening many days early anyways, the rsync could be like once an hour [19:18:57] maybe... I'd worry about all the failed verifications before the cronjob syncs the files [19:19:37] if we're prepared to be copying certs/keys around then maybe just have the live server running LE [19:19:38] it shouldn't be happening often (re-verification for renewal), and the renewal script should leave an existing unexpired one in place if it fails the verification [19:20:09] (for early renewal) [19:22:00] in general we don't want to encourage much traffic on these anyways [19:22:10] ok [19:22:25] at some ideal future point in time, maybe all the non-canonicals that are servied there just do local 301->https and then serve a static parking page and done heh [19:22:33] (for legal/squatting/whatever) [19:23:11] (but at least initially, I think we'll have to support generic redirects that mimic what happens with apache redirects.conf) [19:24:22] anyways, I'd engineer it mostly as single host. we can tack on a ::spare (or whatever) role to a second machine + rsync mechanism afterwards pretty easily. [19:25:04] so long as all kinds of other desirable properties are in place: that renewals happen very early (like current LE scripts), that a failed renewal doesn't replace a working cert that still has expiry time left, etc [19:26:02] we could maybe do this in k8s or ganeti too, I don't know that the traffic is big enough to even warrant real hardware [19:27:05] (we should probably survey how big the non-canonical traffic is today. I don't know, as a whole. But I know we've dug into individual domains stats before and found them to be things like 0.00001% or something) [19:27:16] yeah [19:27:28] I don't have request log access [19:33:02] I'm doing a rough check now, just on the 1/1000 data with regex filtering out all the canonicals [19:34:12] people apparently put some strange things in Host: headers :P [19:34:24] and/or proxy-style URL hostnames sent to our not-proxies [19:34:58] apparently we get a lot of requests with an http hostname which is an ip address in the 10.25.0.0/16 space, which we don't even use internally :P [19:35:14] but the best by far is: [19:35:18] Host: ${SITE} [19:35:21] (literally) [19:36:08] 10netops, 06Operations, 10ops-eqiad: Decommission psw1-eqiad - https://phabricator.wikimedia.org/T149224#2746100 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson psw1-eqiad has been reset to factory settings, removed from rack and placed in storage. Racktables has been updated. [19:38:05] out of 7798146 requests in a day of 1/1000-sampled data, 3021 are requests that are not for our canonical domainnames, so ~0.0387% of traffic [19:38:19] but that's before filtering a lot of junk out of those 3021 that aren't legit at all [19:40:07] 1654 filtered somewhat for sanity [19:41:37] some of the remainder are still ridiculous, like: googleads.g.doubleclick.net, sdk.open.inc2.igexin.com, picasaweb.google.com, m.movistar.co, ... [19:41:55] oh and 60 of them are because I forgot case-inesnsitivity so it caught www.Wikipedia.org [19:42:11] the only ones in double-digits are: [19:42:13] 1356 www.wikipedia.com [19:42:13] 104 wikipedia.com [19:42:13] 60 www.Wikipedia.org [19:42:13] 17 fr.wikipedia.com [19:42:15] 17 en.wikipedia.com [19:42:18] 15 m.wikipedia.com [19:42:20] 12 click.union.ucweb.com [19:42:52] 1523 are in wikipedia.com, that's the top "legit" non-canonical by far [19:43:25] anything else would be insignificant compared to that one [19:43:47] wikipedia.com comes in at ~0.02% of requests [19:43:50] roughly [19:44:23] so we're looking at something on the order of 20 reqs/sec long-term average [19:44:51] (and all nginx has to do with them is match some SNI or host: header stuff and apply some regex to generate a redirect) [19:46:32] (or perhaps emit a short static parking page) [19:46:56] bblack: great, thanks [20:36:03] 10Traffic, 06Analytics-Kanban, 06Operations, 06Performance-Team, 06Reading-Admin: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2746331 (10Nuria) @BBlack Would you be so kind as to look at our latest proposal to bucket users on doc : https://docs.google.com/docum... [20:44:39] 10Traffic, 06Analytics-Kanban, 06Operations, 06Performance-Team, 06Reading-Admin: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2746336 (10ellery) The pseudo code does not quite match the current text description of the Double Bucket proposal. [21:29:30] mh apt on cp1008 is unhappy, I guess that's expected since it is a test host, I wanted to get prometheus-node-exporter going there [21:29:34] also no ferm afaics [21:52:37] no ferm on any caches [21:55:11] I can fix up the apt stuff though [21:56:40] do you need ferm? [21:58:32] bblack: no I just noticed it, I have accidentally loaded e.g. conntrack though by running iptables [21:59:07] if it is wide open that's fine too, but yeah apt fixed would be nice so prometheus-node-exporter can get installed [22:00:51] yeah they're wide open [22:01:04] cp1008 arguably shouldn't be, just haven't made it a priority to move it to .eqiad.wmnet [22:01:12] (and give it an LVS IP all to itself) [22:01:59] godog: apt fixed, and puppet agent run fixed as well. seems it did a bunch of promtheus-related things. [22:02:32] bblack: neat, thanks! yeah it installed the node-exporter to get machine stats [22:02:55] I'm going through the list of hosts prometheus can't poll for metrics [22:02:58] so the expoted listens on :9100 on everything, and the central host(s) connect to the clients? [22:03:02] well "clients" [22:03:52] correct, prometheus servers asks for protobufs over http, though e.g. curl gets the text format back [22:04:48] ok [22:05:15] so long as it doesn't trigger ferm install on the nodes :) [22:05:45] hehe no that's still base::firewall, which wasn't included on logstash100[456] since a year or so, I just discovered [22:06:02] nice! [22:15:39] slightly related to the above for cp hosts, socket/net stats are also exported, e.g. cp1008:~$ curl -s localhost:9100/metrics | grep -E '^node_(sock|net)stat' [22:19:22] that will be fascinating data to have :) [22:20:01] the level of detail in netstat is staggering, but if we're getting it all fleetwide for free, yay :) [22:21:28] e.g. be able to see global effects of playing with timewait-related netstats and observing graphs based on [22:21:31] node_netstat_TcpExt_TW 5.0570617e+07 [22:21:34] node_netstat_TcpExt_TWKilled 1.804962e+06 [22:21:36] node_netstat_TcpExt_TWRecycled 91209 [22:21:42] 10netops, 06Operations, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2746698 (10fgiunchedi) [22:21:53] etc [22:22:26] hehe yeah that's already ingested in codfw/eqiad, you can create grafana dashboards with it [22:26:17] 10netops, 06Labs, 06Operations, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2746728 (10Krenair)