[00:41:49] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 4 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2600579 (10AndyRussG) >>! In T143271#2599100, @Mattflaschen-WMF wrote: > If this affects windo... [06:13:28] oh cache_upload ulsfo exploded overnight I see [06:14:03] cpu_system, mmh [06:14:05] https://ganglia.wikimedia.org/latest/?r=custom&cs=8%2F31%2F2016+12%3A38&ce=9%2F1%2F2016+6%3A46&m=cpu_report&tab=ch&vn=&hide-hf=false&hreg[]=cp40%2805|06|07|13|14|15%29.ulsfo.wmnet [06:14:47] very interestingly, on cp4006 varnishd is using > 100% CPU [06:15:08] *now*, with ulsfo depooled [06:15:27] on the other hosts we're around 10% [06:16:45] I've stopped the systemtap probes, saving all logs so far into T144257-2016-08-31.log [06:16:46] T144257: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257 [06:57:37] so, I thought I would wake up to some answers, woke up and got more questions :) [07:00:06] 1. the probes didn't capture a single CL:zero rewrite, except for cnt_synth with error_code 751 (redirects, should be normal). Did the CL issue actually not happen at all or are the probes looking for the wrong stuff? [07:00:31] 2. why did cpu_sys raise so much? Is it stap fault? [07:00:51] 3. why is varnish-be on cp4006 using > 100% cpu with no traffic whatsoever? [07:02:43] (or almost none) [07:03:22] 4. other depooled varnishes are using ~10% cpu. Is that because of PURGEs? [08:18:06] hi ema [08:18:11] any indication that last night was network related? [08:18:16] i.e. should I investigate or not? :) [08:19:11] mark: I don't think it was network related, but it would be great to be 100% sure :) [08:19:17] ok [08:19:47] do you have access to our routers at all? [08:19:56] nope, only to librenms [08:20:02] perhaps we should change that :) [08:20:07] "show log messages" helps a lot ;) [08:20:15] nice [08:20:34] Sep 1 08:01:54 cr1-ulsfo /kernel: Percentage memory available(19)less than threshold(20 %)- 1860 [08:20:39] that's new and suspicious [08:21:15] oh that would be 20 minutes ago [08:21:32] it's ongoing [08:21:36] * ema found https://wikitech.wikimedia.org/wiki/Network_cheat_sheet [08:23:09] actually apparently it has been doing that for a long time [08:23:21] i think we're going to upgrade it soon anyway [08:25:25] yes, the zayo link went down shortly before the varnish issues [08:25:47] and afterwards [08:25:48] sigh [08:25:53] oh [08:26:10] at what time did it go down? [08:26:53] Sep 1 02:00:27 cr2-ulsfo rpd[1496]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 198.35.26.211 (xe-1/3/0.0) removed due to: the interface is purged [08:26:54] and [08:27:03] Sep 1 05:11:47 cr2-ulsfo rpd[1496]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 198.35.26.211 (xe-1/3/0.0) removed due to: the interface is purged [08:27:28] wow yeah that's when the cpu_system spike happened [08:27:33] 2:00 [08:28:16] a little later really, 02:06 [08:31:39] i'm going to email them [08:32:20] thanks [08:39:56] mark: yesterday I tried to sign up for maint-announce@wikimedia.org, it said my email address is banned :) [08:40:06] who should I ask to fix that? [08:40:12] huh [08:40:14] i'll have a look [08:40:30] are you on noc@ too? [08:40:33] the email alias [08:40:43] I'm on noc, yes [08:40:45] ok [08:46:37] back to questions 3. and 4. raised earlier this morning, I've tried firewalling port 3128 on cp4006, hence stopping purge traffic to varnish-be, and still CPU usage stayed high [08:47:25] what does perf top say? [08:47:58] done the same with port 3127, hence stopping PURGE traffic to varnish-fe, cpu usage went down immediately for the frontend process [08:49:10] mark: perf top says 56% varnishd, Symbol insfree [08:49:32] insert free? [08:49:50] possibly something memory management related, or the object caching store [08:49:55] yep [08:50:06] these are the top function calls, 10 seconds sample: [08:50:09] V1L_Write 98933 [08:50:09] http_VSLH 104501 [08:50:09] VSLbt 117124 [08:50:09] http_findhdr 117992 [08:50:11] VRT_count 140825 [08:50:14] http_IsHdr 143614 [08:50:17] VSLb 270089 [08:50:19] VSLbv 270089 [08:50:21] WS_Assert 322509 [08:50:24] vsl_sanity 696616 [08:50:40] function name - number of times it was called ^ [08:51:22] yeah but cpu time spent might be more relevant [08:51:33] indeed [08:51:34] i'm guessing it's spending a lot of time on some large data structure [08:52:53] on cp4005 varnishd does not show up in perf top [08:53:57] gdb time :-) [08:59:59] ema: I don't think maint-announce is actually used [09:00:34] hm or maybe it is [09:00:34] they've changed it [09:02:12] i've updated the ban regexp [09:02:12] could you try subscribing again? [09:02:17] it's now ^.*@(?!wikimedia\.org)$ [09:03:10] mark: "Your subscription request has been received" [09:03:12] thanks! [09:03:24] yes [09:08:51] so zayo is looking into the link issue [09:09:25] of course, it should reroute quickly, with minimal disruption [09:09:35] hopefully not something varnish chokes on [09:10:03] ulsfo should be depooled anyways [09:13:27] yeah let's keep it that way for now [09:13:32] telia is also doing some maintenance somewhere still [09:13:37] haven't looked into it but... [09:14:59] top 20 function calls by execution time: [09:15:01] Lck_CondWait 58097571870 [09:15:01] vbp_task 54183718172 [09:15:01] vbp_poke 54081903175 [09:15:01] VBT_Open 27135026499 [09:15:04] CNT_Request 6411209799 [09:15:06] HSH_DerefObjCore 5120889755 [09:15:09] ObjFreeObj 5009815446 [09:15:11] default_oc_freeobj 4987756585 [09:15:14] STV_free 4905842945 [09:15:16] smf_free 4868398340 [09:15:19] free_smf 4866368104 [09:15:21] insfree 4865238220 [09:15:24] SES_Proto_Req 2769713799 [09:15:26] HTTP1_Session 2761802957 [09:15:29] vcl_call_method 2543592845 [09:15:31] VSLb 1534146680 [09:15:34] VCL_recv_method 1453878823 [09:15:36] vbf_fetch_thread 1116163576 [09:15:39] cnt_vdp 1042150314 [09:15:41] VBF_Fetch 990553855 [09:17:12] so most likely lock contention? [09:18:42] so it seems, insfree is also there [09:18:46] do they use spinlocks or something similar? Used too much could lead to %sys increase [09:27:51] the code is super clear as always (and full of phk diplomacy quotes) [09:29:21] the vbp_task seems to be related to "Poll backends for collection of health statistics" [09:29:51] and uses a lock [09:30:34] cpu_sys is very low at the moment [09:54:07] ah sorry I thought it was one of the problems :) [09:54:46] I saw the ref to 100% CPU and misread [09:54:49] sorryyy [10:04:03] so looking at another cp host (cp4013) it seems like it's not calling insfree at all. cp4006 somehow got stuck calling it instead [10:06:03] that would be in bin/varnishd/storage/storage_file.c, "Insert/Remove from correct freelist" [10:23:56] so, the 4 questions were: [10:23:57] 09:00 < ema> 1. the probes didn't capture a single CL:zero rewrite, except for cnt_synth with error_code 751 (redirects, should be normal). Did the CL issue actually not happen at all or are the probes looking for the wrong stuff? [10:24:02] 09:00 < ema> 2. why did cpu_sys raise so much? Is it stap fault? [10:24:04] 09:00 < ema> 3. why is varnish-be on cp4006 using > 100% cpu with no traffic whatsoever? [10:24:07] 09:02 < ema> (or almost none) [10:24:09] answer to 4. is yes [10:24:11] 09:03 < ema> 4. other depooled varnishes are using ~10% cpu. Is that because of PURGEs? [10:25:05] the partial answer to 3. is that varnish-be on cp4006 seems to be up to no good at the moment and insfree is somehow responsible/invloved [10:25:43] does insfree have a lock? [10:26:07] Lck_AssertHeld(&sc->mtx); [10:26:08] yes [10:27:36] maybe grep which other functions use that lock [10:28:53] remfree [10:29:09] and others in storage_persistent_silo.c [10:29:23] i thought we wouldn't use persistent anymore? [10:30:18] correct, we're not using it on v4 [10:30:25] oh the problem is on v3? [10:30:55] nope, it's on v4. insfree and remfree are in storage_file.c, which we're using [10:31:17] I just wanted to mention that the same lock is used in storage_persistent_silo.c too [10:32:03] ok :) [10:32:13] so, to recap: insfree and remfree are in storage_file.c, and both use that lock :) [10:33:47] mark: so xe-1/3/0 went down at 2:00, right? Shouldn't we see some gaps here then? https://librenms.wikimedia.org/graphs/to=1472725200/id=7518/type=port_bits/from=1472638800/ [10:34:20] yeah strange [10:34:23] perhaps it was very brief [10:34:36] that spike in in-traffic correlates nicely with the beginning of the cpu_sys spike [10:34:48] https://ganglia.wikimedia.org/latest/?r=custom&cs=8%2F31%2F2016+12%3A38&ce=9%2F1%2F2016+6%3A46&m=cpu_report&tab=ch&vn=&hide-hf=false&hreg[]=cp40%2805|06|07|13|14|15%29.ulsfo.wmnet [10:37:53] maybe put a breakpoint in either of those two functions with gdb and examine the length of that free list? [10:39:39] if it's e.g. very large due to some leak or whatever it might be looping over that for a long time [10:44:32] * mark bbl [11:24:51] lunchtime [11:33:39] ema: https://gerrit.wikimedia.org/r/307931 [12:34:16] that storage freeing is involved still harks back to the stuff yesterday about a half-invalidated busyobj being responsible for CL:0 in a race condition [12:34:45] systemtap could've altered the race, now instead of commonly getting a CL:0 on the race, we get a leak of supposedly-free objects on the race [12:45:36] moritzm: thanks! [12:47:07] bblack: today I've noticed that on machines with the new 4.4.2-3+wmf3 we couldn't load modules [12:47:14] moritzm: new kernel package rev I should update to for that? [12:47:17] eg: iptables -L doesn't work [12:47:23] heh [12:47:36] loading iptables modules on caches could get us in trouble anyways [12:47:47] but not great that modules don't load :) [12:48:16] I've downgraded the kernel on cp4006 to drop connections to varnish-be and see if the cpu usage would go down [12:48:37] that funnily didn't change anything cpu-wise [12:49:21] there's an ABI change which slipped into out of the 4.4.x stable updates, so some kernel modules will fail to load until the kernel is rebooted [12:49:41] sorry I just re-upgraded the installed one on cp4006 out of confusion. I saw an available package and thought that was the new-new one :) [12:49:47] the kernel.org handling of those changes is really messy, the fix is to bump the ABI in the kernel package in a rebuild [12:49:58] so that it turns to linux-image-4.4.0-2 [12:50:23] and thus becomes a co-installable kernel which modprobe/kmod treats as differently scoped [12:51:24] bblack: np, I'm done with that experiment anyways [12:51:58] cp4005's backend is still depooled, should we perhaps repool it? [12:53:07] ema: yeah [12:54:03] done [13:01:49] * ema is still puzzled about xe-1/3/0 going down at 02:00 [13:01:58] there is no gap in the graph here https://librenms.wikimedia.org/graphs/to=1472696400/id=7518/type=port_bits/from=1472686200/ [13:02:42] but a big spike at 02:05, correlating timewise with the cpu_sys increase [13:02:49] bblack: any ideas? ^ [13:05:59] no good ones lately about this :) [13:06:50] there's clearly a varnish bug involved, perhaps more than one. and we're finding interesting new ways to trigger it with different effects... [13:07:05] it makes me wonder if we're the first people ever to run v4 with real traffic :P [13:07:43] (on that note, perhaps we're not the first to run it with real traffic, but we could be the first that uses such a high volume of PURGE requests... even though most aren't for upload, just spamming the purge logic path that hard could be triggering) [13:08:23] it might be interesting for a further isolated test, to turn off vhtcpd on a test host and see if problems poof [13:08:56] but really at this point, sadly, I think we should scale back the install until we understand this better [13:09:22] as in, downgrade most of cp4 back to v3, go back to a single v4 node there we can depool [13:09:44] and start again with frontend-only pooled on the v4 node, and look again for these effects [13:10:31] the amount of errors, the unknown causes, the strange behaviors, and the #days we've been staring at this. it just doesn't add up to a state a whole DC should be in persistently for us. [13:10:51] yeah, sad but true [13:13:10] one could blame the zayo link as well, but still stuff shouldn't explode so spectacularly :) [13:15:20] yeah [13:15:26] we just happen to be hitting good triggers :) [13:16:15] there's ways to mitigate purge volume for upload, too, but we can't do that for text, so it's good to solve the problem now [13:17:11] ema: https://librenms.wikimedia.org/graphs/to=1472735700/id=7518/type=port_bits/from=1472649300/ [13:17:16] (vhtcpd supports a regex filter at its own level that could cut out all requests that aren't for upload.wikimedia.org on the shared multicast stream, but the quoting of the regex through from puppet->systemd->args has always seemed daunting, and we should've split multicast to handle it instead anyways, but blah blah blah) [13:20:46] Sep 1 02:00:27 cr2-ulsfo mib2d[1553]: SNMP_TRAP_LINK_DOWN: ifIndex 574, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-1/3/0 [13:20:46] Sep 1 02:00:28 cr2-ulsfo mib2d[1553]: SNMP_TRAP_LINK_UP: ifIndex 574, ifAdminStatus up(1), ifOperStatus up(1), ifName xe-1/3/0 [13:20:46] Sep 1 02:00:28 cr2-ulsfo mib2d[1553]: SNMP_TRAP_LINK_UP: ifIndex 557, ifAdminStatus up(1), ifOperStatus up(1), ifName xe-1/3/0.0 [13:20:51] it was down for only 1s [13:21:13] so makes sense we don't see that in graphs [13:21:19] ok [13:21:48] Sep 1 05:11:47 cr2-ulsfo mib2d[1553]: SNMP_TRAP_LINK_DOWN: ifIndex 574, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-1/3/0 [13:21:48] Sep 1 05:11:50 cr2-ulsfo mib2d[1553]: SNMP_TRAP_LINK_UP: ifIndex 574, ifAdminStatus up(1), ifOperStatus up(1), ifName xe-1/3/0 [13:21:49] 3s [13:21:50] etc [13:21:59] mark: I see. Do you think the spike around 02:05 is unrelated to the link going down? [13:22:23] what does "In" mean there? codfw -> ulsfo? [13:23:26] in is from the perspective of the router [13:23:28] so incoming to ulsfo [13:23:48] from codfw [13:24:48] yes [13:25:00] on the OTHER router, this happened: [13:25:06] Sep 1 02:00:32 cr1-ulsfo jddosd[1496]: DDOS_PROTOCOL_VIOLATION_SET: Protocol TTL:aggregate is violated at fpc 0 for 2 times, started at 2016-09-01 02:00:31 UTC [13:25:06] Sep 1 02:01:56 cr1-ulsfo sshd[26782]: Accepted password for rancid from 2620:0:861:2:862b:2bff:fe51:91ff port 63371 ssh2 [13:25:06] Sep 1 02:01:59 cr1-ulsfo /kernel: Percentage memory available(17)less than threshold(20 %)- 1854 [13:25:06] Sep 1 02:02:04 cr1-ulsfo sshd[26782]: Received disconnect from 2620:0:861:2:862b:2bff:fe51:91ff: 11: disconnected by user [13:25:06] Sep 1 02:02:04 cr1-ulsfo inetd[1377]: /usr/sbin/sshd[26782]: exited, status 255 [13:25:07] Sep 1 02:05:32 cr1-ulsfo jddosd[1496]: DDOS_PROTOCOL_VIOLATION_CLEAR: Protocol TTL:aggregate has returned to normal. Violated at fpc 0 for 2 times, from 2016-09-01 02:00:31 UTC to 2016-09-01 02:00:31 UTC [13:25:52] that probably means, too many "ttl exceeded" messages received in those 5 minutes [13:26:46] possibly due to rerouting [13:27:18] hmmm [13:27:47] you wouldn't think the re-routing would cause a huge bump in TTL? did the packets hit a loop? [13:28:01] yeah probably in a loop for some reason [13:28:03] briefly maybe [13:28:17] it says 2 times in 5s [13:28:28] one router being quicker than the other etc [13:28:40] all that traffic coming into ulsfo from codfw didn't hit cp4* machines though [13:28:42] rerouting is messy, honestly [13:29:14] bytes_in actually went down at that point on all cp4* hosts [13:29:37] so that could be a loop [13:39:15] would be handy if you could see snapshots of routing tables at all points eh [13:40:10] and see that both routers point at each other for a certain destination? :) [13:40:41] yes [13:40:50] which could definitely be the case for a couple seconds [13:44:46] bblack, mark: if the zayo madness is over we could repool ulsfo now, right? [13:45:08] well i have no indication that it is [13:45:11] they're working on it [13:45:18] i mean, it's up and all, but might flap occasionally [13:45:18] oh I thought it was over [13:45:39] last update I see is that they're planning maintenance for it [15:13:24] zayo just announced maintenance for tomorrow [16:14:20] mmh I've depooled cp4006 to downgrade it but it's getting some frontend requests (even though ulsfo is depooled, probably the usual broken clients) [16:14:47] well we have our own monitoring reqs that don't go away on DC-depool, too [16:14:56] no these are real clients [16:15:07] I think it could be a pybal issue [16:15:26] oh on pybal depool issue, yes [16:15:35] I mean the DC depool. we also have external browser-emulating monitoring [16:15:38] 3rd party [16:15:53] but I donno if it targets DCs outside of what geodns controls... [16:15:56] perhaps it's T134893 again [16:15:57] T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893 [16:16:26] ema: that would explain 4006 gets reqs when supposedly pybal-depooled, but wouldn't explain ulsfo getting browser reqs in general (aside from straggler bad-dns clients) [16:16:40] ipvsadm -Ln on lvs4002 would tell you if the depool worked [16:17:45] bblack: yep, cp4006 is still pooled as far as LVS is concerned [16:18:39] what was the trick at this point? pybal restart? [16:20:08] yes [16:22:00] bblack: on lvs4002, and that's a safe operation right? [16:23:33] ema: relatively-speaking, yes. the traffic will blip over to lvs4004 and back while the BGP session is down [16:23:59] or you could take down pybal on 4004 first and hope static fallback works, but that's probably not the awesomest idea either :) [16:24:13] if it's still gdnsd-depooled none of it's terribly user-relevant [16:24:26] ok, restarting pybal on lvs4002 then [16:27:37] yeah that worked. This bug is bad though. [16:31:52] yeah :/ [16:32:32] it's marked high-prio in phab, that means it should magically resolve itself by someone else's effort right? :) [16:32:41] * bblack is snarky today heh [16:33:22] but yeah we have a few high-prio pybal issues [16:33:30] we really need to set some time aside to fix them somehow [16:34:09] I'm not fond of the notion of putting off important bugs to a future refactor that brings in more invasive changes, either [16:34:14] yep [16:34:16] IMHO, we need bugfix commits with the current architecture [16:35:39] uh while downgrading cp4006 varnish.service didn't start properly. I had to systemctl unmask it [16:36:02] * ema adds that to https://wikitech.wikimedia.org/wiki/Varnish#Downgrading_Varnish_4_to_Varnish_3 [16:36:26] nothing works today :D [16:38:30] it's been a rough week in general heh [16:38:35] they happen! :) [16:39:16] indeed! Next week every command will return 0. [16:41:00] moritzm: I could reboot cp4006 with 4.4.2-3+wmf3, is that a good idea or should I wait for a newer kernel? [16:44:03] bblack: current status is: puppet enabled on cp4005 (v4) and cp4006 (v3) [16:44:19] disabled on other cp4* upload hosts, all still running v4 [16:44:22] ok [16:44:45] so basically you want to repool first so that the caches refill in a normal fashion, instead of repooling to empty contents, right? [16:44:51] correct [16:45:08] I guess when we repool depends on zayo stability [16:45:27] if they're still flapping, maybe wait for tomorrow's maint, but if it seems stable for now, we could do the transition and worry about the maint after [16:50:44] * ema has to go [16:50:56] let's see the zayo situation tomorrow! bye :) [16:51:21] ok, bye :) [16:58:12] ema: rebooting is fine, the newer kernel is mostly for those systems that are not yet updated [18:36:23] 10Traffic, 06Operations, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2602326 (10BBlack) This ticket has dragged on and wavered off-topic considerably. We've been supporting chapoly ciphers for about a month now, so the base issue here is resolve... [18:36:30] 10Traffic, 06Operations, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2602327 (10BBlack) 05Open>03Resolved a:03BBlack [19:24:56] 10Traffic, 06Operations: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523#2602609 (10BBlack) [19:25:28] 10Traffic, 10DBA, 06Operations, 06Performance-Team, and 2 others: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2602624 (10aaron)