[00:05:42] 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3200333 (10Nirzar) @Gilles Few things. Please, let's not use design principles out of context. >When you load an article page, we... [00:08:22] paravoid: nice! :) [00:10:22] I was going to check if they need repooling, but they self-repooled on bootup [00:10:27] they did? [00:10:32] ouch! [00:10:37] OCSPs were long expired [00:10:41] probably because ema had pre-armed them to do so for the maintenance reboots, which they never came back from [00:11:25] in hindsight, it's not an amazing system, the way we do that :) [00:11:36] they have an outstanding alert each [00:11:39] Check Varnish expiry mailbox lag [00:11:42] UNKNOWN: cannot run varnishstat [00:11:45] ok [00:12:00] also, see the update on the PNG/JPG issue above :/ [00:12:25] the varnishstat thing is probably just another perms bug [00:16:45] yeah [00:16:56] those unknowns are fixed with: "chmod 644 /var/lib/varnish/*/*.vsm" [00:17:10] fighting vsm perms seems to be an ever-ongoing battle :P [00:18:03] paravoid: what about PNG/JPG issue above? [00:18:24] someone said that it's still happening above? [00:18:29] nah [00:18:34] they failed to clear their browser cache [00:18:38] ah [00:18:41] sorry didn't see the update [00:18:47] only saw the first one, ok [00:25:55] re the auto-repool, there's a few things to say about it I guess: [00:26:36] 1) The technical logic is a bit simplistic. If you touch a certain file in /var, it arms a one-shot auto-self-repool on next bootup. Which, obviously, can be problematic in a situation like this where it fails to reboot then comes back days later [00:27:19] (perhaps one hack against that is to expand that logic a bit and only trust the var-file if its mtime is less than X old, like maybe 1h) [00:29:08] 2) We still would really like to have a feature upgrade for conftool depooling in general, where we move from a data model of pooled=yes|no to depooled=[reason1, reason2], where the pooled=yes state is implied by an empty set of reasons [00:29:48] then auto-de/re-pool scripts around reboots might set and clear their own reason, but the admin might separately apply another reason like "iDRAC failed TNNNNN" to suppress actual repooling [00:30:00] (and it's just good in the general case. wish puppet disables could stack up reasons like that too!) [00:31:24] ha! that's such a neat idea [00:32:03] hadn't thought of that, that's really smart [00:34:32] 3) Probably we should fix things up such that nginx does not get started unless OCSP files are sane. Either make some kind of pre-start command in the unit file, or a dependency unit, or whatever, that checks them and/or executes the update script that normally goes off once an hour in cron [00:34:59] across a range of scenarios, we basically don't ever want nginx to start without good OCSPs. and if it had failed to start, lvs would've healthchecked it down [00:35:53] you could conceivably do a more complicated set of checks in pybal's healthchecks as well [00:35:58] expired certs, OCSP etc. [00:36:25] hmmm [00:36:28] although by that point other stuff on the host (varnish config for example) will probably also be misconfigured/stale and pybal can't check for that [00:36:55] we could even write a new healthcheck for pybal that just polls icinga for the host the service is on and fails if they have any outstanding icinga non-OK states :) [00:37:12] it would need to fail gracefully (not set the host down in ipvs) if it can't contact icinga at all, though [00:37:56] the icinga healthcheck could prevent all kinds of strange one-off cases we wouldn't think of [00:38:12] like not pooling a service if NTP is broked and the system time is bad, which might cause who knows what unpredictable fallout [00:39:08] yeah we get tons of of random icinga alerts [00:39:14] and it's easy to break one of them too [00:39:18] well :P [00:39:42] I think the only ones that would be noisy spam cases for cp* these days would be: [00:39:42] puppetmaster fails -> puppet runs fail -> varnish fails [00:39:47] that's quite the cascading failure [00:40:02] 1) Random puppetfails (which are still kind of a wtf to me, can we not fix those?) [00:40:21] 2) IPsec fails on all hosts when one host goes down [00:40:26] it's a wtf to me too, I'll ask someone to take care of that in a couple of weeks [00:40:47] but besides the random ones, you can easily imagine just puppet failing for whatever reason [00:40:54] 3) the current mailbox lag alerts, which mark CRIT when we're really not sure they always mean CRIT [00:41:00] yeah [00:41:04] a wrong change pushed to some unimportant place in the tree, the puppetmaster failing for whatever reason etc. [00:41:08] but other checks not so much [00:41:26] hrm [00:41:27] mabye a whitelist of icinga services to check on the host for hard-fail state [00:41:32] well there's a race with check_dpkg and apt-get :) [00:42:09] (check_dpkg warns on half-configured packages, which is an intermediate state when upgrading packages) [00:42:19] yeah [00:42:25] yeah a whilelist of services may make more sense [00:42:29] still, a bit scary :) [00:42:49] the amount of racing that goes on around puppet and/or apt is super-annoying in general [00:43:08] :/ [00:43:24] even just puppet with itself and manual actions, and the unreliablility of basic disable/enable, etc [00:44:02] it would be nicer if the agent ran as a single daemon with consistent state, but which only executes an agent run when asked to by some CLI command that pings it over a socket or something [00:44:31] the cron could do that, and en/dis -able state could be managed by that daemon inside itself, and then it also needs the disable-reasons-stack upgrade [00:44:44] and the ability to enable->run in a single atomic operation [00:44:53] and all the fancy things we've built scripts around it to handle :P [00:44:58] puppet can run as a daemon as well, and executing it on demand is called "kick" [00:45:09] the daemon part is the default even I think [00:45:13] I remember using it in 0.24 or so [00:45:17] it was horrible [00:45:22] hmmm yeah, I imagine it would be [00:45:28] it got stuck very often, it memory leaked etc. [00:45:29] but it doesn't have to be, like so many things :) [00:45:53] I think at some point it was memory leaking so much they replaced it with something small that execve()d the agent [00:45:57] at which point, I'd rather have cron :) [00:46:02] everything about how the agent lock/state files and disable/enable works, and how exit-codes work, etc [00:46:07] it all sucks [00:46:19] and the random crons going off suck in that everything else has to work around them to be reliable [00:46:53] (everything being manual work, but also any other scripts/crons we run which an active puppet agent might interfere with) [00:48:15] ema: remind me we should digest the earlier part of this conversation at least into actionables in phab about preventing auto-repool days later with dead OCSP files [00:48:21] :) [00:48:40] haha [00:49:39] ok I'm off [00:49:41] bye :) [00:57:34] bye :) [09:01:06] 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3200916 (10Gilles) I'm sorry to say, but what operating systems do, what other websites do, is completely irrelevant. You put way too... [10:07:37] bblack, paravoid: re: auto-repools and OCSP, I've filed T163541. Let me know if the summary of your discussion makes sense :) [10:07:38] T163541: cache hosts should auto-repool iff OCSP files are sane - https://phabricator.wikimedia.org/T163541 [10:08:36] re: setting N=1024 as the small obj threshold, we have a 97.7% frontend hitrate on cp2011 for such objects, I guess we can safely proceed with https://gerrit.wikimedia.org/r/#/c/349389/ [13:12:17] 10netops, 06Operations: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3201340 (10ayounsi) Telia ticket #00725688 opened. [13:24:31] \o/ merging [13:33:15] I guess once the dust settles on that change, doing a quick ban on "obj.status == 200 && obj.http.Content-Length < 1024" on the backends should clear the affected objects out of storage without having to wait days [13:34:04] and then hopefully we see some positive effects, like fewer struct smf in bin0, and less propensity for mailbox lag [13:34:14] heh I don't think we can use lt in the ban [13:34:54] but yes we should certainly ban them :) [13:35:16] just tried in vtc to confirm: [13:35:17] ban(): expected conditional (~, !~, == or !=) got "<" [13:35:35] regex then [13:38:56] '^[1-9][0-9][0-9]$' || '^10[0-2][0-4]$' maybe? [13:40:28] mmh no that wouldn't match, eg, 1005. coffee first :) [13:47:25] ^10([01][0-9]|2[0-3])|[0-9]{3}$ [13:49:01] (and yeah technically that also manages nonsense leading-zero encodings like 009, but those shouldn't exist anyways and if they do we still ban them :P) [13:52:10] oh that's still wrong [13:52:22] ^10([01][0-9]|2[0-3])|[0-9]{1,3}$ [13:52:33] or if you want to be pedantic about leading zeros, then: [13:52:46] ^10([01][0-9]|2[0-3])|[1-9][0-9]{0,2}$ [13:53:32] or if you assume the regex compiler is non-optimizing and you want it to run faster... [14:00:12] ^1(0([01][0-9]|2[0-3])|[0-9]{0,2})|[2-9][0-9]{0,2}$ [14:00:31] is a start I think, but it could probably be improved in the early half and it's getting more annoying to mentally validate :) [14:03:29] eh that has to be wrong [14:10:27] maybe this is both more-optimal and correct? [14:10:30] ^1(0([01][0-9]?|2[0-3]?|[3-9])|[1-9][0-9]?)|[2-9][0-9]{0,2}$ [14:10:57] (in that none of the alternations overlap, it's always a tree choice of distinct values for the next digit without re-winding or re-evaluating an already-evaluated position) [14:15:01] mmh the last one doesn't seem to match 1 and 10 :) [14:15:14] ^(0|1(0([01][0-9]?|2[0-3]?|[3-9])?|[1-9][0-9]?)?|[2-9][0-9]{0,2})$ [14:15:25] ^ manually validated that one with a perl loop over the values 0-100000 :) [14:15:31] matches 0-1023 [14:16:14] cool! [14:16:59] ^(0|10([01][0-9]|2[0-3])|[1-9][0-9]{0,2})$ [14:17:07] ^ this one validates the same too though and is far simpler to understand [14:18:32] interestingly in perl the short one benchmarks faster [14:18:37] (slightly) [14:20:39] OK so obj.status == 200 && obj.http.Content-Length ~ '^(0|10([01][0-9]|2[0-3])|[1-9][0-9]{0,2})$' [14:21:00] can I go ahead in codfw? [14:21:16] yeah [14:25:11] yeah it's consistent, it takes about 20% more runtime when I do millions of loops over the longer regex vs the shorter one in perl [14:25:22] very strange! [14:25:48] I mean, I would've been ok with near-identical times, as I figure perl has a regex optimizer that reduces them to the same thing [14:26:17] but it's weird that the long one, which is more mechanically-straightforward, takes longer to execute even after they're both optimized. [14:27:06] I guess that just goes to show all the usual rules about such things :P [14:30:07] usual rules = keep it simple? :) [14:30:39] ban on CL < 1024 started in codfw [14:30:46] usual rules = optimization is hard, premature optimization is not worth the effort, only benchmarking can tell you, etc [14:31:09] I wouldn't be shocked if the ban execution spikes mailbox lags a little [14:31:44] maybe in this case usual rules also includes "don't think you understand how the regex engine works unless you wrote it" [14:38:45] I wonder if obj has a field for the storage name? [14:38:57] (it might've been nice to have this ban only actually iterate bin0) [14:39:12] cp2002 finished [14:39:48] they all did [14:40:04] yeah [14:40:16] what was the thing with ban.list entries with 'C' again? [14:40:26] like 1492784964.136950 4161613 C [14:40:37] no idea [14:40:45] but when the regex dissappears, it's done processing :) [14:41:21] it's interesting that struct smf did drop off, but it did so around 14:00 (inbetween the merge and the ban) [14:42:28] well on cp2002 anyways [14:42:35] on cp2026 too [14:43:05] oh it's in lag anyways, from before we started [14:43:36] and 304s ramped off about then, it may be the point at which objects first started expiring after its last restart [14:44:54] 6/10 are in mailbox lag crit currently anyways [14:45:41] 2020 and 2011 only since recently though (probably started taking off post-merge or post-ban) [14:47:06] still, fundamentally we're caching far fewer unique objects per backend daemon (and especially in bin0) [14:47:23] right [14:47:24] it should make things better in the long run, or at worst have little to no effect because we don't really understand all the drivers [14:47:44] but that it could take a while for the pattern to settle into a new normal even after the ban seems reasonable [14:49:09] I wish we had more ways to easily use different axes in our grafana dashboards [14:49:24] e.g. varnish-machine-stats, you have to pick the source DC and a single host [14:49:56] it would be nice to be able to also pick a whole cache cluster and optionally filter down by DC as well [14:50:02] e.g. all codfw cache_upload stacked up [14:50:48] I don't know if grafana even has the cache cluster metadata avail in that sense [14:50:55] something like ganglia's "compare hosts" would be cool [14:52:54] cp2002 is lagging and it's been running since 2 days only [14:53:45] yeah :/ [14:54:15] well this is the first time we've had such a disparate pattern all in one place I think [14:54:40] I tend to assume that within the US the access patterns are homogenous-enough [14:54:54] but between EU and Asia, probably quite different (different trending topics, interests, wiki languages, etc) [14:55:12] right [14:55:23] so yeah for codfw this is an entirely different pattern [14:55:31] pre-switchover, the Asia upload pattern's primary backend cache is the (smaller) ulsfo backend set, then it misses to codfw (which has very low US load), then that misses to eqiad [14:55:44] whereas esams has a large backend set for the EU, which misses straight to eqiad [14:55:45] bblack: I've quickly tested to enable the multiselect for the $server template var in grafana, but unfortunately is not that easy :) [14:56:12] so the two patterns blend together in eqiad before, but only after sufficient chance to have their own large caches, so the actually mixing miss-rate at eqiad tends to be low [14:56:17] I guess it will need some tweaking in the graphs for the queries [14:56:51] now with ulsfo's small caches -> codfw and esams misses also to codfw (and all the US traffic into codfw directly too), we're seeing a whole global blend in play, which is a very different pattern [14:57:03] probably far more unique objects above any arbitrary barrier of hotness [15:00:40] now for something entirely different and a bit crazy: how about malloc+swap? have we ever discussed this? [15:00:42] if we really need to mitigate that, we could undo the cache::route_table commit [15:01:08] (we haven't seriously gone down that path (malloc+swap) even mentally, no) [15:01:56] reverting this: https://gerrit.wikimedia.org/r/#/c/347613/ would re-insert the eqiad backends as a pattern-buffer between esams and codfw, and might ease the strain [15:02:01] at the cost of not "really" not using eqiad [15:02:23] you'd think since eqiad total storage size <= esams it would make no difference, but the two mitigating factors there are: [15:02:44] (1) LRU pushes out different objects for each new object in eqiad than in esams, because one has 11 nodes and the other has 12 for the chashing [15:02:59] (2) Differently timed weekly backend restarts missing different chunks of objects after each [15:09:10] side note: the ban lasted ~10 minutes judging by cp2002's disk usage (14:30ish - 14:40ish) [15:09:16] so I'd theorize that given the above, even though total storage in eqiad is slightly smaller than total storage in esams, re-injecting eqiad between esams and codfw would (after eqiad has a chance to warm up) reduce the absolute rate of "esams misses that reach codfw" [15:10:29] I was thinking, eqiad has been depooled for ~3d, if (when) we do revert it might be wise to apply the change incrementally in esams? [15:11:03] I don't think it makes a huge difference there [15:11:24] either way, it's a buffer between a warm backend cache in esams and a warm backend cache in eqiad, so it should refill without much impact [15:12:04] the idea on the switchback plan was to do that first, wait a day, then bring users back into eqiad caches (with a now kinda-warm (with a different pattern!) set of data from the esams miss-throughs) [15:12:33] then a day later we see MW+services switch back to not relying on codfw caching, but everything should be warmed up already from the above [15:12:36] to warmup eqiad a bit, makes sense [15:13:28] I've gotta run out and take care of $things, bbl :) [15:15:06] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3201613 (10akosiaris) [15:15:10] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 - https://phabricator.wikimedia.org/T163326#3201611 (10akosiaris) 05Open>03Resolved This was done as well. There is an action item discovered during this which would be to automa... [15:15:10] have fun :) [15:15:40] for when you're back, we haven't set keep back to 3d in upload yet (https://gerrit.wikimedia.org/r/#/c/348698/) [15:16:53] not sure if it's wise to do that given the current state though [15:28:38] (meaning that it might be good to wait a bit and evaluate the impact of the current experiment before changing conditions again) [15:46:18] 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3201700 (10Whatamidoing-WMF) a:05BBlack>03None [16:53:08] ugh the varnish code is hard to follow [16:53:48] so we're failing in VFP_GetStorage, cache/cache_fetch_proc.c:90 [16:55:03] which fails because of st being NULL in cache/cache_obj.c:306 [16:55:23] (which calls objallocwithnuke and so forth) [16:55:43] now by staring at all this I can't help but wonder [16:56:11] why do we have to fail with a 503? I mean it's sad that we can't store this object in cache but hey [17:00:19] there's also a fetch_chunksize setting that we might want to look into (perhaps?) [17:01:08] sorry for saying stuff in random order :) [17:01:45] <_joe_> 5xx are high on upload [17:01:53] <_joe_> saw the alarm go off various times [17:02:58] yeah, thanks _joe_ [17:03:49] _joe_: the 503 rate is following a similar pattern as yesterday, we have an experiment running that might (hopefully) improve the situation [17:05:21] 1) code path followed, confirmed with stap, is VFP_GetStorage -> ObjGetSpace -> objallocwithnuke. I haven't found out yet where VFP_GetStorage is called in the first place, perhaps we could patch that to avoid failing with 503s? [17:07:14] 2) looking at some of the 503s, the failing bin is always either bin3 or bin4 [17:16:22] s/always/often/, there are bin0 failures too on cp2022