[00:05:42] <wikibugs__>	 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3200333 (10Nirzar) @Gilles  Few things. Please, let's not use design principles out of context.   >When you load an article page, we...
[00:08:22] <bblack>	 paravoid: nice! :)
[00:10:22] <bblack>	 I was going to check if they need repooling, but they self-repooled on bootup
[00:10:27] <paravoid>	 they did?
[00:10:32] <paravoid>	 ouch!
[00:10:37] <paravoid>	 OCSPs were long expired
[00:10:41] <bblack>	 probably because ema had pre-armed them to do so for the maintenance reboots, which they never came back from
[00:11:25] <bblack>	 in hindsight, it's not an amazing system, the way we do that :)
[00:11:36] <paravoid>	 they have an outstanding alert each
[00:11:39] <paravoid>	 Check Varnish expiry mailbox lag
[00:11:42] <paravoid>	 UNKNOWN: cannot run varnishstat
[00:11:45] <bblack>	 ok
[00:12:00] <paravoid>	 also, see the update on the PNG/JPG issue above :/
[00:12:25] <bblack>	 the varnishstat thing is probably just another perms bug
[00:16:45] <bblack>	 yeah
[00:16:56] <bblack>	 those unknowns are fixed with: "chmod 644 /var/lib/varnish/*/*.vsm"
[00:17:10] <bblack>	 fighting vsm perms seems to be an ever-ongoing battle :P
[00:18:03] <bblack>	 paravoid: what about PNG/JPG issue above?
[00:18:24] <paravoid>	 someone said that it's still happening above?
[00:18:29] <bblack>	 nah
[00:18:34] <bblack>	 they failed to clear their browser cache
[00:18:38] <paravoid>	 ah
[00:18:41] <paravoid>	 sorry didn't see the update
[00:18:47] <paravoid>	 only saw the first one, ok
[00:25:55] <bblack>	 re the auto-repool, there's a few things to say about it I guess:
[00:26:36] <bblack>	 1) The technical logic is a bit simplistic.  If you touch a certain file in /var, it arms a one-shot auto-self-repool on next bootup.  Which, obviously, can be problematic in a situation like this where it fails to reboot then comes back days later
[00:27:19] <bblack>	 (perhaps one hack against that is to expand that logic a bit and only trust the var-file if its mtime is less than X old, like maybe 1h)
[00:29:08] <bblack>	 2) We still would really like to have a feature upgrade for conftool depooling in general, where we move from a data model of pooled=yes|no to depooled=[reason1, reason2], where the pooled=yes state is implied by an empty set of reasons
[00:29:48] <bblack>	 then auto-de/re-pool scripts around reboots might set and clear their own reason, but the admin might separately apply another reason like "iDRAC failed TNNNNN" to suppress actual repooling
[00:30:00] <bblack>	 (and it's just good in the general case.  wish puppet disables could stack up reasons like that too!)
[00:31:24] <paravoid>	 ha! that's such a neat idea
[00:32:03] <paravoid>	 hadn't thought of that, that's really smart
[00:34:32] <bblack>	 3) Probably we should fix things up such that nginx does not get started unless OCSP files are sane.  Either make some kind of pre-start command in the unit file, or a dependency unit, or whatever, that checks them and/or executes the update script that normally goes off once an hour in cron
[00:34:59] <bblack>	 across a range of scenarios, we basically don't ever want nginx to start without good OCSPs.  and if it had failed to start, lvs would've healthchecked it down
[00:35:53] <paravoid>	 you could conceivably do a more complicated set of checks in pybal's healthchecks as well
[00:35:58] <paravoid>	 expired certs, OCSP etc.
[00:36:25] <bblack>	 hmmm
[00:36:28] <paravoid>	 although by that point other stuff on the host (varnish config for example) will probably also be misconfigured/stale and pybal can't check for that
[00:36:55] <bblack>	 we could even write a new healthcheck for pybal that just polls icinga for the host the service is on and fails if they have any outstanding icinga non-OK states :)
[00:37:12] <bblack>	 it would need to fail gracefully (not set the host down in ipvs) if it can't contact icinga at all, though
[00:37:56] <bblack>	 the icinga healthcheck could prevent all kinds of strange one-off cases we wouldn't think of
[00:38:12] <bblack>	 like not pooling a service if NTP is broked and the system time is bad, which might cause who knows what unpredictable fallout
[00:39:08] <paravoid>	 yeah we get tons of of random icinga alerts
[00:39:14] <paravoid>	 and it's easy to break one of them too
[00:39:18] <bblack>	 well :P
[00:39:42] <bblack>	 I think the only ones that would be noisy spam cases for cp* these days would be:
[00:39:42] <paravoid>	 puppetmaster fails -> puppet runs fail -> varnish fails
[00:39:47] <paravoid>	 that's quite the cascading failure
[00:40:02] <bblack>	 1) Random puppetfails (which are still kind of a wtf to me, can we not fix those?)
[00:40:21] <bblack>	 2) IPsec fails on all hosts when one host goes down
[00:40:26] <paravoid>	 it's a wtf to me too, I'll ask someone to take care of that in a couple of weeks
[00:40:47] <paravoid>	 but besides the random ones, you can easily imagine just puppet failing for whatever reason
[00:40:54] <bblack>	 3) the current mailbox lag alerts, which mark CRIT when we're really not sure they always mean CRIT
[00:41:00] <bblack>	 yeah
[00:41:04] <paravoid>	 a wrong change pushed to some unimportant place in the tree, the puppetmaster failing for whatever reason etc.
[00:41:08] <bblack>	 but other checks not so much
[00:41:26] <paravoid>	 hrm
[00:41:27] <bblack>	 mabye a whitelist of icinga services to check on the host for hard-fail state
[00:41:32] <paravoid>	 well there's a race with check_dpkg and apt-get :)
[00:42:09] <paravoid>	 (check_dpkg warns on half-configured packages, which is an intermediate state when upgrading packages)
[00:42:19] <bblack>	 yeah
[00:42:25] <paravoid>	 yeah a whilelist of services may make more sense
[00:42:29] <paravoid>	 still, a bit scary :)
[00:42:49] <bblack>	 the amount of racing that goes on around puppet and/or apt is super-annoying in general
[00:43:08] <paravoid>	 :/
[00:43:24] <bblack>	 even just puppet with itself and manual actions, and the unreliablility of basic disable/enable, etc
[00:44:02] <bblack>	 it would be nicer if the agent ran as a single daemon with consistent state, but which only executes an agent run when asked to by some CLI command that pings it over a socket or something
[00:44:31] <bblack>	 the cron could do that, and en/dis -able state could be managed by that daemon inside itself, and then it also needs the disable-reasons-stack upgrade
[00:44:44] <bblack>	 and the ability to enable->run in a single atomic operation
[00:44:53] <bblack>	 and all the fancy things we've built scripts around it to handle :P
[00:44:58] <paravoid>	 puppet can run as a daemon as well, and executing it on demand is called "kick"
[00:45:09] <paravoid>	 the daemon part is the default even I think
[00:45:13] <paravoid>	 I remember using it in 0.24 or so
[00:45:17] <paravoid>	 it was horrible
[00:45:22] <bblack>	 hmmm yeah, I imagine it would be
[00:45:28] <paravoid>	 it got stuck very often, it memory leaked etc.
[00:45:29] <bblack>	 but it doesn't have to be, like so many things :)
[00:45:53] <paravoid>	 I think at some point it was memory leaking so much they replaced it with something small that execve()d the agent
[00:45:57] <paravoid>	 at which point, I'd rather have cron :)
[00:46:02] <bblack>	 everything about how the agent lock/state files and disable/enable works, and how exit-codes work, etc
[00:46:07] <bblack>	 it all sucks
[00:46:19] <bblack>	 and the random crons going off suck in that everything else has to work around them to be reliable
[00:46:53] <bblack>	 (everything being manual work, but also any other scripts/crons we run which an active puppet agent might interfere with)
[00:48:15] <bblack>	 ema: remind me we should digest the earlier part of this conversation at least into actionables in phab about preventing auto-repool days later with dead OCSP files
[00:48:21] <bblack>	 :)
[00:48:40] <paravoid>	 haha
[00:49:39] <paravoid>	 ok I'm off
[00:49:41] <paravoid>	 bye :)
[00:57:34] <bblack>	 bye :)
[09:01:06] <wikibugs__>	 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3200916 (10Gilles) I'm sorry to say, but what operating systems do, what other websites do, is completely irrelevant. You put way too...
[10:07:37] <ema>	 bblack, paravoid: re: auto-repools and OCSP, I've filed T163541. Let me know if the summary of your discussion makes sense :)
[10:07:38] <stashbot>	 T163541: cache hosts should auto-repool iff OCSP files are sane - https://phabricator.wikimedia.org/T163541
[10:08:36] <ema>	 re: setting N=1024 as the small obj threshold, we have a 97.7% frontend hitrate on cp2011 for such objects, I guess we can safely proceed with https://gerrit.wikimedia.org/r/#/c/349389/ 
[13:12:17] <wikibugs>	 10netops, 06Operations: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3201340 (10ayounsi) Telia ticket #00725688 opened.
[13:24:31] <ema>	 \o/ merging
[13:33:15] <bblack>	 I guess once the dust settles on that change, doing a quick ban on "obj.status == 200 && obj.http.Content-Length < 1024" on the backends should clear the affected objects out of storage without having to wait days
[13:34:04] <bblack>	 and then hopefully we see some positive effects, like fewer struct smf in bin0, and less propensity for mailbox lag
[13:34:14] <ema>	 heh I don't think we can use lt in the ban
[13:34:54] <ema>	 but yes we should certainly ban them :)
[13:35:16] <ema>	 just tried in vtc to confirm:
[13:35:17] <ema>	 ban(): expected conditional (~, !~, == or !=) got "<"
[13:35:35] <ema>	 regex then 
[13:38:56] <ema>	 '^[1-9][0-9][0-9]$' || '^10[0-2][0-4]$' maybe?
[13:40:28] <ema>	 mmh no that wouldn't match, eg, 1005. coffee first :)
[13:47:25] <bblack>	 ^10([01][0-9]|2[0-3])|[0-9]{3}$
[13:49:01] <bblack>	 (and yeah technically that also manages nonsense leading-zero encodings like 009, but those shouldn't exist anyways and if they do we still ban them :P)
[13:52:10] <bblack>	 oh that's still wrong
[13:52:22] <bblack>	 ^10([01][0-9]|2[0-3])|[0-9]{1,3}$
[13:52:33] <bblack>	 or if you want to be pedantic about leading zeros, then:
[13:52:46] <bblack>	 ^10([01][0-9]|2[0-3])|[1-9][0-9]{0,2}$
[13:53:32] <bblack>	 or if you assume the regex compiler is non-optimizing and you want it to run faster...
[14:00:12] <bblack>	 ^1(0([01][0-9]|2[0-3])|[0-9]{0,2})|[2-9][0-9]{0,2}$
[14:00:31] <bblack>	 is a start I think, but it could probably be improved in the early half and it's getting more annoying to mentally validate :)
[14:03:29] <bblack>	 eh that has to be wrong
[14:10:27] <bblack>	 maybe this is both more-optimal and correct?
[14:10:30] <bblack>	 ^1(0([01][0-9]?|2[0-3]?|[3-9])|[1-9][0-9]?)|[2-9][0-9]{0,2}$
[14:10:57] <bblack>	 (in that none of the alternations overlap, it's always a tree choice of distinct values for the next digit without re-winding or re-evaluating an already-evaluated position)
[14:15:01] <ema>	 mmh the last one doesn't seem to match 1 and 10 :)
[14:15:14] <bblack>	 ^(0|1(0([01][0-9]?|2[0-3]?|[3-9])?|[1-9][0-9]?)?|[2-9][0-9]{0,2})$
[14:15:25] <bblack>	 ^ manually validated that one with a perl loop over the values 0-100000 :)
[14:15:31] <bblack>	 matches 0-1023
[14:16:14] <ema>	 cool!
[14:16:59] <bblack>	 ^(0|10([01][0-9]|2[0-3])|[1-9][0-9]{0,2})$
[14:17:07] <bblack>	 ^ this one validates the same too though and is far simpler to understand
[14:18:32] <bblack>	 interestingly in perl the short one benchmarks faster
[14:18:37] <bblack>	 (slightly)
[14:20:39] <ema>	 OK so obj.status == 200 && obj.http.Content-Length ~ '^(0|10([01][0-9]|2[0-3])|[1-9][0-9]{0,2})$'
[14:21:00] <ema>	 can I go ahead in codfw?
[14:21:16] <bblack>	 yeah
[14:25:11] <bblack>	 yeah it's consistent, it takes about 20% more runtime when I do millions of loops over the longer regex vs the shorter one in perl
[14:25:22] <bblack>	 very strange!
[14:25:48] <bblack>	 I mean, I would've been ok with near-identical times, as I figure perl has a regex optimizer that reduces them to the same thing
[14:26:17] <bblack>	 but it's weird that the long one, which is more mechanically-straightforward, takes longer to execute even after they're both optimized.
[14:27:06] <bblack>	 I guess that just goes to show all the usual rules about such things :P
[14:30:07] <ema>	 usual rules = keep it simple? :)
[14:30:39] <ema>	 ban on CL < 1024 started in codfw
[14:30:46] <bblack>	 usual rules = optimization is hard, premature optimization is not worth the effort, only benchmarking can tell you, etc
[14:31:09] <bblack>	 I wouldn't be shocked if the ban execution spikes mailbox lags a little
[14:31:44] <bblack>	 maybe in this case usual rules also includes "don't think you understand how the regex engine works unless you wrote it"
[14:38:45] <bblack>	 I wonder if obj has a field for the storage name?
[14:38:57] <bblack>	 (it might've been nice to have this ban only actually iterate bin0)
[14:39:12] <bblack>	 cp2002 finished
[14:39:48] <bblack>	 they all did
[14:40:04] <ema>	 yeah
[14:40:16] <ema>	 what was the thing with ban.list entries with 'C' again?
[14:40:26] <ema>	 like 1492784964.136950 4161613 C  
[14:40:37] <bblack>	 no idea
[14:40:45] <bblack>	 but when the regex dissappears, it's done processing :)
[14:41:21] <bblack>	 it's interesting that struct smf did drop off, but it did so around 14:00 (inbetween the merge and the ban)
[14:42:28] <bblack>	 well on cp2002 anyways
[14:42:35] <ema>	 on cp2026 too
[14:43:05] <bblack>	 oh it's in lag anyways, from before we started
[14:43:36] <bblack>	 and 304s ramped off about then, it may be the point at which objects first started expiring after its last restart
[14:44:54] <bblack>	 6/10 are in mailbox lag crit currently anyways
[14:45:41] <bblack>	 2020 and 2011 only since recently though (probably started taking off post-merge or post-ban)
[14:47:06] <bblack>	 still, fundamentally we're caching far fewer unique objects per backend daemon (and especially in bin0)
[14:47:23] <ema>	 right
[14:47:24] <bblack>	 it should make things better in the long run, or at worst have little to no effect because we don't really understand all the drivers
[14:47:44] <bblack>	 but that it could take a while for the pattern to settle into a new normal even after the ban seems reasonable
[14:49:09] <bblack>	 I wish we had more ways to easily use different axes in our grafana dashboards
[14:49:24] <bblack>	 e.g. varnish-machine-stats, you have to pick the source DC and a single host
[14:49:56] <bblack>	 it would be nice to be able to also pick a whole cache cluster and optionally filter down by DC as well
[14:50:02] <bblack>	 e.g. all codfw cache_upload stacked up
[14:50:48] <bblack>	 I don't know if grafana even has the cache cluster metadata avail in that sense
[14:50:55] <ema>	 something like ganglia's "compare hosts" would be cool
[14:52:54] <ema>	 cp2002 is lagging and it's been running since 2 days only
[14:53:45] <bblack>	 yeah :/
[14:54:15] <bblack>	 well this is the first time we've had such a disparate pattern all in one place I think
[14:54:40] <bblack>	 I tend to assume that within the US the access patterns are homogenous-enough
[14:54:54] <bblack>	 but between EU and Asia, probably quite different (different trending topics, interests, wiki languages, etc)
[14:55:12] <ema>	 right
[14:55:23] <ema>	 so yeah for codfw this is an entirely different pattern
[14:55:31] <bblack>	 pre-switchover, the Asia upload pattern's primary backend cache is the (smaller) ulsfo backend set, then it misses to codfw (which has very low US load), then that misses to eqiad
[14:55:44] <bblack>	 whereas esams has a large backend set for the EU, which misses straight to eqiad
[14:55:45] <volans>	 bblack: I've quickly tested to enable the multiselect for the $server template var in grafana, but unfortunately is not that easy :)
[14:56:12] <bblack>	 so the two patterns blend together in eqiad before, but only after sufficient chance to have their own large caches, so the actually mixing miss-rate at eqiad tends to be low
[14:56:17] <volans>	 I guess it will need some tweaking in the graphs for the queries
[14:56:51] <bblack>	 now with ulsfo's small caches -> codfw and esams misses also to codfw (and all the US traffic into codfw directly too), we're seeing a whole global blend in play, which is a very different pattern
[14:57:03] <bblack>	 probably far more unique objects above any arbitrary barrier of hotness
[15:00:40] <ema>	 now for something entirely different and a bit crazy: how about malloc+swap? have we ever discussed this?
[15:00:42] <bblack>	 if we really need to mitigate that, we could undo the cache::route_table commit
[15:01:08] <bblack>	 (we haven't seriously gone down that path (malloc+swap) even mentally, no)
[15:01:56] <bblack>	 reverting this: https://gerrit.wikimedia.org/r/#/c/347613/ would re-insert the eqiad backends as a pattern-buffer between esams and codfw, and might ease the strain
[15:02:01] <bblack>	 at the cost of not "really" not using eqiad
[15:02:23] <bblack>	 you'd think since eqiad total storage size <= esams it would make no difference, but the two mitigating factors there are:
[15:02:44] <bblack>	 (1) LRU pushes out different objects for each new object in eqiad than in esams, because one has 11 nodes and the other has 12 for the chashing
[15:02:59] <bblack>	 (2) Differently timed weekly backend restarts missing different chunks of objects after each
[15:09:10] <ema>	 side note: the ban lasted ~10 minutes judging by cp2002's disk usage (14:30ish - 14:40ish)
[15:09:16] <bblack>	 so I'd theorize that given the above, even though total storage in eqiad is slightly smaller than total storage in esams, re-injecting eqiad between esams and codfw would (after eqiad has a chance to warm up) reduce the absolute rate of "esams misses that reach codfw"
[15:10:29] <ema>	 I was thinking, eqiad has been depooled for ~3d, if (when) we do revert it might be wise to apply the change incrementally in esams?
[15:11:03] <bblack>	 I don't think it makes a huge difference there
[15:11:24] <bblack>	 either way, it's a buffer between a warm backend cache in esams and a warm backend cache in eqiad, so it should refill without much impact
[15:12:04] <bblack>	 the idea on the switchback plan was to do that first, wait a day, then bring users back into eqiad caches (with a now kinda-warm (with a different pattern!) set of data from the esams miss-throughs)
[15:12:33] <bblack>	 then a day later we see MW+services switch back to not relying on codfw caching, but everything should be warmed up already from the above
[15:12:36] <ema>	 to warmup eqiad a bit, makes sense
[15:13:28] <bblack>	 I've gotta run out and take care of $things, bbl :)
[15:15:06] <wikibugs__>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3201613 (10akosiaris)
[15:15:10] <wikibugs>	 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 - https://phabricator.wikimedia.org/T163326#3201611 (10akosiaris) 05Open>03Resolved This was done as well.  There is an action item discovered during this which would be to automa...
[15:15:10] <ema>	 have fun :)
[15:15:40] <ema>	 for when you're back, we haven't set keep back to 3d in upload yet (https://gerrit.wikimedia.org/r/#/c/348698/)
[15:16:53] <ema>	 not sure if it's wise to do that given the current state though
[15:28:38] <ema>	 (meaning that it might be good to wait a bit and evaluate the impact of the current experiment before changing conditions again)
[15:46:18] <wikibugs__>	 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3201700 (10Whatamidoing-WMF) a:05BBlack>03None
[16:53:08] <ema>	 ugh the varnish code is hard to follow
[16:53:48] <ema>	 so we're failing in VFP_GetStorage, cache/cache_fetch_proc.c:90
[16:55:03] <ema>	 which fails because of st being NULL in cache/cache_obj.c:306
[16:55:23] <ema>	 (which calls objallocwithnuke and so forth)
[16:55:43] <ema>	 now by staring at all this I can't help but wonder
[16:56:11] <ema>	 why do we have to fail with a 503? I mean it's sad that we can't store this object in cache but hey
[17:00:19] <ema>	 there's also a fetch_chunksize setting that we might want to look into (perhaps?)
[17:01:08] <ema>	 sorry for saying stuff in random order :)
[17:01:45] <_joe_>	 5xx are high on upload
[17:01:53] <_joe_>	 saw the alarm go off various times
[17:02:58] <ema>	 yeah, thanks _joe_ 
[17:03:49] <ema>	 _joe_: the 503 rate is following a similar pattern as yesterday, we have an experiment running that might (hopefully) improve the situation
[17:05:21] <ema>	 1) code path followed, confirmed with stap, is VFP_GetStorage -> ObjGetSpace -> objallocwithnuke. I haven't found out yet where VFP_GetStorage is called in the first place, perhaps we could patch that to avoid failing with 503s?
[17:07:14] <ema>	 2) looking at some of the 503s, the failing bin is always either bin3 or bin4
[17:16:22] <ema>	 s/always/often/, there are bin0 failures too on cp2022