[13:53:01] the ipv6 internet is just not as reliable as the ipv4 one https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-target_site=ulsfo&var-ip_version=All&var-country_code=All&var-asn=All&from=now-7d&to=now [14:35:13] elukey: looks like things finally quieted down on the memcache front https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-2d&to=now the shape of the traffic does suggest to me not reparse jobs, but rather, pages needed to be re-rendered after being (for the first time in months?) effectively purged from the CDN quickly [14:41:22] interestingly, we also had a bunch of micro-bursts of saturation on mc1027, despite the usual TX metric reporting 440Mbps or less [16:26:19] <_joe_> cdanis: we purged 4.8M pages out of 6M? [16:26:22] <_joe_> something like that [16:26:51] <_joe_> also, we purged a lot of broken urls :D [16:55:05] indeed [16:55:10] I think more like 4.5M [16:55:12] but... still. [16:56:23] vgutierrez: more fun times with ats-tls? [16:56:45] yey [16:57:10] I need to compile it with ASAN support and see what's going on [16:57:50] ugh :( [17:38:51] cdanis: one question though - why did we see that sqlblob thing hammering mc1028? Just because it was contained in a ton of pages? (the ones getting re-rendered) [17:39:31] also the rise in bw usage from the slab point of view matches with the change in the module [17:39:36] elukey: yeah, the template in question was the Lua for one of the two styles of citations used on enwiki [17:39:48] so I think what's different about this event is that purges in esams were working [17:40:07] that blob is transcluded in something like 4.5M enwiki pages [17:42:16] cdanis: so to understand, the change in the lua template/module caused all those 4.5M pages to be purged and hence re-rendered, hammering memcached [17:44:29] I see that https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/587902/ got merged, after the next train we'll have some new mw metrics to play with :) [17:44:43] elukey: that's my theory, given the shape of the tx traffic from that host [17:44:51] I don't know how to actually verify this, ofc [17:45:26] but if it was jobqueue stuff I'd expect something flatter, not something that looked diurnal [17:45:40] makes a lot of sense [17:46:10] yeah here look at this -- [17:46:12] https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=84&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&from=now-7d&to=now [17:46:34] network rx on the appserver machines spikes around 12:50UTC that day, which is the right time [17:46:44] and then tapers off to normal nadir [17:47:47] https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=84&fullscreen&orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [17:47:55] there's some effect on the apiservers, but not nearly as pronounced [17:48:09] https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=84&fullscreen&orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jobrunner&var-instance=All [17:48:15] and basically no effect on the jobrunners, which is actually kind of odd [17:48:24] but anyway, I think that mostly confirms this theory [17:50:50] it was strange this time since the key was relatively small, like 80k [17:51:01] in the past I saw problems with 200k+ [17:51:34] (or even jumbo keys like 400/500K) [17:59:38] hm, I'm not finding anything that looks super-obvious in either CDN cache hit rate (in either varnish or ats-be stats) or in parsercache (there's *some* effect on pc hit rate, but not hugely dramatic) [18:10:09] (afk! will read later :)