[00:35:35] 10netops, 10Operations: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ayounsi) [01:12:05] 10netops, 10Operations: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ayounsi) a:03ayounsi Not answering the question but adding data points. We periodically have similar links flap or cut, and this one doesn't seem different from the others. Lo... [08:11:25] 10Certcentral, 10Patch-For-Review: store non-config files in /var/lib/certcentral - https://phabricator.wikimedia.org/T209475 (10Vgutierrez) 05Open>03Resolved [11:23:26] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Urbanecm) [11:44:41] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) There is no rewrite rule for zh-yue wiktionary; there is one for yue wikipedia. See line 97: https://gerrit.wikimedia.org/r/plugins/gitiles/oper... [11:44:59] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) p:05Triage>03Normal [13:57:09] bblack: I think we should lower the threshold as low as possible as long as it's not accelerating swift storage growth rate visibly. if that means applying it altogether, we can move onto doing it on the backends [13:58:39] and yes, if we can get rid of the threshold, we can remove the restart which is for the best [13:59:01] at some low threshold, it's not worth all the restarting basically [13:59:35] the ~1-2% webp you were reporting the other day seems odd to me, given how many fetches should be for 100+ hit objects. [13:59:38] well, there's also the thumbor capacity and swift growth rate to consider [13:59:53] which makes me wonder if maybe a whole lot of thumbs are over the 256K mark for non-frontend [13:59:54] maybe not worthwhile from varnish's perspective, but could be a useful gate [14:00:29] I think what we're seeing if how diverse our traffic is [14:00:36] it's a very long tail of thumbnails [14:01:10] I suspect that without a threshold at all on the frontends we would see a significant percentage [14:01:23] but also maybe a tipping point got swift growth :) [14:01:50] maybe the next steps should be halving the threshold multiple times? to get there more progressively [14:02:04] if that [14:02:11] I think from 100 on down, things might move quickly [14:02:34] ok, I can prepare a lot of intermediary patches [14:02:46] and then we'd deploy them in succession as soon as thumbor recovers from the spike [14:02:47] maybe try 75 and see how much diff it makes? [14:03:36] like 100 -> 90 -> 80 -> 70 and potentially if it goes as well as the last one, we'd do them in a row the same day, pushing the next one as soon as the thumbor graphs have flattened [14:03:46] doing it for the backend-only objects (>256K size as png/jpg) with some kind of hit-conditional system like we have now, would be very tricky [14:04:10] size isn't available at the point of the VCL? [14:04:17] to determine that it's a backend-only object [14:04:38] yeah but backend obj.hits isn't readily available in the FE right now [14:05:03] wouldn't that logic run on the backend? [14:05:49] if the logic ran solely in the backend, it runs the risk of polluting the frontend cache with webp in jpeg URLs [14:06:08] I guess we can also put the opposite size-limiter there [14:06:10] not if you filter out objects that aren't kept int he frontend :) [14:06:16] right [14:06:36] but then there's some complexities to think about there, size isn't always an exact science, unfortunately [14:06:58] if anything leaked through a mismatch of the size filters, it would pollute [14:07:24] a backend-restarted webp could contain a header to inform the frontend [14:07:29] and size filtering at different stages of the pipeline... might get screwed up by varnish temporarily due to gzip, although I guess these content-types shouldn't be gzipped [14:07:37] I mean it already should, with content-type [14:07:48] true enough, I guess [14:07:59] well [14:08:12] it's all about the size filtering being sane I think [14:08:25] more complicated for sure, but we don't have to worry about that until we've established that we're going to remove the current frontend threshold [14:08:34] once the backend has done a rewritten fetch for .webp, the frontend will still see the object as .jpg URL [14:09:14] anyways [14:09:30] ah yes, so even for the case where it's on purpose, you do need to rewrite the url on the frontend at the same time as the backend [14:09:46] so the main concern with turning this all the way on (no thresholding), is: (1) short term spikes in thumbor load + (2) long-term growth in swift thumb storage ? [14:10:16] mostly the latter, the former is easy to manage [14:10:16] did swift storage already level out from any increase so far? [14:10:53] anyways [14:11:19] it doesn't look like the angle of growth changed at this point [14:11:20] so long as we have some kind of obj.hits filter on the current code (even obj.hits > 1), it's frontend-only and <256KB objects only [14:11:53] if we switch away from that to just a straight rewrite with no obj.hits filter and no restarts, it will affect the >=256KB objects as well [14:11:59] which might be a big turning point in load/storage [14:12:55] I think before we move onto that, we would make swift auto cleanup based on edge traffic a priority [14:12:57] so yeah, we'll have to come up with some kind of scheme for doing that in a less-dramatic way [14:13:14] does swift have some idea of access times that makes sense? [14:13:32] (object last fetched by varnish X days ago) [14:14:16] maybe, but that alone isn't an indicator of how hot the object really is. popular things can fall out of varnish too if they have a brief slump in popularity [14:14:28] popular things get purged, etc. [14:14:55] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Hydriz) The WikimediaIncubator extension is working as intended, as it is supposed to direct users to the main page of the wiki. I think a rewrite rule in `... [14:15:06] no, but if swift can say definitively "nothing fetched this object in the last 365 days", that's a pretty solid indicator of extreme coldness [14:15:16] and there might be a *lot* of current thumb storage in that kind of category [14:15:19] I don't think it can do that, that I'm aware [14:15:39] godog ? [14:15:44] even a one-shot cleanup would be valuable at this point, IMHO [14:16:18] there's that possibility, just a clean slate, ensuring that we only keep things that don't have an original anymore, for example [14:16:51] if done slowly enough that thumbot can cope with the extra work [14:16:51] maybe one way to do it, would be to pull the last 90 days of raw reqs to cache_upload from analytics dataset, uniq'd, and then delete everything not in that list (also filter for creation older than 90d too, so we don't delete brand-new objects that happened while we were processing the data) [14:16:52] gilles: no such thing inside swift itself, we have swift access logs for the last 90d though [14:17:05] or that, even better than analytics [14:17:23] if we're going to do something like that, I'd rather base it on the varnish logs [14:17:41] then it would really match what people are requesting, regardless of varnish cache status [14:17:47] varnish can't keep objects afloat on its own. If it hasn't been fetched from swift in 90d, it hasn't been from varnish either. [14:18:01] the upload caches have 24-hour TTL caps at each layer [14:18:33] well, caveat :) [14:18:47] if a 304 refresh doesn't count as an access in swift's access log, then we'd have a problem [14:19:00] sounds good to me [14:19:04] it can keep an object afloat for longer times, if it gets 304 refreshes from swift [14:19:48] I'll check real quick if 304s are in there too [14:20:29] I'm about to get on a flight, btw, so I might vanish without notice soon [14:20:50] in any case, I think if we delete all objects that haven't been fetched (via varnish, which I think is implied by swift access to) in the most-recent 90d... (a) it would clean up a ton of storage + (b) there will be some increase in new thumbnailing to replace some of them [14:21:00] but I think (b) would have to be fairly small and slow over time [14:21:32] we can control the rate of the cleanup. presumably it would be something very stupid that iterates over objets, given how limited batch tools are in swift [14:21:37] right [14:21:44] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations: trafficserver debian-glue builds failing on integration-slave-jessie-1001: No space left on device - https://phabricator.wikimedia.org/T209703 (10ema) [14:22:00] 10Traffic, 10Continuous-Integration-Infrastructure, 10Operations: trafficserver debian-glue builds failing on integration-slave-jessie-1001: No space left on device - https://phabricator.wikimedia.org/T209703 (10ema) p:05Triage>03Normal [14:22:09] if we can make something that gets rids of crap faster than we add webps, all without overwhelming thumbor, we're good [14:22:23] doing it on an ongoing basis is much trickier [14:22:28] and without overwhelming swift either [14:22:32] but one-shot could be worth a lot as a starting point [14:22:54] well just keep a list of the recently accessed urls that has expiring entries [14:23:19] it'd be interesting to know first of all how large this list would be, whether it would fit in memory [14:23:45] if we're talking megabytes, shove that into a memcache with expiries and there you have your rolling list of recently accessed thumbs [14:24:32] I misremembered, for swift GETs we have the last week though 304s are in there [14:24:56] presumably if we build this with a memory, it can remember for as long as we want [14:25:08] a nicer pathway, would be to store an efficient approximate-last-access-time with the objects themselves (don't update it on every fetch, as that turns all reads into writes, but maybe update it on fetch if the previous last-access stamp is more than a week old, or even a month old) [14:25:12] but it's good that we can prime it with a week's worth of data [14:25:27] and then you could have something that trawls through them and cleans up objects with approx last-access stamps older than X (months) [14:26:05] and nuke those without any last-access after a certain amount of time? [14:26:10] sort of like filesystem relatime [14:26:22] yeah [14:27:10] the problem is getting started, it would turn every read into a write initially [14:27:21] yeah [14:27:30] you could maybe start with a probabilistic filter [14:27:36] that write can happen async post-send, though [14:27:51] with a low prio, if such a thing is possible [14:27:53] if (no_last_access_metadata && rand(100) < 1) { update it } or something [14:28:09] and then ramp it up [14:28:26] but you can't start nuking that have no entries until it's applied to everything for a significant amount of time [14:28:31] yeah that rand sounds about right, there's about 100x difference in read vs writes in swift typically [14:29:11] so, the full logic would be something like [14:29:21] godog: can you easily work out unique URLs from the logs? and how large that is? [14:30:11] if (has_last_access_metadata) { if(last_access older_than 1week) { update it } } else { if (rand(100) < 1) { update it } } [14:30:39] and then you'd turn up the random cutoff over time while keeping an eye on load, until you can remove the random filter completely [14:31:15] gilles: about 100G of plaintext per day for swift access logs, unique urls I'm not sure, probably by ingesting it into hbase? [14:31:17] sounds good to me, but we're probably talking months to get through all those phases [14:31:21] and then say, wait 90 days, then start periodically nuking everything with access times > 90d + 1w [14:31:27] gilles: and 100M entries per day roughly [14:31:59] no sorry, 200M entries [14:32:37] if you had the access metadata -based nuking running routinely, the nuked amounts would be small and manageable on each run [14:32:51] but you'd still want to deal with that first heavy run separately [14:33:35] and for the first heavy run, we could ask analytics to dump out unique URLs from the past 90d for us, without any real changes on the swift side, and go cleanup based on that (slowly). [14:33:45] just to get us in the ballpark and kill all the years-stale stuff [14:34:14] that would probably be the most beneficial thing to do at first [14:34:50] if we knock out a double digit percent of our swift storage with an initial big cleanup, they we can probably afford to turn on the webp faucet entirely [14:35:01] yeah that's similar to what I did some quarters ago when we did the thumbnail dimensions analysis, cross reference requested sizes vs stored sizes, generate a list of objects to delete and slowly do that [14:35:05] and then work on making the cleanup permanently ongoing [14:35:23] there will be lag time from when the analytics 90d list is generated, and the (probably long) period over which we run it. So it would need the additional filter of "not created in the past 90 days" too to avoid nuking objects newer than the analytics dump that aren't in the whitelist. [14:35:58] ok, boarding, ttyl [14:36:11] bye gilles [14:36:16] some script can ratelimit itself and trawl every object in swift, and if not in the analytics-90d-whitelist and not created in past 90d, kill it. [14:36:22] cya [14:37:56] yeah sth like that sounds doable, and marking thumbnails we found that have no original but should [14:38:39] also, just a count of the 90d uniques from analytics, vs known thumb count in swift (is that easy to get vs original?), would give us an idea what % of the storage it might end up nuking [14:38:58] and thus some idea how slowly the load needs to be spread out [14:39:42] yeah bytes/object counts for thumbs vs originals is easy to get, we push it to graphite [14:39:48] here's another funny thought: [14:40:02] overall bytes/objects that is [14:40:08] if you want to avoid all of this trouble with access dates and logs and hive queries and state [14:40:56] you could also write a script that executes once per N, picks a random thumbnail object from all of swift thumbs, and nukes it unconditionally. [14:41:20] and tune the N so it's not too impactful when it hits useful objects, but over time it will also cull all the long-stale ones. [14:43:12] yeah that seems like another good approach for ongoing maintenance after an initial cleanup [14:43:27] the reason I'm saying this is because we're talking about 1.2B thumbnails atm [14:43:41] right [14:44:34] but if we can make a random-thumbnail-picker that's truly random (that might be harder than it sounds!), and we have some good guestimate that e.g. 45% of current thumnails are stale/inaccessed [14:44:51] we could set N quite high initially until most of them are cleared out, and then tune it down as we go [14:45:04] (maybe initially as high as "as fast as you can in a serial loop" heh) [14:45:45] heheh it'd be a good test of how fast swift is at deleting objects for sure [14:45:57] but a billion could take quite a while [14:46:38] of course if we could possibly avoid storing more (meta)data per-object like relatime I'm all for it! [14:46:58] if deletes take a full second on average, it would take ~38 years to delete 1.2 billion objects [14:47:19] so it would have to be a pretty fast mechanism heh [14:47:59] it might be simpler to solve it by some kind of partitioning scheme, but I don't know enough about swift architecture to figure out those details [14:48:09] but the general idea would be: [14:48:46] declare some new empty storage partitions as the place new thumbnails go, while all the existing thumbs are still readable from the old partitions [14:49:28] and wait a while, then start nuking large chunks of old thumnails (whole partitions that represent X% of them), and shifting the freed-up space to the new partitions [14:49:38] or something like that [14:50:33] (at some rate where the re-nailing of hotter ones doesn't get overwhelming) [14:51:12] interesting! yeah if we go down the road of "write elsewhere in swift" that elsewhere should also be where we store less than 3x copies [14:51:32] heh [14:51:47] I guess that policy is there to protect originals, but happens to also apply to all the thumbs? [14:51:59] this are the notes I took re: thumbs cleanup at the time https://wikitech.wikimedia.org/wiki/Swift/Thumbnails_Cleanup for https://phabricator.wikimedia.org/T162796 [14:52:12] afternoon vgutierrez [14:52:18] hi Krenair [14:52:19] bblack: correct, all 3x per site without distinctions [14:52:51] vgutierrez, is there anything you need from me on certcentral? [14:53:14] Krenair: so, we need to test certificate deployment, dunno if you want to give it a try [14:53:22] um [14:53:40] otherwise I'll do it as soon as I finish some stuff lvs2010 related [14:53:41] I could set up a couple of instances in labs somewhere I guess [14:53:49] bblack: we shard commons objects into 256 containers, so that's a unit of parallelism if we were to massively delete objects, delete in parallel across containers [14:53:54] well.. we already have a certificate for pinkunicorn.wm.o [14:54:02] so we could test it in cp1008.eqiad.wmnet [14:54:06] right but that's in prod where I have no access [14:55:00] so I'll leave it to you to try with pinkunicorn.wm.o [14:55:03] ack [14:55:08] godog: while my wheels are turning in this general area: we could also trial fetching thumbs straight from thumbor->varnish, initially at some low percentage. [14:55:23] and see how big thumbor would have to scale to reach 100%, and then drop thumbs from swift [14:56:18] but I suspect, with out current scheme of wiping per-cache-host backend storage once a week, the spikes refilling varnish might be a bit much for it. [14:56:25] bblack: heh, nice in theory but I don't think thumbor will be able to do resizing with the same latency as swift is serving objects [14:56:28] maybe later after we get ATS backends and have persistent caches again [14:56:37] i.e. 150ms p95 for first byte [14:56:46] oh, that sucks heh [14:57:39] maybe it gets better for smaller files? we could maybe do something based on the byte-size of the orig. [14:58:58] the approx-last-access-metadata approach is probably best in the long term if thumbor couldn't be universally fast enough for users on a miss, though. [14:59:50] but then again, we do take slow misses through thumbor before a thumb first appears in swift, so it does happen some of time [15:00:02] (I think we return a 404 in that case, then later it just starts working?) [15:00:18] we'd just have a higher rate of those [15:01:00] mhh yeah I think we serve the correct response if thumbor is able to generate the thumb, so varnish waits in that case yeah [15:02:11] but swift has much bigger storage than varnish, and less volatile, so not nearly as often as it would be with direct swiftless fetches [15:05:14] (listening to the k8s talk) [15:07:06] <_joe_> bblack: thumbs for large images can take up to 30 s to render [15:07:35] <_joe_> I'm not really sure we want to re-render them every time the cache is wiped on the varnishes [15:08:33] <_joe_> where "varnishes" == whatever we have working as an edge cache [15:13:58] GPUs! [15:14:26] RPIs [15:16:04] 10Traffic, 10Operations: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) [15:16:41] https://github.com/jbaiter/epeg-cffi#benchmarks [15:28:06] bblack: what we discussed yesterday: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272 [15:33:13] bblack: hmm yesterday we decided to drop "en" instead of "enp" to avoid network interfaces names beginning with 0-9, right? [15:41:53] only if strictly necessary, meaning the length of the ascii is >15? [15:42:00] so it doesn't affect any existing ones [15:42:20] basically if len>15, strip len-15 chars from front of name [15:43:34] oh I see, you're stripping from the len before the .vlan is added [15:43:45] I don't know if it's a rule that vlan ids have to be 4 digits either heh [15:43:52] (in some sense, here) [15:44:27] but it would be simpler and more fool-proof to do it at the end anyways [15:44:51] e.g. [15:45:06] $full_tag = "${iface[0]}.${vlan_id}" [15:45:37] then do the size check on full_tag and remove leading chars down to max len [15:47:43] hmmmm that would work as well [15:47:46] ok [15:49:13] yeah I know, puppet isn't very good at this stuff [15:49:30] could maybe move the whole thing down inside interface::tagged too, I donno [15:50:15] oh [15:50:35] heh, interface::tagged actually invents the interface name and ignores the $title/$name anyways [15:50:43] so you have to fix it down in there anyways [15:51:32] $intf = "${base_interface}.${vlan_id}" [15:51:54] lovely, thx :) [15:53:16] hmm I'm pretty tempted to get rid of that line [15:53:56] $intf = "${base_interface}.${vlan_id}", otherwise we'd have in puppet one name for the catalog resource and a different one for the actual iface name [15:57:46] then you'd have to update all the callers, too [15:58:01] (some in openstack) [15:58:17] got you :) [15:58:39] 10Traffic, 10Operations, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) > Hi Rob > > > > I’m raising to our IT to check on this. > > Please bear with us while they investigate on it. > > > > Best regards... [16:26:00] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) Does the community want that, if there is a community of users on the incubator? [16:32:30] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Dzahn) Yes, renaming of "zh-yue" to "yue" is stalled / lowest since 12 years or more. (2006 before Bugzilla?) T10217 and T30441 [16:32:32] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [16:32:36] 10Traffic, 10Operations, 10Patch-For-Review: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) 05stalled>03Open We fixed the `verify_config` issue in ATS 8.0.0-1wm2, this is not stalled anymore. [17:39:14] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10RobH) I've updated the firmware for bios/idrac/network on lvs2009 & lvs2010. lvs2007 & lvs2008 don't respond to mgmt interface connection attempts, and do not ping. Sho... [17:43:01] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10RobH) I don't want to upload Dells firmware drivers to our systems (because I'm sure that is against some user agreements downloading the Dell software!) So I'll just li... [20:28:50] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Urbanecm) >>! In T209693#4754087, @ArielGlenn wrote: > Does the community want that, if there is a community of users on the incubator? I think explicit bl... [20:29:54] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) Are there other zh-yue and yue projects that also need to be addressed? I f we are going to add redirects, we might as well do all that are needed. [20:47:39] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Dzahn) Wikipedia: zh-classical zh-min-nan zh-yue Wiktionary: [20:56:03] 10Wikimedia-Apache-configuration, 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Dzahn) re: community desire: T30441#2637271 -> "Many editors of Cantonese Wikipedia have been watching this thread for 9 years" [23:42:28] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10kruusamagi) For me, it seems that the issue has grown even bigger in time. The delay with Estonian Wikipedia is often like 3 weeks (!!!), that means not... [23:46:51] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Bawolff) >>! In T119366#4754971, @kruusamagi wrote: > For me, it seems that the issue has grown even bigger in time. The delay with Estonian Wikipedia i... [23:49:54] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Bawolff) Fwiw: im of the opinion that date magic words should reduce varnish cache to at least 24 hours, maybe six hours. Im doubtful that super long ca...