[01:54:53] 10Traffic, 06MediaWiki-Stakeholders-Group, 06Operations, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2621906 (10BBlack) [01:54:58] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2621905 (10BBlack) 05Open>03Resolved [01:55:08] 10Traffic, 06MediaWiki-Stakeholders-Group, 06Operations, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323297 (10BBlack) 05Open>03Resolved a:03BBlack [07:06:08] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2622342 (10grin) I respectfully disagree with most of the points, but as it's been said before: I have noted that the topic should be considered c... [07:07:28] ulsfo backends have started nuking lrus already, interesting how much faster it happened compared to codfw [07:07:42] still everything looks stable [08:24:11] bblack: why do we need to upgrade eqiad+esams in parallel? Can't we just (1) re-route esams->codfw (2) upgrade esams first? [08:55:00] back again after the core had fun destroying it´s own database and all backlogs and so on. :/ [08:58:12] ema: vk is behaving fine :) [08:58:53] elukey: nice! [08:58:55] welcome back Snorri [09:41:38] Thanks. By the way. When I´m finished with my thesis (hopefully soon!) and I got some time I´m thinking about researching bloom filters a bit. This sounds very interesting and fun! Maybe when I know more about it we could discuss how to use them in caching scenarios. Something like the 404 filter sounds super helpful! [09:44:07] Snorri: nice. When it comes to bloom filters, I think the last idea we had to deal with one-hit-wonders was to actually avoid using bloom filters and instead rely on X-Cache-Int [09:44:13] see T144187 [09:44:13] T144187: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187 [09:44:54] and https://gerrit.wikimedia.org/r/#/c/308995/ which hasn't been merged yet given that we're half-way in upgrading cache_upload to varnish 4 at the moment [09:50:55] Sounds reasonable! [09:52:57] Okay...my procrastinating has now reached the level of thinking what to do with stuff I have no knowledge off! At least better use of my time than checking random websites. [11:29:47] ema: I guess I have the idea stuck in my head that we shouldn't leave esams re-routed too long (or codfw->direct too long, either) [11:29:58] they're both adding latency to some fairly slow misses [11:31:55] it's normally 83ms eqiad<->esams. but esams<->codfw is 119ms. and the codfw<->eqiad link is 35ms for those misses too. [11:32:45] so yeah we could split them up on separate days [11:47:22] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2622824 (10BBlack) >>! In T144508#2622342, @grin wrote: >> it would take me pages just to explain in depth the huge range of problems with that se... [11:49:19] Snorri: spending your time thinking of ways to do things you have no knowledge of sounds like an excellent plan to me. It's the best way to learn new things :) [11:50:12] also our nascent patch linked above is "wrong" I think. I don't think we can usefully blend the ideas of using X-Cache-Int metadata and requiring N+ hits. [11:50:34] we can do X-Cache-Int to avoid DC-global one-hit-wonders, but it's useless to try to filter at the two-hit+ level. [11:51:36] consider that we hash ips into frontends. filtering on hit/2 (as opposed to hit at all) doesn't mean "hit at least twice between all these frontends", it means "hit by at least two distinct frontends" [11:51:44] yeah...but I should write about the things I do have knowledge of right now. Like what my data shows. Hopefully I can let you 2 proofread my work next week. [11:53:09] well the rest of my statement above was something like "consider the edge case of 1 client requesting the same object 100 times from the same frontend" [11:53:18] does it? If the object is not cached on the hit/1 already it would be a miss on the frontend and the backend would be hit the 2nd time. Then the third request has the hit/2 in the backend? [11:53:23] but before I finished I realized I was wrong about the patch being wrong, so we're back to it being right [11:53:45] I've only had one coffee since I woke up :) [11:55:16] That means my argument is correct which makes me feel good about understanding stuff. So don´t drink more coffee I guess :) [11:55:27] :P [11:55:31] too late! [11:56:06] Oh and the X-Cache-Int metadata thing will be featured in the future work section! [11:56:31] so we now have 3 ideas for improving our basic hitrates that are back-burnered because other stuff is ongoing: [11:56:42] 1. mostly text-specific: sort query params [11:57:15] 2. mostly upload-specific: lower our frontend object size limit (currently 50MB, optimal is probably much smaller). static limit isn't ideal, but it's better than nothing for now. [11:57:38] 3. probably helps everywhere: the various N-hit-wonder ideas [11:57:59] Ah this reminds me of something. [11:59:05] I have no idea how hard it would be to implement (probably very) and how much effect it might have (i guess not that much sadly) but...knowing how most people I know use wikipedia most requests from one user are linked together [11:59:23] heh [11:59:37] prefetching? [11:59:44] Like I search for something. Film, Person, Mathematic Equation or something.... [11:59:47] exactly [12:00:15] I tend to think prefetching inside our caches will tend to be covered by traffic stats, or at odds with them [12:00:40] Yeah it might be. [12:00:54] in other words, if the to-be-prefetched article is at all popular it's probably already cached by someone else's direct request, or the first of many people that followed the same link being prefetched. [12:01:07] and if it's really not that popular, it's not worth evicting anything more-popular for it [12:01:39] but I think your case is stronger in the backend caches, where we have ample room. [12:02:01] Mhhh [12:02:02] it's the frontends where we can't afford to prefetch->evict on little evidence [12:03:00] Yeah I guess. I thought it might be interesting to prefetch on the frontend level and only lookup on the backend but not the application. So only prefetching already popular articles which might be not on one specific frontend [12:03:39] I think with the space pressure on the frontend it would be hard to justify the tradeoff [12:03:50] But the eviction is a problem I guess. On the backend it could help. But as you said most of the interesting sites would be in the backend already [12:04:01] Yeah. [12:04:11] but: if there are prefetchable links in an actually-viewed article, and those links are cold, it could make sense to get them loaded in backends. [12:04:35] FE fetching a backend hit is pretty fast and reasonable (when the prefetched link is viewed) [12:04:43] Still not used to 100GB for website caching being small space^^ [12:04:47] but missing through our whole cache stack back to the applayer sometimes isn't [12:05:41] prefetching to the backand COULD help mitigate the delay from esam to equiad for example [12:05:58] I donno, on the other hand, prefetching based on access seem like one of those things where you're unlikely to tune it just right, or it's unlikely to adapt well to some load patterns, etc [12:06:20] probably the right way to think of this, if we're assuming backend storage is big enough (which it probably is on cache_text for articles).... [12:06:41] yeah exactly. prefetching everything would probably be a waste and only prefetching the correct things would be hard to get right. [12:07:03] would be to turn it around and say "every time an article is actually-edited and the parsercache updates, have mediawiki trigger the backend caches to fetch the article" [12:07:33] of course if we don't actually have space for them all that could be catastrophic heh [12:07:42] A friend of mine wrote his thesis about trying to give the links a weight based on the content and importance. Something like this might help. But would be a lot of fine tuning and might not be of much help in the end. [12:08:00] but still, there's probably a fair amount of overlap between most-recently-edited and likely-to-be-requested-by-users [12:08:54] That could actually be nice. After an edit prefetch the edited article to the backends. [12:09:16] taking this idea a little further: the applayer already gives us such a signal in the form of PURGE on updated articles... [12:09:29] so all we really have to do is: instead of just purging on purge, purge->refetch. [12:09:49] which is funny in an awesome way [12:10:15] because I've been complaining for nearly a year that the applayer is sending us an insane rate of PURGE requests for various reasons (much higher than you'd expect based on real human wiki edit rates) [12:10:26] and this would reflect that volume back at them and maybe melt the appservers :P [12:10:36] this could be even combined with the previous "one-hit" wonders. only refetch if hit on backends at least X times for example [12:10:56] Okay...melting the appservers was not what I had in mind actually :D [12:13:48] yes, other than the melty rate of purge, it's actually a good idea [12:13:53] but this could help especially with highly edited articles. News stories, election candidates and so on. articles edited every hour or so should be incredibly popular (normally) and the refetch could mitigate some misses there [12:14:00] it's just funny how it highlights that purge rate needs to be fixed first :) [12:14:40] I don't know if VCL alone or a vmod or a varnish patch would be required to implement it [12:15:37] "it" being something like "on explicit purge of an item (not LRU nuke), if hitcount > N, re-fetch it from the backend immediately". ideally asynch, but even just in-sync with processing the purge req would be maybe-ok. [12:15:47] no idea. Implementing things is really not my strong suit. I actually love the theoretical stuff but writing code has not yet been something I like. [12:30:10] ema, bblack: new kernel with new ABI is now on apt.wikimedia.org, gets pulled in via a new linux-meta package [12:36:56] The following packages have been kept back: linux-meta-4.4 [12:37:08] The following packages will be upgraded: linux-meta [12:37:22] ^ ignoring unrelated packages, that's what I see on basic "apt-get upgrade" on a cache now [12:37:43] so should we upgrade and also install linux-meta-4.4, or? [12:38:15] moritzm: ^ [12:39:44] bblack: https://gerrit.wikimedia.org/r/#/c/309312/ - anything against me going forward? All the prereq should be in place [12:40:02] when we had 3.19 and 4.4 in parallel , there were separate linux-meta-3.19 and linux-meta-4.4, linux-meta only depends on linux-meta as a transition package, so updating linux-meta and linux-meta-4.4 should be dine [12:40:17] (IIRC I just need to merge and then go to ns0.w.o and run sudo -i authdns-update) [12:41:43] moritzm: so I should just dist-upgrade? [12:42:37] elukey: yup, should work [12:43:05] thanks! [12:45:18] bblack: maybe run it on a single host initially, but should be fine [12:49:25] The following NEW packages will be installed: linux-image-4.4.0-2-amd64 [12:49:28] The following packages will be upgraded: libidn11 linux-image-3.16.0-4-amd64 linux-libc-dev linux-meta linux-meta-4.4 [12:49:38] that's what I got on dist-upgrade, testing 1x host just now, looks sane [12:58:25] looks good upgrade-wise [13:41:20] godog: prometheus-varnish-exporter is in new https://ftp-master.debian.org/new/prometheus-varnish-exporter_1.0-1.html \o/ [13:41:50] ema: nice \o/ good job [14:20:09] bblack: ema: do you have to plan to rename puppet class role::cache::2layer ? Class with a leading digit are no more supported in recent puppet [14:20:14] I am just wondering :D [14:20:20] ema: ok so assuming we spread out esams/eqiad, do you want to start esams late sunday my time / early monday your time still? I'm mostly worried about my inability to back you up on anything much while traveling for the SF meetings the rest of next week, and then we have like 1 good week left in the quarter and then barcelona [14:20:36] hashar: really it should go away completely in a refactor, they're all 2layer now [14:20:46] ahhh [14:20:51] so that is essentially solved :] [14:21:29] thx ! [14:30:19] bblack: yes I'm fine with starting esams any time [14:30:39] ok [14:31:23] assuming I don't see evidence of weekend issues, that is, but I don't think we will [14:31:53] doing that during the weekend would be also OK with me as long as I know that in advance (before I start drinking negroni that is) [14:37:51] :) [14:38:32] I finally managed to write some doc about systemtap -> https://wikitech.wikimedia.org/wiki/SystemTap [14:39:03] I'm gonna be pretty busy most of the weekend. realistically I'll probably start esams later-than-ideal, like 02:00 or 03:00, basically after kids are asleep Sunday night for me [14:40:13] ok [14:40:40] what spacing did you use the last time you ripped through ulsfo with live traffic? 15 or 20 min? [14:41:19] for most machines, longer than that, I was looking at the number of frontend objects [14:41:37] and keeping an eye on grafana [14:41:52] ok [14:42:07] the specific times can be found on SAL [14:42:49] right [14:43:07] oh wait, codfw was the one that took longer (less clients) [14:43:18] I'll probably be watching graphs too, I'm more just trying to realistically estimate how long the process will go with 12 caches [14:43:50] for the first hosts I think 30 minutes is a valid estimate [14:44:12] (conservative) [14:45:20] but yeah I guess we don't want to stay 24h in a mixed v3/v4 state :) [14:45:58] heh 6h that is [14:47:36] ultimately the time driver's going to be how our miss-rates and esams link traffic plays out as it goes, and even ulsfo only gives us an approximate idea of what it will look like. [14:48:11] but the dns turnup of ulsfo the other day only spiking to ~2.2Gbps is a good sign. I expected higher heh. [14:49:57] right [14:57:09] codfw upgrade times: 17:35 18:31 19:22 19:45 20:41 21:28 21:59 22:24 22:49 23:17 [15:01:54] 56 51 23 56 47 31 25 25 29 -> avg 38 [15:02:26] probably 30 is doable if pushing hard [15:18:03] 10Traffic, 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623166 (10RobH) [15:30:11] bblack: not sure if I prefer to start late Sunday night or early Monday morning :) [15:30:38] my time? [15:31:37] my time [15:31:52] bblack: oh wait did you mean 02:00 your time or UTC? [15:33:41] I meant 02:00 UTC, but well maybe I should rewind a bit. "not sure if I prefer to start" - esams or eqiad? [15:34:18] the esams time is mostly driven by the esams daily traffic cycles - we're much better off trying to get it closer to the low point there [15:35:19] the fundamental problem is esams low-point isn't a good time for anyone :) [15:36:38] bblack: right, 2:00 UTC would be esams low-point [15:37:06] ideally we'd start even earlier so the ~6h-ish window doesn't run so far into the upramp of traffic [15:37:17] but realistically I won't be there any earlier than that, and it works :) [15:38:43] I could start around 00:00, that way we would hopefully be done around 06:00 [15:39:12] 10Traffic, 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623236 (10RobH) [15:45:42] if it works out that you can, that's great. either way I'll plan to show up around 02-03: -ish and either take over or start [15:47:01] 10Traffic, 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623255 (10RobH) a:05RobH>03MoritzMuehlenhoff [15:49:27] yeah I think that should work. I can start at some point around 00:00, perhaps a bit earlier, and you can take over when you manage to get online [15:52:36] 10Traffic, 06Operations, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2623268 (10BBlack) I'm putting up a straw-man hostname-decom date of 2016-09-19, which is ~10 days out from now. We'll never actually eliminate the trailing traffic before decom, and mo... [15:56:43] so the plan would be: (1) route upload esams to codfw (2) usual upgrade procedure with an eye on the hitrate, right? [15:58:43] and librenms esams link traffic [15:58:49] right [17:03:09] 10Traffic, 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623166 (10akosiaris) FWIW, those were up to now certificates issued by our internal CA (along with the labvirt* certs from what I remember and can see). II... [20:10:46] 07HTTPS, 10Traffic, 06Operations, 06Performance-Team, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454047 (10AlexMonk-WMF) I did some digging and found /shared/pywikipedia/core/pywikibot/comms/rcstream.py in tools defaults to stream.wikimedia.org on port 80 [20:23:59] 07HTTPS, 10Traffic, 10Pywikibot-core: rcstream support defaults to stream.wikimedia.org:80 - https://phabricator.wikimedia.org/T145244#2624453 (10AlexMonk-WMF) [20:31:30] 07HTTPS, 10Traffic, 06Operations, 10Pywikibot-core: rcstream support defaults to stream.wikimedia.org:80 - https://phabricator.wikimedia.org/T145244#2624490 (10AlexMonk-WMF) a:03AlexMonk-WMF [21:35:10] not sure if it was mentioned already, varnish 5.0.0 beta1 was released today [21:35:31] The list of changes are numerous and will not be expanded on in detail. [21:35:34] Major items: [21:35:37] * Experimental support for HTTP/2. [21:35:39] * VCL labels, allowing for per-vhost VCL. [21:35:42] * Always send the request body to the backend, making possible to cache e.g. POST. [21:36:05] https://twitter.com/bsdphk/status/774321722027802628 is... special and infuriating :)