[05:25:22] 10Traffic, 10ExternalGuidance, 10Operations, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10santhosh) Confirmed that enwiki redirects to mobile version when accessed from Google transl... [08:40:56] 10Traffic, 10MobileFrontend, 10Operations, 10TechCom, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Tbayer) >>! In T214998#5005391, @Krinkle wrote: >>>! In T214998#4929700, @Jdlrobson wrote: >>... [09:03:52] 10Traffic, 10MobileFrontend, 10Operations, 10TechCom, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Tbayer) >>! In T214998#4929968, @tstarling wrote: > It complicates SEO in the sense that, whe... [09:35:09] 10Traffic, 10Operations, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10ema) >>! In T202966#5007017, @ayounsi wrote: > cp1099 is the last standing host between me and powering off asw-c-eqiad. > > From this task and the prompt `cp1099 is a Unpuppetised sys... [10:06:42] 10Traffic, 10Operations: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10ema) [10:06:44] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10ema) [10:08:00] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Some esams<->eqiad varnish backend connections closed by peer - https://phabricator.wikimedia.org/T216006 (10ema) [10:19:09] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Some esams<->eqiad varnish backend connections closed by peer - https://phabricator.wikimedia.org/T216006 (10ema) Varnishlog of the varnish **backend** instance serving the request in esams reports the following: ` - ReqMethod... [11:15:06] T216006 is interesting [11:15:06] T216006: Some esams<->eqiad varnish backend connections closed by peer - https://phabricator.wikimedia.org/T216006 [11:15:52] https://releases.wikimedia.org/blubber/linux-amd64/blubber fails, as well as /blubber/windows-amd64/blubber.exe [11:16:21] they're both ~ 9M. /parsoid/parsoid_0.10.0all_all.deb (~46M) works fine [11:19:13] you would thus think that file size is not necessarily a critical component of the issue, but then requesting a smaller portion of the blubber file works around the problem [11:20:07] eg: [11:20:11] curl -H "Range: bytes=0-100000" -L --resolve releases.wikimedia.org:443:91.198.174.192 --http1.1 -v -o /dev/null https://releases.wikimedia.org/blubber/linux-amd64/blubber?x=$RANDOM [11:21:28] are you doing that remotely? maybe try on a frontend box itself? [11:22:12] hi :) [11:22:25] maybe nginx is involved too? [11:22:30] TLSv1.2 (IN), TLS alert, close notify (256) [11:22:33] possibly [11:22:56] but even with nginx still there, I'd see if behavior changes when curl isn't across a wan from varnish-fe [11:23:15] ah, yeah, I can repro from a frontend box going through the whole stack [11:25:21] nope, nginx is innocent [11:25:31] can reproduce with: [11:25:37] curl -H "X-Forwarded-Proto: https" --http1.1 -v -o /dev/null -H "Host: releases.wikimedia.org" http://localhost/blubber/linux-amd64/blubber?x=$RANDOM [11:27:28] another thing you might look at, is whether there are any obvious correlations between these known cases (the wikidata one and releases) when you query them from the applayer stuff directly. [11:28:08] e.g. if both servers are outputting them as close-delimited and its varnish that's having to figure out CL and possibly impacting streaming, or both are using gzip encoding incorrectly, or who knows what. [11:28:45] I suspect it's a varnish bug, since restarting fes helped before, but it might be be interesting to know the triggers [11:30:05] it has to be something quite funky I'm sure [11:30:45] lunchtime here, bbiab [11:59:28] 10Traffic, 10MobileFrontend, 10Operations, 10TechCom, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10ovasileva) Just wanted to chime in with a product perspective on this. This change is not cu... [12:43:28] mmh, and now after lunch I cannot reproduce anymore [13:21:03] a full stomach solves everything [13:21:39] albeit temporarily, problem is back :) [13:21:44] 10Traffic, 10Core Platform Team, 10Operations, 10Performance-Team, and 3 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10jbond) p:05Triage→03Normal [13:30:02] varnish-fe seems to be closing the connection before -be is done writing, I see loads of RST to :3128 [13:30:38] let's not be overly dramatic, s/loads/various/ [13:31:50] and indeed the source ports of the RST match the varnishlog entries with Debug ~ "Connection reset by peer" [13:38:53] the bulk of those are transient though, while /blubber and the wikidata one can be reproduced quite consistently [13:56:55] ah there you go [13:57:07] -- ExpKill LRU_Exhausted [13:57:07] -- FetchError Could not get storage [13:57:08] -- BackendClose 11217 vcl-ed50cc64-c0ad-4266-89a4-9e4539972e1a.be_cp3033_esams_wmnet [13:57:16] that's what the frontend says [13:58:05] which explains the "Connection reset by peer" on the backend [13:59:51] the larger file /parsoid/parsoid_0.10.0all_all.deb does not fail simply because it's large enough that we pass [14:02:08] I'm not sure I understand why you'd close an origin server connection when failing to get storage [14:02:18] was the LRU exhaused error on the be or the fe? [14:02:23] fe [14:02:45] right, so probably the sane answer here is to reduce our fe no-cache size limit [14:03:02] it makes sense that it's hard to evict enough objects from LRU for a giant object vs a small mem cache [14:03:48] what's bothering here is that the -fe returns 200+partial response [14:04:32] well it's streaming as it tries to acquire storage chunks [14:05:01] true [14:05:06] so that's kind of a natural fallout: if storage fails and you've already streamed through part of the response (incl critically the initiate header and status), not much else you can do [14:05:16] s/initiate/initial/ [14:05:34] -sfile returned 503 when failing to get storage though, right? [14:06:02] I think that depends whether it's streaming the response or not [14:06:38] (or maybe it depends whether it tries to preallocate room for the whole response, which may not even be possible for responses with no CL from the applayer) [14:06:54] but I assume these have a CL, if nothing else I think we create one by turning off streaming at the be [14:07:06] yes they do have CL [14:07:31] directly from the applayer in this case [14:07:50] I don't know if -sfile would 503 in the same case, or not [14:08:48] it would make sense that with streaming enabled it'd behave like the frontend is right now [14:10:20] actually, I think we only have an FE size-limiter in upload [14:10:32] (in VCL) [14:11:00] maybe for the really big cases, if they have CL info, varnish-fe is able to quickly decide to not cache on its own [14:11:04] but the middling cases, it tries and fails [14:13:19] we also have admission_policy set to "none" everywhere now for $reasons [14:13:27] uh, interesting, I thought we did have a size cutoff for text too [14:13:48] but the block in upload-frontend that does the size cutoff checks for policy != "exp", so "none" still uses it [14:14:40] any idea the size distribution (or min observed size) of failing cases? [14:14:46] would the same 256K we put on upload work? [14:15:30] let me find out [14:16:13] from the historical pov, we probably never bothered putting the size restriction on text because we didn't expect it to commonly hold truly-huge objects [14:19:39] I'm collecting a sample of CLs for which we fail to acquire storage on cp3030 [14:20:21] so apparently varnish is autonomously deciding to pass when objects are "too big"? TIL [14:21:22] I would guess at some cutoff it would, based on something related to total mem cache size [14:22:56] or it may be implicit in the algorithms, e.g. when the size is really big it quickly "fails" to acquire storage (rips through the nuke limit in one go?) and then proceeds as a pass, but in middle-ground cases it thinks it can get enough, but later fails partway through the transfer. [14:23:10] LRU_Exhausted and the nuke limit are different cases I think? [14:25:36] LRU_Exhausted is what happens after $nuke_limit attempts to nuke have been attempted and not enough space is found [14:26:37] hmmm I thought there were a couple of different similar cases like that, but I could be remembering wrong [14:27:21] or at least that's what I'm concluding given that we try 50 ExpKill before giving up and saying LRU_Exhausted, and nuke_limit defaults to 50 :) [14:27:26] that there was something like an inner loop of "nuke up to X things to make room for this one chunk allocation", vs another one for how many times chunk allocation can fail while doing one object, or something. [14:28:43] smallest failing CL so far is 581253 [14:31:27] maybe try copying the 256K block from upload to text VCL? [14:31:48] chasing the true cause in varnish C code sounds annoying heh [14:31:58] it does :) [14:33:07] yeah I think 256K should be safe, we can keep an eye on the hitrate in the upcoming days [14:33:32] also we might want to follow MAIN.n_lru_limited, if it grows it's almost certainly bad news? [14:33:47] the description of the counter is "Reached nuke_limit" [15:02:13] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Some esams<->eqiad varnish backend connections closed by peer - https://phabricator.wikimedia.org/T216006 (10ema) Those connection resets on the varnish backend layer happen when frontend caches are full and varnish cannot make sp... [15:03:54] * bblack refuses to get nerdsniped by ema's last two lines into opening the varnish source code again [15:04:08] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, and 2 others: nuke_limit often reached on esams varnish frontends - https://phabricator.wikimedia.org/T216006 (10ema) [15:05:07] :) [15:08:41] so I've moved the 256K cutoff from upload's cluster_fe_backend_response to the common vcl_backend_response [15:09:24] now I wonder if cluster_fe_backend_response_early and wm_common_backend_response actually have important things for the cutoff [15:12:58] heh, wm_common_backend_response sets the ttl, so yes [15:14:02] * ema shakes fist at vcl [15:22:06] bblack: seems reasonable? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494937/ [15:35:05] ema: nope, because upload's cluster_fe_backend_response ends with a "return (deliver);", which causes the moved code to never execute for upload [15:35:44] so does text heh [15:35:48] oh doh! [15:36:31] both text and upload also call a $cluster_common_backend_response() too, but one before admission stuff and one after lol [15:36:46] upload's is kind of orthogonal and pointless [15:37:12] but text's looks complicatedly-related maybe [15:38:21] but the ordering works I think, hmmm [15:38:49] maybe just move both clusters' "return (deliver)" back over to the common backend code where it says the default might be invoked? [15:39:28] oh then I thought about misc and realized the other dimension of this problem space [15:40:00] it's not that varnish had some internal implicit size limits, it's that when we switch VCL for misc (e.g. for releases), the misc VCL has its own "don't cache over 10MB" clause there... [15:40:26] ohhh, right! releases.w.o is misc! [15:40:59] it's the same basic code block, just 10MB instead of 256K [15:41:23] they could all be coalesced in theory [15:41:55] but misc doesn't have the return deliver in its cluster_fe_backend_response, so it actually does currently fall through to varnish default VCL there, probably intentionally. [15:42:44] at this point I would opt for the easy way out and just duplicate the code block from upload to text, and also change misc's copy to match (256K) [15:42:55] it's nice not to duplicate code, but we're not working in a nice environment :P [15:43:12] yes, I wanted to be cool but it's definitely a terrible idea [15:44:08] oh, also in misc frontend we pass on missing CL, while on upload we don't [15:44:27] std.integer(beresp.http.Content-Length, 0) >= 262144 [15:44:29] well on upload we always get CL anyways, I think (maybe double-check maps) [15:44:41] what about text? [15:45:40] but yeah, misc and upload's assumptions really do differ there [15:46:27] misc says "if no CL, assume it's big and pass", and upload says "if no CL, assume it's small and don't pass" (because there probably are a few dumb meta-things on upload like that which are small and not swift media files) [15:47:05] and also in the misc/text case I don't think we exclude small files from the backend storage, whereas in upload we explicitly do. which implies if upload-fe fails to cache small objects, they don't get cached at all. [15:47:06] on both misc and upload we disable streaming if CL is missing at the backend-most layer (hence we should always have CL on the frontends) [15:47:34] but not text? [15:47:47] I don't think so, no [15:47:52] this may be one of those things where we tried it on text and it broke something for $stupid_reasons .... [15:48:15] so maybe on text frontends we should also assume it's big if no CL and pass [15:48:29] well [15:48:43] in the text case, we known mediawiki outputs at the applayer commonly have no CL but need caching [15:49:00] so if we don't have the stream-disable-at-the-be hack, and we assume no CL == pass at the fe, things get very uncached [15:49:15] mmh, yeah [15:51:05] so probably the least-invasive fixup that's still general would be: [15:51:36] 1) Leave text/upload assuming "if fe sees no CL, cache in the fe", leave misc assuming "if fe sees no CL, do not cache in the fe" [15:51:46] 2) Copy the upload block over to text for a 256K sanity limit there. [15:52:04] 3) Change misc's value from 10MB to 256K (but see 1, don't copy the logic from upoad) [15:53:13] then we can say the text "cluster" (incl misc) has a 256K limit like upload when CLs are known, and we don't screw with the rest of it. [15:53:35] I like it [15:53:44] some of this will get easier to fix once we get past the ATS-backends hurdle [15:54:12] I think we'll have better odds of creating missing CLs in ATS, and not have to do hacks like upload where we skip backend caching on small objects, too. [15:55:14] there's going to be other fe simplifications too (we can get rid of the dc loop checks, for instance) [15:56:11] oh that's backend-only anyways heh [15:57:19] meeting soon! [15:57:29] \o/ [15:58:15] I mostly said that to pre-empt someone reminding me of it when I'm late in a few minutes [15:58:53] /o\ [16:01:41] bblack: meeting! [16:02:01] ema: we miss you :( [16:45:27] 10Traffic, 10Operations, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10Cmjohnson) @ayounsi asw2-c8 is a 1G switch....does this need to go to a 10G rack? I can move to asw2-c7 xe-7/0/6 looks open [16:47:35] ema: we're moving cp1099 to a different rack, so poweroff, etc I assumite it's fine as it's a test server [16:47:40] assume* [16:49:43] yeah [16:51:20] XioNoX: yup! [16:52:18] thx! [16:58:30] bblack: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494937/ updated, though I'm not sure I understand the `beresp.http.Content-Length ~ "^[0-9]{9}"` part. 9 digits surely is way larger than 256K? [16:59:13] right [16:59:24] it can be trimmed to match the others I think. [16:59:52] oh they're all at 9 heh [17:00:44] right now, upload is at 9 and misc at 8 [17:00:46] so the fear there is that the CL might be numerically-misinterpreted if it's past 2^31-1, which is 10 digits long [17:00:47] :) [17:01:01] so assume anything with lots of digits is over-threshold [17:01:06] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) [17:01:34] I'm not even sure anymore how legit a concern that is, but I assume at least at some past point, varnish integers or str->integer conversions might've been 32-bit signed or unsigned. [17:01:44] or some other part of the stack somewhere might make such a mistake [17:02:53] but it doesn't hurt as a safety valve. you only need it to say {10} to prevent the overflow/underflow/whatever issues I guess, but 9 is fine too. [17:03:29] fair enough [17:04:44] Varnish 6.0 (which we haven't upgraded to yet, and I wouldn't dare until after we kill the varnish-be's) release notes say: [17:04:48] Integers in VCL are now 64 bits wide across all platforms (implemented as int64_t C type), but due to implementation specifics of the VCL compiler (VCC), integer literals' precision is limited to that of a VCL real (double C type, roughly 53 bits). [17:04:52] In effect, larger integers are not represented accurately (they get rounded) and may even have their sign changed or trigger a C compiler warning / error. [17:05:04] so I assume that implies that 5.x and earlier are still worse-off than that in various ways [17:05:39] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Campsite, and 2 others: nuke_limit often reached on esams varnish frontends - https://phabricator.wikimedia.org/T216006 (10ema) >>! In T216006#5008346, @ema wrote: > Interestingly, the problem is not reproducible with larger objects, as varnish autonomously d... [17:06:44] 10Traffic, 10Operations, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10Cmjohnson) @ayounsi server moved [17:07:10] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) lvs100[789] network port disabling: ` robh@asw-c-eqiad# show | compare [edit interfaces interface-range LVS-cross-row] - member-range xe-8/0/26 to xe-8/0/28; [edit inte... [17:12:59] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) [17:15:03] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1007.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:15:15] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1008.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:15:28] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1009.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:17:45] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1010.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:18:00] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1011.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:18:12] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for lvs1012.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Rem... [17:33:29] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) [17:33:41] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10RobH) a:05RobH→03Cmjohnson [18:13:46] 10netops, 10Operations, 10ops-eqiad: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [18:43:09] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [18:43:30] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) a:05ayounsi→03Cmjohnson [19:13:57] bblack: seeing some interesting stuff while digging through Netflow in Ashburn, for example we're seeing "Airtel Nigeria" there [19:16:04] Also lots of Telecom Austria [19:27:54] I guess it's logged in users [19:32:51] It looks like most of Telia traffic in eqiad is european users beeing redirected to eqiad, then Indian [19:42:52] Slightly similar topic, I worry that looking AMS-IX saturates the Telia link in esams [19:44:48] unrelated, I think that's the sexiest network map I've seen so far: https://telxius.com/network/interactive-map [19:57:50] so yeah, Ashburn is the default geoip location for stuff that doesn't resolve via MaxMind [19:58:03] it's probably pretty normal to see strange stuff from all over the globe there [19:58:55] but, there's almost certainly some low-hanging fruits there if we dig into it and want to manually engineer things a little better, but I'm also leery of how much manaual engineering we want to maintain at that level without developing better tooling first. [19:59:40] (e.g. we could look at a case like Airtel Nigeria and see that they need to be going to esams and MaxMind has it wrong, and maybe look up all their netblocks and manually config geoip, and try to push on MaxMind to fix them in the database) [20:00:16] having ASN-level tooling for the manual stuff would be nice. some kind of workflow by which we can just say ASN4234 traffic should go to esams (and pull the near-realtime routes advertised by that AS into the geodns stuff) [20:01:17] how big a volume of "looks misdirected" traffic in eqiad are you looking, percentage-wise? [20:05:16] bblack: I checked Airtel Nigeria in GeoDNS and they're correctly set to esams [20:05:49] (they only have 2 big prefixes, so that's easy [20:05:51] ) [20:06:14] but their traffic from those same ranges is landing in eqiad? [20:06:43] it might be interesting to dig into what's going on there to see if we have some issues we don't understand [22:02:18] could be just users accessing e.g. dumps.wm.org [22:02:56] netflow isn't just main traffic infra [22:03:42] for that kind of analysis the geoip-enhanced webrequest data in e.g. druid may be more suitable