[02:38:22] 07HTTPS, 10Traffic, 10DBA, 06Operations, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#2618187 (10Dereckson) [02:39:50] 07HTTPS, 10Traffic, 10DBA, 06Operations, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1218427 (10Dereckson) [09:11:16] 10Traffic, 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2618555 (10hashar) 05stalled>03declined From T140257#2595926 and follow up response from ops, we are keeping the status quo of using a public... [10:03:40] ema: my patch to show 4.4.19 in uname didn't work, there seems to be another weird later of indirection in the Debian kernel build, I'll stick with the old format for now and revisit for the next kernel update [10:20:58] moritzm: ok, where's the patch out of curiosity? [10:21:50] https://gerrit.wikimedia.org/r/#/c/308151/ [10:23:16] wow that's a big patch :) [10:24:29] rules.gen is auto-generated, but it's intended somewhat different, normally it's meant to run once by the kernel maintainer doing the upload to the archive and rebuilding that version is simple, making changes is often messy, though [10:45:28] ema: nukelru yet? [10:47:39] bblack: lots! [10:48:00] awesome [10:48:26] it's kind of an interesting datapoint, the time-to-nukelru after wiping a DC and bringing in "real" traffic [10:48:39] yeah, it takes a really long time [10:48:45] given codfw isn't storage-handicapped like ulsfo with just 6 upload servers [10:48:53] but still, it was under 24h from ulsfo users arriving [10:49:50] I'm sure that says something mathematically complex about the effective/active set size of swift vs the cache size, but I don't know what :) [10:50:30] :) [10:51:52] if we're surviving with heavy nukelru and no ill effects, I'd say next step is re-upgrade ulsfo and put users back there again [10:52:19] with ulsfo routed back to codfw [10:52:24] well, yeah [10:52:28] I guess that's step 1 :) [10:52:41] right, this way we do v4 -> v4 -> applayer [10:53:13] we know from before there's issues with v3 fetching from v4, some 503s I think? Not huge and horrible, but not ideal [10:53:54] the tradeoff here is between moving the users back first and then upgrading (which rolls the storage wipe, they don't all miss across the WAN at once) [10:54:10] (but they face some %fail from v3->v4, probably) [10:54:26] or upgrading all of ulsfo first, and then flipping users back with a big initial spike of misses in an empty DC [10:54:59] I tend to think the second one is better, if we time flipping geodns back for a low-point in ulsfo's daily swing [10:55:13] would be nice to have a weight in dns to add users slowly :) [10:55:46] yeah, we've done that before very manually, by making changes to the mapping itself, but it's painful [10:56:07] very manually [10:56:18] e.g. remove ulsfo from all countries except for a few, then repool ulsfo which only brings those in, etc [10:57:13] weight knobs that worked automagically with geographic mapping would be ideal, but we're pretty far from having them :) [11:03:14] https://github.com/gdnsd/gdnsd/issues/126 + https://phabricator.wikimedia.org/T94697 [11:14:50] moritzm: we're merging stuff together :) [11:15:28] moritzm: can I go ahead with puppet-merge? [11:15:28] shall I or do you want to? [11:15:32] go ahead [11:15:47] done [11:34:37] i should have a look when eqiad caches are due for refresh [11:34:38] can't be long [11:36:32] IIRC the 8x old ones still in service in esams were next-up weren't they? [11:36:45] yes [11:36:47] those first [11:36:58] i was thinking next quarter or early Q3 [11:36:59] for that [11:37:04] ok [11:37:05] i've already started prepping a bit [11:37:24] I really wish we could get over a few more abstraction and cluster-merging hurdles before then, but it's not the end of the world if we don't [11:37:38] what do you mean? [11:38:00] misc has its own 4x4 cluster mostly for logical reasons, not physical ones [11:38:29] if it didn't make our VCL code so much more unintelligible, misc's traffic could be a sub-case within e.g. the text cluster. [11:38:46] it would also confuse stats, but again that's a logical reason not a real traffic-level/perf reason [11:39:35] ok [11:39:37] varnish5's per-blah VCLs could handle that better [11:39:50] there's not a good way to merge them that doesn't cause us pain today [11:40:29] another model I was thinking of, though, was to start having clusters share nodes just in the backend but not frontend layers [11:41:05] the backend VCL is smaller and simpler and easier to make shared in most cases. the frontend is where most of the stats and traffic levels and complex VCL is at. [11:41:47] and then it might be ok for a lower-traffic cluster to have e.g. 2x nodes in a DC, if we know that in an emergency we can always re-role a frontend from an adjacent cluster to maintain redundancy [11:42:13] but none of those ideas are mature enough yet to plan around, that's what I'm complaining about :) [11:44:10] heh ok [12:22:19] bblack: moving the conversation back here :) [12:22:31] I'll upgrade ulsfo soon then [12:23:17] awesome [12:23:32] cool that I don't have to be careful in this case [12:26:14] let's rephrase that: s/be careful/worry too much about timings and such/ [12:26:34] <_joe_> ema: it sounded more daring in the first version :P [12:26:43] right :) [12:28:45] all the risk waits for us at the end, when we accidentally saturate the whole wave to ulsfo for a few seconds :) [12:29:07] for some value of "accidentally" [12:30:17] I really don't think it will, though. we'll see some ugly peak that doesn't completely saturate, and drops off very quickly, probably exponentially [12:31:48] for the esams and eqiad cases, I still like the idea of temporarily moving esams backends to route through codfw, then rolling through upgrades within eqiad and esams, then putting the routing back. [12:32:38] <_joe_> makes sense [12:32:47] <_joe_> so, why is mixng v3/v4 bad? [12:33:03] <_joe_> I mean having v4 varnishes interact with v3 backends? [12:33:33] maybe a good sequence there is: (1) switch esams->codfw in cache routing (2) roll through esams v4 upgrades (3) roll through eqiad v4 upgrades (4) switch esams routing back to eqiad (5) switch codfw routing back to eqiad [12:33:46] potentially 2 and 3 could proceed in parallel, I think [12:34:29] _joe_: we don't know the root cause definitively, but there seem to be incompatibilities mixing v3/v4 requesting from each other, which lead to a low-but-notable rate of either 503s or responses with artificial content-length:0 [12:35:12] <_joe_> yeah I got what was the problem, and that you suspected it was that mix causing it, it just sounded somewhat strange :) [12:35:13] my best guess at this point is that the bug is on the v3 side, and in practice it doesn't affect v3<->v3 or v3<->real users, but it does affect interaction with v4. [12:36:05] we saw some pretty obvious temporarily glitches when rolling through v3->v4 upgrade even within one DC, which is from the temporary condition of v3/v4 mixing in both directions. [12:36:20] but even with a whole DC upgraded to v4, we see issues when a v3 datacenter backs it [12:36:33] but they poofed when we switch to a v4-only setup [12:36:57] there's some hints we were looking at with past known-issues in our v3 related to the kind of fallout we're seeing [12:38:44] we did look for a more-definitive answer, but that could take forever, and our custom hacky v3 build is known to be missing at least one upstream bugfix that seems relevant, but conflicted with our custom plus patches in a difficult way to resolve in the past. [12:41:52] all of this theory could fall apart, of course, if we re-observe the issues when we put users back in ulsfo on a v4->v4 stack [12:42:00] yeah [12:42:46] but at least then we've got a bug report we can work with. it's hard to work with upstream on "hey there seems to be an interop bug in one of these two varnish versions, one of which is a hacky unsupported variant of v3" [12:43:27] reducing the problem to the v4->v4 case makes things much simpler to think about and report on [12:44:30] still got no replies about the stalls and the gigantic issue with reload after vmod upgrade [12:45:27] I think at this point they're thinking "shit it's them again" [12:46:18] <_joe_> ema: prolly, yes [12:46:36] <_joe_> ema: we also are, traffic wise, probably their biggest user [12:46:53] <_joe_> I don't know if cloudflare uses varnish [12:47:29] I don't think they do [12:48:08] akamai does :) [12:48:24] but who knows what version or hacks, or in conjunction with what else in their stack [12:48:43] fastly runs v2 heavily modified, it's essentially a private fork apparently [12:49:35] <_joe_> akamai uses varnish? [12:49:39] we might be the biggest user of a relatively unmodified and current open source varnish, at least once we finish moving upload to v4 [12:49:41] <_joe_> they swore me they didnt [12:50:04] I think they do, but I could be mis-interpreting [12:51:12] _joe_: evidence is http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-120.pdf linked by daniel into our discussion in https://phabricator.wikimedia.org/T144187 [12:52:29] but technically all that's evidence of is that akamai helped/participated in academic research on improving caching, and that research's output included some heavy-duty code written for integration into varnish and such. [12:53:00] technically, it could just be that varnish was the open-source platform of choice for proving/presenting the results of what they ended up doing with similar algorithms in completely different software at akamai [12:53:55] <_joe_> bblack: akamai engineers told me clearly everything they use is house-built [12:54:06] it's entirely possible [12:54:27] the point of the paper is to explore a general algorithmic approach that could help any cache [12:54:39] <_joe_> it might borrow parts from varnish, but at least their ESI platform is like 10x more complex and evil than what varnish allows [12:55:08] <_joe_> but then again, that could be on /top/ of varnish [12:55:22] again: that akamai participated and provided data, and the paper's solution was written for varnish, doesn't mean the algorithm isn't used with other software inside akamai and varnish is just the open-source reference cache for public paper purposes. [12:55:25] <_joe_> also they had some really, really fancy stuff with image/video caching [12:55:35] but I'd tend to think the simplest answer is that they use varnish [12:57:13] see especially "3.3 Integration with a production system" in the linked PDF [12:57:42] it seems odd they went to the trouble of making the reference implementation of AdaptSize on Varnish so scalable for production workload on a pragmatic level if it's not being used and is just an academic example [12:58:44] "The value of exponential function outgrows even 128-bit number representations. We solve this [12:58:47] problem by using an accurate and efficient approximation for the exponential function using a Pade´ [12:58:50] approximant [65] that uses only addition and multiplication operations, allowing the use of SSE3 [12:58:53] vectorization on a wide variety of platforms, speeding up the model evaluation by about 10× in our [12:58:56] experiment." [12:59:09] who would care about that for test/proof code, given it doesn't matter to the results, but matters to using the code in a high-scale environment? [13:35:08] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2619143 (10ema) codfw is running fine with v4 routed straight to the applayer. We're going to upgrade ulsfo back to v4 routed to codfw to test v4<->v4 behavior. [13:36:06] bblack: not sure if it is a good moment for some questions about creating a new domain (pivot.w.o) [13:36:46] (dns part + varnish + else if needed) [13:44:21] elukey: sure [13:44:53] bblack: thanks :) [13:45:07] for the DNS part (last one to do), is https://gerrit.wikimedia.org/r/#/c/309312/1/templates/wikimedia.org sufficient? [13:45:32] elukey: yes, but should be merged after the varnish part is fully deployed [13:45:39] yep yep [13:46:03] (just to avoid pointless cached 404s) [13:46:42] and about it: do I need to update the misc director only? IIUC there is no need of a special SSL cert because of the wildcards, but I have some doubts about the nginx -> Varnish part. [13:47:51] Basically my idea is to create a code change like the one that I did for yarn.w.o [13:47:52] yeah assuming the service is already running on some internal service endpoint, the only changes for public traffic is the DNS linked above, and adding stuff to modules/role/manifests/cache/misc.pp's list of servicehost->backendhost stuff [13:48:23] awesome [13:49:08] is there any doc about what geoip!misc-addrs does? I am really curious (I can guess what it does but I'd like to read more). You can simply tell me goto gdnsd docs :) [13:49:44] if you mean at the gdnsd level, it means "use the geoip plugin to resolve this hostname, and feed it the service label misc-addrs" [13:50:09] which you can then look for in our dns repo in the file /config-geo [13:50:18] misc-addrs => { [13:50:18] map => generic-map [13:50:19] ... [13:52:41] thanks! I am going to check :) [13:57:04] upgraded ulsfo upload to v4 [13:59:36] <_joe_> ema: nice, already done? [13:59:36] \o/ [13:59:37] <_joe_> wow [14:00:53] yeah when you don't care about cache contents it's a pretty quick thing :) [14:06:58] nice [14:07:23] I can do the dns stuff at ~20:00, I doubt you want to be here that late [14:09:17] bblack: that would be great, yes [14:09:33] thanks :) [16:41:09] > 35M frontend nukelrus, ~ 3M backend nukelrus in codfw [16:41:28] and still no explosions [16:41:58] that said, there has been a little 501 increase today [16:42:25] starting a little after 15:40 [16:43:20] oh, Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98) [16:43:20] I forget, what was up with 501 before? [16:43:26] right, that heh [16:43:29] ignorably IMHO [16:43:57] it might even make sense to redirect that UA to a page stating we're in 2016 [16:44:55] "Hi fellow traveler from the past. Please use a contemporary client. -- The Future" [16:45:11] "Welcome to the Internet, Newcomer. Please discard all devices and software over 5 years old before entering!" [16:47:25] Ubuntu Precise turns 5 around the same time MS drops all security support for Vista (April next year) [16:47:47] I hope we're long off it before then heh [16:48:56] imagine the party MS threw when they dropped support for XP [16:49:21] while laughing at the rest of the world that still has to support it :P [16:49:28] right [16:50:44] great plan, guys. dominate the consumer computer market, then start releasing increasingly shitty updates for years that nobody wants to update to because they're incompatible and/or broken and/or ugly and/or privacy-invasive, then drop all tech/sec support for those stuck on the last reasonably-ok version. [16:51:32] (well for some peoples' value of reasonably-ok) [16:51:57] as in it kinda allows me to run office without crashing every day [16:52:12] and yeah, a few viruses, but hey [16:54:16] I've been telling all the old folks in my family to stop using true local machines/software altogether and just try to do e.g. chromebook + google services + various other web services [16:54:32] not because it's ideal, but because anything else is too complicated for them to secure it and I don't want to deal with it either heh [16:56:30] http://futureoftheinternet.org/ good book about this stuff (terrible title though) [16:57:22] walled gardens security vs. openness/hackability/freedom [16:57:26] yeah I hate what pushing them towards google cloud stuff means for their privacy and such [16:58:02] but given the acceptable/simple/old-people-compatible alternatives that are realistic in the market today, it's better than them using local ancient software on windows and participating in botnets routinely [16:58:40] I did get my parents to move to a mac at least, some years back [17:02:51] participating in botnets :) [17:03:25] anything against me merging https://gerrit.wikimedia.org/r/#/c/309315 ? :) [17:05:00] elukey: nope [17:05:12] thanks! [17:35:34] godog: varnish exporter packaged: git+ssh://git.debian.org/git/pkg-go/packages/prometheus-varnish-exporter.git https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=837089 [17:36:23] can't wait to play with prometheus! [17:39:22] ema: really nice! <3 [17:40:06] yeah I guess we'd run one of those per varnish instance on two different ports, the puppet part should be reasonably straightforward [17:40:31] right, the exporter allows to pass -n to varnishstat so it should be really quite easy [17:41:27] in which DCs do we have the prometheus "servers"? They're on the bastions right? [17:42:34] codfw/eqiad for now, but yeah I was planning on having it on the ulsfo/esams bastions too [17:43:10] alright [17:53:30] * godog off for real [18:45:05] 10netops, 06Operations: configure port for frdb1001 - https://phabricator.wikimedia.org/T143248#2620310 (10Jgreen) [19:37:31] 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, 06Operations: update *.wmflabs.org certificate - https://phabricator.wikimedia.org/T145120#2620595 (10RobH) [19:40:21] 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2620630 (10RobH) a:05RobH>03chasemp [19:41:47] 07HTTPS, 10Traffic, 06Labs, 10Labs-Infrastructure, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2620595 (10RobH) p:05Normal>03High [19:57:39] !log repooling normal traffic to cache_upload in ulsfo [19:57:39] Not expecting to hear !log here [19:57:42] :P [20:11:34] so far pretty smooth on ulsfo. looks like network spike (at least, in the 5-min averages shown in librenms) is ~2Gbps [20:12:03] peak anyways. the 10 minute TTL rampin on gdnsd spreads things out a little vs what you'd expect from a hard switch of users [20:12:23] it may yet go a little higher [20:21:12] yeah peak (5min avgs) was 2.19Gbps for codfw->ulsfo cache refill traffic, it's dropping back now [20:21:40] cache is slowly filling in, too, with front/local hits replacing remote hits [20:31:26] 10Traffic, 06Analytics-Kanban, 06Operations, 06Performance-Team: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2620857 (10Nuria) [21:04:20] bblack: still looking good? [21:04:38] yeah. I've logged some CL:0, but it's because I'm not filtering 416 well :) [21:04:53] because those 416 start out as 200, so they match RespStatus==200 [21:05:20] yep. As a lame workaround you could run varnishncsa and grep -v " 416 " [21:05:22] now I'm trying this: [21:05:25] varnishlog -cn frontend -g request -q 'ReqMethod ~ "GET" and ReqHeader ~ "Host: upload" and RespStatus == 200 and RespStatus != 416 and RespHeader ~ "Content-Length: 0"' [21:05:35] tried already, doesn't work [21:05:50] it seems to work right (in that if I change the "CL: 0" to "CL: 9", it does log a bunch of normal requests quickly) [21:06:07] will still match a 416? [21:06:15] funnily enough, yes [21:06:31] oh well [21:06:52] maybe with this transactional logging of request processing, they should implement some kind of array syntax :P [21:07:06] varnishlog -cn frontend -g request -q 'ReqMethod ~ "GET" and ReqHeader ~ "Host: upload" and RespStatus[0] == 200 and RespStatus[1] != 416 and RespHeader ~ "Content-Length: 0"' [21:07:10] :) [21:07:37] in any case, no 503 issues either, everything seems ok [21:07:38] :) [21:07:51] yeah I was just looking at grafana [21:08:38] very good then! see you tomorrow [21:08:55] on librenms with the ulsfo link, over those first 10 minute (while DNS TTL going out), it ramped up quickly to the 2.19Gbps peak, and has been very slowly trailing off since [21:09:14] seems stable-ish at ~1.6Gbps now [21:09:25] but I bet it keeps dropping once the backends in ulsfo fill up more [21:09:27] cya :) [22:22:01] 07HTTPS, 10Traffic, 06Operations: wmflabs.org should enforce HTTPS - https://phabricator.wikimedia.org/T144790#2621191 (10AlexMonk-WMF) [22:22:12] 07HTTPS, 10Traffic, 06Labs, 06Operations: wmflabs.org should enforce HTTPS - https://phabricator.wikimedia.org/T144790#2610285 (10AlexMonk-WMF) [22:23:14] 07HTTPS, 10Traffic, 06Labs, 06Operations: wmflabs.org should enforce HTTPS - https://phabricator.wikimedia.org/T144790#2610285 (10AlexMonk-WMF) There's also T102367, but that's specific to tools [22:57:38] ema: assuming ulsfo+codfw hold up (and it seems like they will so far!), let's look at trying to do the eqiad+esams magic on Monday. I can wake up early to help. I'll be traveling starting Tuesday. [22:59:02] my best thinking on that is still basically: (1) re-route esams->codfw (2) roll through eqiad+esams at the quickest reasonable pace in parallel (3) wait for for varnish-caching stats to indicate backends mostly-refilled in at least eqiad if not both (4) switch esams->eqiad routing again. [23:05:51] hmmm was just looking at graphs, though, and esams low point is centered somewhere near 02:00, a few hours wide [23:06:27] which isn't really an awesome time for either of us heh, much less both [23:06:51] we don't have to care as much about eqiad's load-timing. It's nice-to-have there, but at least the refill isn't going over a WAN link in that case. [23:08:11] still, eqiad starts dropping off sharply somewhere in the vicinity of ~01:30-02:00, reaching the bottom at ~09:00 [23:10:04] if we assume, say, 20 minute node timing, either DC is something like a 4 hour job spacing them out [23:11:15] let's try 15, makes it slightly more palatable [23:13:32] which 3h window to do a DC, and spreading eqiad+esams work apart and leaving the routing odd for longer is less an issue than leaving either one running mixed v3/v4 for long. means something like: re-route esams at 01:00, esams upgrade window 01:00-04:00, eqiad upgrade window 06:00-09:00, and then fixing routing probably many hours later when it makes sense. [23:15:20] I could maybe kick that off late Sunday night my time / early Monday morning yours (possibly slightly later than 01:00, but it will still work, those times are approx), and then you can pick it up later when you get online. [23:25:15] or we could do the same thing 24H later I guess [23:26:07] I donno, we can talk over options and think of better solutions maybe tomorrow when you read this :)