[09:18:33] <wikibugs_>	 10Traffic, 10Analytics-Kanban, 10Operations: Artificial spike in offset of unique devices  from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3377967 (10ArielGlenn) p:05Triage>03Normal
[09:21:34] <godog>	 greetings, I'm trying to debug a problem with increased 503s since we've deployed thumbor to commons this morning, i.e. after 8:40 here  https://logstash.wikimedia.org/goto/8d7dc129eb257ff92e3d9e722c6fd4e9
[09:22:29] <godog>	 matches an increase in failed fetches in eqiad https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1498464980368&to=1498468899083&panelId=3&fullscreen&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=upload thanks to ema's dashboard
[09:22:58] <godog>	 though requesting with curl e.g. the top failed url https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Echinocactus_grusonii_Hildm.PNG results in a 404
[09:23:33] <ema>	 interestingly, looking into varnishlog, those actually *are* 404s that become 503s
[09:23:36] <godog>	 i.e. with curl -v 'http://ms-fe.svc.eqiad.wmnet/wikipedia/commons/thumb/f/f2/Echinocactus_grusonii_Hildm.PNG' -H 'Host: upload.wikimedia.org' >/dev/null
[09:23:37] <gilles>	 I'm seeing some binary junk right now on that url opened in chrome
[09:23:53] <ema>	 see https://upload.wikimedia.org/wikipedia/de/thumb/7/77/75px-Cyclobutan.png
[09:24:02] <gilles>	 https://www.dropbox.com/s/i8v6z4cqtxsdejn/Capture%20d%27%C3%A9cran%202017-06-26%2011.23.55.png?dl=0
[09:24:46] <godog>	 wow, no I haven't got binary junk after the error page yet
[09:24:55] <gilles>	 first time I see that, I've opened a bunch
[09:25:20] <ema>	 gilles: browser?
[09:25:28] <gilles>	 chrome
[09:27:21] <gilles>	 this time it hung for a while and I have a double error page + a little binary junk
[09:27:42] <gilles>	 https://www.dropbox.com/s/dgzw617jqhauoj6/Capture%20d%27%C3%A9cran%202017-06-26%2011.27.37.png?dl=0
[09:28:07] <ema>	 godog: curl: (56) Illegal or missing hexadecimal sequence in chunked-encoding
[09:28:18] <ema>	 godog: I get that ^ by running the curl command you've mentioned above
[09:28:41] <gilles>	 same
[09:30:00] <godog>	 indeed
[09:30:57] <ema>	 godog: also it's a 404 with Content-Length: 69 and Transfer-Encoding: chunked, that might confuse varnish
[09:31:11] <godog>	 through ms-fe.svc but not thumbor.svc, so it is rewrite.py
[09:31:40] <elukey>	 I was about to report the same problem :)
[09:32:34] <godog>	 gilles: I think we should revert, varnish seems indeed confused by the above
[09:32:53] <elukey>	 from the varnishlog that I got it seems what ema is saying, CE chunked + CL ==> Backend Fetch failure for a 404
[09:36:00] <ema>	 so yeah the problem seems to be exactly what curl says
[09:36:40] <ema>	 I've tried with netcat, there's no chunk size 
[09:38:59] <godog>	 yeah looks like to me rewrite.py is passing on t-e but adding a c-l
[09:39:20] <gilles>	 I see the content-length in thumbor
[09:39:29] <gilles>	 but no chunked-encoding
[09:39:39] <ema>	 I think the actual problem might be sending t-e: chunked while the content isn't chunked
[09:40:23] <gilles>	 ah, there is an actual body with the 404, makes sense
[09:40:31] <ema>	 yep!
[09:40:44] <ema>	 404 Not Found
[09:40:49] <gilles>	 so chunked encoding is added by rewrite.py/swift?
[09:41:02] <godog>	 I got it from nginx too afaics
[09:41:02] <ema>	 then a bunch of newlines and finally:  404: Not Found404: Not Found  
[09:41:32] <gilles>	 ah yes, nginx has Transfer-Encoding: chunked
[09:42:54] <gilles>	 nginx has chunked on by default
[09:43:50] <godog>	 indeed, though on a 200 from thumbor the transfer isn't chunked
[09:43:55] <elukey>	 elukey@mw1295:~$ curl --header "Host: upload.wikimedia.org" "localhost/wikipedia/de/thumb/7/77/75px-Cyclobutan.png" -i shows Transfer-Encoding: chunked
[09:44:03] <godog>	 iow
[09:44:05] <godog>	 curl -v 'http://thumbor.svc.eqiad.wmnet:8800/wikipedia/en/thumb/8/89/Mohammed_Bin_Salman.jpg/100px-Mohammed_Bin_Salman.jpg' -H 'Host: upload.wikimedia.org'  > /dev/null
[09:45:36] <ema>	 confirmed, varnish returns a 503 with "Error: Body cannot be fetched" on te:chunked responses with non-chunked contents
[09:45:41] <gilles>	 what would be the harm of turning off chunked on nginx?
[09:46:43] <ema>	 see https://phabricator.wikimedia.org/P5630
[09:48:03] <godog>	 I can't think of any harm in the nginx <-> swift <-> varnish case but I might be missing sth
[09:48:36] <gilles>	 ok, we can turn it off. although to me it looks like it's chunking properly:
[09:48:55] <gilles>	 https://phabricator.wikimedia.org/P5631
[09:49:24] <gilles>	 maybe rewrite.py can't handle the chunking properly?
[09:50:07] <godog>	 possible yeah, I don't get why nginx adds chunking to 404 but not the 200 from thumbor heh
[09:51:00] <ema>	 this is from ms-fe:80 -> https://phabricator.wikimedia.org/P5632
[09:51:33] <gilles>	 yeah, looks like rewrite.py isn't liking that
[09:53:41] <gilles>	 https://gerrit.wikimedia.org/r/361428
[09:53:56] <ema>	 so thumbor uses te:chunked for 404s but not for 200s? 
[09:54:29] <godog>	 ema: I believe it is nginx adding te:chunked, I haven't seen thumbor replying chunked yet
[09:54:45] <godog>	 gilles: thanks! can you add a comment above to explain why?
[09:55:01] <gilles>	 thumbor doesn't send back chunked, it's nginx that adds it
[09:55:18] <godog>	 ema: e.g. from thumbor1001 curl -v 'http://localhost:8816/wikipedia/en/thumb/8/89/Mohammed_Bin_Salman.jpg/100px-Mohammed_Bin_Salman.jpg' > /dev/null
[09:57:41] <gilles>	 godog: done
[09:58:32] <godog>	 thanks, I'll try it and merge 
[09:59:14] <gilles>	 and indeed nginx doesn't add it to 200s, who knows why...
[09:59:40] <ema>	 yeah
[10:01:01] <ema>	 almost OT: I've tried adding random characters to the request URL above and I still get a 200:
[10:01:04] <ema>	 curl -v 'http://localhost:8816/wikipedia/en/thumb/8/89/Mohammed_Bin_Salman.jpg/100px-Mohammed_Bin_Salman.jpg.whatever.blabla' > /dev/null
[10:01:21] <ema>	 oh, that changes Thumbor-Parameters:
[10:01:32] <ema>	 "filters": "conditional_sharpen(0.0,0.8,1.0,0.0,0.85):format(blabla)"
[10:01:34] <ema>	 interesting
[10:02:43] <gilles>	 it probably defaults to jpg when you ask for a format that doesn't exist
[10:03:02] <gilles>	 the URL handler is definitely a bit permissive, but so is mediawiki in some cases
[10:04:26] <gilles>	 the second mention of the filename in the last part of the thumbnail URL is redundant in practice
[10:05:49] <godog>	 sigh I was a little too eager with running puppet on thumbor* but it is restarted everywhere now
[10:06:46] <godog>	 yeah looks like 503s are recovering
[10:07:06] <gilles>	 failed fetches dropping like a rock
[10:08:01] <gilles>	 opening the old 503s in my browser 404s properly now
[10:08:29] <ema>	 I'm wondering if chunked transfers weren't working because proxy_http_version 1.1 is missing from the nginx conf
[10:09:30] <ema>	 (the default value is 1.0 apparently)
[10:16:25] <gilles>	 I've learnt not to poke the nginx dragon
[11:46:16] <ema>	 uh, actually in this case it's the swift-proxy / rewrite.py dragons that are doing the wrong thing, turning a chunked response body into a non-chunked response, but still returing the te:chunked header, adding content-length, and messing up the content-type in the process as well 
[11:46:33] <ema>	 compare https://phabricator.wikimedia.org/P5631 with https://phabricator.wikimedia.org/P5632 :)
[12:47:23] <bblack>	 well the good thing about chunked is it lets you stream out content (perf win) when you don't know the final length to add CL header in the first place.
[12:47:58] <bblack>	 which is why dynamic compressors and such like to emit chunked, so they don't have to cache up the whole compressed output in memory first before emitting the headers inclusive of CL and then the content
[12:49:13] <bblack>	 but for cases like thumbor/swift, we really want to waste the time/space generating CL and then transmitting it at every step.  We only have to determine CL once (when storing an object into swift, or thumbnailing it to be stored into swift)
[12:49:38] <bblack>	 the upload cache relies on CL information to be smart anyways, it doesn't like CL-less images
[12:50:10] <bblack>	 (and you can still "stream" out a large object when sending it with a CL, if you know the CL up-front (which we do after the first storage or transmission))
[12:51:05] <bblack>	 I guess above: s/at every step/at every place we generated or store a new image/
[12:52:46] <bblack>	 so TL;DR - it's good to not support chunked in the beneath-varnish layers for the swift/thumbor sort of case.  It avoids the possible messes that would ensue if those layers were for some reason failing to emit CL headers to varnish.
[13:29:55] <elukey>	 since we are talking about upload and 50Xs, there are periodical 500s in https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X for upload.w.o
[13:29:58] <elukey>	 :)
[13:32:55] <godog>	 bblack: fascinating (re: c-l) thanks!
[13:33:10] <elukey>	 from cp1073's varnishlog I see engine: wikimedia_thumbor.engine.svg and Internal Server Error returned from the backend
[13:33:41] <godog>	 yeah looks like a legit 500 from thumbor
[13:34:29] <elukey>	 meant to crash in that way? 
[13:36:43] <godog>	 maybe not meant but it might happen that some rendering tools choke on uploaded media yeah
[13:37:16] <elukey>	 ahhh okok
[13:54:35] <ema>	 godog: any reason to keep global datasources in the datasource template here https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats or can I filter them out?
[13:54:56] <godog>	 ema: yeah safe to filter them out
[13:55:09] <ema>	 ok thanks!
[14:00:32] <ema>	 bblack: I've merged https://gerrit.wikimedia.org/r/#/c/353274/ today limiting transient storage usage on text to 5G (fe) and 2G (be). Instead of rolling restarting all varnishes I guess we can just wait for cache host reboots (kernel updates) which I was planning on doing later on this week anyways
[14:07:04] <bblack>	 ema: sounds good :)
[14:09:37] <godog>	 not sure if this was linked here in the past, https://twitter.com/keegan_csmith/status/862672319524864001
[14:13:36] <bblack>	 nice!
[14:13:51] <bblack>	 so what we've informally talked about before on this sort of topic
[14:14:40] <bblack>	 varnish's normal model is there's the cache item's actual TTL, and then a short "grace" window beyond the TTL (both are tunable).  during the grace window stampeders can use the slightly-stale copy while one client is triggering the actual refresh.
[14:14:58] <bblack>	 the whole reason the grace period is short is that you don't want to serve "stale" content too long
[14:15:26] <bblack>	 so we'd talked about reversing how this works.  We start with our base TTL (let's say it's 86400), and instead of having grace go a fixed, say, 5 minutes beyond.
[14:16:23] <bblack>	 we put the grace inside the TTL as a percentage.  So if we want 10% grace, we'd start with TTL=86400, and then transform that to TTL=(86400 - 8640 = 77760), Grace=8640.
[14:16:45] <bblack>	 that gives us a much broader grace window for the async/"stale" refresh window, but without actually serving any stale content
[14:17:48] <bblack>	 their idea in that paper seems to mostly be to do a probabilistic early refresh on a single client's non-stale req, or something along those lines
[14:18:39] <bblack>	 I do like the probabilistic angle, because we undoubtedly have rushes that occur between related separate objects that all share the same TTL values and were initially fetched around the same times.  And all kinds of statistical stampeding around combinations of related TTL numbers in long term patterns, etc.
[14:19:09] <bblack>	 maybe the right way to blend up these ideas is to add some randomness to the "grace-inside-TTL" calculation for each object as it's stored?
[14:19:56] <bblack>	 e.g. we could set that grace-inside-TTL window as a range of, say, anywhere from 10% -> 20% of the object's life, and choose the exact amount randomly at object storage time.
[14:20:23] <bblack>	 (it's sort of like combining grace-inside-TTL with "random TTL reduction" in some sense, I guess)
[14:20:54] <godog>	 yeah seems like even a small variation might be worth it, especially for upload where it seems likely to have many objects at around the same ttl waiting to expire
[14:22:41] <bblack>	 I love the sorts of solution he has in the paper in general
[14:23:05] <bblack>	 the kind where instead of the usual "track all this extra state in data structures and implement this complex book-keeping and checking per request"
[14:23:36] <bblack>	 it all boils down to "execute this one simple math function with probabilistic and/or random inputs to make a boolean decision on every request, and things work out well without all that other mess"
[14:24:07] <bblack>	 state is hard, cpu time for math is easy
[14:26:57] <bblack>	 while we're on the subject of algorithmic improvements we could use around here:
[14:27:02] <bblack>	 https://research.googleblog.com/2017/04/consistent-hashing-with-bounded-loads.html
[14:27:27] <bblack>	 ^ the idea there is to modify consistent hashing with something like "if the target node's load is too high, shift the request to the next failover load on the chash ring"
[14:27:56] <bblack>	 to avoid what we see where frontends might laser-focus a hot un(front)cacheable item onto a single chashed backend and overwhelm it
[14:28:18] <bblack>	 s/to the next failover load/to the next failover node/ heh
[14:29:30] <godog>	 indeed, would be very nice to see that in ipvs' chash
[14:30:13] <bblack>	 of course you can do that the stateful-ish way where nodes report their own loads back and you're actually looking at the real load of the target servers
[14:30:48] <ema>	 BTW speaking of keeping state vs. math, ipvs' chash does keep state in memory for no valid reason last time we checked :)
[14:30:50] <bblack>	 or the simple and slightly-less-accurate way, where a given frontend calculates a virtual load for each server it's balancing to, based on its own parallelism of reqs sent to that backend or whatever
[14:32:55] <ema>	 bblack: that could be pretty inaccurate though, right? other frontends might put a much higher load on the given backend 
[14:45:20] <bblack>	 yeah but it solves the important case of hotness correctly I think
[14:45:52] <bblack>	 let's say you have a hot item in the backend caches (or beyond), so this means an uncacheable object or e.g. an upload image that's too big for FE caching rules
[14:46:23] <bblack>	 sufficient hotness to cause problems for us means lots of client IPs.  ipvs/lvs hashing means those IPs are distributed reasonably-fairly over all the frontends
[14:46:55] <bblack>	 but now this one URL is comprising, say, 10% of all requests received or something crazy like that: it will be about the same 10% share in any of the FEs
[14:47:30] <bblack>	 so they'll all experience the backend that URL is hashing to as one that they're sending excessive request-load through, and they'll all back off some reqs to the next-chash choice
[14:48:29] <bblack>	 the important thing I think is that nodes' loads are considered relatively
[14:48:59] <bblack>	 it's not a bound that's set by X reqs/sec, or X parallelism as a fixed limit you do the bounding at
[14:49:45] <bblack>	 but more like a percentage limit from the norm.  "Node 7 has crossed the threshold of having 35% more reqs than the average of all nodes" or whatever.
[14:50:08] <bblack>	 (in fancier math terms to make it sound more astute)
[14:51:38] <bblack>	 so each frontend is solving the problem it is creating, they're all responsible for only their own behavior
[14:52:12] <bblack>	 "As frontend node X, I notice in my outbound requests I'm sending very disproportionate traffic to backend Z, so I should stop doing that"
[15:37:47] <wikibugs_>	 10Traffic, 10Analytics, 10Operations: Artificial spike in offset of unique devices  from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560#3379313 (10Nuria)
[16:18:19] <wikibugs_>	 10Traffic, 10netops, 10Operations, 10User-Joe: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3379576 (10elukey) Just to be sure I'll shutdown kafka on kafka2001 before https://racktables.wikimedia.org/index.php?page=rack&rack_id=2207, please ping me 5/10 mins before the rack :)
[16:43:31] <wikibugs_>	 10Traffic, 10netops, 10Operations, 10User-Joe: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3379729 (10ayounsi)
[18:41:26] <wikibugs_>	 10Traffic, 10Operations, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3380116 (10ema) Just a few (partial) answers so far, but here we go!  >>! In T164768#3374941, @BBlack wrote: > 1) cache_misc still has a `do_stream = false` case on the backend...
[19:11:07] <wikibugs_>	 10Traffic, 10Commons, 10Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3380185 (10BBlack) This task has gotten a bit confusing.  Stepping back a bit from the specific case of Commons (be...
[19:39:59] <wikibugs_>	 10Traffic, 10ArchCom-RfC, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10Anomie) It'd be nice if you define "API": are you talking about res...
[20:11:19] <justinl>	 Hey all, I'm working towards migrating my wikis to AWS. I'm currently using Varnish on each Apache server behind our load balancers. In AWS, this will use ELB but I'm wondering about using CloudFront in front of the ELB but wondering if cache purges could still work correctly.
[20:13:16] <bblack>	 If you're putting your same apache+varnish stack in your EC2 instances behind ELB, it should be theoretically possible for cache purging of those varnishes to continue working (although I don't know how you're pushing the invalidations to Varnish today? HTCP + vhtcpd is what we use here today, which relies on multicast, which may not work there?)
[20:13:34] <bblack>	 but as for cloudfront, I don't even know what invalidation methods they support if they're caching in front of ELB for you
[20:14:01] <bblack>	 but obviously, somehow mediawiki config and/or code would need some kind of updates to forward invalidations to CloudFront in some way
[20:14:13] <justinl>	 Hmm, didn't know about an option for multicasting Varnish purges, that could solve a different problem. Currently I just have Varnish+Apache on each server and all of the servers' IPs in $wgSquidServers for invalidation.
[20:14:37] <justinl>	 Could I just add the wikis external URLs to $wgSquidServers or would that cause other issues?
[20:15:00] <bblack>	 is invalidation working for you now with your servers' IPs entered in wgSquidServers?
[20:15:49] <justinl>	 CloudFront aside, the multicast idea is interesting when considering switching to an EC2 autoscaling group of web servers. I will be using Salt to manage LocalSettings.php and would need it to automatically update the configs on scale up and down events.
[20:16:11] <justinl>	 Or use salt mine, reactor, etc. with autoscaling events
[20:16:13] <bblack>	 yeah so stepping back a bit to your current non-EC2 setup
[20:16:24] <bblack>	 (I had to go check MW docs to make sure how things work heh)
[20:16:29] <justinl>	 Currently on my physical servers, yes the invalidation is working.
[20:16:50] <bblack>	 setting wgSquidServers to all the varnish IPs means that it will send HTTP PURGE requests to them, so yeah that works
[20:17:38] <justinl>	 Right, and that currently works fine. Just trying to consider options that AWS opens up.
[20:18:01] <bblack>	 the method we use here (because we have tons of mediawiki servers and tons of varnish caches, separately) is to send multicast HTCP packets across our network to send the purge information, and we have a separate daemon "vhtcpd" running on the varnishes that listens to that multicast stream and turns it into HTTP PURGE traffic over localhost to each local varnish
[20:18:38] <bblack>	 but I'm not really sure, I don't think EC2 really supports multicast in its native normal networking
[20:18:56] <bblack>	 the vhtcpd bit is here: https://github.com/wikimedia/operations-software-varnish-vhtcpd
[20:19:04] <justinl>	 Another thought has been to move to dedicated Varnish EC2 instances with an ELB between those and the Apache/PHP servers, but then mediawiki still needs to know the Varnish servers for PURGEs unless, as you suggested, I could come up with a multicast or broadcast (within a Varnish security group) could work
[20:19:32] <bblack>	 right
[20:19:53] <bblack>	 but cloudfront's a whole separate issue, I'm not sure which protocols they even accept automated purging via
[20:20:22] <justinl>	 Yeah, I don't know yet. I just came up with this idea this morning while we were having a meeting with our AWS reps.
[20:20:45] <bblack>	 apparently they have an amazon API for invalidations
[20:21:11] <bblack>	 http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html
[20:21:44] <justinl>	 Plus it's really just me managing the entire wiki environment (I'm the only Linux eng on my team) so I'd also like to keep things as simple as reasonbly possible while still creating a fast, cost effective wiki environment in AWS.
[20:21:49] <bblack>	 with POST reqs to this API being the mechanism: http://docs.aws.amazon.com/cloudfront/latest/APIReference/API_CreateInvalidation.html
[20:22:20] <bblack>	 also: "The first 1,000 invalidation paths that you submit per month are free; you pay for each invalidation path over 1,000 in a month."
[20:22:33] <justinl>	 Yeah saw that too. Could get pricey!
[20:22:51] <bblack>	 it might be simpler to just limit the cloudfront TTLs accordingly, to just offload the most-extreme bursts
[20:23:13] <bblack>	 e.g. you might have Varnish with relatively long TTLs and correct purging, just to service content fast without re-rendering at the MediaWiki level, etc...
[20:23:42] <bblack>	 but then have Cloudfront cache in front with a maximum TTL of, say, 5 minutes, and no real hookup for purging
[20:24:18] <bblack>	 you get at most 5-minute-stale pages, but if a given URL is getting hammered like crazy, 5-minute cache objects would only refresh 288 times per day
[20:25:02] <justinl>	 Our wikis are quite busy overall so even 5 minutes frequently be too long. :/
[20:25:44] <justinl>	 Unfortunately I don't have enough insight into caching and invalidation metrics beyond general hit ratios.
[20:26:28] <justinl>	 I'd really love deeper insights like the stuff I see at grafana.wikimedia.org (I plan on implementing grafana on top of my graphite system)
[20:28:42] <bblack>	 well, generally you're going to get up to a handful of invalidations per article edit
[20:28:58] <bblack>	 I think the most basic set it /wiki/Foo and /wiki/Foo?action=history getting invalidated
[20:29:21] <bblack>	 if you have the same MobileFrontend plugin we do, that adds invalidates for the mobile-domainname variants of both as well, for 4 total URL purges per article edit
[20:29:33] <bblack>	 template edits can generate purges for lots of articles that use the template of course
[20:30:23] <justinl>	 We don't use the MobileFrontend extension and our sites are VERY template-heavy, like templates in templates in templates, etc. Plus semantic mediawiki in the mix...
[20:30:30] <bblack>	 cache invalidation is a Hard Problem, there's no easy answers, just tradeoffs between TTLs and the work you do to invalidate stale content
[20:31:11] <justinl>	 Ok, at least this confirms that there's no simple answer and to approach any thoughts of CloudFront carefully with lots of testing. :) 
[20:31:20] <bblack>	 https://twitter.com/codinghorror/status/506010907021828096?lang=en
[20:31:58] <bblack>	 but if you really have a desire to move into EC2, you can do that without cloudfront and just take the hits on your varnishes like you do today
[20:32:21] <bblack>	 the only really tricky problem there is the one you outlined about autoscaling and varnish IPs
[20:33:13] <bblack>	 there are lots of ways to hack that together.  I don't think multicast actually works in EC2, even within a VPC, last I heard/checked.
[20:33:51] <justinl>	 Yeah, definitely moving to AWS, just making sure I think outside the box since AWS brings so many options to the table. For now it'll be a lift-and-shift, setting things up very similar to how they currently are in our datacenter, but once that's done, then I can start looking at more flexible stuff like autoscaling, lambda stuff, etc.
[20:33:55] <bblack>	 but you could perhaps write a short module/plugin/patch/something for MediaWiki to do something smarter than the statically-configured wgSquidServers
[20:34:26] <bblack>	 e.g. polling a list from the Amazon APIs for the IPs of the autoscale group so that the purges go to all of them without reloading config constantly.
[20:35:15] <justinl>	 I'm not really a developer and don't want to go writing custom mediawiki code. I typically like to keep things as simple and off-the-shelf as reasonably possible.
[20:35:25] <bblack>	 ok
[20:36:02] <justinl>	 There's a lot I can do with Salt and that's probably where most of my automation effort will go initially, until I'm ready to play around with more "advanced" AWS features.
[20:36:17] <justinl>	 Thank you very much for the advice, Brandon! :) 
[20:37:15] <justinl>	 I think the hardest thing will be handling autoscaling, whether or not I move Varnish to dedicated servers.
[20:37:15] <bblack>	 np
[20:37:48] <bblack>	 you might also peek at Cloudflare too, they seem to have a simple purging API that might not have per-purge costs: https://api.cloudflare.com/#zone-purge-individual-files-by-url-and-cache-tags
[20:38:07] <bblack>	 you could put MW+Varnish in EC2 behind ELB, and then put Cloudflare in front of the ELB IP separately
[20:38:36] <bblack>	 but you'd still need to write some glue code/module somewhere to support purging via Cloudflare's API from MW (similarly to the Cloudfront case)
[20:39:38] <justinl>	 I'm not averse to writing code, my degrees although dated are in CS and I'm learning Python now for tons of stuff, but I've not done PHP in about 15 years :P
[20:40:21] <justinl>	 I actually really enjoy coding, but I try to keep things reasonably simple since I'm the only Linux eng here and I could get hit by a truck.
[20:40:34] <bblack>	 your brain cells are probably thanking you heartily for that latter concession :)
[20:41:10] <justinl>	 Yeah, a while back I stumbled on the Fractal of Bad Design article on PHP and remembered the horrors. :P
[21:05:13] <Platonides>	 justinl: that's why you would want to upstream your patches :P
[21:10:55] <wikibugs_>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3323716 (10RobH) So attempting to upload this over the web to the gui mgmt interface times out.  It may work a bit better if done locally from eqiad.  I'm pushing the file to my home...
[21:17:38] <wikibugs_>	 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380505 (10Jgreen)
[21:26:58] <wikibugs_>	 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380529 (10CCogdill_WMF) Major Gifts is discontinuing the events integration we had in place through these sites. The contract ends at the end of the month, so I'm pretty...
[22:07:34] <wikibugs_>	 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380658 (10BBlack) Yeah we can close this task if the sites are gone.  We'll want to remove the current IP address mapping for these hostnames from our DNS when this happ...
[22:14:09] <wikibugs_>	 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380675 (10CCogdill_WMF) As far as I know, it is just those first two subdomains you listed. I'm not sure benefactors.wikimedia.org goes anywhere, anyway...
[22:20:40] <wikibugs_>	 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#2359459 (10Dzahn) benefactors.wikimedia.org may not be used for HTTP(S) but it is apparently used for email?  T130937  https://phabricator.wikimedia.org/rODNSc6dc7dcb64c4...
[22:24:29] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Shop: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3380694 (10BBlack) @Jseddon @Mbeat33 - ping again? The redirect appears to work currently, but still no HSTS header.
[22:27:36] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Shop: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3380705 (10BBlack)
[22:27:38] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Shop: Canonical URL in Store points to HTTP address, should be HTTPS - https://phabricator.wikimedia.org/T131131#3380702 (10BBlack) 05Open>03Resolved a:03BBlack Currently this looks to be fixed.  The relevant snippet on the live store site is now: ```...
[22:27:51] <wikibugs_>	 10Traffic, 10Operations: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3380706 (10CCogdill_WMF) Oh, I see. I'm not entirely sure about this.  @DKaufman I'm trying to identify which domains are getting phased out with the Trilogy system. We u...
[22:59:54] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#3380808 (10BBlack) 05Open>03Resolved a:03BBlack This has been working for some time, at least for the HTTPS issue at the root as tasked here!  The other part about doc...
[23:02:18] <wikibugs_>	 10Traffic, 10Operations: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380816 (10BBlack)
[23:03:12] <wikibugs_>	 10Traffic, 10Operations: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380831 (10BBlack)
[23:03:23] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380832 (10BBlack)
[23:06:48] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380851 (10BBlack) The original point of this (now ~2 years old) tracking task was to track the very long tail of known but relatively-minor issues preventing us from reaching...
[23:07:38] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380860 (10BBlack)
[23:07:41] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#3380859 (10BBlack)
[23:07:51] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10BBlack)
[23:07:54] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2715740 (10BBlack)
[23:08:05] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack)
[23:08:11] <wikibugs_>	 10HTTPS, 10Traffic, 10Discovery, 10Operations, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884289 (10BBlack)
[23:09:11] <wikibugs_>	 10Traffic, 10Operations, 10Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#3380867 (10BBlack)
[23:09:14] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1101271 (10BBlack)
[23:13:02] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380869 (10BBlack)
[23:26:04] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3380884 (10BBlack)
[23:26:42] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack)
[23:27:34] <wikibugs_>	 10HTTPS, 10Traffic, 10Operations, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack)
[23:50:09] <wikibugs_>	 10Traffic, 10Operations: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380934 (10BBlack) @Ottomata - Any high level new info about timetables for deprecating and then removing the RCStream stuff in favor of EventStreams ( T130651...
[23:52:43] <wikibugs_>	 10Traffic, 10Operations: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3380949 (10BBlack) ( Note also ori did a soft announce of HTTPS transition for it about a year ago, but with no target date for disabling plain HTTP: https://l...