[00:06:54] 10Traffic, 06Operations, 06Zero, 13Patch-For-Review: Use Text IP for Mobile hostnames to gain SPDY/H2 coalesce between the two - https://phabricator.wikimedia.org/T124482#2237531 (10BBlack) This was merged around 2016-04-25 18:40 UTC, and legit caches that honor TTLs correctly should have all stopped handi... [03:35:41] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#2237848 (10yuvipanda) Yeah, I submitted a patch to upstream to bind to 127.0.0.1 instead, and things are all ok now for me. [08:05:14] 10Traffic, 06Operations: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2238375 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:41:51] 10Traffic, 06Operations: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2238726 (10MoritzMuehlenhoff) apache 2.4.10-10+deb8u4+wmf1 has been built against openssl 1.0.2 and uploaded to carbon. I'll update this bug once all existing jessie systems are upgraded. [09:43:52] <_joe_> moritzm: if we're maintaining an apache package, we should build it for trusty too [09:50:18] we don't have openssl 1.0.2 for trusty [10:33:12] <_joe_> actually we need it for solving an annoying bug in the fastcgi module [10:33:31] <_joe_> but then I just realized we'll be upgrading appservers to jessie this quarter probably [13:00:34] 10Traffic, 06Operations: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2239219 (10MoritzMuehlenhoff) Apache on all jessie systems has been upgraded and restarted. [13:00:49] 10Traffic, 06Operations: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2239220 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [13:01:41] moritzm: awesome :) [13:25:54] bblack: FYI, there's a libgd security update. varnish links against libgd, but only as part of the image_filter module, which we don't use. so I'll install these for posterity on the cp* systems, but we don't need a varnish restart [13:40:07] moritzm: ack, thanks [14:18:52] 10Traffic, 06Operations, 13Patch-For-Review: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2239542 (10BBlack) 05Open>03Resolved a:03BBlack thanks @MoritzMuehlenhoff ! [14:36:05] bblack: I tried to add the 'maintenance' option in https://gerrit.wikimedia.org/r/#/c/285364/ but also filed a code review to modify the default Varnish error page https://gerrit.wikimedia.org/r/#/c/285363/ [14:36:33] (target cache::misc for the reimage of stat1001.e.w) [14:46:55] bblack: so basically only something like if dir.key?('maintenance') and dir[maintenance].key?('message'): [14:47:09] and then error_synth(503, dir['maintenance']['message']) [14:47:46] no, I mean: if dir.key?('maintenance') error_synth(503, dir['maintenance']) [14:48:02] and in the data: maintenance => 'this service is down for blah till foo' [14:49:45] yes sure sorry I wrote it down in the wrong way.. Any concern about the new error page? [14:50:03] elukey: while you're in there, maybe put a comment in role::cache::misc about using it, too, where existing comment block is with: [14:50:06] # misc-cluster specific (for now!): [14:50:42] sure! [14:51:28] elukey: I think the error page change seems sane, at least for this case [14:51:57] I know last time it was worked on, one of the things was alinging the visual of it with the MediaWiki error page (hence the CSS, etc) [14:52:12] I don't know if the exact messaging was part of that too in some important way, but I don't think so, trying to figure that out [14:52:37] *aligning :) [15:00:17] 10Traffic, 06Analytics-Kanban, 10DNS, 06Operations: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2239690 (10Nuria) p:05Triage>03High [15:02:01] 10Traffic, 06Analytics-Kanban, 10DNS, 06Operations: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10Nuria) @BBlack : can you confirm whether is OK with ops to deploy this domain to 1001? [15:05:31] all right https://gerrit.wikimedia.org/r/#/c/285364 should be better [15:48:26] all right, just tested it with the puppet compiler for 1052,1044 and 1043, no changes as expected. I'll wait the code review for the default error template and then I'll merge [15:54:29] ok [16:04:24] 07HTTPS, 10Traffic, 06Operations: Preload STS for wikimedia.org - https://phabricator.wikimedia.org/T132685#2239986 (10BBlack) [16:04:26] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2239987 (10BBlack) [16:04:28] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2239984 (10BBlack) 05Open>03Resolved a:03BBlack [16:08:10] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2240008 (10BBlack) Status Update: `letsencrypt::cert::integrated` seems to work as expected, and is managing 3x LE certs on carbon with automatic prov... [16:46:49] bblack: any chance we can continue the varnish maps configuration around noon SF ? [16:47:18] bblack: I have a meeting at 2:30pm SF, so I'm not going to bed early in any case... [16:47:44] gehel: works for me. today should be much simpler than yesterday, we're past all the difficult bits :) [16:48:22] bblack: you also said yesterday that all we had to do were merging a few patches and running a few commands... :P [16:48:25] I'll be there... [16:52:26] well technically that's all we did. I guess it depends on your definition of "a few" :) [17:54:54] 10Traffic, 06Operations: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10BBlack) [17:55:07] 10Traffic, 06Operations: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240512 (10BBlack) [17:57:16] 10Traffic, 06Operations: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240517 (10Southparkfan) [19:12:03] gehel: ping [19:12:51] bblack: pong [19:13:05] I was just thinking about you ... [19:13:37] I'm going to need to take a break at some point to feed Oscar (a bit before 10pm), but I'm all yours until and after then [19:15:59] ok [19:16:17] let me put the LE stuff on pause a bit, and let's dig into this [19:17:50] gehel: ok so where we left off is: [19:18:15] 1. The new cache_maps hosts are all in their correct new config and upgraded to varnish4, and technically capable of offering service as far as we know [19:18:42] 2. the route table is set up as esams->eqiad, ulsfo->codfw, eqiad->codfw, codfw->appservers [19:18:57] 3. the user traffic on the front edge is currently all to eqiad (the new eqiad servers) [19:19:44] 4. the old eqiad cache_maps boxes (cp104[34]) are dead dead - they're not in any confd/etcd lists, their puppet roles are switched back to just "include standard", and all service daemons are stopped (and there's a ticket for dc-ops to decom them)( [19:20:14] I'm starting questions again... why isn't esams->eqiad? We have a direct network link only to eqiad? [19:21:07] it's just geography and resiliency. all the sites can reach each other over our private links. but ulsfo is more-directly connected (and with lower latency) to codfw than to eqiad. [19:21:17] and esams is more-directly connected to (and has lower latency to) eqiad rather than codfw [19:21:40] I suspected something similar... [19:21:57] in the long long run when all application-layer services are active:active at both sites, and ignoring failover/fallback/exception states... [19:22:17] we'd expect our normal configuration to be ulsfo->codfw->appservice.codfw, esams->eqiad->appservice.eqiad [19:22:28] (two independent halves in Traffic routing terms) [19:22:50] for now since most services are primary in eqiad, the way we configure most clusters/services is: [19:23:08] ulsfo->codfw, codfw->eqiad, esams->eqiad, eqiad->appservers.eqiad [19:23:29] we could in theory, without active:active apps, do that instead as this more-efficiently: [19:23:37] ulsfo->codfw, codfw->appservers.eqiad, esams->eqiad, eqiad->appservers.eqiad [19:24:15] but the reason we don't currently jump from caches in one DC directly to appservers in another is we lack any kind of encryption for cross-dc cache->app traffic, whereas we have ipsec protecting cross-dc cache->cache traffic. [19:24:15] all that is left to do is enable LVS in front of those new servers and enable geo DNS? [19:25:03] and then maps is a little different than the norm outlined above because currently it only has application servers in codfw, whereas most other services are primary (or only) in eqiad [19:25:14] so we funnel down to codfw there instead of funneling down to eqiad like the others [19:25:39] as we are going to get maps servers in both eqiad and codfw, should the goal be to have active-active? [19:26:22] well we're going to get maps servers in both for sure, we'll at least be able to switch between them for failover [19:26:49] we haven't had the active:active conversation about maps yet, but I suspect it's one of the ones that will be easy, since there's no real state that's not easily created in duplicate in both places [19:27:07] yep, but it is a stateless service (from the client point of view) and it serves traffic directly to users [19:27:32] there might be some nitpicking to do, about what if a client flaps from one to the other, and they're slightly out of sync on updating from upstream OSM data, etc... [19:28:11] and yes, all that is left to do, fundamentally, is enable LVS in front of the new servers in ulsfo, codfw, and esams, and then enable geodns [19:28:37] which is https://gerrit.wikimedia.org/r/#/c/268238 + https://gerrit.wikimedia.org/r/#/c/268240 [19:28:48] while we are at it, Elasticsearch is another service that could be active-active (it kind of is already), except for network latency. [19:28:55] yup [19:28:58] restbase, too :) [19:29:11] * gehel is focusing on service he knows a bit... [19:29:24] but ES isn't directly behind public varnish, either [19:29:38] Yep, that's the limitation... [19:30:06] it's an internal service, and I guess as a starting point, we'd do active:active for such things via geodns directly, like we do for the front of cache clusters [19:30:34] as in, define a new internal virtual service hostname that's not DC-specific, and configure it in wmnet zonefile to geodns-route to the 2x internal IPs for service in eqiad+codfw [19:31:27] and for switching DC? We reconfigure DNS? Or is there an automatic switch in DNS if a service is unavailable? [19:31:41] (at which point a consuming service like mediawiki uses the new virtual hostname, and thus MW-in-eqiad uses ES-in-eqiad, and MW-in-codfw uses ES-in-codfw. It will still effectively follow on with MW switching, but there will be zero work to do on the ES side when MW switches. [19:32:13] well we assume both are up under normal conditions. we can manually disable one or the other and the DNS server will flip all traffic to the remaining one. [19:32:43] there's also monitoring built into the DNS server that can automatically fail things out, but that's always a little scary on the false-positive side. [19:32:43] it is kind of a hack at the moment, but there is zero work to do on ES / Cirrus side to switch DC [19:33:16] anyways back to the task at hand! [19:33:21] yep [19:33:39] LVS/pybal changes are scary, because sometimes it's fragile, and because last time I checked, new services require pybal restarts, too. [19:33:39] so first : https://gerrit.wikimedia.org/r/#/c/268238/ ? [19:34:25] I did add a service to LVS some weeks ago. I do remember it was not as straightforward as it could have been [19:35:09] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2241038 (10Dzahn) >>! In T132521#2202415, @Chmarkine wrote: >>>! In T132521#2202254, @BBlack wro... [19:36:31] gehel: let me amend that, it's missing another bit of the puzzle [19:37:45] gehel: ok it's fixed up now [19:38:38] so let's start with disabling puppet on the affected LVSes before we even merge, just to be on the safe side [19:38:55] ok [19:39:04] we are screwing with things now that always a hairs' width away from bringing down important big services that everyone will scream about [19:39:22] you know how to make me feel comfortable.. [19:39:25] :) [19:40:21] so lvs2002, 2005, 3002, 3004, 4002, 4004 ? [19:40:49] from reading balancer.pp (but not entirely sure if it means what I think it means [19:40:59] yeah [19:41:02] root@neodymium:~# salt -v -t 10 -b 100 -E '^lvs(200[25]|[34]00[24])' cmd.run id [19:41:10] ^ I did that just to confirm salt regex works as expected [19:41:32] so, disable puppet on all of those [19:42:29] then merge -> puppet-merge the first change [19:42:53] then we'll start with, say, lvs4004, and go enable->run puppet there and see how it looks (probably fine, will show some pybal config change and add a new public IP to the loopback) [19:43:15] the higher-numbered of each pair is the backup, it's not active for traffic until the other fails [19:43:23] so 4004 is a relatively-safe option [19:43:54] but after puppet runs, pybal/LVS still won't be configuring the new service yet, pybal will need a service restart, and then we can confirm the config in 'ipvsadm -Ln' [19:43:57] ok, let me just check if I disabled puppet correctly (still a bit new to salt) [19:44:24] root@lvs4004:~# puppet agent -t [19:44:24] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'activating new varnish for maps - T131880'); [19:44:24] T131880: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880 [19:44:28] seems like it works [19:45:06] yep [19:45:56] damn, jenkins not there yet... should I just V+2? [19:46:23] seems really slow for Jenkins [19:46:40] yes [19:47:16] ok, merging [19:47:32] done [19:48:12] puppet run on lvs4004 ? [19:48:41] yeah [19:50:06] seems sane so far [19:50:06] we have LV + pybal config changes [19:50:30] well we have pybal configfile changes, and we have new loopback IPs that look correct [19:50:46] what's lacking is pybal won't reload its primary config and create a brand-new service without a restart [19:50:56] which you can see in 'ipvsadm -Ln', which shows the live LVS config [19:51:17] and pybal restarts are not routine things, that's why puppet doesn't automate them [19:51:46] I should I checked ipvsadm before to see what changed... [19:51:49] when pybal is briefly down on restart (or not so briefly if config error?), it stops talking BGP to the routers. If the lvsNNNN in question is currently primary, that causes traffic to at least briefly flip over to the other LVS. [19:51:56] nothing changed in ipvsadm yet [19:52:26] on the backup it's not as big a deal, which is where we are now on 4004 (its primary is 4002) [19:52:44] when it briefly stops BGP to the routers, it doesn't move any real traffic, which is still currently flowing to 4002 [19:53:20] so our next step on 4004 is 'service pybal restart', but please log that before, and then check ipvsadm -Ln output after [19:53:32] I'm going to have to leave you for ~30' in about 5'... [19:53:41] ok [19:53:54] I'm still doing that restart and hoping it goes well [19:54:01] ok [19:55:35] seems ok, we now have new ipvs table entries for: [19:55:36] TCP 198.35.26.113:443 sh [19:55:40] restarted [19:55:43] yep [19:55:45] TCP [2620:0:863:ed1a::2:d]:80 wrr [19:56:01] and the other protos (v4+v6/80+443) [19:57:08] is there a way to test that? [19:57:31] LVS still look a bit like black magic to me... [19:58:07] there's no really good way to test things yet, but we don't really care if maps' new IPs are working in ulsfo at this point [19:58:16] the main thing is that we don't break the other services there, like upload.wm.o [19:58:28] after the primary is restart too, we can test the new maps IP manually [19:58:56] do you want to run through the puppet -> pybal restart on 4002 before you go, or too late? [19:59:10] and a way to test that upload.wm.o is still working? Just reading the conf and hoping we read it correctly? [19:59:20] let me do 4002 [19:59:35] mostly that the ipvsadm before/after picture didn't change much for upload's IPs [20:00:01] sorry, I'm going to let you do 4002 [20:00:03] and when the primary one's pybal restart executes, our primary test that upload.wm.o still works is that we're not getting spammed in icinga, and paged on all our phones, and users swarming in with complaints, etc [20:00:12] I'll be back as soon as I can... [20:00:16] ok, cya when you get back [20:00:23] thanks again ! [20:08:55] 10Traffic, 06Operations, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2241161 (10BBlack) [20:09:58] so now that certificates have "no cost" (if I understand it) [20:10:03] can we come up with subdomains ? :-D [20:10:23] like www.ci.wikimedia.org jenkins01.ci.wikimedia.org etc ? [20:10:42] please no :P [20:11:05] the no cost is money, not ops pain :) [20:11:27] so zero capex but still a good share of opex [20:11:28] :D [20:13:54] subdomains are a PITA, there has to be a good reason for them, basically [20:14:10] was asking for the sake of it [20:14:31] another related question is does our misc cache supports routing URL paths to different backends ? [20:14:40] usually the best reason for a subdomain is that you're crossing a real administrative-control barrier with a completely different DNS server administered by different people. [20:14:58] like https://integration.wikimedia.org/jenkins01 https://integration.wikimedia.org/jenkins02 https://integration.wikimedia.org/somethingelse [20:14:59] such as we do with delegating corp.wm.o -> corp OIT [20:15:05] with each three served on different hosts? [20:15:16] yup [20:15:46] hashar: it's possible, but we don't have any cases like that today as examples [20:16:04] there's some minor VCL infastructure to plumb for it, to template that out in generic terms so we can manage the config as data. [20:16:55] instead of the erb / ruby / vcl spaghetti code? [20:17:06] (sorry not meant to be offending, it just very hard to follow / get what is going on in our vcl.erb templates) [20:17:26] well, it's just more-abstracted erb/ruby/vcl code. but we're better with an abstraction than a bunch of one-off stanzas for specific services in the VCL directly [20:17:34] e.g. right now we have this in the manifest: https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/cache/misc.pp#L36 [20:18:02] and this in the backend VCL: https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/misc-backend.inc.vcl.erb#L5 [20:18:32] the comment is threatening ! [20:18:37] ".and for sanity's sake, there should be no overlap among them" :D [20:19:02] the two together mean we can define new cache_misc services in the data in the manifest, and give them an attribute like req_host => integration.wikimedia.org [20:19:12] we just need to extend that to include path regexes as well [20:19:32] yeah that misc-backend ruby code is very intimidating. At first glance I can tell it generates the routing rule based on the hash. But that seems super hard to hack [20:20:01] well it's fundamentally a hard problem we're facing here, there's not much fixing for that. [20:20:30] but we're still better with everyday things like "add a new service foo mapped through cache_misc in the follow way" being data updates and not VCL code/template updates [20:20:35] so it's better to abstract it that way [20:20:42] yeah definitely [20:21:03] the $app_directors hash is straight forward and nicely abstract all the low level logic [20:21:09] which is nice [20:21:20] yeah I'm hoping to continue expanding on that to remove a lot of one-off custom VCL [20:21:42] the ticket for the long-term view is at https://phabricator.wikimedia.org/T110717 [20:21:55] right now only cache_misc has that at all, and it's a very minimal version of it so far [20:23:19] neat. I am copy pasting all the above to a doc and will think again ;) [20:23:25] thanks ! [20:23:41] and kudos for letsencrypt! [20:24:47] :) [20:26:08] back on the cache_maps topic, for gehel's backlog-reading: [20:26:39] lvs4002 was successful, and the way to test that is to look up maps-lb.ulsfo.wikimedia.org's IP ( 198.35.26.113 ) [20:26:53] and do this or equivalent: [20:26:54] curl -sv https://maps.wikimedia.org/ --resolve maps.wikimedia.org:443:198.35.26.113 [20:27:13] which is basically bypassing our DNS (which doesn't yet hand out that IP), and checking it manually [20:27:31] I'm starting on codfw now [20:31:29] Thanks, I still need about 15' [20:34:25] codfw checks out ok, moving on to esams [20:41:24] esams seems ok too [20:43:00] last little step is: https://gerrit.wikimedia.org/r/#/c/268240 [20:43:47] which is the dns repo, so s/puppet-merge on palladium/authdns-update on radon/ (really any authdns server will do. radon happens to be ns0) [20:44:57] gehel: waiting on you for that^ one, no rush [20:47:34] I'm back! [20:47:54] so we have working service IPs at all the sites now [20:48:00] just missing the DNS bit to send users to them [20:48:52] with any other service major service we might have concerns about cache warmup, etc [20:49:27] how would we do cache warmup ? Just some script to fetch a bunch of URLs ? [20:49:35] but maps is low traffic and that's way more trouble than it's worth, and the cache misses will mostly be inter-cache anyways, as the codfw backends have been caching things since yesterday for users, and thus all the new cache misses only fetch as far as there in the common case, not the applayer [20:49:41] Or do we have a way to send traffic progressively? [20:50:03] you could do that, if you had a list of the top-1000 URLs or somethign [20:50:27] in the rare case this is an issue on a major cluster in the past, we've remapped geoip to only bring in a few small countries first, things like that [20:50:48] so, let me rebase and merge https://gerrit.wikimedia.org/r/#/c/268240/ [20:50:54] yup [20:55:14] sorry for the delay, got a merge conflict... [20:55:21] on mine? [20:55:50] I think so, there was a modification on mobile. I think I messed up the merge... [20:55:57] lemme review that... [20:57:05] yeah it's not right [20:58:09] the m line should be text-addrs, not mobile-addrs [20:58:17] basically, don't touch the non-maps lines :) [20:59:05] eep actually it's all kinds of messed up. gerrit makes that not so obvious [20:59:53] I can fix the rebase if you want [21:00:18] I got it, just give me 30 more seconds... [21:00:24] ok [21:01:34] Ok, don't know what I did for the first merge... Can you check I did not screw up my fix? [21:02:03] yeah it's good now on PS7 [21:03:13] Ok, so merge it? Anything special to deploy it? [21:04:26] yes merge, and yes special [21:04:46] Ono a not completely unrelated note: I still see the old varnishes in the Maps caches eqiad group in Ganglia ... [21:04:49] you have to ssh to any one of the 3x authdns servers and type "authdns-update" as root [21:05:10] yeah ganglia's always a trailing problem with any kind of changes [21:05:15] low-prio I guess [21:05:40] eeden.w.o ? [21:05:50] yeah that works, that's the one in esams [21:07:42] update running [21:07:50] Done [21:08:04] nice [21:08:22] as the 10 minute TTL expires, we should start seeing what little traffic there is move into the other DCs [21:08:42] oh actually not 10 minutes, 1 hour [21:08:47] (was the old TTL) [21:09:34] so, nothing really left to do here except close up related tickets and wait [21:09:35] do we have a dashboard of traffic / cache cluster? [21:09:42] yes [21:10:05] well we have lots of different ways to look at some of the same data [21:10:15] for exploring while making changes and such, I tend to prefer this one: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [21:10:56] great! [21:11:02] you can drop-down select for maps cluster, and then look at eqiad vs non-eqiad DCs, etc [21:12:27] ok, so not much traffic yet, but there was not all that much traffic on codfw to start with... [21:12:32] LVS's own monitoring requests are a big chunk of current maps traffic normally, though, so there's a significant baseline [21:12:35] s/codfw/equiad/ [21:12:55] that's why you can see it rise in two stages. the first jump at the new DCs is when pybal came in and started monitoring them [21:13:53] 10Traffic, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2241394 (10BBlack) 05Open>03Resolved [21:14:18] 10Traffic, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542014 (10BBlack) [21:14:22] 10Traffic, 06Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2241395 (10BBlack) 05Open>03Resolved [21:14:36] 10Traffic, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2241397 (10Gehel) Varnish Maps cluster now fully configured, some traffic can already be seen on https://grafana.wikimedia.org/dashboard/db/varnish-a... [21:15:39] \O/ [21:15:54] bblack: Well, thanks a lot for taking the time! I'm sure you'd have done it twice as fast without me ! [21:16:27] it's always a good experience. it's easier to overlook how retarded some things are when you're just wading through your usual routines alone. [21:16:35] when you have to explain them, it's a whole different thing :) [21:16:59] yeah, you learn a lot by teaching! [21:17:31] I don't think anyone's really made any kind of meta-task for something like "give maps service full 'production' status" or something like that [21:17:58] I haven't seen it... [21:18:27] which would've parented the stuff we just finished up, and also some task for configuring up the software and getting it running on the 2x4 karotherian servers currently being ordered in codfw+eqiad, which would parent their procurement tickets [21:18:47] and then there's some other little bits I'm sure. for one, actually monitoring the service with full alerting and all that [21:19:05] cache_maps doesn't have monitoring/paging currently, it's disabled because it's non-production and we don't want it waking us up [21:19:24] I'm sure there's some equivalent backend monitoring to turn on, too [21:20:09] someone should organize that up in phab so we have a more-concrete idea of when "maps is fully production-ified" [21:21:16] I'll create the task and ask around if anyone has idea to put in it [21:21:42] Just time for a smoke before my next (and last) meeting. Again, thanks a lot for taking the time! [21:25:18] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2241409 (10ori) >>! In T96848#2231806, @BBlack wrote: > If you have time and want to do it (next week!), by all means go for it, I have lots else to keep me busy indefinitely :) My ba... [21:25:52] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2241411 (10BBlack) Ok, np! [21:36:21] 10Traffic, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2241433 (10Gehel) [22:00:35] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2241465 (10BBlack) >>! In T132521#2241038, @Dzahn wrote: >>>! In T132521#2202415, @Chmarkine wro...