[07:39:10] 10Wikimedia-Apache-configuration, 06Labs, 10wikitech.wikimedia.org: Have a short link to the signup page - https://phabricator.wikimedia.org/T48175#2234264 (10Qgil) It's better, indeed. Maybe this is not as big of an issue as I thought it was back in the day. Feel free to decline. [10:46:26] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments-General: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234606 (10SindyM3) De oplossing is: Lemonbit maakt een zogenaamd CSR aan, een certificate request voor wikilovesmonuments.org. Dit CSR kan u opstur... [11:01:03] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments-General: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234629 (10Akoopal) Quick translation: Lemonbit will create a CSR, a certificate request, for wikilovesmonuments.org. This CSR can be send to the mai... [12:14:00] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments-General: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234742 (10SindyM3) @Akoopal Thank you! [13:33:11] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments-General: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234931 (10JanZerebecki) That would work, but there is an easier solution: Use https://letsencrypt.org/ . It should be self-service for the server admin. [13:43:06] 10Traffic, 06Operations, 10RESTBase, 06Services, 03Mobile-Content-Service: Varnish not purging RESTBase URIs - https://phabricator.wikimedia.org/T127370#2234977 (10fgiunchedi) [13:43:34] 10Wikimedia-Apache-configuration, 06Labs, 10wikitech.wikimedia.org: Have a short link to the signup page - https://phabricator.wikimedia.org/T48175#2234978 (10Krenair) 05Open>03declined Okay. Let's not start creating custom apache configs for specific mediawiki pages on a per-site basis, especially just... [14:51:42] 10Traffic, 10Fundraising Tech Backlog, 10Fundraising-Backlog, 06Operations, 13Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1287413 (10fgiunchedi) looks like this was waited for until the fundraising was over, can it be resumed now? also w... [14:54:38] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review: authoritative copy of 'root' files for upload.wikimedia.org is only in swift - https://phabricator.wikimedia.org/T130709#2235231 (10faidon) I'm not sure I see this as a problem, although having them in VCS would definitely be nice. Regardless, t... [15:04:50] 10Traffic, 10Fundraising Tech Backlog, 10Fundraising-Backlog, 06Operations, 13Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1287413 (10BBlack) It's mostly been blocked on lack of anyone having time to work on it, too. At this point, it's... [15:05:25] 10Traffic, 07Varnish, 06Operations: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503#2235279 (10BBlack) [15:05:27] 10Traffic, 10Fundraising Tech Backlog, 10Fundraising-Backlog, 06Operations, 13Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#2235278 (10BBlack) [15:19:58] 10Traffic, 06Operations, 13Patch-For-Review: Update prod custom varnish package for upstream 3.0.7 + deploy - https://phabricator.wikimedia.org/T96846#2235358 (10BBlack) 05stalled>03declined At this point we're not investing further in varnish3 except for security fixes, so there's no point pursuing a re... [15:23:01] 07HTTPS, 10Traffic, 06Operations: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) [15:23:17] 07HTTPS, 10Traffic, 06Operations: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235389 (10BBlack) [15:23:19] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2235390 (10BBlack) [15:23:26] 07HTTPS, 10Traffic, 06Operations: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) [15:23:28] 10Traffic, 06Operations, 06WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#2235392 (10BBlack) [15:25:47] 10Traffic, 06Operations, 06WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#2235431 (10BBlack) >>! In T101048#1826732, @Platonides wrote: > Is there any technical reason not to have the servers using 700 certificates, using SNI fo... [15:28:56] 07HTTPS, 10Traffic, 06Operations: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2235476 (10BBlack) [15:29:58] 07HTTPS, 10Traffic, 06Operations: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) [15:30:00] 07HTTPS, 10Traffic, 06Operations: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [15:30:27] 10Traffic, 06Operations, 06WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#2235483 (10BBlack) [15:30:29] 07HTTPS, 10Traffic, 06Operations: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [18:14:49] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980389 (10fgiunchedi) anything to be done on this task? related patch has been abandoned [18:31:10] 10Traffic, 06Operations, 10Parsoid, 10RESTBase, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2236267 (10Pchelolo) 05Open>03Resolved a:03Pchelolo The new redirect handling code has been deployed in production. [18:32:42] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#2236275 (10BBlack) p:05Normal>03Low tbh I'm not sure, maybe leave it open as low-prio. It is technically a bug (in the responses of our pdns-b... [19:03:15] hi :) [19:03:18] bblack: here I am... [19:03:58] ok, so bringing myself back up to speed since I haven't looked at any of this since it was planned out ~3 months ago, and you at the same time out loud :) [19:04:23] the tracking ticket I was working off of back then was https://phabricator.wikimedia.org/T109162 [19:04:29] and the patches I prepped were these 4x: [19:04:33] https://gerrit.wikimedia.org/r/#/c/268236/ [19:04:37] https://gerrit.wikimedia.org/r/#/c/268237/ [19:04:41] https://gerrit.wikimedia.org/r/#/c/268238/ [19:04:46] https://gerrit.wikimedia.org/r/#/c/268240/ [19:07:17] there's also a file /root/exmobile on neodymium which should list the 16x hosts involved here being re-roled into cache_maps [19:07:20] * gehel reading all that... [19:07:31] root@neodymium:~# cat exmobile [19:07:32] cp2003.codfw.wmnet,cp2009.codfw.wmnet,cp2015.codfw.wmnet,cp2021.codfw.wmnet,cp1046.eqiad.wmnet,cp1047.eqiad.wmnet,cp1059.eqiad.wmnet,cp1060.eqiad.wmnet,cp3003.esams.wmnet,cp3004.esams.wmnet,cp3005.esams.wmnet,cp3006.esams.wmnet,cp4011.ulsfo.wmnet,cp4012.ulsfo.wmnet,cp4019.ulsfo.wmnet,cp4020.ulsfo.wmnet [19:08:18] the human explanation is that all 16 of those are the new maps servers (4x per site, 4x sites). the current maps servers are cp1043 and cp1044 in eqiad (only). [19:09:13] varnish servers are also IPSec endpoints? (reading the gerrit changes) [19:09:16] the first patch re-roles all of those 16 cache servers into their new config correctly, and removes cp104[34] from the hieradata lists used mostly just for ipsec interconnections (which there are none for maps today anyways) [19:09:20] yes [19:10:50] not entirely related to the task at hand, but what do we use IPSec for here? Communication between cache in different DC? Or something more generic? [19:11:13] after that, the 16 new hosts are running as a correct new cluster in the cache cluster sense, but DNS+LVS will still only be sending traffic to the 2x older hosts (DNS only looking at eqiad, and eqiad has 2x old + 4x new in confctl config, but the 4 new are depooled when first added, only the old two are pooled from before) [19:11:36] gehel: yes, to secure user traffic flowing between DCs, which has no HTTPS because varnish doesn't implement HTTPS [19:11:57] just let me know if I ask too many questions... [19:12:11] so after merging the first patch and confirming all puppetized up and functional everywhere (well, especially in eqiad for the moment) [19:12:49] then we do a manual pool-switching step for the eqiads: we pool in the 4x new machines in eqiad one at a time, and then depool the 2x older machines in eqiad. That shifts the traffic from "both old eqiad caches" to "all 4 new eqiad caches" [19:13:11] at that point we can do the second patch, which removes the 2x older eqiad maps caches from conftool config completely [19:14:01] looking at conftool-data/nodes/esams.yaml (patch #1) what exactly do we declare there? [19:14:17] when we're sure the other 12 are puppeted correctly at the other DCs, we can go pool them all in confctl for the first time, and then apply the 3rd patch which turns on LVS services for them in the other 3x DCs for the first time (which I think requires pybal restarts, which are a bit manual, critical, and scary) [19:15:10] when we're sure service is actually working with manual verification through the public IPs at the other 3 sites, we can do the 4th (DNS) patch, which starts doing geodns routing of traffic to all 4x sites instead of sending everyone to eqiad [19:15:14] then we're basically done [19:16:15] gehel: that esams.yaml diff in patch #1 adds the 4x new nodes in esams cache_maps to the conftool data, which is what populates the etcd live data that LVS/pybal use to know which nodes to pool up for which LVS services [19:16:33] when you add new nodes in the config and they're first added to the live etcd data, they're added in a depooled state [19:16:45] pool/depool/weight-change commands are runtime CLI stuff we execute from e.g. palladium [19:18:42] and what is that list of services? conftool is aware of the frontend/backend communication? I was expecting varnish probes at that level. [19:19:24] varnish does probe for healthcheck, but this is higher-level administrative pool/depool from the set of backends varnish can see and healthchecks [19:19:50] the 4x services listed there: [varnish-fe, varnish-be, varnish-be-rand, nginx] [19:20:10] varnish-fe and nginx are LVS services - front-edge user traffic to one of our DNS ips for the service, which LVS/pybal map into the cache frontends [19:20:31] varnish-be and varnish-be-rand control inter-cache traffic [19:20:51] (varnish frontends or other varnish backends talking to a certain pool of varnish backends) [19:21:13] I remember varnish-fe / varnish-be from your presentation, I can't remember varnish-be-rand... [19:21:30] varnish-be-rand is identical to varnish-be, it should always be the same list of hosts [19:22:03] the distinction is the varnish->varnish traffic over the 'varnish-be' set is consistent-hashed on URL, and via 'varnish-be-rand' it's sent to random members of the set [19:22:27] consistent-hash is better if it's a request that's known to be cacheable in the target backend, so that we spread our cache size over the whole set's storage [19:22:45] but for things we think are probably uncacheable, we're better off randomizing that traffic so an uncacheable hotspot doesn't all hit one server [19:23:00] make sense... [19:23:42] last question (for the moment). I'm going to show my ignorance, but I have never seen a DYNA record type in DNS. Something magic? [19:24:00] anyways, before patch #1, we should disable puppet on cp1043 and cp1044, as we want them to just keep doing what they're doing right now until we're done, and not get broken by these changes as we move off of them. [19:24:10] gehel: yes, it's something magic :) [19:24:43] DYNA isn't a real record-type from the DNS RFC, it's a special type used by gdnsd (our authdns servers) to do geographic routing of A/AAAA data [19:24:51] I love magic, but I'm scared of it :-) [19:25:19] there's a file "config-geo" in the root of our operations/dns repo, which maps out the world's geography to our 4x DCs, and then for each service (like cache_maps) assigns distinct IPs for the 4x DCs [19:25:34] Just found the doc... [19:25:50] So should I start by disabling Puppet on cp1043/cp1044 ? [19:25:50] if you're closer to esams, when you lookup a hostname that says in the zonefile 'foo DYNA geoip!maps-addrs', it's going to send you to the esams IPs [19:25:53] etc [19:26:01] gehel: yup, whenever you're ready! [19:26:09] !log in operations channel too Iguess [19:26:09] Not expecting to hear !log here [19:26:20] we don't have !log set up here, that's a meta-task that should happen [19:27:41] (also btw, I managed to get out of my upcoming obligations on the hour, so I'm probably good until whenever you need to go) [19:28:54] gehel: note: after disabling puppet there, make sure it's not still already running (e.g. 'ps -ef|grep puppet') [19:29:22] puppet disabled and not running [19:29:23] sometimes if you rush through 'puppet agent --disable' -> 'merge a new affecting patch', you can actually change the results of an already-running client invocation :/ [19:29:43] we'll see when my eyes starts to close or my brain to shutdown... [19:29:48] so next step: merge the first patch: https://gerrit.wikimedia.org/r/#/c/268236 [19:30:27] you already rebased? Thanks! [19:30:44] yeah just making sure there weren't any scary conflicts with my own other recent work before I threw it at you :) [19:31:01] patch #1 merged [19:31:03] ok [19:31:16] so on neodymium, salt the new machines with puppet, which is like: [19:31:21] running puppet-merge on palladium [19:31:42] salt -v -t 10 -b 100 -L `cat /root/exmobile` cmd.run 'puppet agent -t' [19:31:47] (after puppet-merge) [19:32:00] I fully expect some of those to fail their first puppet runs btw [19:32:19] on cache re-role, there are probably some dependency-graph issues on re-puppeting and such. we can fix it by doing a few runs until they come up clean [19:33:01] I see the conftool updates post `puppet-merge`waiting for those to finish [19:33:06] yeah [19:33:18] make sure it doesn't spit out any errors, too, sometimes it fails on just one or a few changes [19:33:26] in which case you just need to re-run conftool-merge [19:33:46] lots of warnings, but seems to just be about new services being pooled=no, just like you said [19:34:17] before doing the salted puppet [19:34:22] should confirm the eqiad pool state too... [19:35:32] for s in nginx varnish-fe varnish-be varnish-be-rand; do echo === $s ===; confctl --tags dc=eqiad,cluster=cache_maps,service=$s --action get 're:.*'; done [19:35:42] ^I just ran that too, but it's interesting to run/look anyways [19:36:24] there's a minor weight problem, but we can deal with it after puppets, before pooling things in [19:36:48] did you take that command out of some doc somewhere? [19:36:55] (that confctl loop is on palladium) [19:37:01] no I just made it up [19:37:31] jo.e is already working on better syntax for confctl, so that we can do that more-elegantly than that ugly for-loop :) I think he already has an unmerged test patch for it. [19:38:03] * gehel start taking some notes... [19:38:52] let me know how the salted puppet from neodymium works out [19:39:10] I'll probably follow one of them manually in syslog too so I can see what the initial failures look like [19:39:41] salt running, no feedback yet [19:40:46] what's the weight issue you were talking about? [19:41:32] salt finished, looks all green... checking [19:42:06] yeah salt won't really tell you based on green-ness, I see cp1060 that I was following failed some puppet stuff [19:42:45] Oh yeah, there are quite a few warnings in green. confusing... [19:42:48] so I think we need a fixup, because of other things that have happened since those patches were prepped [19:42:51] just a sec [19:43:10] the VCL failed to load in some of the puppet output [19:43:33] I think it's because of the cache_route::table stuff, there's a map exception in that stuff because it was eqiad-only at the time that happened, after those maps patches were made. [19:45:50] fixup: https://gerrit.wikimedia.org/r/285241 [19:46:01] I was still checking varnishtest... [19:46:33] let's puppet-merge that, and then go for another salted puppet round [19:46:41] I'll trust you on that one and merge ... [19:46:53] there's also known race-conditions on conftool data changes + VCL puppetization when making pool changes like these, sometimes that just takes 2x puppet runs in general [19:47:00] but without the fix above it's never going to work on all hosts [19:47:42] ok, waiting on Jenkins... [19:47:58] for that one, I'd just V+2 and go with it [19:48:59] I think I found one more problem with the first patch, hang on a sec [19:49:11] salt already running, sorry [19:51:51] it's ok [19:52:12] so there's a second fallout, also related to the lengthy delay between when these patches were originally written and all the changes that happened since [19:52:44] https://gerrit.wikimedia.org/r/#/c/285245/ [19:52:56] site.pp was updated for the correct new nodes to use, but the hierdata/conftool stuff wasn't [19:53:04] (in patch #1) [19:53:06] so that fixes up that part [19:53:35] lemme see / merge ... [19:55:41] basically back when these patches first went in, the esams part of the plan used cp3015-18, but later the plan was switched to cp3003-6, but the patch was only partly updated to reflect that [19:55:49] conftool errors post puppet-merge... [19:55:53] (so that's my fault again heh, but I blame the delays!) [19:56:05] ERROR:conftool:load_node Backend error while loading node: Backend error: The request requires user authentication : Insufficient credentials [19:56:22] https://phabricator.wikimedia.org/T125485 is when the nodes in esams changed [19:57:03] gehel: how did you run the puppet-merge, exactly? it may require a full root shell rather than run undo sudo? [19:57:19] if not that, I'd guess random failure if it was just 1/N failing like that [19:57:32] sudo puppet-merge [19:57:42] yeah do a "sudo i" to get a login shell [19:57:48] probably needs a sudo -E [19:57:56] then re-run the last part that was invoked by puppet-merge, which is "conftool-merge" [19:58:07] sorry I meant "sudo -i" above [19:58:19] sudo -E is the opposite of what we want [19:58:51] Ok, not my env, but standard root env... [19:58:54] right [19:59:26] I have some religious issues with having a root shell opened... [19:59:40] anyways, the puppet-merge handles sudo issues fine, just conftool-merge doesn't [19:59:44] ok, conftool-merge successfull [20:00:07] ok, now try the salted puppet again, see if things finally clean up... [20:00:16] esams in particular may still need 2 runs after that last merge before it does [20:00:18] lemme check the list in /root/exmobile... [20:00:25] that one's fixed [20:00:36] (was already correct before we started, I mean) [20:00:38] ok, running salt... [20:01:17] sorry, this got a lot more complicated than expected, with those two issues in the first patch [20:01:31] I guess expecting the unexpected should be my expectation, though! :) [20:01:36] complicated is good! That's how I learn stuff! [20:01:50] :) [20:02:16] I still get failed dependencies, lemme look at the logs... [20:02:28] hopefully mostly on esams, looking a bit too [20:03:26] I'm looking at a failed VCL reload on eqiad [20:04:46] yeah it's missing codfw directors, which is probably related to the same stuff as fixup #1 [20:06:09] oh, it's probably that the other sites are depooled still, including codfw that others need backend defs for [20:06:43] ^ I did not understand that... [20:07:03] How did you find that the error is with codfw directors? [20:07:56] the file /etc/varnish/directors.backend.vcl on one of the eqiad nodes (probably similar elsewhere) [20:08:05] its stanzas have no hosts in them and such [20:08:26] the actual VCL error doesn't help much, it says one filename and points at a line number from something else, or something [20:08:35] anyways [20:08:55] varnishtest output is almost as badd as puppet logs... :P [20:08:55] also, while we could fix that with confctl commands alone, let's fix the route_table for maps while we're at it before we fix this [20:09:12] maps is still "special" in that it only has applayer services at codfw and not eqiad [20:09:32] so long as cp104[34] remain puppet-disabled, we can fix that now for the rest without affecting the running service on the old ones [20:10:33] that a question I meant to ask at some point: why do we have Maps app server only in codfw... [20:11:16] because cache_maps was set up poorly in the past, and is now finally getting real hardware allocations [20:11:55] we gave it some leftover old hardware that was available to beta-test on, which happened to be 2x junky old caches in eqiad, and 4x of some kind of re-useable node in codfw for the applayer [20:12:13] we're fixing the cache part now with a real allocation, and they've ordered new hardware for the applayer in eqiad+codfw, but it's not in place yet [20:12:27] mergind https://gerrit.wikimedia.org/r/#/c/285250/ ? [20:12:31] yeah [20:13:15] so let's puppet that now, which will almost certainly still fail puppet in some ways [20:13:18] merged, running puppet ... [20:13:30] and then after that's done, I'll fix confctl pooling [20:13:41] you have no faith! [20:14:22] confctl pooling? The weight issue you were talking about earlier? I'm still not sure I understood what that is... [20:15:00] no, not the weight issue. sorry just more over-complications in all of this. we don't normally bootstrap whole new cache clusters often, especially half-configured ones with old outdated patches, etc... [20:15:32] because the varnish and varnish-be conftool data is all pooled=no at 3 of the sites, that screws up the VCL definitions for the backends at other sites [20:15:49] the fixup is to pool those now: [20:15:51] for d in codfw esams ulsfo eqiad; do for s in varnish-be varnish-be-rand; do echo === $d/$s ===; confctl --tags dc=$d,cluster=cache_maps,service=$s --action set/pooled=yes 're:.*'; done; done [20:16:08] that pools up inter-cache backend defs and such [20:16:40] in practice it won't affect cp104[34] because they haven't been puppeting since all this started. worst case they would pick up an invalid VCL change and fail to reload it and keep chugging [20:16:46] I understand most of that... just the 're:.*' part, I'm not really sure what it matches [20:17:00] it matches the regex '.*' on the hostnames within that dc+cluster+service [20:17:01] I still got errors in puppet... [20:17:08] yeah errors are still expected [20:17:29] after the confctl loop above, puppet will start clearing up [20:17:41] (although again, due to confctl race conditions that are known-issues, it may still take 2x runs of puppet) [20:18:08] ok, let me run that double loop on neodymium [20:18:14] ok [20:20:08] I'm also re-running puppet agent now on kafka10* to help clear up ipsec host list issues causing alerts on them, that were fixed up in one of the fixup commits earlier [20:20:12] no confctl on neodymium... running on palladium (we have too many nodes!) [20:20:27] heh :) [20:20:39] yeah puppet-merge + confctl on palladium, salt on neodymium [20:21:27] Yes, I am sure of what I am doing. (kind of) [20:21:33] :) [20:22:13] done [20:22:31] runnign puppet again (via salt) ? [20:22:51] yeah [20:23:12] after that confctl, it will probably take 2x salted runs of puppet before they all clean up their act [20:23:15] if no more issues heh [20:23:44] let's see [20:23:48] I'm following along in syslog on cp1060 to see what it does [20:24:31] ok cp1060 failed again [20:24:52] still waiting on salt to get back to me... [20:25:52] heh [20:26:03] so many problems with this convoluted re-mapping of clusters... [20:26:10] but this time the VCL checked pass, but varnish start failed? [20:26:17] I still see one more problem, which will probably only affect esams+eqiad hosts in puppet [20:26:48] the problem is cp104[34] are not defined as backends, even though confctl still knows about them [20:26:51] I see at least 1 codfw host failing [20:26:57] cp2015.codfw.wmnet [20:27:07] yeah probably affects that too [20:27:10] so anyways [20:27:32] the tricky part is, we could fix this by depooling cp104[34] in confctl, but then that will runtime-affect the running service on cp104[34] and break them [20:27:35] so... [20:28:10] let's do 'service confd stop' on cp104[34] before continuing, so they're cut off from confctl updates, too [20:28:59] (you, not me) [20:29:31] that will prevent them from seeing future pool/depool work we do with confctl, to disable cp104[34] themselves [20:29:49] eventually we will route around them at the LVS layer when everything else is working out ok [20:29:59] confd stopped for cp104[34] [20:30:04] ok so now [20:30:54] for n in cp1043 cp1044; for s in varnish-be varnish-be-rand; confctl --tags dc=eqiad,cluster=cache_maps,service=$s set/pooled=no $n.eqiad.wmnet; done; done [20:31:13] that should get those nodes out of the etcd data that all the other nodes are pulling in, which is messing up their VCL loads still [20:31:43] (they don't have the matching VCL node defs needed for those 2x hosts, since they've run puppet since we pulled them out of hieradata) [20:32:23] I think I missed some "do" above heh, what I get for typing commands into IRC! [20:32:36] yep [20:32:36] for n in cp1043 cp1044; do for s in varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_maps,service=$s set/pooled=no $n.eqiad.wmnet; done; done [20:33:23] and --action set/pooled=no ? [20:33:33] oh, yes, that :) [20:33:56] for n in cp1043 cp1044; do for s in varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_maps,service=$s --action set/pooled=no $n.eqiad.wmnet; done; done [20:33:57] see now you're doing better than me, I can retire in peace [20:34:14] done [20:34:28] so one more time with the salted puppet, maybe two more times, if no further issues... [20:34:36] I can fix syntax error, does not mean I have the semantic correct... [20:34:44] :P [20:35:21] I hope it works this time... my brain is slowing down... [20:35:58] I see icinga recovering ... [20:36:32] yep, starting to look good! [20:40:39] 5 minute break to switch laptop... brb [20:40:44] and get a coffee... [20:41:14] ok [20:41:27] the remaining errors in icinga right now seem reasonable-ish for the state we're supposed to be in [20:41:32] FWIW [20:42:08] (traffic-pool not working on the new hosts, confd dead on the old) [20:43:55] I'm running another salted puppet now just to be sure things are stable [20:45:59] ok, I'm back... [20:49:52] ok me too [20:50:16] so basically the next bit is to confirm maps traffic through the new eqiad nodes works manually, before users end up there [20:52:10] ok, let em find a test URL... (I have not actually worked all that much with maps yet...) [20:53:20] well we can use the root page, https://maps.wikimedia.org/ [20:53:46] so if you curl that from the outside world, you can see what the public response to that looks like [20:54:01] bblack-mba:~ bblack$ curl -v https://maps.wikimedia.org/ [20:54:11] 200 OK + < Cache-Control: public, max-age=0 [20:54:17] < X-Cache: cp1044 miss(0), cp1043 frontend miss(0) [20:54:21] (old servers) [20:54:25] and some html content there [20:54:58] if we go on the new eqiad machines, since they actually have the public IP for that mapped to their loopbacks, if you fetch that from them, they'll hit themselves [20:55:11] root@cp1060:~# curl -v https://maps.wikimedia.org/ [20:55:21] seems to work on cp1060 [20:55:25] and we see things look about the same, but: [20:55:26] < X-Cache: cp2009 miss(0), cp1046 miss(0), cp1060 frontend miss(0) [20:55:35] which means it's flowing through all new nodes, not old [20:55:47] and that it's obeying cache::route_table (eqiad -> codfw -> direct) [20:56:34] so if that holds on all 4 of the new machines, we know it's valid (from a correctness standpoint) to start sending users through the new eqiad machines [20:56:53] we probably want to ramp them in a bit, so it's not a sudden cache miss for all requests on the fresh new caches [20:57:09] with bigger services it's far more important. for this honestly we could probably live with it, it's in beta and lower-traffic. [20:57:15] but it's good to go through the exercise anyways [20:58:31] so this is the current pooling/weighting for the frontend services (nginx=443, varnish-fe=80) in eqiad: [20:58:34] for s in varnish-fe nginx; do echo === $s ===; confctl --tags dc=eqiad,cluster=cache_maps,service=$s --action get 're:.*'; done [20:59:14] it happens that the default weighting is all weight:1, which is where we want to end up in the end. that the old ones happen to have weight:10 is historical cruft, but affects how this all blends up as we move users... [21:00:21] so we'll have 10x less traffic on new servers, correct? [21:00:26] so step one is to pool in the new nodes with light traffic. given the old ones are at weight:10, pooling them in at weight:1 should be sufficient (the sum weight is 24, 20 to old and 4 to new) [21:00:36] so to pool them in at current weight: [21:01:16] for s in varnish-fe nginx; do echo === $s ===; confctl --tags dc=eqiad,cluster=cache_maps,service=$s --action set/pooled=yes 're:cp10(46|47|59|60)\.eqiad\.wmnet'; done [21:01:29] (I think, I've not done complex regexes with this often before, we'll see how that goes) [21:02:53] ok, pooling them in! [21:04:26] done [21:05:13] so I'm watching with "varnishlog -n frontend |grep RxURL" just to kinda see reqs flowing, on cp1060 [21:05:20] I'm looking at the varnish HTTP errors grafana dashboard, anything else I should keep an eye on? [21:05:39] maps traffic is small, so it's hard to even tell, but I do see real tile requests occasionally [21:06:14] heh [21:06:31] so we're good up to here, but I just noticed another little side-issue [21:06:56] bblack-mba:hieradata bblack$ git grep varnish_version4 [21:06:56] hosts/cp1043.yaml:varnish_version4: true [21:06:57] hosts/cp1044.yaml:varnish_version4: true [21:07:08] we upgraded cache_maps to varnish4, but on a per-hostname basis :P [21:07:20] so the "new" caches are running the older varnish (the VCL/config is compatible with both) [21:07:32] should probably go ahead and fix that now [21:08:16] ok, let's do that. Varnish version is pinned in Puppet ? [21:10:21] I think ema puppetized everything, it's supposed to be a pretty smooth process [21:10:29] but let me make sure there's not commands that need running too [21:10:37] pushing up the puppet part now, but wait [21:13:07] I know he documented some stuff on switching back and forth for varnish3<->4, I think there was a few commands involved [21:15:15] ok I see some stuff from his bash history heh [21:15:25] let me play with cp1060 first for procedure [21:16:56] * gehel is reading puppet code about cache and varnish... [21:17:19] gehel: I disabled puppet on all the exmobile group, so we can merge that varnish4 hieradata thing, and then I can try the procedure on one node [21:17:30] ok [21:24:34] sorry sorting this part out is getting complicated [21:24:43] if you need/want to bail, it's perfectly ok :) [21:25:35] the apt pinning part of the change isn't seeming to take proper effect... [21:25:39] I still have some time, if you're ready to walk me through it, I'm happy to learn... [21:26:01] I just found that code in puppet... [21:26:50] yeah [21:27:03] there's some missing puppetization bits in there somewhere on making the 3->4 switch I think [21:28:04] I'm jsut starting to understand what's in there. I have no idea of what is missing :P [21:28:17] so what I think I've got it down to now [21:28:38] I'm working on cp1060 and seeing if I can get it upgraded, I think I just did, but not yet re-pooled [21:29:00] anyways, basically: [21:29:27] 1. diasble puppet (already done for all, will leave them like that and move through the rest on them one at a time) [21:29:45] 2. echo deb http://apt.wikimedia.org/wikimedia jessie-wikimedia experimental > /etc/apt/sources.list.d/wikimedia-experimental.list [21:30:04] (the packages are in the experimental repo and that sources.list entry isn't puppetized, so that's the really-missing bit that stooped me) [21:30:28] 3. depool; service varnish-frontend stop; service varnish stop; apt-get -y remove libvarnishapi1; rm -f /srv/sd*/varnish* [21:30:43] (depool node, stop varnishes, remove varnish packages, kill persistent storage) [21:30:56] 4. apt-get update [21:31:04] (to get the new varnish4 packages listed from experimental) [21:31:12] 5. puppet agent --enable; puppet agent -t [21:31:27] (may have to run the above a few times to get through dependency hell) [21:31:31] 6. pool [21:31:39] (when sure we can repool the new as working varnish4) [21:32:13] ok, want me to try that on cp1059? [21:32:15] I had a lot more false starts on cp1060, I'm going to try just the steps above on cp1059 after cp1060 is repooled [21:34:27] Ok, let me know how that works... [21:34:35] we're gonna have a bunch more to do heh :) [21:36:44] I'm not here for all that much longer. I'm not trusting myself on prod servers after a certain time at night... [21:36:47] so the procedure above works. first puppet run does a lot of things then fails, second puppet run does more things and succeeds. third puppet run does yet more things and succeeds. fourth puppet run is finally a no-op [21:37:05] so to make it a little more compact: [21:37:37] echo deb http://apt.wikimedia.org/wikimedia jessie-wikimedia experimental > /etc/apt/sources.list.d/wikimedia-experimental.list; depool; service varnish-frontend stop; service varnish stop; apt-get -y remove libvarnishapi1; rm -f /srv/sd*/varnish*; apt-get update; puppet agent --enable; puppet agent -t; puppet agent -t; puppet agent -t; puppet agent -t [21:37:45] I love puppet... [21:37:48] the 4th and final agent there should be the usual no-op [21:37:55] want to try that on the next eqiad? [21:37:58] pooling cp1059... [21:38:37] 1046 or 1047 take your pick [21:38:39] yep, lemme do that. [21:38:44] I'll take 1046 [21:38:53] ok trying 1047 then [21:39:15] doing them in // ? [21:39:28] I guess load is not really an issue here ... [21:39:36] yeah it's ok for this pair [21:40:02] load isn't an issue, but in other other DCs we might temporarily break puppet again if we "depool" all 4x nodes at once, even though they have no traffic. [21:40:16] 1046 depooled, I'll you know when finished [21:41:28] I got consistent results on 1047, 4th puppet run no-op [21:41:59] well no green text anyways, there's always a few lines of white spam about tlsproxy, unrelated [21:42:22] sorry, did not start yet, had to check planning with my better half... [21:42:23] pooling cp1047 [21:43:15] basically with the varnish4 issue fixed in eqiad, next step is fix it everywhere else same way (step through the 4 hosts though to avoid breaking puppet worse) [21:43:49] and then ramp down the weight:10 on the 2x old servers in eqiad for nginx+varnish-fe services (say cut it to 5 then 1), then actually remove them from service with pooled=no [21:44:17] at that point it's merge the next patch (which just kills the 2x depooled old eqiad servers) [21:44:59] Ok, it's going to be time to get some sleep pretty soon for me. I'm going to either bail at that point or do it tomorrow... [21:45:10] then merge the 3rd patch (add definitions to other DC's LVS) and pool all the nodes at the other DCs in confctl (all at once) [21:45:34] then confirm manual traffic in other DCs via their public IPs, then merge the final (geodns) patch to actually send users to those [21:45:51] that's the remainder in a nutshell, other than the unexpected, and some restarts of pybal daemons [21:46:11] yeah I figure you're bailing soon, so figured I'd summarize the rest for you from where we are now [21:46:37] summarized like this, it sound so simple :P [21:46:52] heh [21:47:08] I don't like the excess complexity either. it's way better than it used to be. we'll get there someday :) [21:47:45] but honestly one-off issues provisioning new caches clusters, it's hardly worth fixing them or even doccing them. new cache cluster are extremely rare events, and by the time you hit another one everything will have changed again anways. [21:48:31] 1046 done + repooled? [21:48:50] just finished checking, repooling now [21:49:00] done [21:49:33] well basically fixing varnish4 at the other DCs, then adjusting weights and merging the next patch, gets us to a good sane stopping point [21:49:47] and those things, you're not going to learn much from anyways, you've seen them already [21:50:03] I'd say bail now, I'll finish varnish4 + weighting + merge next patch to kill old servers [21:50:15] we can leave it like that for the night, running on the new infra but users still only coming in through eqiad [21:50:24] and do the rest tomorrow sometime about turning up traffic in the other DCs [21:52:45] sounds like a plan! Thanks a lot for the lesson! [21:52:59] np! get some good sleep :) [21:53:44] good night! [22:24:05] 10Traffic, 07Varnish, 06Operations, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2237213 (10BBlack) Noticed doing v3 -> v4 upgrades on the new cache_misc hosts, I had to do the following to effect the upgrades: 1. Disable puppet on affected node... [23:06:11] 10Traffic, 06Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2237325 (10BBlack) @Gehel and I made partial progress on this today, to resume tomorrow. Current situation: 1. all 16x new cache_maps mac...