[04:08:16] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4012780 (10Prtksxna) >>! In T188362#4010716, @Dzahn wrote: > Access is still denied to me on T175445 which is surprising because... [09:43:56] 10Wikimedia-Apache-configuration, 10Operations, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4013161 (10Joe) [10:02:28] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4013204 (10Aklapper) T175445 is neither NDA nor Security but custom... Going to https://phabricator.wikimedia.org/maniphest/task/... [11:01:48] so I've added a graph plotting numa_vmstat_(hit|mis) rate per second [11:02:17] there are differences between cp4021 (numa_networking: on) and cp4022 (numa_networking: off) [11:02:21] (surprise!) :) [11:02:27] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=68&fullscreen&edit&orgId=1&var-server=cp4021&var-datasource=ulsfo%20prometheus%2Fops [11:02:35] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=68&fullscreen&edit&orgId=1&var-server=cp4022&var-datasource=ulsfo%20prometheus%2Fops [11:03:57] it seems that on cp4022 (numa off) hits decrease with the increase of misses [11:04:15] that is not the case on cp4021 (numa on) [11:04:46] note that in our case numa_miss==numa_foreign [11:05:07] > numa_miss A process wanted to allocate memory from another node, but ended up with memory from this node. [11:05:14] > numa_foreign A process wanted to allocate on this node, but ended up with memory from another one. [11:06:22] (I have no idea why they are the same, just staring at the graph) [11:20:33] <_joe_> well for every numa_miss on one node, you should have a numa_foreign on another node [11:20:36] <_joe_> right? [11:21:26] <_joe_> how do you collect those metrics btw? [11:21:34] oh, because we have only two nodes! [11:23:26] <_joe_> well even if you have 100, you're not showing per-node stats, right? [11:23:41] <_joe_> so the total number of events will always be the same [11:23:52] that's what I'm trying to understand :) [11:24:33] right, the stats I'm plotting now are node_vmstat_numa_hit, which are "global" [11:24:38] so yes, you're right [11:25:14] if we'd use node_memory_numa_*, instead, they'd be per-node [11:25:23] <_joe_> yes [11:25:30] <_joe_> not sure how interesting it is [11:25:43] <_joe_> it could show hot nodes [11:26:03] _joe_: they're coming from prometheus-node-exporter (vmstat and meminfo_numa collectors) [11:27:26] <_joe_> very cool anyways [11:27:42] <_joe_> I would love to have the time to do similar monitoring/tuning on the appservers [13:27:24] I suspect the TL;DR will be that the basics of the numa_networking approach are a sound thing to do on any network centric server that can have multiple reuseport-based listening sockets. [13:27:49] (in separate threads, and with an appropriate ethernet adapter) [13:28:44] and so eventually we'll probably want to default it on for any such cases we know of where the sw+hw combination makes sense, which is easy to derive just for applicable software classes + facts, and not need a hiera setting. [13:28:53] the hiera setting is more about the experimental phase of gradually testing it [13:33:41] morning bblack [13:34:08] I was giving some love to your last week assignment, aka deep into nginx releasing here at WMF [13:34:53] so I was trying to find in gerrit the changes involving taking changes from upstream to our repo [13:35:13] ok [13:35:20] 8572f8c Merge tag 'debian/1.13.6-2' into wmf-1.13 --> let's take this one [13:35:41] that commit has the following change-id associated: I70f8ced4257f0039c519c0d76f3092f393d93b3c [13:35:50] but I'm not able to find that change-id in our gerrit [13:36:21] yeah, I tend not to push the upstream imports through gerrit [13:36:47] (because then it would be pointless gerrit-churn of something not actually being reviewed, for N commits straight) [13:36:48] I suspected that, but the change-id confused me [13:36:50] but it's so much fun if you do [13:37:10] the change-id is auto-generated by the commithook [13:37:41] we've had a few different approaches to the nginx situation, the branch history in the overall is confusing [13:37:53] yup [13:39:04] but yeah, currently the basic idea is wmf-1.13 is a copy of debian's 1.13, and has direct local commits where we've patched modded things, and we just keep merging new debian tags in on top from time to time. It's not ideal either, but not as bad as some other approaches we've had before :) [13:39:51] the only thing that makes it sane to handle, is the implicit rule that we don't patch the sources directly in git there when we make local changes. [13:40:23] if there's any diff between debian's 1.13 tags and ours (after merging in theirs) coming from local wmf commits/work, it's in debian/ [13:40:23] for that you use the patches/*.patch stuff, right? [13:40:28] yup [13:40:41] mostly it will be in debian/patches/, but sometimes it's a change to other debian/ files outside of patches, too [13:41:06] but you can kinda summarize it by seeing the debian<->wmf diffs of debian/, and looking for the commits that caused them. [13:41:56] and then we're also transitionally supporting both jessie and stretch packagings of this [13:42:13] stretch is our default packanging in the wmf-1.13 branch now, and wmf-1.13-jessie is the jessie packaging of it [13:43:05] I see that we are a few minor releases behind... debian/1.13.9-1 VS 1.13.6-2 [13:43:47] yeah, in general we usually wait to upgrade until we something in the changelog on hg.nginx.org that looks noteworthy for us [13:44:37] there hasn't been anything super-compelling lately that I've seen, but I also haven't trawled that list in a while [13:44:59] and in general, it would be better if we just routinely updated instead of sporadically :) [13:45:10] call me paranoid.. *) Bugfix: in the ngx_http_v2_module. [13:45:16] that smells like security bugfix [13:45:35] yeah there's a few decent ones in the 1.13.9 set in particular, it looks like [13:45:42] included in 1.13.9 according to http://nginx.org/en/CHANGES :) [13:46:47] I'd pay less attention to CHANGES and more to the hg commitlog to dig in though [13:47:15] e.g. new in 1.13.9 is this: http://hg.nginx.org/nginx/rev/8b0553239592 [13:47:41] but it's quite possible the issue it's fixing was also introduced in 1.13.9 with this: http://hg.nginx.org/nginx/rev/3d2b0b02bd3d [13:47:49] (but I haven't read deeply into the two to see if that's the case) [13:48:48] before we move to a new nginx release, it would also be nice to complete https://phabricator.wikimedia.org/T164456, I think at this point most nginx installations should be the same on jessie and stretch [13:49:39] good point! [13:50:18] actually, elastic is running an older version still as well, but should be straightforward to fix [13:51:01] although it doesn't use tlsproxy, so unrelated [13:51:12] IOW only conf* is running an older version at this point [13:51:44] yeah so like I said in the comment, the upgrade itself should be seamless. I'd just not want to be doing it to someone else's server without them :) [13:53:17] I guess conf* is j.oe? [13:53:41] the upgrade on mw* needed some trickery wrt conffile handling, in the end I used "apt-get install nginx-full -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" [13:53:48] yeah, mostly _joe_ [13:57:38] vgutierrez: in any case, you can still prep+work on 1.13.9 packaging (TL;DR - merge in new upstream, re-quilt our patches to still apply cleanly (hopefully trivial!), maybe remove the do_ssl_wait_shutdown patch as that experiment proved pointless) [13:57:49] just don't upload them yet till we resolve the switch of everything to -light [13:58:46] I'm not sure which kidn of workflow you're starting to use with gerrit [13:59:06] I tend not to use gerrit-specific tooling, and just use generic git CLI and memorize the rules of refs/for/X [13:59:11] I'm mostly suffering with gerrit :) [13:59:25] basically you can treat it like a normal hosted git repo [13:59:26] right now I rely on git-review [13:59:39] but yep... I saw the hard-way mode [13:59:45] the way I think of it is like this: [13:59:57] (assuming origin and sitting on a master branch matching upstream master) [14:00:14] git push origin master # would be the norm for a normal raw git repo hosted somewhere [14:00:34] git push origin master # also works for gerrit, but completely bypasses review, and our settings usually prevent it for most repos [14:00:57] git push origin HEAD:refs/for/master # works with gerrit, and puts your changes up throudh HEAD into the review queue for master instead [14:02:09] the other end of it is pulling down others patches to look at / review / edit (or your own, because you deleted them from your local branch to work on other stuff after you uploaded them days ago) [14:02:28] for that I use gerrit's web UI links [14:02:37] e.g. you can start at going to https://gerrit.wikimedia.org/r/#/c/415204/ [14:02:48] hit the Download dropdown in the upper right [14:03:01] copy the Cherry-pick command, and it gives you this in your pastebuffer: [14:03:05] git fetch ssh://bblack@gerrit.wikimedia.org:29418/operations/puppet refs/changes/04/415204/4 && git cherry-pick FETCH_HEAD [14:03:34] which will cherry-pick that one commit-in-review back on top of your current local branch, preserving the commitid so new revs of it can be pushed up again using refs/for/ [14:04:02] yup.. that's basically git review -d change_number I guess [14:04:20] I find git-review gets increasingly-confusing the more advanced you try to make use of it [14:04:25] but personal choice, some love it :) [14:06:03] in the case of nginx upgrade, my normal-ish workflow is something like: [14:06:20] 1) locally merge debian's 1.13.9-1 or whatever onto wmf-1.13 [14:07:06] 2) Push that directly (no review) [14:07:28] 3) do local commits for fixups (e.g. re-quilting our patches to apply cleanly, removing dead patches, etc) [14:07:35] 4) do a changelog commit for our new release [14:07:44] 5) push those changes through for-review using refs/heads/ [14:08:14] I think sometimes I put 3+4 in the same commit if it's a simple case [14:09:36] "702c661eb Release 1.13.6-2+wmf1 for stretch" was such a case, it has a patch line numbering quilt fixup in the same commit as the changelog entry [14:13:31] s,refs/heads/,refs/for/, above in (5), oops :) [14:33:25] git log --left-right origin/wmf-1.13...debian/1.13.9-1 --oneline |grep ">" |wc -l [14:33:33] 24 [14:33:41] 24 new commits from upstream [14:33:45] not that bad [14:55:37] ema: I'm still slightly concerned/hedging about deploying a reload-vcl that autodiscards, in case we haven't found all the varnish bugs that may randomly crop up, but maybe I'm being paranoid at this phase? [14:55:54] ema: how do you feel about it? [14:57:19] bblack: I feel slightly concerned as well :) How about we deploy it on a few selected hosts and see how it goes? [14:57:20] we could potentially limit it a bit to avoid possibility of mass problems. even something simple like putting a random 10% chance around the autodiscard code block for now, so that if there's some systemic issue it doesn't spread too fast, and then ramp that up later if things look sane [14:57:53] ema: if we wanted to selectively deploy the new script as a whole, it's problematic because the flags have changed, etc [14:58:12] I think the rest of the change is pretty sane and healthy, it's mostly the autodiscard that scares me a little. [14:59:17] (or make autodiscard a CLI option and then control the rollout of that option in the set via puppet) [14:59:18] maybe add a flag to reload-vcl to skip discarding and deploy with that flag on first? [14:59:31] ok [14:59:34] yeah :) [15:04:47] Based on customer feedback, we will be enhancing the SSH command line [15:04:47] interface in a future release of the iLO 4 firmware. [15:04:47] Release Notes on www.hp.com/go/iLO for additional information. [15:04:54] (and that's a 404) [15:06:08] lol [15:08:52] why are those consoles always so incredibly bad btw [15:09:00] 7jBZCBy{Nʄ?h(Zԫʔ^kjk [15:09:07] this is what I get on lvs1005 ^ [15:09:24] (and more gibberish which I'm sparing you) [15:09:55] that's a drac to be fair, they all have their own reasons to be hated [15:11:56] even just the initial SSH exchanges are slow at times, meh [15:12:24] the general feeling I get when it works is "yeah cool this time it kinda works, who knows next time!" [15:12:34] when it doesn't work, I'm not particularly surprised [15:12:46] the slow SSH exchanges are due to outdated firmwares [15:13:07] https://phabricator.wikimedia.org/T171041 [15:16:55] 10Traffic, 10Operations, 10Page-Previews, 10RESTBase, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4014055 (10Niedzielski) @bblack, @phuedx sorry to be a bother. This seems like an important issue as we're trying to rollout page previews to prod th... [15:25:19] ema: added "-a" flag to turn on auto-discard, which is not currently set and thus inactive, and re-tested the new python on cp5001, seems ok [15:25:27] if it runs and passes pep8 it has to be good, right? [15:25:30] :) [15:26:06] ema: https://gerrit.wikimedia.org/r/#/c/415204/4..5/modules/varnish/files/reload-vcl [15:26:51] (I split the new func because line lengths were getting overlong from indent. thanks python+pep8 for forcing me not to create long functions :P [15:26:54] ) [15:28:28] bblack: LGTM yeah, minor unrelated nitpick is that we could re.compile out of the for loop first (but whatever) [15:29:46] seems like premature optimization if it adds lines and nobody will ever know the difference :) [15:30:31] (in any case, shouldn't that be a language-level optimization if the re-string is a constant?) [15:31:45] \_/ it [15:32:03] lol [15:32:14] you like my advanced ascii art ship? [15:32:28] making it so! [15:32:49] !log disabling puppet on A:cp for deploy of https://gerrit.wikimedia.org/r/#/c/415204/ and friends [15:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:23] (btw, that really should echo over to -ops, I wish it did) [15:33:38] logmsgbot does that for other channels' !log statements [15:36:05] PCC has failed us lol [15:36:15] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Invalid relationship: Service[varnish] { require => File[/etc/default/varnish] }, because File[/etc/default/varnish] doesn't seem to be in the catalog [15:39:01] bblack: there's still a reference to /etc/default/varnish in modules/varnish/manifests/instance.pp which I think shouldn't be there? [15:39:35] yes, fixed [15:39:35] ah yeah you just merged a patch addressing that :) [15:39:39] but how did PCC miss that? [15:40:14] dunno [15:42:49] I vaguely remember that for some valid reason PCC isn't able to do certain things, which might include the scenario above [15:43:25] yeah, I know I usually have to remember stuff about catalog caches and new facts and such [15:43:51] but you'd think if a puppet server is unable to compile because there's a service dep on a completely-missing resource, that it couldn't possibly do a test-compile for comparison. [15:43:58] I suspect my understanding of how PCC works is flawed :) [15:44:31] so the error there is 'Could not retrieve catalog from remote server' [15:44:53] perhaps the catalog was indeed compiled properly but the next step went wrong, and pcc takes care of the former only [15:45:01] perhaps [15:45:14] step 1: compile to some intermediate form, step 2: resolve relationships? [15:45:20] and PCC only does step 1 [15:46:35] Puppet configures systems in two main stages: [15:46:38] <_joe_> catalog gets compiled onto the master [15:46:45] <_joe_> then it's executed by the agent [15:46:50] <_joe_> which resolves execution order [15:47:11] yeah :) [15:47:11] https://puppet.com/docs/puppet/4.10/architecture.html [15:47:14] <_joe_> I still haven't found an handy way to make puppet compute the graph from the catalog without doing a puppet run [15:47:45] <_joe_> it needs for someone to dive deep into puppet internals and extract the ruby incantation to just do that, given a catalog [15:47:58] <_joe_> tbh, it's not a pleasant experience [15:48:19] <_joe_> it's some deeply convoluted ruby with all kinds of wizardry and indirections [15:48:36] <_joe_> maybe the new clojure version is better, no idea [15:48:55] s/wizardry/insanity/ [15:50:08] anyways, with the fixup it's rolling out cleanly so far [15:51:19] great [15:51:55] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4014170 (10Dzahn) Yes, I can see the ticket now, thanks @Prtksxna and @Aklapper. [15:52:15] I'm upgrading lvs1003, last LVS host for kernel upgrades (CC: moritzm) [15:52:38] oK! [15:53:37] I want to fix the numa_networking stuff before we do cache kernel reboots, FYI [15:54:15] (it can be enabled at runtime, but we'll probably get better data on how they behave if they're rebooted anyways, vs long-running varnishds adapting to the new patterns at runtime) [16:00:45] sounds good [17:00:51] so the reload-vcl thing seems sane so far [17:00:55] going to push the probe updates [17:02:17] well after another run through PCC to confirm the ms values [17:07:25] ema: re: numa_networking, maybe better than going site-by-site would be to turn it on for a percentage of hosts at all sites, for easier cross-comparison given similar traffic/load [17:07:39] in which case I can leave the puppetization alone for now and do hosts/foo.yaml for them for now [17:08:06] (say 2 hosts per cluster+site combo for now, and then look at doing it for all after that has run for a couple weeks) [17:08:07] yeah that seems reasonable for easier comparison (and fewer puppet struggles) [17:09:05] which reminds me that I wanted to move the numa graph closer to the one about memory :) [17:16:27] done, they look interesting! https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp4022&var-datasource=ulsfo%20prometheus%2Fops [17:17:32] yeah [17:18:19] so the numa misses are higher on 4021, the "more-optimal" configuration [17:18:27] but also less-spiky [17:19:04] I suspect the stats are mostly driven by varnish, not nginx [17:19:52] (and the situation for varnish is different when nginx numa_networking:on) [17:20:26] they do more cross-node allocs, but it's a consistent penalty instead of spiking around, because node1 isn't affected by network traffic causing evictions as much? [17:20:29] or some such theory [17:21:27] since we only have 2-node machines, and miss==foreign when summed across them [17:21:49] it might make sense to dump numa_foreign from the graph, but then also split by-node (graph hit/miss separately for node0 + node1) [17:22:30] 10Traffic, 10Operations, 10Page-Previews, 10RESTBase, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4014590 (10Niedzielski) I tested a few more endpoints just using the [[ https://en.wikipedia.org/api/rest_v1/ | documentation site ]] and here's what... [17:22:51] yup! [17:23:09] gotta run out for a bit now, will make it so when I'm back :) [17:31:38] another random interesting datapoint: [17:32:23] when looking at backend health probes headed from eqsin->(codfw+eqiad) [17:32:42] there's some latency variance which is fairly sticky per-host, so it's not random [17:32:58] some backends respond to the simplistic /varnishcheck faster than others, by enough to be concerning [17:34:04] for codfw the average is ~417ms and for eqiad it's ~483ms, which are reasonable averages all things considered (we're only looking for ballpark ~2xRTT) [17:34:26] but in both cases, the variance from best<->worst is about a 40ms spread [17:34:48] and it stays consistent as to which specific backend hosts are slowest or fastest [17:35:28] for an example, for upload healthchecks from eqsin->eqiad [17:36:44] the current best is cp1063 at ~464ms (but there are a few close to this minimal value) [17:36:57] but cp1062 is consistently much higher at ~507ms [17:37:31] in the net, the average of them all is about in the middle (the best are ~20ms better than avg, and the worst are ~20ms worse than avg) [17:38:41] at first I assumed it might be a varnish-level effect (for whatever reason, some varnishds are laggier), but it's there in ping data too: [17:38:49] cp5001->cp1062: [17:38:54] rtt min/avg/max/mdev = 256.965/256.993/257.033/0.254 ms [17:38:58] cp5001->cp1063: [17:39:05] rtt min/avg/max/mdev = 248.021/248.054/248.108/0.621 ms [17:41:05] these two hosts (1062 + 1063) are also in the same row/vlan [17:41:11] so no obvious explanation there [17:42:49] they have consistent ping times for both from anywhere else I test [17:42:59] even bast5001 gives them the same times (both ~248) [17:43:43] just not from cp5001... [18:48:06] 10Traffic, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 3 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#4015126 (10Ottomata) [18:54:39] 10Traffic, 10Analytics, 10Analytics-Cluster, 10Operations, and 2 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#4015160 (10Ottomata) [20:43:02] 10Traffic, 10Citoid, 10Operations, 10RESTBase, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#4015516 (10mobrovac) [21:26:27] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015735 (10Volker_E) @Dzahn: Repos are requested. [21:38:14] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015774 (10Dzahn) @Volker_E Alright, thanks for the update! As you can see above i started uploading some changes and merged what... [21:39:14] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015776 (10Volker_E) Great! [22:31:36] ema: FYI I rolled out numa stuff to a bunch of nodes (2/N for all sites' text+upload clusters): https://gerrit.wikimedia.org/r/#/c/415631/ . All seems reasonable so far. [22:39:59] ema: also, our vcl.list counts are insane in general. The worst offender I found so far is cp3034 (upload) frontend at 181 VCLs in the list [22:40:23] I'm gonna try manually discarding out a few of these nodes and see how it goes (carefully) [22:53:56] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015968 (10Bawolff) > Please also consider for your planning enough time for the security team to do a review. They will be be abl...