[07:44:45] oh so vhtcpd segfaults were due to varnish frontend mem resizing, interesting [07:45:09] I assume https://gerrit.wikimedia.org/r/#/c/352601/ fixed it [08:00:15] 10netops, 06Operations, 10fundraising-tech-ops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3246519 (10ayounsi) [08:35:38] 10Traffic, 06Operations: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3246599 (10ema) p:05Triage>03Normal [08:38:04] 10Traffic, 06Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3234359 (10ema) p:05Triage>03Normal [10:07:32] 10netops, 06Operations, 10fundraising-tech-ops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3246796 (10Qse24h) [10:08:25] 10netops, 06Operations, 10fundraising-tech-ops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:09:16] 10netops, 06Operations, 10fundraising-tech-ops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:12:44] 10netops, 06Operations, 10fundraising-tech-ops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:13:39] 10netops, 06Operations, 10fundraising-tech-ops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:15:23] 10Traffic, 06Operations: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3244737 (10Qse24h) [12:14:59] ema: yeah, and probably. we'll see if the new sizes work :) [12:16:16] volans: re the formula, I think the "remove a constant factor" part mirrors the reality of the situation better, but there's a lot of unknowns too [12:17:12] in some simple sense you'd just split the avail mem between the two major needs: frontend malloc and backend disk's buffer/cache, say 66% FE and 33% left to buffer disks [12:17:43] but then there's also a constant factor for the smaller amount of memory taken up by everything else on the host and reasonable headroom [12:19:23] those other "constant" bits really shouldn't scale up with mem size. they should be things like the relatively-static amounts monitoring tools take, or nginx takes, etc. [12:32:05] https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/ [12:32:19] ^ APNIC has a nice BBR article. If you already know TCP history, skip the first bit :) [12:51:57] bblack: I'd assumed that if the solution was as easy as (TOTMEM - RESERVED) * 0.66 you would have already used that :) [12:52:32] so I just looked 3 minutes at something that might fit the trends of those numbers... thanks for the article, looking [13:43:58] ema: can you check out https://gerrit.wikimedia.org/r/#/c/352826/ etc? [13:56:46] bblack: sure [14:06:43] bblack: so on misc and maps we want to keep default_ttl as it is, right? [14:07:48] yeah maps was already 1d and misc is 1h [14:07:55] (I think!) [14:08:01] yeah, on maps it's already 24h [14:09:22] looks good! [14:11:09] double-checking the exp_thread patch [14:16:14] bblack: both lgtm [14:17:09] ok [14:17:32] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3248147 (10ayounsi) 05Open>03Resolved Now that LibreNMS has been upgraded to the most recent version, I've been able to poke more at thi... [14:19:49] the maps thing, of course, is a little more complicated! [14:20:33] (and in any case, in my mind still blocked on getting backends to do storage-wipe-on-restart, and then deploying some altered storage binning percentages) [14:20:53] (and also, we need to prepare analytics for webrequest_maps data showing up in webrequest_upload instead) [14:34:31] (yes please :) [14:41:17] 10netops, 10Monitoring, 06Operations, 13Patch-For-Review: nagios monitor transit/peering links and alert on low/high traffic - https://phabricator.wikimedia.org/T80273#3248224 (10ayounsi) 05Open>03Resolved As we have some link with 0.2% of outbound traffic, I added a LibreNMS rule to alert if any traff... [14:47:47] 10netops, 10Monitoring, 06Operations, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#3248266 (10ayounsi) A new check has been added to LibreNMS to monitor "show system alarms" (yellow and red) As well as all the moving parts (PSU/FAN/etc...) [14:48:47] 10Traffic, 06Operations, 13Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3248269 (10BBlack) @elukey - I think the only real analytics fallout here is that the data that is currently feeding to you as `webrequest_maps` will become data that's mix... [14:49:50] 10Traffic, 06Operations, 13Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3248278 (10elukey) I am going to ask to my team and report back asap! [14:50:26] godog, bblack: qdisc collector for node_exporter ready for your review/testing pleasure! https://phabricator.wikimedia.org/P5408 [15:05:59] bblack: so, the nginx upgrades are done right? [15:12:47] ema: nice! there's also a tool in prometheus to check/lint metric names that can be useful https://github.com/prometheus/prometheus/issues/1953 [15:18:03] ema: yes, and I forgot to even mention it in our weekly update :) [15:25:52] bblack: cool, so the next thing in the todo list would be adding backend storage wipe to the varnish-be systemd unit I guess [15:26:25] yeah [15:26:43] mkfs would be the most-assured way to have no lingering effect [15:26:57] but might take longer than rm -f /srv/sd*/*; sync [15:27:13] (and the sync does nothing anyways I think, and free space calculation tends to be delayed) [15:27:30] which is essentially what we currently do in varnish-backend-restart I think [15:28:04] oh I forgot we already had that in there heh [15:28:26] yeah, with a sleep hacked in there because of T149881 [15:28:27] T149881: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881 [15:28:28] so actually, our existing restart can handle the binning change. modulo possible concerns about free space accounting races [15:29:07] I do wonder if the "rm -f all files on the filesystem" solution is actually as-clean as an mkfs or similar [15:29:32] or if it leaves a bunch of block group accounting and such in a weird state that could mess up getting linear contiguous allocations out of fallocate() and such... [15:32:42] (I guess since all the upload backends have been doing weekly-or-manually-faster varnish-backend-restarts for a while now, we could maybe just audit whether the current files are contiguous and that would answer that question) [15:40:37] bblack: filefrag's output is interesting [15:41:04] on cp2024, for example, we've got: [15:41:05] /srv/sda3/varnish.bin2: 185 extents found [15:41:08] bblack: green light for maps, if you could just give us an heads up but that would be all. We'll need to change some data gathering config in hadoop after the fact [15:41:16] /srv/sdb3/varnish.bin3: 290 extents found [15:41:51] bin3 is much smaller than bin2 though [15:46:24] where is that? [15:46:34] I was looking at cp1071 and it shows zero extents for all of them [15:46:35] bblack: cp2024 [15:47:05] root@cp1071:~# filefrag -vx /srv/sd*/varnish.*|grep extents [15:47:05] /srv/sda3/varnish.bin0: 0 extents found [15:47:05] /srv/sda3/varnish.bin2: 0 extents found [15:47:05] /srv/sda3/varnish.bin4: 0 extents found [15:47:05] /srv/sdb3/varnish.bin1: 0 extents found [15:47:07] /srv/sdb3/varnish.bin3: 0 extents found [15:47:43] bblack: oh, I've used filefrag without -x [15:47:44] says zero on cp2024 for me too [15:47:46] oh [15:48:14] hmmm [15:49:26] I see [15:50:02] so yeah, there are small holes in contiguity in some sense [15:59:09] 10Traffic, 06Operations: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3248632 (10RobH) So I'll add a few options/items for review: * The router/switches are racked at the tops of the racks. If we move them to mid rack level, there are no power plugs at the middle of the rack to... [15:59:39] of course, we can also question how much block-linearity even matters on an SSD [16:00:09] but it might be interesting to experiment further with mkfs optimizations over the existing setup (to use larger cluster sizes, allocate fewer inodes, etc) [16:33:24] 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3248720 (10BBlack) APNIC has a good writeup here (first half is TCP history redux, second half goes into interesting details and new data on BBR)... [16:38:40] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3248727 (10elukey) Tried to strace all the nginx processes and this is the relevant part: ``` [pid 36881] connect(97, {sa_family=AF_INET, sin_port=htons(80), sin_addr=in... [19:07:47] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3249214 (10BBlack) You might want to look at the other side of the nginx proxy as well. Perhaps apache is terminating its connection to the local nginx with RST, and thi... [20:33:48] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3249552 (10Framawiki) [20:34:18] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3249567 (10Framawiki) [20:38:51] can we add wikispecies.org to the main wildcard cert ^ [20:39:01] i know species.wikimedia.org is the official project domain [20:39:14] but that was probably a mistake and it should have used wikispecies.org from the beginning [20:39:18] years ago [20:40:58] well, as things stand now if species.wikimedia.org is canonical, wikispecies would be a redirect to lump in with the pile of other redirects that we end up keeping [20:41:44] probably the right thing to pursue here is a move to make wikispecies.org canonical. and technologically that would start with adding it to our wildcard list (or getting it a separate cert temporarily until the next renewal, I guess). [20:42:05] but I'd like to see that switching the canonical URL is something that will actually happen before we do that. [20:42:49] personally, I have no idea about the project [20:43:10] but at some point in the long run, we'll probably need to have some standards about adding canonical 2LDs to our list, too [20:43:50] yea, that's of course all correct, i guess i'm asking for an exemption for a special case where it was done wrong long time ago and because i know the "make wikispecies.org canonical" isn't going to happen [20:43:59] (because the overall size of our list of canonical domains matters, so there has to be some barrier of project popularity or usefulness or something before consuming slots there... although that probably raises some chicken-and-egg questions too about how to get the popularity up without it)( [20:44:58] I'd say we really shouldn't add redirect-only domains to the canonical (unified cert) list [20:45:15] yea, if there is a barrier of popularity that will be bad for wikispecies. realistically if you just bring that name up in -staff people will say something like "lol, who uses that" / "should be merged into wiktionary" etc, :p [20:45:19] but I don't see why it would never happen, to get approval to switch canonicals [20:45:45] 10Domains, 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3249633 (10Urbanecm) [20:46:49] sticking a new domain into unified adds bytes to everyone's negotiations for everything, basically. and there are buffer/packet limits we trip over with that. [20:47:03] (one of the reasons for our custom patch for 8K local buffers in the openssl lib, for instance) [20:47:26] it just doesn't make sense for a low-volume redirect-only [20:47:52] gotcha, ok. it's just another redirect then like others [20:48:09] of course for all I know some of our existing canonicals would fail that litmus test of traffic too, I've never really looked since we have so many other lower-hanging issues on that front :) [20:48:12] they should have just used it from the beginning [20:48:21] it's the only project inside .wikimedia.org [20:48:33] and has always been the odd one out that [20:50:44] on a side note: one day it will become a frontend of wikidata, merely generating content that has all been imported [20:50:50] i think