[00:52:15] bblack: btw re goland/splice() -- looks like there is some support for the tcp->tcp case: https://go-review.googlesource.com/c/go/+/107715/ [00:53:48] s/goland/golang/ [09:58:20] hashar: morning! When you have a sec: https://gerrit.wikimedia.org/r/#/c/integration/config/+/473503/ [10:08:03] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [10:08:05] 10Traffic, 10Operations, 10Patch-For-Review: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) 05Open>03Resolved [10:08:22] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [10:12:37] ema: deploying. Thank you for taking care of adding CI jobs to repos :-) [10:17:42] ema: I did a recheck on the latest merged change ( https://gerrit.wikimedia.org/r/#/c/operations/software/fifo-log-demux/+/473432/ ) the build passed https://integration.wikimedia.org/ci/job/debian-glue-non-voting/2351/ :) [10:17:58] \o/ [10:18:00] ty hashar [10:18:57] hashar: so, we might as well use -voting I guess? [10:23:43] hashar: https://gerrit.wikimedia.org/r/#/c/integration/config/+/473712 [10:29:38] ema: yes :) deploying that one [11:08:17] ema: deployed, it is now voting [11:08:18] :) [11:08:48] hashar: nice! [13:54:05] sigh.. something is wrong with lvs2010 interfaces [13:54:11] only 2 of them are getting IP addresses [13:56:14] but there are apparently configured on /etc/network/interfaces [13:56:15] hmm [14:04:22] no way... [14:04:27] root@lvs2010:~# ifup enp59s0f1d1.2017 [14:04:27] Error: argument "enp59s0f1d1.2017" is wrong: "name" too long [14:24:51] 10netops, 10Operations: asw2-a-eqiad FPC2 reboot - https://phabricator.wikimedia.org/T209588 (10ayounsi) p:05Triage>03Normal [14:34:53] Krenair: please when you get the chance check https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/473706/ and https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/473713/ [14:35:04] I'd like to release that before testing certificate deploy through puppet [14:58:12] vgutierrez, hm we didn't end up using the dns challenges dir did we? [14:58:30] Krenair: currently certcentral stores there the challenges [14:58:37] but we could get rid of it [14:59:36] right but we don't actually read out of it do we? [15:00:24] anyway I merged the first commit and made the cherry-pick that needs to be merged with the debian branch commit [15:03:08] yep, that's why I was telling that we could get rid of it [15:03:50] yeah [15:10:15] vgutierrez: on the lvs topic, we need a new naming scheme for the vlan interfaces basically? [15:10:28] bblack: indeed [15:10:42] vgutierrez: I don't think there's much (maybe nothing) hardcoded that depends on the current scheme, it's just arbitrary to make it easy for humans [15:10:58] bblack: we can discuss it in the meeting if you want :) [15:12:08] yeah [18:37:06] 10Traffic, 10Operations, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) > Vivian, > > Today I recieved a notice about "COMPLETED - Scheduled Generator Capacity Upgrade at the SG3 IBX [5-168459376275]" to my rhals... [18:45:45] bblack: still good to go for https://gerrit.wikimedia.org/r/#/c/472436/ today? [18:48:59] gilles: yes! [18:49:17] gilles: now-ish? [18:49:42] bblack: sure, let's do it [18:53:10] trying to figure out rebase bs [18:53:22] ah yes, classic gerrit [18:54:17] got it [18:55:56] it's rolling out via cumin now, takes a few mins [18:59:39] done [19:00:07] thumbor seems displeased, roll back? [19:00:09] gilles: ^? [19:00:16] a big spike is expected [19:00:42] there was a paging alert on thumbor unresponse to icinga [19:00:47] let's wait a bit more [19:01:14] yeah it's probably that some requests are failing/taking a while due to the high concurrenct [19:01:33] it's mostly 200s coming out, though [19:03:12] seems to be over the bump already, we'll know for sure in a few minutes [19:04:32] the initial spike might've been lower if it I had let it gradually roll out naturally over ~30m to all the cache hosts, too [19:04:36] this is a situation where the haproxy thing would have helped/prevented the alert [19:06:09] when it errors it should fall back to the existing jpg/png [19:14:18] all good in icinga now, right? [19:14:26] yeah [19:15:24] well, maybe [19:16:10] digging a little deeper, it seems odd that a bnch of ms-be systems have HP RAID checks in an unknown state [19:16:50] 1037 6h ago, 1038 2h ago, several of them ~20-40mins ago [19:17:02] and the most-recent 5 happened since the change [19:17:09] I think it's unrelated, but looking in case [19:18:23] ms-be1034 is the latest example, and has in syslog: [19:18:29] Nov 15 19:03:19 ms-be1034 smartd[1286]: Device: /dev/sda, failed to read Temperature [19:18:37] (for all devices, all at the same time) [19:18:57] godog: ^ ? (if you're around) [19:20:49] assuming it's unrelated, since the pattern goes further back than this change [19:25:56] looking at a random esams varnish, we're now serving 2% of our image responses as image/webp (not differenciating originals and thumbs) [19:26:17] when the threshold was 1000 it was 0.035% [19:29:31] nice! [19:31:27] looking at requests that have "Accept: image/", 45% of these have webp in their list [19:31:40] that's the "potential" we're looking it, if there was no threshold [19:33:12] the webp corpus is still building up, though. the curve is flattening on thumbor but we're still at almost 2x the amount of 200s as before [19:38:53] hah, that 2% is going down, now 0.9%. I guess the hottest images are quick to get into people's caches [20:30:13] bblack: let's lower it again next week? [20:30:48] as long as the swift storage trend line doesn't tilt, I think we're good to keep lowering the threshold [21:10:43] how much lower do you think we should go? are we aiming towards not counting hits at all and just restarting requests from all agents that support webp? [21:13:45] or for that matter, not even restarting, but just rewriting [21:16:15] the other thing to think about, is at present we're only doing this at frontends for frontend hits [21:16:47] all objects >256K in size aren't cached at the frontend at all, so they're not really in this experiment. I'm not sure how many thumbs are that big, but possibly many [22:04:13] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Ok, picking this back up! I emailed into our support case 91912127436 > Support, > > This was dropped and not picked back up, so I'm trying to determine the status now. >... [22:18:39] bblack: semi-interesting paper: https://arxiv.org/abs/1811.04288 "IP Geolocation through Reverse DNS". As you may imagine is not a general solution but some of the data is interesting IMHO [22:43:41] volans: plus it has machine learning, can't go wrong :) [22:44:19] lol, indeed! [22:44:43] the idea is not that bad, just to really applicable globally as many provider don't have that information in the reverse DNS [22:45:27] s/to/not/ [22:48:38] or they might just stop providing it :) [22:48:43] or start lying! [22:49:41] both true! [22:50:18] to be honest I was hoping something better from the title, and then reading the actual paper let me quite meh for the final results [22:50:28] the abstract is a bit misleading [23:02:17] there are still projects I have to go fetch with "svn" if I want to dig through their history properly, and it's nearly 2019 :/ [23:02:33] (guthub mirrors don't really cut it when they don't have appropriate tag metadata, etc) [23:02:56] there's even a couple still on CVS [23:03:10] notably: http://software.schmorp.de/pkg/libev.html [23:03:13] Perl's Net::DNS is still using SVN :) [23:03:20] and RT [23:03:21] :( [23:03:31] libev is otherwise a very fancy piece of code that's widely used [23:03:57] we're way past the point where git should be the default thing for almost everything :P [23:04:14] rotfl [23:04:31] Krenair: Net::DNS is actually what I was checking out with svn just now, prompting my comment [23:04:42] heh [23:04:59] I have to go checking out bisections of all their recent version history and re-running tests against it to find out what I'm compatible with [23:05:21] because it would be *way* too much to ask if they bothered documenting breaking changes to APIs in some release notes somewhere :P [23:06:34] I got into Wikimedia tech just as the Git migration happened [23:06:56] so never really had exposure to SVN [23:07:19] ah [23:07:33] had to fiddle around with it to find what I wanted in the net dns repo [23:07:42] I had exposure to "sccs" and just about every widespread thing since heh [23:07:59] they've all ben painful transitions. If I can manage, so can everyone else :P [23:08:58] I spent some significant time (as in, projects I really worked on), on sccs, rcs, svn, svk, then git. others, I just had to do quick checkouts and looking around or whatever. [23:09:04] wow [23:10:39] anyways, I'm ending this ramble with: [23:10:46] All software is horrible :) [23:10:50] :D [23:12:06] reading parts of this made me search phabricator for "bzr"