[00:17:49] <wikibugs>	 10netops, 10Discovery, 10Operations, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045220 (10RobH) >>! In T188045#4045150, @Platonides wrote: > Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, sinc...
[00:59:15] <wikibugs>	 10netops, 10Discovery, 10Operations, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045325 (10Smalyshev) Thanks @RobH! Created {T189548} for loading the data back. @Gehel if you don't see anything else wrong then this one can be resolved.
[02:26:19] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4045378 (10ayounsi) @Cmjohnson Can you cable lvs1016 as listed bellow? | Hostname | Hostport | Switchport | note | |---|---|---|---| | lvs1016 | eth0 | asw2-d:xe-7/0/17  | |...
[03:01:40] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4045408 (10ayounsi) p:05Triage>03Normal
[04:58:27] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4045522 (10Prtksxna)
[09:14:58] <wikibugs>	 10netops, 10Discovery, 10Operations, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045800 (10Gehel) 05Open>03Resolved Yay! Thanks @faidon for finding the issue!  @RobH / @Papaul : the symptoms of wdqs2006 mgmt interface look vaguely similar (T189318). Any chance...
[09:25:45] <vgutierrez>	 "morning" bblack 
[09:27:37] <vgutierrez>	 a few hours ago cp3040 looked a little bit stressed regarding backend connections, but it recovered by itself
[09:27:40] <vgutierrez>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=16&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1520919051178&to=1520932437213
[09:45:47] <bblack>	 yeah 3043 too
[09:45:51] <vgutierrez>	 indeed
[09:45:56] <vgutierrez>	 I followed both of them this morning
[09:47:05] <bblack>	 neither of them are in the ~6d+ uptime range, though
[09:56:22] <bblack>	 I'm not completely sure what to make of the logstash side of things
[09:57:00] <bblack>	 I think we can make some modest improvements, but really we need better indexing/searching there.  I'm not logstash-saavy enough yet to see how/where/why to do that.
[09:57:30] <bblack>	 (to make some of the fields we're sending as "extra" data indexable and such)
[10:02:48] <vgutierrez>	 same here.. first time using logstash on a prod environment
[10:02:56] <vgutierrez>	 we used to burn money into sumologic
[10:19:35] <wikibugs>	 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4045911 (10ayounsi) a:03ayounsi Only looking at the asw ports with link up for now, using LibreNMS:  @Papaul If the ports with LLDP neighbors are correct, I can mass add th...
[10:32:12] <moritzm>	 I noticed that dns5002 is currently depooled, I'm about to reboot it for the spectre kernel update, is it intentionally depooled (hardware/software reasons) or was it maybe overlooked? then I would repool it when the reboot is complete
[10:36:02] <bblack>	 heh, overlooked, never repooled after hw issues were fixed
[10:36:23] <moritzm>	 perfect, I'll repool when the server is rebooted
[10:36:37] <bblack>	 either way, breaking eqsin-specific things is still reasonably-fair-game until for another week or two :)
[10:41:27] <_joe_>	 we're using elasticsearch, in fact, and we decided to store our data in gelf format there
[10:41:34] <_joe_>	 we might want to change that in the future
[11:38:23] <bblack>	 hmmm, already some new interesting logstash entries! time-waitinglist: 166.960184
[11:38:34] <bblack>	 that was for Special:Random on an eqiad backend...
[11:39:19] <bblack>	 no, frontend heh
[11:40:15] <bblack>	 you'd think that'd be a hfm/p kind of case as an uncacheable output...
[11:40:28] <bblack>	 (but I bet the waitinglist time has something to do with coalesce)
[11:41:19] <wikibugs>	 10Traffic, 10DNS, 10Operations, 10Release-Engineering-Team, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4046204 (10jhsoby) >>! In T188776#4021634, @Varnent wrote: >>>! In T188776#4021611, @Bawolff wrote: >> That sa...
[12:01:37] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4046290 (10faidon) a:05faidon>03ayounsi > We're happy to announce that your RIPE Atlas anchor is functioning properly and is now connected to the RIPE Atlas network. > > You can see...
[12:15:15] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin, 10Patch-For-Review: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4046328 (10BBlack)
[12:15:53] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin, 10Patch-For-Review: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3711364 (10BBlack) 05Open>03Resolved >>! In T179042#4046290, @faidon wrote: > Only thing left is monitoring, right?  I think so AFAIK, and done above, showing...
[12:23:02] <bblack>	 so, tl;dr on things that have changed in the past ~24h with the varnish stuff:
[12:24:08] <bblack>	 1) Minor vcl fixups for possible hfm storage exacerbation in https://gerrit.wikimedia.org/r/#/c/418920/ .  Was the only such case I could find, but I don't expect this codepath gets much traffic anyways (if it does, that's probably an issue in its own right)
[12:26:06] <bblack>	 2) Weekly restarts are now "semiweekly" (2x/week).  One of the two times is identical to the existing weekly time, the other is 3.5 days offset.  For hosts that were already past the 3.5-day-uptime mark when I merged, they won't restart for the first time post-patch until they reach 7d uptime.  But for the rest (and all hosts after the first 3.5 days), varnishd backend lifetimes should cap at 3.5
[12:26:12] <bblack>	 d.
[12:27:17] <bblack>	 3) varnishslowlog catches more cases and has more info, which will be interesting in the next investigation (or to stare at ahead of time and try to pick out hints of problematic cases even in good times, like I am now)
[12:28:03] <bblack>	 currently one of the strangest puzzles for me there is that Special:Random shows up with some frequency as having waitinglist delays (coalesce delays).
[12:28:35] <bblack>	 generally the apache backend-timing in these cases is short, but the waitinglist timer in varnish is huge (e.g. 1-5 *minutes*).
[12:29:55] <bblack>	 MW's output headers for Special:Random indicate uncacheability properly in multiple ways, and our VCL should be converting these into hit-for-pass objects with ttl=601s + grace=31s IIUC
[12:30:24] <bblack>	 hit-for-pass should normally prevent any kind of coalesce behavior which would put spammy special:random traffic into waitinglists
[12:31:29] <bblack>	 and the intent of the ~10m ttl + ~30s grace is that in the case that special:random is hot, even when the hfp object expires, the traffic should keep flowing coalesce-free during the grace period, while also asynchronously refreshing from the backend to get a fresher hfp object for replacement.
[12:32:18] <bblack>	 so... clearly something is amiss here.  quite possibly, our understanding of hfp and grace for this case, or simply VCL code bugs on our end.
[13:19:23] <wikibugs>	 10netops, 10Operations: Config discrepencies on network devices - https://phabricator.wikimedia.org/T189588#4046514 (10ayounsi) p:05Triage>03Low
[14:29:41] <moritzm>	 running puppet on lvs1007 spews some errors, known issue?
[14:29:44] <moritzm>	 Notice: /Stage[main]/Main/Node[__node_regexp__lvs10078910.eqiad.wmnet]/Lvs::Interface_tweaks[eth0]/Interface::Rps[eth0]/Exec[ethtool_rss_combined_channels_eth0]/returns:
[14:29:45] <moritzm>	 Cannot set device channel parameters: Invalid argument
[14:29:47] <moritzm>	 Error: ethtool -L eth0 combined 16 returned 1 instead of one of [0]
[14:30:40] <vgutierrez>	 indeed
[14:30:54] <vgutierrez>	 bblack mentioned that a few days ago
[14:31:12] <moritzm>	 ok, just wanted to make sure it wasn't a crazy side effect of https://gerrit.wikimedia.org/r/419159
[15:19:51] <bblack>	 yeah I broke the lvs1007 ethernet stuff, they're due to move to the spare role and whatnot and lose all the special lvs puppetization, which will undo that.
[15:37:50] <bblack>	 fascinatingly, it seems all of the queries to Special:Random I end up seeing in slow logs are coming from AWS or GCE instances.
[15:38:36] <bblack>	 I don't yet know if that's because AWS and/or GCE are statistically-likely consumers of S:R in general, or if there's something special about these particular requests where client induces them to be slow somehow...
[15:40:52] <vgutierrez>	 I guess we have our fair amount of crawlers living in AWS/CRE
[15:44:12] <vgutierrez>	 *GCE
[15:48:55] <bblack>	 right
[15:49:13] <bblack>	 S:R seems like an odd target for something doing useful programmatic work though
[15:49:28] <bblack>	 unless they're trying to get a random sampling for some research, but relying on us for the randomness, I donno
[15:50:39] <bblack>	 it could also be from proxy services hosted there, too
[15:50:47] <bblack>	 (e.g. google's AMP stuff)
[15:57:29] <bblack>	 back on S:R and hit-for-pass in general:
[15:57:34] <bblack>	 --  TTL            VCL 0 0 0 1520956353
[15:57:34] <bblack>	 --  TTL            VCL 0 31 0 1520956353
[15:57:34] <bblack>	 --  TTL            VCL 0 31 0 1520956353
[15:57:34] <bblack>	 --  BerespHeader   X-CDIS: pass
[15:57:34] <bblack>	 --  TTL            HFP 601 0 0 1520956353
[15:58:02] <bblack>	 ^ I observe this on a random live one.  I think this implies that the 31s grace settig is reset to zero on doing the 601s return(pass)
[15:58:41] <bblack>	 our code that corresponds with those log lines looks like:
[15:58:45] <bblack>	                 set beresp.grace = 31s;
[15:58:45] <bblack>	                 set beresp.keep = 0s;
[15:58:46] <bblack>	                 set beresp.http.X-CDIS = "pass";
[15:58:54] <bblack>	                 return(pass(601s));
[15:59:35] <bblack>	 the whole topic of how the internal HFP objects work has always been confusing, so I'll have to re-read some code
[15:59:44] <vgutierrez>	 ===== NODE GROUP =====
[15:59:44] <vgutierrez>	 (24) lvs[2001-2006].codfw.wmnet,lvs[1007,1010].eqiad.wmnet,lvs[5001-5003].eqsin.wmnet,lvs[3001-3004].esams.wmnet,lvs[4005-4007].ulsfo.wmnet,lvs[1001-1006].wikimedia.org
[15:59:47] <vgutierrez>	 ----- OUTPUT of 'apt-cache policy pybal |grep Inst' -----
[15:59:47] <vgutierrez>	 \o/
[15:59:49] <vgutierrez>	   Installed: 1.15.2
[16:00:19] <bblack>	 because when you "hit" on a hit-for-pass object, you still fetch from the backend and usually encounter the same conditions (uncacheable response headers), which triggers your same code that creates an hfp object.
[16:00:41] <bblack>	 naively you'd assume this commonly creates a new replacement hfp object on each fetch through hitting on one
[16:01:08] <bblack>	 but I think once you're in the context of having hit on one, you're attached to some kind of ephemeral request object and a new persistent hfp that other requests can see doesn't actually get created.
[16:02:00] <bblack>	 which leaves us in the position of still needing a grace-window, so that we don't suffer a coalesce pileup on a hot URL that's usually hit-for-pass, each time the hfp object expires.
[16:02:41] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Add UDP monitor for pybal - https://phabricator.wikimedia.org/T178151#3682454 (10Vgutierrez) Pybal 1.15.2 has been successfully deployed in our LVSs, the UDP monitor is now available.
[16:03:00] <bblack>	 vgutierrez: awesome work :)
[16:04:47] <vgutierrez>	 bblack: yey.. but you'll troll me forever with the ghost BGP issue | journalctl eated my log!
[16:05:40] <vgutierrez>	 actually I need a t-shirt with that
[16:06:33] <bblack>	 :)
[16:14:23] <vgutierrez>	 now we can push to production the BGP established sessions BGP check: https://gerrit.wikimedia.org/r/c/415260
[16:16:38] <vgutierrez>	 btw, talking about that.. godog, can we do something to help getting https://gerrit.wikimedia.org/r/c/413142/ merged? :)
[16:21:12] <godog>	 vgutierrez: good question! is it blocking you now isn't it? I've been mostly looking at the puppet goal heh
[16:21:59] <vgutierrez>	 nope
[16:22:13] <vgutierrez>	 I just remembered that you said that it was going to get merged a few weeks ago
[16:22:23] <vgutierrez>	 so I got curious
[16:23:03] <godog>	 vgutierrez: ah ok, yeah it is on my radar but behind the puppet goal heh
[16:23:10] <vgutierrez>	 ack
[16:23:15] <godog>	 too many "heh", canadian influence I guess
[16:25:44] <vgutierrez>	 I won't get picky with anybody's English till mine gets better O:)
[16:30:34] <bblack>	 to continue rambling on these Special:Random reqs.  Another thing I've noticed: MW's S:R response is a 302 with a Location header and Content-length:0 (all sane)...
[16:30:54] <bblack>	 but the varnish output is gzipped with a content-length:20 (presumably just a gzip header and no data)
[16:31:22] <bblack>	 the CE header is right, too, but it seems crazy/pointless that varnish would re-encode a zero-byte response to add gzipping :P
[16:32:31] <vgutierrez>	 yup.. we could save some CPU cycles there
[16:34:39] <bblack>	 at least
[16:35:04] <bblack>	 it's so senseless that it might represent a violation of <some assumptions elsewhere in varnish code> too
[16:58:49] <vgutierrez>	 bblack: what would be better there? avoid compression in 3XX requests or in content-length==0 requests?
[16:59:07] <bblack>	 CL:0 I think.  I'm making a patch, I may cut off a little higher than zero.
[16:59:30] <vgutierrez>	 well.. <= 20 at least
[16:59:49] <vgutierrez>	 if you need to transfer something lighter than the gzip headers..
[17:02:22] <wikibugs>	 10netops, 10Discovery, 10Operations, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4047252 (10faidon) So post-mortem, I think there are 4 different things here: - T189519: Audit switch ports/descriptions/enable (and do this on an ongoing basis) - T189522: Detect IP a...
[17:03:59] <bblack>	 right but also if gzip wasn't able to remove 20 bytes via compression it's a net loss anyways, seems probably common for say a ~100 byte file even.
[17:04:29] <bblack>	 and then there's the fact that at sub-packet sizes you're not really saving a lot of transfer.  and aside from server resources the client has to expend cpu on decompressing the tiny object as well.
[17:08:26] <bblack>	 vgutierrez: seem reasonable? https://gerrit.wikimedia.org/r/#/c/419228/
[17:10:43] <vgutierrez>	 yup
[17:14:01] <bblack>	 oh of course I failed at integer conversion in VCL of course, but minor details! :)
[17:14:16] <vgutierrez>	 bblack: some says that google recommends something between 150 a 1000 bytes
[17:14:21] <vgutierrez>	 150-1000
[17:14:36] <vgutierrez>	 it could be nice getting some metrics regarding this change
[17:15:18] <bblack>	 yeah, throw it in the bin of "once we get past 100 other things that are more important and everything is smooth and sane and we have free time, we could experiment on tuning this value" :)
[17:15:35] <vgutierrez>	 yup
[17:18:14] <godog>	 also known as the write/append only bin
[17:18:25] <vgutierrez>	 bblack: that < 860 is right?
[17:18:35] <bblack>	 no, it's wrong :)
[17:19:10] <bblack>	 I think PS4 is good
[17:20:39] <vgutierrez>	 makes sense
[17:21:37] <vgutierrez>	 regarding traffic reduction, maybe investing some time on https://phabricator.wikimedia.org/T137979 would be better than optimizing that 860 bytes figure
[17:48:04] * volans about to merge a DNS patch https://gerrit.wikimedia.org/r/c/419193/
[17:48:26] <volans>	 quickly asking JIC is there anything ongoing I should hold on or the authdns-update procedure has changed recently
[17:50:21] <vgutierrez>	 bblack: ^^
[17:53:53] <moritzm>	 !log installing ncurses updates from stretch point release
[17:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:18] <volans>	 moritzm: you probably wanted -ops but it works here too :D
[17:55:02] <moritzm>	 oops :-)
[17:56:51] <bblack>	 volans: no, nothing unusual going on
[17:57:11] <volans>	 ack, thanks
[18:29:21] <bblack>	 blerg, I have a logstash entry at 18:00, well after the full deploy of the anti-gzipping patch, which still shows a gzipped output of S:R's zero content-length.
[18:30:37] <bblack>	 I could maybe make the case that the HFP object that already exists stores the do_gzip=true for a while after that patch goes live, but I don't think our HFPs' TTLs should've been long enough to still be in play, either.
[18:32:47] <bblack>	 oh I get it now
[18:32:58] <bblack>	 my curl-based tests directly to MW weren't using AE:gzip
[18:33:07] <bblack>	 it's MW that's gzipping the zero-length output :P
[18:33:38] <bblack>	 (well MW or its apache)
[18:33:42] <bblack>	 (or hhvm)
[18:35:04] <bblack>	 I thought once way way back, we ended up disabling gzip on the MW apaches because it triggered various other edge cases.
[18:38:55] <vgutierrez>	 right: https://gerrit.wikimedia.org/r/c/314317/3/wmf-config/CommonSettings.php
[18:40:53] <bblack>	 yeah but that's probably an hhvm-level thing with wgGzip
[18:41:01] <bblack>	 err, wgUseGzip
[18:41:13] <bblack>	 our apache config does load mod_deflate and compresses outputs
[18:41:24] <bblack>	 (and has for a long time, at least before 2014)
[18:43:00] <bblack>	 ah it appears in MW core code, wgUseGzip only affects their "File Cache" in includes/cache/FileCacheBase.php
[18:43:50] <bblack>	 (to pick up .gz files stored on disk as alternatives to non-.gz)
[18:44:49] <bblack>	 well anyways, I think the 860-byte cutoff in varnish is still rational.  it's just unfortunate that in this case the gzipping is happening elsewhere.
[18:46:08] <bblack>	 I'll make a task to followup about this later
[18:48:23] <bblack>	 or maybe I'll re-read the existing https://phabricator.wikimedia.org/T125938 first :)
[18:49:19] <bblack>	 lol
[18:49:22] <bblack>	 https://phabricator.wikimedia.org/T125938#2694407
[18:49:48] <bblack>	 ^ this is where I note the wgUseGzip change didn't do what we wanted, which was disable gzipped output entirely wherever it's happening at the applayer
[18:49:58] <bblack>	 no further substantial followup since late 2016 :P
[18:50:27] <bblack>	 the rationale at the time, a little above that, was:
[18:50:31] <bblack>	 "Given the number of issues we've seen over time with hhvm/php gzip output under various error conditions causing big confusion (by hiding interesting 500s behind 503s), I really think we should disable hhvm gzip output entirely. The cache layers will gzip gzippable content for the cache storage and the user regardless, without inducing this buggy confusion."
[18:51:34] <bblack>	 also, if apache gzips something it does it in a streaming/chunked fashion, which means even if at the php-layer MW wanted to give an explicit Content-Length (which we like), apache's gzipping will remove it in common cases I think.
[18:51:47] <bblack>	 (except very tiny ones like this S:R example)
[19:05:23] <wikibugs>	 10Traffic, 10Operations, 10Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4047636 (10BBlack) So, recapping this ticket that's been stale for quite a while:  * We've had past applayer bugs with gzipped outputs in e...
[19:23:30] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#4047699 (10BBlack) 05Open>03Resolved
[23:13:48] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4048335 (10ayounsi) 05Open>03stalled
[23:21:50] <bd808>	 bblack: I've got a DNS question for you. In our OpenStack deploy we have script that sets up PTR records for the public ips that projects use. One of the ips is the HTTP reverse proxy that handles *.wmflabs.org. Today there are 409 <something>.wmflabs.org forward records pointing to that IP. Is it reasonable from a protocol perspective to setup that many PTR records for an IP? 
[23:23:29] <Platonides>	 what would be the point in having them?
[23:24:00] <Platonides>	 I would place a single PTR to that ip that points to its nature
[23:24:05] <bblack>	 bd808: yeah generally PTR records are more-or-less for debugging and information rather than function
[23:24:29] <bblack>	 bd808: you can do a single one for its functional name like revproxy.wmflabs.org or something
[23:25:07] <bd808>	 it looks like designate currently responds with ~600 reverse mappings for 208.80.155.156, many of which are out of date because we changed the config limits
[23:25:43] <bd808>	 *nod* I'll see if I can figure out a clean way to do that in the maintenance script
[23:25:49] <bblack>	 you will eventually probably have a problem with so many
[23:26:13] <bblack>	 because they won't fit in an un-truncated UDP response, and any client which does look them up for informational purposes will have to retry over TCP and slow things down, etc
[23:26:19] <Platonides>	 that would also be a nice target for dns amplification attacks
[23:26:34] <bd808>	 WP:BEANS Platonides 
[23:26:41] <Platonides>	 hehe
[23:27:05] <bblack>	 that too, depending on the behavior of the server, but it's really on the dns server to have a reasonable amp limit there (maximum edns0 response sizes, and whether it truncates for partial answers or completely)
[23:27:59] <bd808>	 thanks for the sanity check bblack 
[23:28:04] <bblack>	 (the latter part meaning if 609 PTRs don't fit in the edns0 response size, does it cram in as many as it can when returning the "truncated" bit, or just return a minimal empty response with the truncated bit)
[23:29:03] * bblack out, will peek back in a few hours