[03:29:04] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3746533 (10Arlolra) Parsoid seems to be configured correctly since https://wiki.dronelaws.io:8000/localhost/v3/page/html/Main_Page/2 renders just fin...
[07:28:35] <wikibugs>	 10HTTPS, 10Operations, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3746663 (10ema)
[07:29:46] <wikibugs>	 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 3 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875#3746666 (10ema) p:05Triage>03Normal
[07:30:01] <wikibugs>	 10Traffic, 10Operations: LVS IPv6 IPs should all be recorded in DNS - https://phabricator.wikimedia.org/T179026#3746667 (10ema) p:05Triage>03Normal
[07:35:35] <wikibugs>	 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3746668 (10ema) p:05Triage>03Normal
[07:38:30] <wikibugs>	 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3696395 (10ema) Anything else left to do here? Is the problem solved for you @Chicocvenancio?
[07:40:05] <wikibugs>	 10Traffic, 10MediaWiki-Authentication-and-authorization, 10Operations, 10Security-Core: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604#3746674 (10ema) p:05Triage>03Normal
[12:10:53] <wikibugs>	 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3747466 (10Ladsgroup) This is not my decision to make, our PM is not around, I'll ask her when she's back
[12:20:18] <godog>	 TIL you can have numa nodes with memory only and no CPU, e.g. https://www.intel.com/content/www/us/en/solid-state-drives/optane-ssd-dc-p4800x-mdt-brief.html
[12:29:21] <bblack>	 yeah it's going to be fascinating to see how that plays out on price/perf and architectures
[12:29:31] <bblack>	 (the optane stuff)
[12:33:52] <ema>	 all text/upload cp nodes upgraded to 4.9.51 and rebooted, only a bunch of spares left now :)
[12:35:06] <bblack>	 those are our future ATS testbed nodes :)
[12:36:05] * ema looks at them with a quick glance full of hope
[14:21:24] <ema>	 bblack: any more comments re: https://gerrit.wikimedia.org/r/#/c/388064/ ?
[14:24:18] <ema>	 moritzm: all cache and LVS hosts upgraded and rebooted \o/
[14:25:39] <moritzm>	 ema: awesome! shall we also reboot the dnsauth servers for completeness?
[14:30:24] <wikibugs>	 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748043 (10BBlack) I haven't had time to analyze it deeply/manually, but I managed to capture/filter down tcpdump verbose/stamped outputs for exactly one...
[14:32:08] <bblack>	 ema: yes, maybe, stalling...
[14:32:46] <bblack>	 moritzm: I don't think we've had any recent cases where authdns reboots went smoothly heh.  We can do it, but carefully.
[14:35:37] <moritzm>	 yeah, I remember that we had a few hiccups related to name resolution on the host itself not being immediately available during reboot. we should probably still bite the bullet and move those to 4.9, though: dnsauth are among the few jessie hosts running 4.4 still
[15:35:37] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3748216 (10RobH)
[15:57:11] <wikibugs>	 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748250 (10BBlack) Annotating some basic thoughts on the above (keep in mind with various kinds of offload in play, packetization/MTU/checksum will often...
[16:03:20] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) These are idling as role spare with the OS installed, ready for service.
[16:03:28] <bblack>	 XioNoX: check out the last ticket update above, does any of that ring any bells for you on the ICMP thing?
[16:05:04] <XioNoX>	 will do
[17:05:42] <XioNoX>	 bblack: https://github.com/Exa-Networks/exabgp - https://packages.debian.org/fr/stretch/exabgp
[17:08:23] <wikibugs>	 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748444 (10ayounsi) Another data point, https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&from=1507633459013&to=1507680000000...
[17:39:09] <wikibugs>	 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748534 (10BBlack) I'm pretty sure all of the TCP application-level data flows match up roughly with the expected sequence of TLS HANDSHAKE -> CLIENT HTTP...
[17:45:03] <XioNoX>	 bblack: re your comment, how long is the idle period on the cp?
[17:47:54] <bblack>	 XioNoX: you mean nginx's configured timeout to close an idle client conn?
[17:48:00] <XioNoX>	 bblack: yeah
[17:49:34] <bblack>	 our keepalive timeout is 60s
[17:49:42] <bblack>	 looking to see if some others could be in play
[17:50:21] <bblack>	 and lingering timeout is 5s (if it's trying to do a lingering close)
[17:52:10] <XioNoX>	 bblack: how did we come up with 60s, is that after tests? or recommended default, or?
[17:52:18] <bblack>	 reset_timedout_connection is another nginx param that might be relevant (default is off, if "on" then it explicitly uses RST on timeout connections to avoid FIN_WAIT_1 ... which is the state those ICMP'd connections end up in for a bit...)
[17:52:56] <XioNoX>	 oh interesting
[17:53:03] <bblack>	 XioNoX: not sure.  we actually have it configured at two different places in the config at 65s for the whole-daemon and then a more-specific 60s setting for our unified proxy.  would have to dig in git blame to know.
[17:54:14] <bblack>	 the one in nginx.conf predates git history:
[17:54:17] <bblack>	 de059228933 templates/nginx/nginx.conf.erb            (Ryan Lane          2011-09-07 22:28:35 +0000  65)     keepalive_timeout  65;
[17:54:31] <bblack>	 de059228 being the initial import into git from svn
[17:55:29] <bblack>	 755cbeddcc7 modules/protoproxy/templates/localssl.erb (Mark Bergsma       2013-07-24 16:55:45 +0200 37)         keepalive_timeout 60;
[17:55:53] <bblack>	 ^ that's the 60s value in localssl.conf, it's just the original value chosen when mark first wrote that config file to initially set this up.
[17:56:16] <bblack>	 either way, the timeout to this FIN+ACK is far less than keepalive_timeout
[17:58:02] <bblack>	 on the experimental front I'm going to try the close_notify thing a little later
[17:58:21] <bblack>	 on the research front, it would be interesting to dig/refresh a bit on whether FIN+ACK is a valid way to start closing a connection, etc
[17:58:38] <bblack>	 (or if that almost certainly implies either a broken client or an injected FIN)
[17:59:55] <bblack>	 for all we know this could just be broken client-side software that only exists in certain models of mobile phones with certain software revs, which is only common across a range of carriers in the EU, or something like that.
[18:00:26] <bblack>	 (would be easy to verify that by correlating the destunreach client IPs to varnishlog of their UA strings)
[18:01:20] <XioNoX>	 yeah, or aggregate the source ASN using netflow
[18:01:43] <XioNoX>	 default nginx value for that is 75s btw
[18:07:48] <XioNoX>	 bblack: from https://www.ietf.org/rfc/rfc793.txt fin,ack seems leggit. `FIN-WAIT-1  --> <SEQ=100><ACK=300><CTL=FIN,ACK>  --> CLOSE-WAIT`
[18:08:45] <bblack>	 yeah but FIN-WAIT-1 is a state you only reach after someone sent a FIN already
[18:09:40] <ema>	 also, in the trace there's a 30s delay between the last received packet (which is an ack) and the fin-ack
[18:10:00] <ema>	 you wouldn't expect the implementation to wait that long to piggyback the ack :)
[18:10:29] <bblack>	 and there was no need to piggyback an ack.  it already ack'd up to that point.
[18:10:35] <ema>	 that too
[18:10:48] <bblack>	 (I honestly don't remember if it's legit to ACK the last bit of the peer's data + send an initial FIN together.  maybe?)
[18:11:51] <bblack>	 but so far my best theories for that initial FIN+ACK packet are that something injected a FIN from us first, or that FIN+ACK back to us is injected or altered, or there's a code bug in the client's stuff.
[18:12:33] <bblack>	 FWIW there was no obviously-persistent broken mobile UA on the unreachables
[18:12:45] <bblack>	 but what was interesting is they're almost all Firefox/Chrome, and no MSIE
[18:12:56] <bblack>	 which could again be a pointer at different TLS close_notify behaviors and such
[18:13:46] <XioNoX>	 if it was exactly 30s it would make it more clear that it's a hitting a timeout on the client side (or middle box) but 34.6s is quite random
[18:15:29] <bblack>	 eh
[18:15:38] <bblack>	 could be some randomization thing.  30s+/-rand
[18:16:02] <bblack>	 if we buy the FF/Chrome-vs-MSIE theory though, then we still have to ask why it's only esams clients heh
[18:16:07] <XioNoX>	 the diagram on page 2 of http://www.cs.northwestern.edu/~agupta/cs340/project2/TCPIP_State_Transition_Diagram.pdf is useful, at least to give a clear image
[18:17:25] <ema>	 does this happen over the weekend too?
[18:17:55] <bblack>	 the pattern drops off during weekends
[18:17:59] <ema>	 ha
[18:18:07] <ema>	 https://grafana.wikimedia.org/dashboard/db/tcp-fast-open?panelId=3&fullscreen&orgId=1&from=1509659003733&to=1510240262964
[18:18:09] <bblack>	 (not completely, but much lower)
[18:18:33] <ema>	 that's failed TFO, mostly affecting esams, lowering during the weekend
[18:19:01] <bblack>	 yeah but in this case there was no TFO in my capture (the initial handshake was dataless), and the dstunreach happens after normal data exchange + timeout
[18:19:29] <ema>	 right but it might point in the direction of some shitty middle-boxes used by people at work Mon-Fri 
[18:19:35] <bblack>	 yeah
[18:19:51] <bblack>	 a lot of shitty TLS cipher choices known to be made by middleboxes have that kind of pattern, too
[18:20:07] <bblack>	 (and shitty TLS cipher choices made mostly by work-deployed outdated client machines)
[18:21:00] * ema runs out for dinner
[18:21:00] <ema>	 o/
[18:21:38] <bblack>	 I'm trying the do_wait_shutdown experiment on cp3030
[18:22:02] <bblack>	 !log cp3030: puppet-disabled + manual nginx ssl_do_wait_shutdown config
[18:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:29] <elukey>	 is it for the tcp rsts? 
[18:28:30] <elukey>	 nice! 
[18:28:37] <elukey>	 curious about the results
[18:28:57] <bblack>	 there's a bump in all the cp3030 TCP stats from that, but they seem to normalize back to about the same quickly
[18:29:23] <bblack>	 still 2x old-config nginx workers running though, waiting for those last ones to drain and die off
[18:31:26] <bblack>	 icmp unreach, RST, retrans... they all seem to normalize to roughly the same values as before with do_wait_shutdown
[18:34:49] <XioNoX>	 a bit unrelated, but shouldn't we set the keep-alive timeout value in the header `< Connection: keep-alive` ? cf. http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_timeout 
[18:36:05] <bblack>	 maybe
[18:36:45] <bblack>	 for clients that honor it, if we set it a bit shorter than our own keepalive timeout, it might cause more client-close instead of server-close, which is a good thing for TIME_WAITs and such
[18:36:49] <XioNoX>	 it might help if it's a browser miss-behaving, but not if it's middle-box related
[18:37:11] <bblack>	 but also, setting our keepalive timeout to be commonly longer than most browsers would have the same effect
[18:39:03] <XioNoX>	 firefox has a default of 115s
[18:42:20] <bblack>	 the tradeoffs there are of course, the longer the total avg keepalive time, the more connection parallelism we have opened up into at least the front of nginx
[18:42:42] <bblack>	 which is why just raising it to 3 minutes or whatever to get it higher than browser timeouts probably isn't a great idea
[18:43:01] <bblack>	 but we could leave it like it is at 60 and send browsers the header for something like 55 to get them to close first.
[18:43:17] <XioNoX>	 http://gabenell.blogspot.com/2010/11/connection-keep-alive-timeouts-for.html (from 2010)
[18:44:05] <bblack>	 I wonder which modern browsers honor the server-sent keepalive header, too
[18:45:34] <XioNoX>	 https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive "Indicating the minimum amount of time an idle connection has to be kept opened (in seconds)"
[18:45:45] <XioNoX>	 *minimum*
[18:45:55] <XioNoX>	 "Also, Connection and Keep-Alive are ignored in HTTP/2; "
[18:45:57] <XioNoX>	 damn
[18:46:01] <bblack>	 heh
[18:46:26] <bblack>	 anyways, I'm gonna try a full daemon "upgrade"-style restart, in case my do_ssl_wait_shutdown patch has some quirk that doesn't enable it correctly on a mere reload
[18:46:59] <bblack>	 !log cp3030 - round 2 of ssl_do_wait_shutdown test
[18:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:49] <XioNoX>	 https://tools.ietf.org/id/draft-thomson-hybi-http-timeout-01.html#rfc.section.2.1
[18:50:26] <bblack>	 yeah seems to do the opposite of what I would've hoped
[18:53:11] <bblack>	 reconfirmed in case I was remembering wrong: it is the side who first initiates active-close (sends the first FIN) that suffers the TIME_WAIT
[18:53:26] <bblack>	 so in the HTTP(S) case ideally we do want the client to close first rather than the server
[18:54:36] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748769 (10RobH) Swapped mainboard yesterday, but during the installer today got the following:  [  457.538179] BUG: soft lockup - CPU#19 stuck for 23s! [apt-get:38504]     │ [  493.53...
[18:59:52] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748777 (10BBlack) So, looking at all the crash messages we've managed to record since the beginning of this ticket, the CPU# indicated has had a history of: 41, 23, 47, 47, 1, 19 .  T...
[19:00:43] <bblack>	 !log cp3030 - end experimentation, puppetizing back to normal config
[19:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:46] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748879 (10RobH) I've updated Dell, and they want me to move it to socket 1 and repeat.  I'm asking them to just send me a replacement CPU, we'll see what happens.
[20:29:04] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3749116 (10RobH) They agreed and are dispatching a replacement part.  I'll likely go ahead and do the proposed swap with the existing, but this will eliminate my having to make two tri...
[23:23:41] <wikibugs>	 10Traffic, 10netops, 10Operations, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3749534 (10ayounsi)
[23:46:04] <wikibugs>	 10Traffic, 10netops, 10Cloud-VPS, 10Operations: Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10ayounsi)
[23:58:09] <wikibugs>	 10Traffic, 10netops, 10Cloud-VPS, 10Operations: Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10madhuvishy) Noting here that proprietary software is not usually installed on WMCS environments per https://wikitech.wikimedia.org/wiki/Wikitech...