[03:29:04] 10HTTPS, 10Traffic, 10Operations, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3746533 (10Arlolra) Parsoid seems to be configured correctly since https://wiki.dronelaws.io:8000/localhost/v3/page/html/Main_Page/2 renders just fin... [07:28:35] 10HTTPS, 10Operations, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3746663 (10ema) [07:29:46] 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 3 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875#3746666 (10ema) p:05Triage>03Normal [07:30:01] 10Traffic, 10Operations: LVS IPv6 IPs should all be recorded in DNS - https://phabricator.wikimedia.org/T179026#3746667 (10ema) p:05Triage>03Normal [07:35:35] 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3746668 (10ema) p:05Triage>03Normal [07:38:30] 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3696395 (10ema) Anything else left to do here? Is the problem solved for you @Chicocvenancio? [07:40:05] 10Traffic, 10MediaWiki-Authentication-and-authorization, 10Operations, 10Security-Core: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604#3746674 (10ema) p:05Triage>03Normal [12:10:53] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3747466 (10Ladsgroup) This is not my decision to make, our PM is not around, I'll ask her when she's back [12:20:18] TIL you can have numa nodes with memory only and no CPU, e.g. https://www.intel.com/content/www/us/en/solid-state-drives/optane-ssd-dc-p4800x-mdt-brief.html [12:29:21] yeah it's going to be fascinating to see how that plays out on price/perf and architectures [12:29:31] (the optane stuff) [12:33:52] all text/upload cp nodes upgraded to 4.9.51 and rebooted, only a bunch of spares left now :) [12:35:06] those are our future ATS testbed nodes :) [12:36:05] * ema looks at them with a quick glance full of hope [14:21:24] bblack: any more comments re: https://gerrit.wikimedia.org/r/#/c/388064/ ? [14:24:18] moritzm: all cache and LVS hosts upgraded and rebooted \o/ [14:25:39] ema: awesome! shall we also reboot the dnsauth servers for completeness? [14:30:24] 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748043 (10BBlack) I haven't had time to analyze it deeply/manually, but I managed to capture/filter down tcpdump verbose/stamped outputs for exactly one... [14:32:08] ema: yes, maybe, stalling... [14:32:46] moritzm: I don't think we've had any recent cases where authdns reboots went smoothly heh. We can do it, but carefully. [14:35:37] yeah, I remember that we had a few hiccups related to name resolution on the host itself not being immediately available during reboot. we should probably still bite the bullet and move those to 4.9, though: dnsauth are among the few jessie hosts running 4.4 still [15:35:37] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3748216 (10RobH) [15:57:11] 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748250 (10BBlack) Annotating some basic thoughts on the above (keep in mind with various kinds of offload in play, packetization/MTU/checksum will often... [16:03:20] 10Traffic, 10Operations, 10ops-ulsfo: setup/deploy dns400[12]/wmf721[56] - https://phabricator.wikimedia.org/T179204#3716775 (10RobH) These are idling as role spare with the OS installed, ready for service. [16:03:28] XioNoX: check out the last ticket update above, does any of that ring any bells for you on the ICMP thing? [16:05:04] will do [17:05:42] bblack: https://github.com/Exa-Networks/exabgp - https://packages.debian.org/fr/stretch/exabgp [17:08:23] 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748444 (10ayounsi) Another data point, https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&from=1507633459013&to=1507680000000... [17:39:09] 10Traffic, 10netops, 10Operations: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3748534 (10BBlack) I'm pretty sure all of the TCP application-level data flows match up roughly with the expected sequence of TLS HANDSHAKE -> CLIENT HTTP... [17:45:03] bblack: re your comment, how long is the idle period on the cp? [17:47:54] XioNoX: you mean nginx's configured timeout to close an idle client conn? [17:48:00] bblack: yeah [17:49:34] our keepalive timeout is 60s [17:49:42] looking to see if some others could be in play [17:50:21] and lingering timeout is 5s (if it's trying to do a lingering close) [17:52:10] bblack: how did we come up with 60s, is that after tests? or recommended default, or? [17:52:18] reset_timedout_connection is another nginx param that might be relevant (default is off, if "on" then it explicitly uses RST on timeout connections to avoid FIN_WAIT_1 ... which is the state those ICMP'd connections end up in for a bit...) [17:52:56] oh interesting [17:53:03] XioNoX: not sure. we actually have it configured at two different places in the config at 65s for the whole-daemon and then a more-specific 60s setting for our unified proxy. would have to dig in git blame to know. [17:54:14] the one in nginx.conf predates git history: [17:54:17] de059228933 templates/nginx/nginx.conf.erb (Ryan Lane 2011-09-07 22:28:35 +0000 65) keepalive_timeout 65; [17:54:31] de059228 being the initial import into git from svn [17:55:29] 755cbeddcc7 modules/protoproxy/templates/localssl.erb (Mark Bergsma 2013-07-24 16:55:45 +0200 37) keepalive_timeout 60; [17:55:53] ^ that's the 60s value in localssl.conf, it's just the original value chosen when mark first wrote that config file to initially set this up. [17:56:16] either way, the timeout to this FIN+ACK is far less than keepalive_timeout [17:58:02] on the experimental front I'm going to try the close_notify thing a little later [17:58:21] on the research front, it would be interesting to dig/refresh a bit on whether FIN+ACK is a valid way to start closing a connection, etc [17:58:38] (or if that almost certainly implies either a broken client or an injected FIN) [17:59:55] for all we know this could just be broken client-side software that only exists in certain models of mobile phones with certain software revs, which is only common across a range of carriers in the EU, or something like that. [18:00:26] (would be easy to verify that by correlating the destunreach client IPs to varnishlog of their UA strings) [18:01:20] yeah, or aggregate the source ASN using netflow [18:01:43] default nginx value for that is 75s btw [18:07:48] bblack: from https://www.ietf.org/rfc/rfc793.txt fin,ack seems leggit. `FIN-WAIT-1 --> --> CLOSE-WAIT` [18:08:45] yeah but FIN-WAIT-1 is a state you only reach after someone sent a FIN already [18:09:40] also, in the trace there's a 30s delay between the last received packet (which is an ack) and the fin-ack [18:10:00] you wouldn't expect the implementation to wait that long to piggyback the ack :) [18:10:29] and there was no need to piggyback an ack. it already ack'd up to that point. [18:10:35] that too [18:10:48] (I honestly don't remember if it's legit to ACK the last bit of the peer's data + send an initial FIN together. maybe?) [18:11:51] but so far my best theories for that initial FIN+ACK packet are that something injected a FIN from us first, or that FIN+ACK back to us is injected or altered, or there's a code bug in the client's stuff. [18:12:33] FWIW there was no obviously-persistent broken mobile UA on the unreachables [18:12:45] but what was interesting is they're almost all Firefox/Chrome, and no MSIE [18:12:56] which could again be a pointer at different TLS close_notify behaviors and such [18:13:46] if it was exactly 30s it would make it more clear that it's a hitting a timeout on the client side (or middle box) but 34.6s is quite random [18:15:29] eh [18:15:38] could be some randomization thing. 30s+/-rand [18:16:02] if we buy the FF/Chrome-vs-MSIE theory though, then we still have to ask why it's only esams clients heh [18:16:07] the diagram on page 2 of http://www.cs.northwestern.edu/~agupta/cs340/project2/TCPIP_State_Transition_Diagram.pdf is useful, at least to give a clear image [18:17:25] does this happen over the weekend too? [18:17:55] the pattern drops off during weekends [18:17:59] ha [18:18:07] https://grafana.wikimedia.org/dashboard/db/tcp-fast-open?panelId=3&fullscreen&orgId=1&from=1509659003733&to=1510240262964 [18:18:09] (not completely, but much lower) [18:18:33] that's failed TFO, mostly affecting esams, lowering during the weekend [18:19:01] yeah but in this case there was no TFO in my capture (the initial handshake was dataless), and the dstunreach happens after normal data exchange + timeout [18:19:29] right but it might point in the direction of some shitty middle-boxes used by people at work Mon-Fri [18:19:35] yeah [18:19:51] a lot of shitty TLS cipher choices known to be made by middleboxes have that kind of pattern, too [18:20:07] (and shitty TLS cipher choices made mostly by work-deployed outdated client machines) [18:21:00] * ema runs out for dinner [18:21:00] o/ [18:21:38] I'm trying the do_wait_shutdown experiment on cp3030 [18:22:02] !log cp3030: puppet-disabled + manual nginx ssl_do_wait_shutdown config [18:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:29] is it for the tcp rsts? [18:28:30] nice! [18:28:37] curious about the results [18:28:57] there's a bump in all the cp3030 TCP stats from that, but they seem to normalize back to about the same quickly [18:29:23] still 2x old-config nginx workers running though, waiting for those last ones to drain and die off [18:31:26] icmp unreach, RST, retrans... they all seem to normalize to roughly the same values as before with do_wait_shutdown [18:34:49] a bit unrelated, but shouldn't we set the keep-alive timeout value in the header `< Connection: keep-alive` ? cf. http://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_timeout [18:36:05] maybe [18:36:45] for clients that honor it, if we set it a bit shorter than our own keepalive timeout, it might cause more client-close instead of server-close, which is a good thing for TIME_WAITs and such [18:36:49] it might help if it's a browser miss-behaving, but not if it's middle-box related [18:37:11] but also, setting our keepalive timeout to be commonly longer than most browsers would have the same effect [18:39:03] firefox has a default of 115s [18:42:20] the tradeoffs there are of course, the longer the total avg keepalive time, the more connection parallelism we have opened up into at least the front of nginx [18:42:42] which is why just raising it to 3 minutes or whatever to get it higher than browser timeouts probably isn't a great idea [18:43:01] but we could leave it like it is at 60 and send browsers the header for something like 55 to get them to close first. [18:43:17] http://gabenell.blogspot.com/2010/11/connection-keep-alive-timeouts-for.html (from 2010) [18:44:05] I wonder which modern browsers honor the server-sent keepalive header, too [18:45:34] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive "Indicating the minimum amount of time an idle connection has to be kept opened (in seconds)" [18:45:45] *minimum* [18:45:55] "Also, Connection and Keep-Alive are ignored in HTTP/2; " [18:45:57] damn [18:46:01] heh [18:46:26] anyways, I'm gonna try a full daemon "upgrade"-style restart, in case my do_ssl_wait_shutdown patch has some quirk that doesn't enable it correctly on a mere reload [18:46:59] !log cp3030 - round 2 of ssl_do_wait_shutdown test [18:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:49] https://tools.ietf.org/id/draft-thomson-hybi-http-timeout-01.html#rfc.section.2.1 [18:50:26] yeah seems to do the opposite of what I would've hoped [18:53:11] reconfirmed in case I was remembering wrong: it is the side who first initiates active-close (sends the first FIN) that suffers the TIME_WAIT [18:53:26] so in the HTTP(S) case ideally we do want the client to close first rather than the server [18:54:36] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748769 (10RobH) Swapped mainboard yesterday, but during the installer today got the following: [ 457.538179] BUG: soft lockup - CPU#19 stuck for 23s! [apt-get:38504] │ [ 493.53... [18:59:52] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748777 (10BBlack) So, looking at all the crash messages we've managed to record since the beginning of this ticket, the CPU# indicated has had a history of: 41, 23, 47, 47, 1, 19 . T... [19:00:43] !log cp3030 - end experimentation, puppetizing back to normal config [19:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:46] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3748879 (10RobH) I've updated Dell, and they want me to move it to socket 1 and repeat. I'm asking them to just send me a replacement CPU, we'll see what happens. [20:29:04] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3749116 (10RobH) They agreed and are dispatching a replacement part. I'll likely go ahead and do the proposed swap with the existing, but this will eliminate my having to make two tri... [23:23:41] 10Traffic, 10netops, 10Operations, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3749534 (10ayounsi) [23:46:04] 10Traffic, 10netops, 10Cloud-VPS, 10Operations: Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10ayounsi) [23:58:09] 10Traffic, 10netops, 10Cloud-VPS, 10Operations: Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3749582 (10madhuvishy) Noting here that proprietary software is not usually installed on WMCS environments per https://wikitech.wikimedia.org/wiki/Wikitech...