[08:07:35] hello traffic team [08:07:48] cp3055 was down, I depooled it and powercycled [08:08:26] will wait for somebody to triple check before repooling :) [08:54:37] hi elukey [09:05:22] ciao ema [09:13:57] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) On 2019-12-10 cp3055 went down too: ` 19:33 <+icinga-wm> PROBLEM - Host cp3055 is DOWN: PING CRITICAL - Packet loss = 100% ` Depooled and power-cycled by @elukey on 2019-12-11T08:04. [09:14:04] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [09:19:15] 10Traffic, 10Operations: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10ema) [09:19:22] 10Traffic, 10Operations: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10ema) p:05Triage→03Normal [09:19:41] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) [09:23:24] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10jcrespo) See: T240177 T237730 backup2001 was updated to new bios last time it crashed. [09:26:11] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) Do we have somewhere to collect the kernel versions of the hosts and whether they were upgraded before/after the crash? I upgraded db2125's kernel when it crashed to: ` root@db2125:~# u... [09:50:25] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [11:05:20] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10MoritzMuehlenhoff) Some observations: - I'm pretty sure this is unrelated to the kernel, we've seen these crashes with both 4.9 and 4.19 - backup2001 had latest firmware when it crashed - backup200... [11:06:05] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10ema) >>! In T238305#5731093, @jcrespo wrote: > See: T240177 T237730 backup2001 was updated to new bios last time it crashed. cp3053 too (T239041) and has been running fine since, FWIW. [11:07:38] 10Traffic, 10Operations: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10MoritzMuehlenhoff) Given that the firmware updates itself were still showing these symptons, this wouldn't hurt, but I doubt it's a complete fix, I wrote up some proposal at https://phabricator.wikimedia.org/T238305#5731421,... [11:21:00] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10jcrespo) >>! In T238305#5731424, @ema wrote: >>>! In T238305#5731093, @jcrespo wrote: >> See: T240177 T237730 backup2001 was updated to new bios last time it crashed. > > cp3053 too (T239041) and... [12:26:21] 10Domains, 10Traffic, 10DNS, 10Operations: Donate wikiźródła.pl and wikisłownik.pl to the Foundation - https://phabricator.wikimedia.org/T240446 (10tomasz) [14:17:35] uh! [14:17:36] $ curl -s -v -H "Host: en.wikipedia.org" 'http://appservers.svc.eqiad.wmnet/wiki/Main_Page' 2>&1 | grep Transfer [14:17:39] < Transfer-Encoding: chunked [14:18:02] that's not the case for s/http/https/ [14:18:14] 10netops, 10Operations: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10Phamhi) [14:21:02] lol [14:21:05] nice find! [14:24:29] that can't be great for ttfb :) [14:39:38] so I guess nginx is de-chunking? [14:40:26] so it seems! I haven't found out why though [14:40:37] we have `proxy_buffering off;` [14:41:53] but other buffers are on-ish [14:41:57] see the rest of the template... [14:41:58] hey, does nginx use http/1.0 for proxying?!? [14:42:07] Default: [14:42:08] proxy_http_version 1.0; [14:44:59] oh but we override that [15:02:01] yeah a lot of tlsproxy was built around what we *had* to do to make nginx<->v-fe work in #traffic [15:02:08] some of those settings are probably non-ideal for this dase [15:02:29] it's also set to do one request-per-connection too [15:03:08] or we could stop investing in that and just move it to envoy and hope it works better! [15:07:02] even haproxy mapping conns 1:1 would be better (acting as a purely TLS/TCP revproxy, not HTTP) [15:07:15] right [15:08:04] and since appservers.svc is a roundrobin LB... [15:08:20] the sessionid cache is probably nearly-useless. what are the odds of hitting the same server twice? [15:09:08] either way that's probably a relatively-minor concern so long as we have low reconnection rates from ats-be->appservers [15:09:23] but the chunking/buffering/etc interference could be a real factor [15:10:12] any interesting gains from the session/token cacheability stuff and/or changing do_not_cache()? [15:10:27] nope [15:11:06] we could s/https/http/ in the appservers remap for a eqiad cp host and see exactly how much overhead nginx is introducing [15:11:16] I expect that should be very visible here https://grafana.wikimedia.org/d/7-ZqK8-Wz/varnish-frontend-ttfb-comparison?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-backend_a=cp1075&var-backend_b=cp1085&var-percentile=75 [15:12:31] just to get an idea of course, not as a "fix" of any sort :) [15:12:33] +1, give it a shot [15:12:52] 75 is the ats? [15:12:58] correct [15:13:36] wow that fe<->be hit metric is interesting, ~30ms for nothing? [15:14:17] at p75 anyways [15:14:34] maybe we have a problem with insufficient connection pooling/reuse at the v-fe->ats-be level, too [15:14:53] although still 30ms is a lot to establish a dc-local HTTP conn [15:15:21] I've got to step a way for a few, bbiab! [15:15:50] cya [16:37:08] o/ [16:37:13] ok decided on some names [16:37:16] https://gerrit.wikimedia.org/r/c/operations/dns/+/556411 [16:37:19] aklsoi [16:37:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/556413 [16:37:23] also* [16:40:27] \o/ [16:40:31] * godog drove-by +1s [16:41:21] heheh :) [17:41:47] 10Traffic, 10Operations: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10jcrespo) a:03ema Proposing merging this ticket into T238305 (or resolve it), unless there is some host-specific tasks pending for cp3055, like upgrading the firmware and assigning to someone that could do that (@robh remot... [17:47:19] 10Traffic, 10Operations: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10RobH) I was tagged into this, so I'm guessing the info is needed for firmware? The server is running the following: Bios 2.2.11 - this is very outdated, urgent flagged update currently is 2.4.8 ilom 3.34.34.34 - this is... [17:47:43] bblack: i suspect when the new cp sysetms were racked [17:47:45] no one updated the bios [17:47:48] which is non-ideal [17:47:50] =[ [17:48:05] i get why it happened, forgotten in shuffle, but meh [17:48:21] (reference cp3055) [17:50:45] 10Traffic, 10Operations: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10jcrespo) @Robh Indeed this would need owner confirmation and depooling, not asking you to do anything. Was tagging you just to confirm a remote upgrade was possible and reasonable for 3xxx datacenter, given its particular lo... [17:50:50] https://phabricator.wikimedia.org/T238305#5731421 [17:51:06] robh: if bios rev is included in "firmware" there, I think that's not it? [17:51:34] 10Traffic, 10Operations: Investigate trafficserver-tls crash on cp3064 - https://phabricator.wikimedia.org/T240183 (10jcrespo) Same comment as T240425#5732940 [17:51:37] 3057 also crashed twice, but not sure if it was updated between (back in nov) [17:52:34] i dont get the question sorry [17:52:45] i just know jaime tagged me in about the firmware of the bios [17:52:47] and its outdated [17:53:01] and part of our racking is supposed to be to update bios before handing off to you [17:53:01] yeah, but moritz is also saying we've seen this crash on firmware-updated nodes, too [17:53:06] oh, ok [17:53:12] so that may be true, but may not explain our problems, either [17:53:14] yeah i wasnt arguing for the firmware being the cause [17:53:19] oh ok [17:53:25] just mentioning it should likely have been updated before we handed it to you =] [17:53:51] can we poll firmware revs over ipmi or something? [17:54:04] I know they'll be different for every model, but it's metadata we could track and manage anyways [17:55:11] 10Traffic, 10Operations: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10RobH) >>! In T240425#5732979, @jcrespo wrote: > @Robh Indeed this would need owner confirmation and depooling, not asking you to do anything. Was tagging you just to confirm a remote upgrade was possible and reasonable for 3... [17:55:20] I hesitate to say alert on it, because then we'll build a dataset saying all R440 should be at rev 2.34.23 (etc), and then someone will update that every time a new rev comes out, and then we might start just applying pointless bios updates all the time with no cause just because new ones are out. [17:55:33] and the history of bios update quality says that's probably not a great idea without cause heh [17:56:08] well, no alerting imo [17:56:18] but it owuld be nice to list it all and we can snag the huge outliers [17:56:31] ie: perhaps try to keep them within a year of latest or whatever [17:56:54] most systems have to reboot annually so updating annually with that seems ok to me... if they do reboot that often, but i think they all do... [17:59:15] if all our stuff was virtual we wouldn't care, we'd just migrate pods or vms or whatever Elsewhere and reboot for hardware-level stuff whenever we feel like it without disruption :) [18:45:57] 10netops, 10Operations: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10brion) [18:49:18] 10netops, 10Operations: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10jcrespo) p:05Triage→03High @ayounsi could you have a look if it is something we could do something about? It was reported on IRC and more than one person said it was expe... [18:52:47] 10netops, 10Operations: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10jcrespo) I am told he is on vacations right now, maybe @akosiaris or @faidon can have a look? I didn't see anything obvious packet loss on grafana metrics or librenms. [19:19:02] heya, I’m having major pebkac with the patch to add an lvs for kibana-next https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556443/ and the associated dns change https://gerrit.wikimedia.org/r/#/c/operations/dns/+/556442/ could I ask someone who makes lvs changes on the regular for a review? [20:34:21] 10netops, 10Operations: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10CDanis) I did some [[ https://atlas.ripe.net/measurements/23604772/#!probes | ICMP pings ]] and [[ https://atlas.ripe.net/measurements/23604785/#!probes | TCP port 443 tracer... [20:49:24] 10netops, 10Operations: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10CDanis) We also have two probes on Comcast's network constantly performing pings towards our RIPE Atlas anchor in ulsfo. Their network performance looks relatively stable ov... [21:19:03] 10Traffic, 10DNS, 10Operations, 10Research: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10leila) >>! In T240303#5726814, @BBlack wrote: > I'm assuming that, for now, the hosting of the web service (and email?) is not moving, just the whois ownership and DNS ser... [21:51:45] 10netops, 10Operations: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10brion) `sudo mtr -z -s 1000 -T -P 443 phabricator.wikimedia.org` gives similar results: ` Host Loss% Snt... [23:06:10] 10Traffic, 10DNS, 10Operations, 10Research: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10Krinkle) This question isn't directly related but might help indirectly clear some confusion: Who will pay for the domain name when it expires? (Noting that DNS is where...