[06:52:11] 10netops, 10Operations: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105 (10ayounsi) 05Open→03Resolved p:05Triage→03Normal [07:07:14] 10netops, 10Operations: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105 (10ayounsi) 05Resolved→03Open As a test, I pushed the following: ` [edit chassis] - alarm { - management-ethernet { - link-down ignore; - } - } ` As cr2-eqdfw doesn't have a mgm... [07:56:06] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1085.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [08:23:31] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1085.eqiad.wmnet'] ` and were **ALL** successful. [09:00:19] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1087.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [09:24:29] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1087.eqiad.wmnet'] ` and were **ALL** successful. [09:26:38] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3064.esams.wmnet'] ` The log can be found in `/var/log/wm... [09:57:00] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3064.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3064.esams.wmnet'] ` [10:45:42] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) Today at 7:30ish we've disabled the compress plugin everywhere. It's clearly buggy and [[ https://grafana.wikimedia.org/d/7-ZqK8-Wz/... [11:50:10] 10Traffic, 10DNS, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Aklapper) [12:37:09] 10Traffic, 10DNS, 10Mail, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Reedy) Did this ever work? Or did someone just start using the email and expect it to work? [12:40:43] 10Traffic, 10DNS, 10Mail, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10HakanIST) It's been working since 2016. I've just checked the queue and latest received email is dated 08/01/2019. [12:47:51] 10Traffic, 10DNS, 10Mail, 10Operations: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Dzahn) Looks like this changed in https://gerrit.wikimedia.org/r/c/operations/dns/+/533219 + @Vgutierrez [12:58:25] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log can be found in `/var/log/wm... [12:59:43] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1089.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [12:59:46] 10Traffic, 10Operations, 10Patch-For-Review: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10MoritzMuehlenhoff) I did a reinstall of netflow4001 (had missed this task update and thought it was a botched install) and tested migrations/draining a node, a master failover and a r... [13:24:20] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1089.eqiad.wmnet'] ` and were **ALL** successful. [13:25:47] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` and were **ALL** successful. [13:57:21] 10Traffic, 10Operations, 10Performance-Team: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Vgutierrez) [14:07:52] vgutierrez: without synchronized TFO keys on the appservers cluster, TFO isn't going to help really [14:08:40] and SO_KEEPALIVE... what's the thinking there? [14:09:04] turning it on causes probes to be sent which could tear down an already-broken connection, but they're pretty slow by default [14:10:41] so in our reasoning, we were thinking that it would be better to have spare connections against origin servers ready to use [14:10:49] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [14:11:03] to avoid clients to wait till an origin connection is established [14:11:17] right, I get that part [14:12:07] but SO_KEEPALIVE's actual effect is to make a connection die faster (in the case that the peer has vanished from the network without ever bothering to RST/FIN, e.g. because they lost link suddenly), not stay alive more :) [14:12:13] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) 05Open→03Resolved a:03ema cp2023 and cp1089 were the last two hosts running Varnish as backend cache. We now have exclusively ats-be across the fleet! [14:12:33] and of course, upgrading to TLS 1.3 would help a lot in new connections [14:12:41] bblack: that's also good IMHO :) [14:12:58] well yeah but do we even have that as a case? [14:13:32] and TFO without any good on odds of it succeeding will basically waste bandwidth [14:13:49] (since we know the destination is an RR-set of tons of hosts with unsynchronized TFO eys) [14:13:52] *keys [14:15:00] going back to the 15% ticket: there was mention of making appservers hash on client IP instead of RR, but that's a really big step to take... [14:15:37] (we'd probably want to do some analysis on that, and testing, and even then...) [14:16:08] I think we didn't now that ATS doesn't support TLS session reuse (as a client) when that was written [14:16:33] considering that, I don't know if chashing on client IP is still on the table [14:16:45] well evenif it did [14:17:00] there are lots of upsides to that RR, given the variable nature of MW requests' load, etc [14:17:01] 10Traffic, 10Operations: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) 05Open→03Resolved a:03ema Having finished the transition to ATS T227432, there is no routing between cache backends anymore. [14:17:20] there's also the fact that we have.... 78 cache nodes total around the globe [14:17:38] and there are 275 mw servers between eqiad and codfw [14:18:04] so a hash on cache source IP would put all edge connections through to ~28% of the appservers and leave the rest idle :P [14:18:15] (or something like that) [14:18:45] yep [14:19:08] I think that our best bet right now on the short term is be sure that ATS<-->origin server connections get reused as much as possible [14:19:21] and on the mid term getting TLS1.3 in place [14:20:07] yeah which will require buster on both ends (already in Q3 for us, not sure about MW servers?) [14:20:22] but the best-best thing we can do is just not have connections closing for random bad reasons [14:20:50] we have a ton of natural request->conn amortization possible, and with the idle conn pool helping... we shouldn't be bearing conn startup costs often [14:21:50] that's the idea yes [14:21:55] I'm really curious if MW even has the slow outputs that would give downsides to a machine-local proxy buffer at all [14:22:37] because it could be pretty easy to do store-and-forward in the mediawiki's nginx, and then the interruption/timeout/etc cases just become 503s that don't break reused conns. [14:22:48] AFAIK nginx already does that [14:22:57] does what? [14:23:38] nah forget it, my brain mixed stuff [14:23:40] it has some packet buffering, but it doesn't slurp the response from mediawiki before emitting it on the connection to ATS, it streams it through as available [14:23:48] that nginx is buffering incoming POST requests [14:23:51] but not responses [14:23:58] which is why if the mw<->nginx part breaks during a response, the nginx<->ats also has to break [14:24:31] but the potential downsides of going store-and-forward there: more nginx memory usage on mw*, and also if MW has slow-streamed outputs they'll get even slower from the ATS pov. [14:24:55] (e.g. if it takes 500ms to generate a given page output, but it's stuffing out the first bytes at the 20ms mark and spooling them out over that whole time...) [14:29:54] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5753680, @ema wrote: > There are thus two fronts to work on now: (1) increase connection reuse, and (2) decrease the c... [14:31:05] hi :) [14:31:58] * ema waves [14:37:09] I'm disabling xdebug on cp1075 and cp4028 too to get ready for the holidays [14:45:24] holidays, yeah :) [14:45:35] I guess the one in esams was already cleaned up earlier right? [14:45:42] yup [15:13:05] 10netops, 10Operations, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10CDanis) a:03CDanis [15:13:11] 10Traffic, 10netops, 10Operations, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) a:03CDanis [16:08:59] o/ [16:09:01] can we just merge this? [16:09:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/559528 [16:09:09] removes an unused public routing from varnish + ats [16:09:15] or should yall handle that? [16:10:47] metrics.wm.o -> Error: 502, connect failed at 2019-12-19 16:10:36 GMT [16:10:57] I don't see any good reason not to just mege [16:12:53] merged! [16:12:53] ty [16:25:50] ottomata: should we kill the metrics.wm.o public hostname as well? [16:26:10] it's a 1-liner in the dns repo basically [16:29:07] ya [16:29:13] will push [16:30:44] https://gerrit.wikimedia.org/r/c/operations/dns/+/559533 [16:32:52] thanks!