[00:09:03] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2001.codfw.wmnet - https://phabricator.wikimedia.org/T248815 (10Papaul) [00:09:24] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2002.codfw.wmnet - https://phabricator.wikimedia.org/T248818 (10Papaul) [00:09:57] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2004.codfw.wmnet - https://phabricator.wikimedia.org/T248824 (10Papaul) [00:10:24] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp2005.codfw.wmnet - https://phabricator.wikimedia.org/T248848 (10Papaul) [00:11:51] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-codfw: decommission cp2006.codfw.wmnet - https://phabricator.wikimedia.org/T248856 (10Papaul) [05:21:19] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Site-requests, and 2 others: Remove "Cache-control: no-cache" hack from wmf-config - https://phabricator.wikimedia.org/T247783 (10Joe) [05:37:20] 10Traffic, 10Operations: Servers freezing across the caching cluster (November 2019) - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) @faidon actually the cp hosts are running buster (T242093) since February 13th. I do believe we haven't seen more occurrences of this issue on the cache cluster since... [07:33:11] 10Traffic, 10Operations: ATS TLS session cache efficiency reduced in TLSv1.3 - https://phabricator.wikimedia.org/T245502 (10Vgutierrez) This has been mitigated by providing support for TLS Session Tickets (T245616) and reducing the number of issued tickets on new connections from 2 to 1 by submitting this patc... [07:33:26] 10Traffic, 10Operations: ATS TLS session cache efficiency reduced in TLSv1.3 - https://phabricator.wikimedia.org/T245502 (10Vgutierrez) 05Open→03Resolved [07:33:31] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [07:33:45] 10Traffic, 10Operations: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 (10Vgutierrez) 05Open→03Resolved [07:33:48] 10Traffic, 10Operations: ATS TLS session cache efficiency reduced in TLSv1.3 - https://phabricator.wikimedia.org/T245502 (10Vgutierrez) [08:06:38] 10netops, 10Operations: netflow hosts spamming /var/log - https://phabricator.wikimedia.org/T249177 (10ayounsi) [08:06:41] 10netops, 10Operations: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10ayounsi) [08:08:58] 10netops, 10Operations: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10ayounsi) Note that Fastnetmon 1.1.4 is in Debian 11. Back-porting it might be easy. [08:11:06] 10netops, 10Operations: netflow2001 kafkatee-webrequest restart loop - https://phabricator.wikimedia.org/T249176 (10ayounsi) As long as we drop RPKI invalids, we can remove RPKI counter. [13:34:20] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/3 { ... } + member xe-7/0/5; [edit interfaces] - xe-7/0/5 { - description cp2... [13:37:20] 10Traffic, 10Operations, 10decommission, 10ops-codfw: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul) [14:26:43] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) For the moment I am happy with TLS encryption only, since we'll probably move to kerberos authentication soon and it doesn't make much... [14:26:53] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) [16:36:47] 10netops, 10Operations: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10ayounsi) Thanks to Moritz I backported (locally only for now) FNM 1.1.4 and installed it on netflow4001. `Unpacking fastnetmon (1.1.4-1~deb10u1) over (1.1.3+dfsg-8.1)... [17:03:04] ema: random async thought that occured to me in a meeting just now, that's probably worth pursuing... [17:03:19] the varnish-fe code for "retry 503 once in frontend instances, to paper over transient issues" [17:04:02] which came of that whole line of thinking about "we can retry once at the outermost layer, and by policy other layers on the inside should never retry to avoid multiplying failing requests" [17:04:26] [but side note, apparently some internal things do retries anyways, apparently the old decisions have rotted a bit] [17:04:50] we now also have a number of known cases where internal services are, for better or worse, making sub-requests back into the frontend caches... [17:05:06] which could still cause a multiplication from our retry-503-once code [17:05:58] we could probably mitigate that impact, by putting a conditional around the retry-503-once code, so that it doesn't happen when X-Client-IP is in private/WMF address space [17:06:12] (only retry for outsiders, not for insiders) [19:10:41] 10Traffic, 10MediaWiki-Debug-Logger, 10Operations, 10Developer Productivity: noc.wikimedia.org with X-Wikimedia-Debug routes to mwdebug but host is not served there - https://phabricator.wikimedia.org/T245552 (10Krinkle) [19:12:45] 10Traffic, 10MediaWiki-Debug-Logger, 10Operations, 10Developer Productivity: noc.wikimedia.org with X-Wikimedia-Debug routes to mwdebug but host is not served there - https://phabricator.wikimedia.org/T245552 (10Krinkle) Ah okay, so we're stuck between a rock and a hard place. * Before: We don't route XWD... [19:15:39] 10Traffic, 10MediaWiki-Debug-Logger, 10Operations, 10Developer Productivity: noc.wikimedia.org with X-Wikimedia-Debug routes to mwdebug but host is not served there - https://phabricator.wikimedia.org/T245552 (10Krinkle) [19:35:38] 10Traffic, 10Operations: Servers freezing across the caching cluster (November 2019) - https://phabricator.wikimedia.org/T238305 (10faidon) Ah! That's awesome to hear. May I suggest to resolve this (and the associated "upgrade firmware"?) task then, and reopen if we have another one of these? [21:41:45] 10Traffic, 10Operations, 10User-DannyS712: 503 error on enwikinews - https://phabricator.wikimedia.org/T249280 (10DannyS712) [21:42:03] 10Traffic, 10Operations, 10User-DannyS712: 503 error on enwikinews - https://phabricator.wikimedia.org/T249280 (10DannyS712)