[07:56:06] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3204919 (10ayounsi) Another point, if the servers saturates its uplink, this also means it needs more capacity. In addition to making the al... [10:54:25] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3205422 (10ema) @BBlack suggested that the possible underlying issue could be lock contention between the expiry thread and the worker threads. Indeed this seems... [12:53:42] 10Traffic, 06Operations: Test.wikipedia,org is reporting bad gateways outside of the main page - https://phabricator.wikimedia.org/T163684#3205783 (10Zppix) [12:55:26] 10Traffic, 06Operations: Test.wikipedia,org is reporting bad gateways outside of the main page - https://phabricator.wikimedia.org/T163684#3205783 (10TTO) Works for me... https://test.wikipedia.org/wiki/Hello_there_apples_and_bananas! (silly URL to bust cache) is fine. [12:57:43] 10Traffic, 06Operations: Test.wikipedia,org is reporting bad gateways outside of the main page - https://phabricator.wikimedia.org/T163684#3205805 (10Zppix) Correction I get a error 400 bad request... [13:05:18] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3205842 (10ayounsi) From the feedback I collected here is what I believe the maintenance will look like. please edit this comment or let me know... [13:18:02] 10Traffic, 10netops, 06Operations: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3205947 (10ayounsi) [14:22:47] excuse the noise if it's known but cp2005, cp2017, cp2024 all have mailbox lag CRITs [14:23:40] yeah [14:23:55] we're working on creative solutions, but not very visibly I guess :) [14:24:17] cp2002 has an updated package with a code patch that may get us past this for good [14:25:37] yeah I saw that already [14:26:33] mailbox lag crits are interesting to us and useful to keep an eye on growing problems, but they don't imply 5xx on their own [14:26:42] nod [14:26:53] so ideally they're not CRITs, so nobody gets bothered by them, I guess [14:26:58] but the world is imperfect for now [14:28:43] cp2002 (and soon all, if things go ok) are running ema's new patch: https://phabricator.wikimedia.org/P5317 [14:29:17] the idea being to give the expiry thread realtime priority, and put priority inheritance policy on the main mailbox lock [14:29:57] it should be virtually impossible to get mailbox lag like this, I think [14:30:08] but we'll see how the tradeoffs work! :) [15:16:19] 10Traffic, 06Operations: Special:RecentChanges in etwiki displays error - https://phabricator.wikimedia.org/T163696#3206368 (10Zppix) p:05Triage>03High [15:21:48] 10Traffic, 06Operations: Special:RecentChanges in etwiki displays error - https://phabricator.wikimedia.org/T163696#3206411 (10matej_suchanek) p:05High>03Unbreak! >>! In T158004#3206258, @matej_suchanek wrote: > https://cs.wikipedia.org/wiki/Speciální:Poslední_změny (cswiki RC) **down for me with beta ORES... [15:40:53] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3206531 (10BBlack) 05Resolved>03Open hmm, no, it is the HTTPS check, not the IdleConnection one. I wonder why it's RST and not regular close? [20:45:11] 07HTTPS, 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Make default interwiki map links protocol-relative - https://phabricator.wikimedia.org/T33327#353861 (10demon) I disagree. We should use https for those that support it, and http for those that don't. Protocol-relative URLs were a useful to... [20:46:15] 07HTTPS, 10Traffic, 10MediaWiki-General-or-Unknown: Make default interwiki map links protocol-relative - https://phabricator.wikimedia.org/T33327#3207865 (10demon) (Also, this was **never** an operations or traffic bug) [20:48:15] Fuckin' herald [20:49:54] 10netops, 06Operations, 10ops-eqiad: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3207869 (10elukey) @Cmjohnson yep exactly! But it should be done before the 26th, the major goal is to avoid to loose two kafka nodes for extended maintenance at the...