[08:02:34] 10Traffic, 06Operations, 07HHVM, 15User-Joe, 15User-mobrovac: Enable TLS termination on the MediaWiki clusters - https://phabricator.wikimedia.org/T153042#2867337 (10Joe) [16:48:42] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2869377 (10faidon) After a few back and forths with JTAC, the case was escalated to the Advanced TAC (aka ATAC). The issue was thankfully replicated in their... [16:56:02] hi, I encountered an issue where nginx returned a 504 to the browser for the page Special:Search [16:56:49] for some complex search queries we'd like to use a timeout based approach so that we can warn the user that the results displayed are partial [16:57:55] today the timeout we use between MW and elastic is wrongly set to 40s but in fact should be 80s [16:58:29] when I fixed the issue (to actually use a 80s timeout) nginx started to complain [16:59:58] the question is: what is the timeout used by nginx to serve a page like Special:Search? [17:04:41] dcausse: hey, the nginx timeout should be 180s IIRC [17:05:45] yeah proxy_read_timeout 180s [17:06:02] ema: I'm not 100% sure it's effective, elukey verifuied that the backend returned a 200 in 80s [17:06:32] dcausse: nope, it returns a 200 in 40s https://phabricator.wikimedia.org/T152895#2869401 [17:06:38] (in prod) [17:07:05] ema: yes it's bug, the patch I deployed to mwdebug was meant to fix that and use a 80s timeout [17:07:17] oh ok [17:07:18] and at that time nginx started to complain [17:07:32] interesting [17:10:09] dcausse: was the change active between roughly 22:30 yesterday and 13:30 today? [17:10:24] the 80s bugfix I mean [17:11:17] oh but you've probably applied it to mwdebug only [17:11:26] the bugfix was active on mwdebug1002 around 4pm CET today [17:11:36] yes [17:12:03] I was looking at https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&var-site=All&var-cache_type=text&var-status_type=5&from=now-24h&to=now [17:13:07] and there seems to be a 504 plateau during the timeframe I've mentioned, but that seems to happen every day and it's really just a few requests so nothing to see there [17:13:45] this type of query is not widely used I'm not sure it'd be noticeable on a dashboard :/ [17:14:20] btw why are we returning a 200 when the error happens with 40s as a timeout? [17:15:11] if it's an error surely we should use an error status code right? :) [17:15:31] ema: it's an application error like 'Search backend error, please retry later' [17:15:59] and I'd like to change it to: 'Partial results returned, please optimize your query' [17:18:14] 10netops, 06Discovery, 06Operations, 10Wikidata, and 2 others: wdqs2003 switch port configuration - https://phabricator.wikimedia.org/T153094#2869454 (10Papaul) [17:20:24] dcausse: is the patch still applied on mwdebug1002? [17:20:40] ema: no :( [17:21:50] dcausse: ok now I've gotta go anyways, it would be nice to apply the patch again tomorrow if possible and try to repro [17:22:08] ema: sure, thanks for your help [17:22:28] np :) [18:10:48] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2869697 (10faidon) [18:11:59] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10faidon) D1 to D8 was patched with fiber QSFP+s (et-1/1/0 <-> et-8/1/0). The no-name optics we bought in T149726 appear as QSFP+-40G-CU... [20:18:43] 10netops, 06Operations, 10ops-codfw: ms-fe200[5-8] switch port configuration - https://phabricator.wikimedia.org/T152627#2870328 (10RobH) 05Open>03Resolved Done! [21:46:11] 10Traffic, 10Mobile-Content-Service, 06Operations, 10RESTBase, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2870728 (10Pchelolo) 05Open>03Resolved We didn't hear about this problem for a while, let's assume separating it to a n...