[03:31:10] 10Traffic, 10Operations, 10Performance-Team (Radar): Some HTTP requests for MW failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10matmarex) Has anyone seen this issue again in the past two weeks? If not, the VE patch might have fixed it… [04:45:49] 10Traffic, 10netops, 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Marostegui) 05Openβ†’03Resolved Going to close this for now. Feel free to reopen if needed. [10:40:48] ema it looks that cp3034 has a much larger failed fetched error rate [10:40:50] than the others [10:41:10] do you know what's wrong? [10:41:19] just curious, I run into it [10:42:48] effie: ENOEMA ;) cc vgutierrez [10:43:11] that's interesting [10:43:27] cp3034 is running ATS as TLS termination layer [10:43:35] effie: where are you seeing that? [10:45:53] let me find it again [10:49:46] hmmm https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=1568576222591&to=1568803747192 [10:49:47] that one? [10:50:19] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?panelId=3&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=now-6h&to=now [10:50:28] yeah [10:51:52] wonderful [10:53:27] it's consistent for all the instances running ATS-TLS instead of nginx [10:53:34] 🍿 [10:54:05] I at least it is consistent :p [10:54:07] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10WMDE-leszek) @Dzahn the instance was no longer in use, so I've just deleted it. [10:54:22] should I file a task ? [10:54:32] please :) [10:54:36] haha [10:54:39] I was going to do it otherwise [10:54:43] sure [10:55:18] T231433 that's a nice parent [10:55:19] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [10:56:29] tx [10:56:54] I'm wondering if.... [10:57:46] let me know the task id.. [10:57:53] sure [10:59:35] I've got a suspect already... let me check TS source code... [11:01:16] 10Traffic, 10Operations: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10jijiki) [11:01:52] 10Traffic, 10Operations: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10jijiki) [11:01:55] 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10jijiki) [11:20:24] hmmm [11:20:53] something is telling me that the Proxy-Connection: close header sent by ATS is messing with varnish-fe [11:21:51] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?panelId=5&fullscreen&orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend [11:23:50] or even worse, with the ats-be behind varnish-fe :) [11:49:56] yup... Proxy-Connection looks like the culprit :) [12:18:23] 10Traffic, 10Operations: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10Vgutierrez) It looks like ats-tls setting `Proxy-Connection`to `Close` is messing with varnish-fe<-->ats-be connections as it can be seen in https://grafana.wikimedia.org/d/000000352... [12:26:28] 10Traffic, 10Operations: Higher failed fetches error rate on some caching servers - https://phabricator.wikimedia.org/T233205 (10Vgutierrez) 05Openβ†’03Resolved p:05Triageβ†’03Normal a:03Vgutierrez Solved by preventing Proxy-Connection from spreading across varnish-fe and ats-be, thanks for reporting the... [12:26:32] 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [13:46:19] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Thanks to the awesome work of @Jclark-ctr an-presto1001 and an-presto1003 are now reimaged, but an-p... [13:47:59] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Proposed fix for asw2-b: ` delete interfaces interface-range cloud-hosts1-b-eqiad member xe-4/0/5 s... [13:55:48] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10akosiaris) >>! In T225128#5503053, @elukey wrote: > Proposed fix for asw2-b: > > ` > delete interfaces inte... [14:02:56] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Committed: ` elukey@asw2-b-eqiad# show | compare [edit interfaces interface-range vlan-cloud-hosts1... [14:41:40] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) Ok so current status: * All hosts reimaged to buster and working * Renamed hostnames in netbox * Wai... [14:46:20] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) AWESOME thank youuuu [17:28:54] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10RobH) p:05Triageβ†’03Normal [17:29:05] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10RobH) [18:45:15] 10Wikimedia-Apache-configuration, 10Performance-Team, 10Patch-For-Review: Apache configuration: SVGs served by MediaWiki aren't gzipped - https://phabricator.wikimedia.org/T232615 (10Krinkle) [19:20:47] 10Traffic, 10Analytics, 10Operations: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10herron) p:05Triageβ†’03Normal [20:34:36] 10netops, 10Operations: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 (10ayounsi) 05Openβ†’03Resolved From Level3: > I appreciate your patience while we worked on gathering the data on these repair tickets. I’ve attached the repair ticket log abov... [21:13:59] 10Traffic, 10Operations, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) Thanks @WMDE-leszek. ! I think we can close this now @BBlack @Vgutierrez I don't see more leftovers to clean up. [21:16:06] 10Traffic, 10netops, 10Operations: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) 05Openβ†’03Resolved All primary link of all transport pairs have now damping configured. [21:56:27] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) [23:14:31] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Check for faulty optic asw-c-eqiad to cr1-eqiad - https://phabricator.wikimedia.org/T233265 (10Cmjohnson) swapped both optics on cr1-eqiad and asw2-c xe-2/045. Giving it 24 hours to see if any errors return