[02:02:35] bblack: repooling ulsfo [02:24:31] ncredir just paged [02:24:40] where's the icinga bot? [02:24:48] oh I'm staring at the wrong channel, that's why :) [07:44:44] vgutierrez: hello! [07:44:56] yey, I was waiting for you :) [07:45:01] hehehe [07:45:19] I got lost with syslog on relforge [07:45:45] no problem [07:45:58] whenever you're ready [07:46:00] here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/535528 [07:46:03] I'm ready [07:46:11] that one first before the LVS patch [07:46:31] cool [07:49:43] did you run puppet on wdqs hosts? [07:52:19] I just trigger the puppet run [07:52:29] CI was slow on verifying the rebase [07:54:12] vgutierrez@lvs1016:~$ nc -w 2 -zv wdqs1004.eqiad.wmnet 8888 [07:54:12] wdqs1004.eqiad.wmnet [10.64.0.17] 8888 (?) open [07:54:14] nice... [07:54:17] this looks better :) [07:54:28] alright! [07:55:06] so the big one :) [07:55:53] sure, again, I'll disable puppet on the affected LVSs [07:58:28] alright.. cool! [07:59:24] merging... [08:00:54] cool.. lets' run puppet on eqiad secondary LB for low-traffic [08:02:32] alright [08:04:46] as you can see, monitors are happy with heavy queries on lvs1016: https://grafana.wikimedia.org/d/000000421/pybal?panelId=8&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1016&var-service=wdqs-heavy-queries_8888&from=now-5m&to=now [08:05:19] nice! [08:05:25] thanks a lot! [08:05:29] let's go for the secondary on codfw [08:06:00] nah.. you did all the heavy lifting here [08:06:56] :) [08:08:23] same for codfw: https://grafana.wikimedia.org/d/000000421/pybal?panelId=8&fullscreen&orgId=1&from=now-5m&to=now&var-datasource=codfw%20prometheus%2Fops&var-server=lvs2006&var-service=wdqs-heavy-queries_8888 [08:08:29] let's pool the servers [08:09:02] yep [08:14:32] all looking good [08:14:49] I'm going to restart the primary LVSs for low-traffic on eqiad|codfw [08:16:13] alright. cool [08:21:25] onimisionipe: all good. thanks for choosing traffic edge systems <3 [08:22:32] lol... traffic edge systems [08:24:02] Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: Refreshing live OCSP response for certificate unified / rsa-2048 [08:24:02] Sep 12 08:23:31 acmechief-test1001 acme-chief-backend[457]: live OCSP response refreshed successfully for unified / rsa-2048 [08:24:03] \o/ [08:28:56] 10Acme-chief: Implement server-side OCSP stapling - https://phabricator.wikimedia.org/T219765 (10Vgutierrez) After upgrading acme-chief on acmechief-test1001, a tiny storm of OCSP requests was generated: ` Sep 12 08:23:24 acmechief-test1001 acme-chief-backend[457]: Missing/invalid DNS zone updater CMD timeout, u... [08:36:38] https://www.irccloud.com/pastebin/2sTtNvrc/ [08:36:44] looking good :D [08:54:10] 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Mathew.onipe) @Addshore @Ladsgroup @WMDE-leszek, can you test that you can reach wdqs.svc.eqiad.wmnet on port 8888. LVS and othe... [09:05:26] 10netops, 10Operations, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10fgiunchedi) [09:47:31] 10Traffic, 10netops, 10Operations: 503 errors when trying to log in to Wikimedia sites - https://phabricator.wikimedia.org/T232698 (10Aklapper) [09:47:39] 10netops, 10Operations, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10jbond) I think this is a great idea. As to which host, the cumin server makes sense to me or perhaps bastion? The user is a bit of a pain, it would be nice if we could ha... [10:21:36] 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Ladsgroup) The requests work but the TLS ones give me this error: ` ladsgroup@stat1007:~$ curl https://wdqs.svc.eqiad.wmnet:8888... [10:25:02] 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Mathew.onipe) @Ladsgroup there's no TLS termination on that port for now. We should have and I will work on it in the nearest fu... [10:58:17] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) [10:58:29] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) p:05Triage→03High [11:03:45] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) Enabling the session cache debug on a local instance shows this: `willikins:~ vgutierrez$ docker logs -f ats_ats_1 |fgrep timeout [E. Mgmt] log ==> [TrafficManager] using root directory '/usr'... [11:05:31] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) [11:06:04] 10netops, 10Operations, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10jbond) p:05Triage→03Normal [11:06:44] 10Traffic, 10Analytics, 10Operations: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10jbond) p:05Triage→03Normal [11:07:49] 10Traffic, 10FR-Q2-FY2019-20-cleanup-list, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10jbond) p:05Triage→03Normal [11:09:33] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) [11:09:35] 10Traffic, 10Operations: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 (10Vgutierrez) [12:07:48] 10netops, 10Operations, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10fgiunchedi) Yeah I think cumin host would be ok and ditto for user atlas, and we can also use the `ripe-atlas-tools` debian package! [12:50:51] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10fgiunchedi) [13:42:44] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Lucas_Werkmeister_WMDE) [14:30:49] 10netops, 10Operations, 10ops-eqiad: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) [15:07:43] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) >>! In T189333#5483346, @fgiunchedi wrote: >>>! In T189333#5481492, @Krinkle wrote: >> I re-ran my analysis today, and oddly eno... [15:10:14] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) [17:46:16] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [17:52:59] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) After debugging why the eviction is triggered: `#0 ssl_rm_cached_session (ctx=0x188db10, sess=0x2ad5e802a910) at SSLUtils.cc:304 #1 0x00002ad5d30030f9 in remove_session_lock (ctx=0x188db10,... [18:14:58] 10netops, 10Operations, 10ops-eqiad: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) 05Open→03Resolved Alright everything here is done. And was quite smooth. Some notes: * k8s1005 and k8s1006 only had v4/v6 sessions to cr1 and not cr2, which... [18:15:00] 10Traffic, 10netops, 10Operations: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [18:18:42] hmmm ok yes, it's there [18:19:01] I thought multiple BGP peers was in the set of features that were master-only but hadn't made it back to our deployed 1.15 stuff [18:19:12] but confirmed in the repo, and in the deployed code on a live LVS, the feature is in fact there [18:19:43] so we just need to test it and puppetize it really [18:20:10] hmmm [18:20:20] XioNoX: got a sec to confirm it? [18:20:47] bblack: yep [18:20:52] which one? [18:21:13] 1016, will restart it to connect to both (with manual config hack) [18:21:35] ok, adding the config, one sec [18:22:49] alright, 1016 is configured on both [18:22:57] ok restarted [18:23:23] I see established connections from here to both [18:24:34] of course we don't have per-service med or anything, or per-peer [18:24:58] but still I don't think this would cause a functional issue [18:25:38] bblack: I'm seeing all the prefixes on both [18:27:21] yeah [18:27:40] ok turning off experiment :) [18:28:11] removed the config on the router side [18:29:09] yeah will have to thought-experiment this a bit and think of anything dumb this might cause [18:29:28] but I think it's actually fine [18:29:38] (to use the dual-bgp peering like that, from all the LVSes) [18:29:46] (at all the DCs, too) [18:29:55] that would be great! [18:30:00] I can do some puppet patches and fire one off today, to gradually roll it out [18:30:12] maybe just the backup LVS in codfw or something for now, and then work through the rest next week. [18:36:22] 10Traffic, 10Operations: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) It looks like the culprit is https://github.com/apache/trafficserver/commit/03734d05e28af8a7b105a0579056c913fb5d1bc5, I've tested ` https://gerrit.wikimedia.org/g/operations/debs/trafficserve... [19:11:13] 10netops, 10Operations, 10observability: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10ayounsi) LGTM! [19:20:24] 10Traffic, 10Operations, 10Patch-For-Review: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BBlack) What's missing here is turning on BGP peering with all local routers, which is available in our current 1.15 pybal releases. Will fix that up here and then resolv... [19:27:57] 10Traffic, 10Operations, 10Patch-For-Review: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10BBlack) T180069 - Ticket from the feature add for pybal itself [22:14:00] bblack: let me know when you want to sync up to push the change in codfw [22:21:13] 10Traffic, 10Analytics, 10Operations: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria) cc @Ottomata just in case he can do the change too [23:02:54] 10Traffic, 10Analytics, 10Operations: Images served with text/html content type - https://phabricator.wikimedia.org/T232679 (10Nuria)