[00:03:43] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3003811 (10Dzahn) So does pediapress have to fix their cert on tools.pediapress before this can be merged or is it unrelated to that cha... [10:45:09] alright so yesterday's 503s had nothing to do with pybal/lvs nor with the reboots themselves. We currently get a 503 spike every time the backend VCL is reloaded with /usr/share/varnish/reload-vcl, regardless of the DC [10:45:57] of course the impact is amplified if the reloads happen in eqiad/codfw as in that case they affect esams/ulsfo too [10:46:10] https://github.com/varnishcache/varnish-cache/issues/2195 [11:13:21] 10Traffic, 06Operations: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3004933 (10ema) [11:13:28] 10Traffic, 06Operations: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3004949 (10ema) p:05Triage>03High [11:16:10] I'm not gonna proceed with cache_upload reboots till we find a decent solution to this. The proposed one is good old sleep: https://github.com/varnishcache/pkg-varnish-cache/issues/61 [11:25:19] also setting .initial to 2 (same value as .threshold) in the 'varnish' probe seems to help [11:26:43] the default is threshold-1, meaning that the backend is considered sick at the beginning and after one successful poll it's considered fine [11:40:23] 10Traffic, 06Operations: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3004993 (10ema) I've confirmed on cp3040 that the issue is not reproducible by doing either of the following: * set .initial to the same value as .threshold in the [[ https://github.com/wikimedia/ope... [12:04:27] 10Traffic, 06Operations: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3005029 (10ema) [12:10:30] <_joe_> ema: ewww [12:14:35] it's still not really clear to me why the frontends are not affected given that the probe settings should be the same, perhaps because they're DC-local so it's faster to them to toggle from the initial sick state to healthy? [12:31:04] 10netops, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005077 (10elukey) [12:31:21] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005090 (10elukey) p:05Triage>03Normal [12:46:12] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005127 (10elukey) Adding install1002's IP to the whitelist should be: ``` edit set firewall family inet filter analytics-in4 term analytics-publicIP-v4 from destination-address 208.80.... [13:24:47] modules/role/manifests/lvs/balancer.pp contains the definition for the wikimedia-experimental repo, since contint will need that as well for staging jenkins 2, it'd move it a common place and make it hiera-configurable [13:25:28] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005305 (10elukey) Stale things found while reviewing: * term udplog is probably not worth to keep * term kafka is missing kafka2003's IP * term archiva should contain meitnerium's IP,... [13:25:34] but I'm wondering it's still needed for the varnish hosts, with 4.4 being the default kernel and the migration to varnish 4 completed, it's obsolete, right? [13:25:54] if so, I'd not enable it via hiera when making the change [13:41:03] moritzm: I don't think we need experimental on cache hosts any longer, no [13:46:57] ok [14:02:44] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005522 (10elukey) Most urgent fixes: * Remove old AQS IPs ``` delete firewall family inet filter analytics-in4 term aqs from destination-address 10.64.0.123/32 delete firewall family i... [14:23:10] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005589 (10Ottomata) +1 to all of these. But, seeing as there has been an IPv6 with the ACLs for a while, maybe we should ask Ops about the use of continuing to support this VLAN. Not... [14:41:21] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005658 (10elukey) Old/New elastic search IP from Discovery: https://etherpad.wikimedia.org/p/analytics-acls [16:06:20] 10Traffic, 06Analytics-Kanban, 06Operations, 06Performance-Team, 06Reading-Admin: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#3005963 (10Nuria) a:03Nuria [17:41:23] is https://gerrit.wikimedia.org/r/#/c/336440/ bblack-approved? [17:47:42] +1 Tremendous [17:47:58] I'd sign into gerrit, but ... [17:48:42] chromium update yesterday and now it doesn't load external plugins by-default. So I can't access my password manager until I close my browser and relaunch it with new arguments, and I'm in the middle of a meeting in it [17:51:12] <_joe_> ema: jokes aside, is there any way to know when it's ready? [17:52:14] _joe_: it's ready after healthcheck probes complete, to mark the live backups as "up", they default to dead [17:52:18] s/backups/backends/ [17:52:30] and the healthcheck probes in question are configured with a 500ms timeout [17:52:56] so 2s seems like "not too long to be super annoying, but long enough to have a good margin of error on a 500ms healthcheck" [17:53:43] technically I guess we could check with backend.list [17:53:58] kinda [17:54:52] that seems more complex and error-prone, though [17:55:54] you'd have to get the hash of the new vcl, use that in the backends.list filtering, find that they exist and have probed, but not care about sick/healthy [17:56:03] (unless maybe they're all sick, maybe that indicates a bad reload) [17:56:28] and you'd still have to sleep in a loop while you re-checked backends.list output [17:59:01] there's probably all kinds of edge cases in that, too [17:59:29] <_joe_> I agree [17:59:35] or some crazy race condition where varnishadm connects to the wrong socket as it gets replaced on reload or something dumb, or gets deadlocked until vcl.use happens :P [18:13:32] 10Traffic, 06Operations, 13Patch-For-Review: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3006343 (10ema) 05Open>03Resolved a:03ema Tested https://gerrit.wikimedia.org/r/336440 on a esams host, no 503 spikes. Closing. [19:03:40] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3006606 (10Ckepper) Thanks for the heads up. If there is no other option, we can get a wildcard SSL cert for tools.pediapress.com. [19:11:30] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3006662 (10Dzahn) @Ckepper How about using https://letsencrypt.org/ it's easy with https://certbot.eff.org/ , you don't have to spend an... [19:52:29] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3006787 (10Ckepper) Thank you @Dzahn, that's an excellent suggestion. I will look into it and try to set it up for tools.pediapress.com. [22:22:56] 07HTTPS, 10Traffic, 10Collection, 06Operations, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3007531 (10Platonides) Only the last patch depend on the pediapress changing their certificate. The other hosts have a valid certificate...