[09:28:41] <_joe_> hello traffic team [09:28:56] <_joe_> I was planning to rebuild pybal today with the 0.1 secs of sleep for etcd [09:29:08] <_joe_> and distribute over this week [09:29:24] <_joe_> but I'm not sure it's ok given the upcoming vacations [10:28:26] maybe next week? :) [10:28:40] ema is off for 2 weeks AFAIK, I'll be off the rest of the week... [10:28:51] so I'd say that the timing is far from ideal [11:35:21] <_joe_> ok [13:47:28] _joe_: do you have some reference for the "0.1 secs of sleep for etcd" thing? [13:48:03] <_joe_> bblack: what do you mean? Why we had a 1 second of sleep in the first place? [13:48:30] <_joe_> It was an attempt at resolving the pybal etcd connections getting lost from time to time [13:48:34] no I meant like a link to the change or something [13:48:49] <_joe_> the change has been reviewed and merged already [13:49:01] <_joe_> what I need to do is cut a release, and deploy it [13:49:14] <_joe_> it's not urgent though [13:50:38] bblack: https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/631686 [13:50:51] I just meant: I'm not familiar with the context enough to understand your sentence earlier, so I feel like I should be, so I want a reference link so I can educate myself :) [13:50:54] thanks! [13:52:54] <_joe_> ahah ok sorry :) [13:53:35] what kind of race conditions are those? [13:53:42] (I am procrastinating from my actual job, clearly...) [13:53:49] <_joe_> mark: haahha [13:54:47] <_joe_> mark: so the problem is sometimes you watch an item in etcd for a *very* long time before it changes, and if your connection is lost for any reason, by default you're trying to request changes since you last got an event [13:54:54] and even if pybal can theoretically depool and repool that fast from the ipvs table, won't it still be pretty disruptive to the existing traffic flows / conns? [13:55:18] I think j.ji addressed that in a ticket comment already [13:55:26] <_joe_> so you add an 'old_id' parameter, but sometimes that revision is out of the log [13:55:50] <_joe_> bblack: no, and let me explain why. 1) we sometimes do cumilative actions (like depooling 5 servers) [13:56:00] <_joe_> those now take 5 seconds to be completed [13:56:08] <_joe_> there is no reason for that [13:56:44] depooling 5 servers becomes 5 reconnects to etcd? [13:56:49] <_joe_> 2) for actual rolling restarts, we have other guards to protect actual traffic (like draining the server for N seconds) that don't cause 100 pools/depools per seconds [13:57:21] <_joe_> but at the same time, for a pool of 100 servers, we're artificially adding 100 seconds of pause we didn't want. [13:57:45] <_joe_> bblack: yes, every change in the store sends back a response and the connection is closed [13:58:20] <_joe_> so we didn't have that sleep for years, we added it and it didn't really fix our problems [13:58:24] that seems ... like an unusual and unique design decision! [13:58:38] <_joe_> bblack: http long polling? not really [13:58:52] <_joe_> bblack: oh we close the tcp connection because twisted, ofc [13:59:00] but a set of 5 rapid-fire data updates -> 5 rapid fire disconnect->reconnect? [13:59:02] <_joe_> if that's what you were referring to [13:59:10] that's what I'm puzzling on [13:59:14] <_joe_> 5 http resquest/responses [14:00:10] <_joe_> but we don't do keepalive from twisted [14:01:06] ok [14:01:12] tcp keepalive? I guess we could though [14:01:13] that seems pretty un-ideal all around [14:02:44] sorry :) it's "nitpick a random design you don't understand" day [14:02:58] <_joe_> bblack: actually it works pretty well, what's un-ideal - assuming an http interface to the datastore [14:03:45] <_joe_> not websockets, nor any other protocol. plain ole http 1.1 [14:04:08] <_joe_> you can't do much more than sending back "something changed at index Y" [14:04:17] there's considerable costs at some levels somewhere, to reconnecting TCP for every one of many rapid updates, vs just keeping one connection open and sending another long-poll req on it? [14:04:45] <_joe_> given we don't get 1k updates/s, I assume it's pretty negligible [14:05:30] <_joe_> it's a 5th level optimization compared to "we'll wait 1 sec between requests because we thought this might mitigate a race condition" [14:05:41] <_joe_> so to be clear, I think we should just remove that sleep [14:05:56] <_joe_> I'm reducing it for now to reduce the risk of the chnage [14:06:46] https://twisted.readthedocs.io/en/twisted-20.3.0/web/howto/client.html#http-persistent-connection [14:08:10] <_joe_> twisted 20 is what we use now? [14:08:16] i have no idea [14:08:43] to use persistent, you have to use a connection pool, which apparently can be limited to 1 connection in the pool :) [14:16:32] given that the current code seems to handle connection/request/retry handling itself to some extent I think you could even hack it in as is, but that seems not ideal [14:20:57] in master there's a kubernetes client that uses Twisted's HTTP Client Agent: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/pybal/+/refs/heads/master/pybal/kubernetes.py [16:06:39] dear traffic, I am triaging https://phabricator.wikimedia.org/T267578 as medium, but feel free to triage it differently [16:06:48] I am not sure how many people this affects [16:07:01] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: Puppet disabled in beta cluster varnish deployment-cache-text06 - https://phabricator.wikimedia.org/T267578 (10jijiki) p:05Triage→03Medium [16:07:27] same for https://phabricator.wikimedia.org/T267561 [16:07:27] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10jijiki) p:05Triage→03Medium [16:08:17] and https://phabricator.wikimedia.org/T267435 [16:08:32] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: Beta cluster seems to be extremely slow for logged in user during page navigation - https://phabricator.wikimedia.org/T267435 (10jijiki) p:05Triage→03Medium [16:17:12] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: Puppet disabled in beta cluster varnish deployment-cache-text06 - https://phabricator.wikimedia.org/T267578 (10thcipriani) Even with puppet disabled, packages were upgraded which broke this again. I set the packages to... [16:21:22] 10HTTPS, 10Traffic, 10Operations, 10Wikidata, 10wikiba.se website: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10jijiki) p:05Triage→03Medium [16:40:59] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10aborrero) 05Stalled→03Declined >>! In T180179#6611341, @Aklapper wrote: > Could #netops please answer T180179#4965646? Asking a... [16:58:17] o/ I just came across this https://wikitech.wikimedia.org/wiki/URI_Path_Normalization Does this path normalization do any query parameter normalization? (and ordering)? Or will foo.php?one=two&three=four end up separately cached to foo.php?three=four&one=two ? (my assumption being they end up cached twice) [17:02:39] addshore: no, we don't currently do anything fancy for the query part. We've discussed it many times (maybe a few in tickets even), but it's scary territory. [17:03:36] ack! I guess if you normalize when caching you also need to normalize when purging. Sounds nice regarding some ticket we worked on recently (would have made things easier), but also not really needed [17:03:49] the only kinds of normalization the edge layer really does today are the normalization of the hostname part (rejecting invalid data, case-insensitivity, etc) and the encoding-normalization of the path part (e.g. replacing %41 with A) [17:04:20] the more things we can normalize about our URIs the better, but it can be hard to not break things too [17:06:30] the unattainable ideal is that each real unique resource (http content output) has exactly one canonical URI, and that all possible input URIs can be swiftly transformed by the edge into either their appropriate canonical URI (by normalization of some kind) or rejected as invalid [17:07:12] but there are many challenges on the way to that ideal, and some of them are true "known limits of computer science" hard. [18:18:40] heya, are you all doing anything with deployment-cache-text06 in beta? puppet is disabled there, and varnishkafka is not installed [18:18:45] ema: maybe ? [18:24:01] ottomata: there's a ticket about needed a varnish6 upgrade there, possibly related [18:24:24] https://phabricator.wikimedia.org/T267561 [18:24:37] thcipirani was also doing some poking at varnish on beta cluster -- i do think it's likely broken there, somehow [18:25:01] well it's certainly broken. we hit this with almost every major upgrade [18:25:10] varnish 5 is sttill installed [18:25:15] shall I uninstall and try to install 6? [18:25:18] beta pulls our prod puppetization of varnish config + VCL, but nobody goes and does the package upgrades to match [18:25:51] ottomata: it may or may not turn out to be a lot more complicated than that. But it's broken now, and you're welcome to take a stab and possibly resolve the ticket in the process :) [18:26:03] * ottomata begins stabbing [18:42:52] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10Ottomata) I just tried to fix by installing varnish 6, but clearly (and obviously) it isn't that simple. ` $ sudo apt-get remove varnish varnish-modules libvmod-netma... [19:31:48] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10ArielGlenn) I guess that T267439 might be related. [19:47:20] Wikimedia Foundation is operational @ DE-CIX Dallas. [19:47:41] nice :) [20:48:34] 10Traffic, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) >>! In T266373#6616625, @sdkim wrote: > Given we, Product Infra, are not finding issues at our... [20:54:26] 10Traffic, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) > Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content... [20:55:28] 10Traffic, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Urbanecm) [20:55:39] 10Traffic, 10Desktop Improvements, 10Operations, 10Product-Infrastructure-Team-Backlog, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Urbanecm) [21:00:44] 10Wikimedia-Apache-configuration, 10Android-app-Bugs, 10Fundraising-Backlog, 10Operations, and 5 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10DStrine) [21:40:30] qq, what is the default cache ttl in varnish-frontend if the backend doesn't set a Cache-Control header? [22:22:45] ottomata: there's a max ttl of 24h [22:22:56] (even if you set a longer s-maxage, it's capped to that) [22:27:51] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10thcipriani) [22:27:55] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations, 10Release-Engineering-Team-TODO: Puppet disabled in beta cluster varnish deployment-cache-text06 - https://phabricator.wikimedia.org/T267578 (10thcipriani) [22:28:01] 10Traffic, 10Beta-Cluster-Infrastructure, 10Operations: Beta needs to be upgraded to Varnish 6 - https://phabricator.wikimedia.org/T267561 (10thcipriani) [22:28:04] max ya, but is that a default if s-maxage isn't set [22:28:16] also cdanis, we just set s-maxage to 60 s [22:28:33] after 60 seconds, we get a result with age > 60, and get the old content [22:28:45] but then the subsequent request, age < 60 and get the new content [22:28:59] shouldn't the first request after 60 seconds cause varnish to miss and re-cache? [22:29:09] if you don't set a cache-control at all, i'm not sure what happens, but certainly nothing will be cached longer than 24h [22:29:22] is this for the same URL? [22:29:26] yes [22:29:55] anything in https://schema.wikimedia.org [22:30:04] right, but, the same exact URL you see that behavior? [22:30:31] yes [22:31:14] I don't even see it doing caching, I see a frontend miss / backend pass on all the URLs I've looked at [22:31:50] oh I have a cookie set on this domain somehow [22:32:05] i'm just checking in curl [22:34:08] yeah I'm not sure [22:34:23] I don't have more time to look now sorry [22:34:28] ya no prob no biggie [22:34:34] thought maybe we were missing something obvi [22:34:36] ty anyway