[01:14:56] 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: Android unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3994992 (10Krinkle) [01:15:35] 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: Unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3436479 (10Krinkle) [03:23:58] 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: Unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3436479 (10BBlack) https://stackoverflow.com/questions/13578428/duplicate-headers-recei... [08:15:34] current pybal test coverage: https://phabricator.wikimedia.org/P6734 [09:45:11] <_joe_> vgutierrez: to put that into prespective, 2 years ago it was 0% [09:45:13] <_joe_> :P [09:45:38] best thing I found so far.. was flake8 running BUT configured to exclude everything [09:45:39] *sigh* [09:57:36] <_joe_> vgutierrez: eheh yeah, well, twisted is not pep8 compliant [09:57:54] <_joe_> and mark's code isn't either :D [09:58:12] <_joe_> so I'd rather work on the 2.0 version, tbh [10:07:29] yeah... I need to use a paginator on flake8 bgp.py output [10:56:09] 10Traffic, 10Operations, 10Pybal: Some etcd connections not established at startup - https://phabricator.wikimedia.org/T188087#3995831 (10ema) [10:56:35] 10Traffic, 10Operations, 10Pybal: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#3995835 (10ema) [11:11:51] 10Traffic, 10Operations: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089#3995874 (10ema) p:05Triage>03High [11:12:10] <_joe_> ema: wow it's the week of bugs for everyone [11:12:28] <_joe_> in addition to pybal's issue, we found two infinite loops in mediawiki [11:13:34] _joe_: yeah elukey told me about those. We're living interesting times :) [11:14:07] wow [11:14:29] <_joe_> vgutierrez: this is definitely *not* our average week :P [11:14:35] <_joe_> don't get scared [11:14:41] * ema confirms [11:14:44] it's not my fault! [11:14:46] xDDD [11:14:50] <_joe_> uhm right [11:15:04] <_joe_> what changed this week? let's revert just in case :P [11:38:15] * mark wants some style changes in joe's 2.0 code too [11:38:19] but that we'll have to discuss later [13:20:45] you gotta love pybal code: "sendNotificationWithoutOpen = True # No bullshit" bgp.py:844 [13:21:08] what about it? [13:21:38] funny comment :) [13:22:22] I'm studying bgp.py to increase test coverage, hopefully it will help with T188085 [13:22:22] T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085 [13:23:37] good, that's the biggest remaining gap [13:23:54] i have some unsubmitted patches for unit test coverage of server/coordinator.py too [13:24:16] i wrote them like a month ago on the plane and haven't had a chance to get back to it since [15:25:04] _joe_: what would you suggest as a way to tackle https://phabricator.wikimedia.org/T154801#3989098? Basically we want to ensure that a minimum percentage of backends is always pooled in a given DC. [15:25:31] backends=varnish backends in this context [15:26:24] <_joe_> ema: annual planning, sorry [15:26:55] <_joe_> just re-ask me on monday :P [15:27:02] _joe_: no worries, this is not urgent. Just wanted to pick your brain but it can definitely wait :) [15:27:21] <_joe_> I'd be very happy to work on this instead than annual plans, trust me [15:38:46] ema: https://gerrit.wikimedia.org/r/#/c/413740/ [15:39:15] TL;DR - the fini bug you hit in netmapper when testing on cp5 was because cp5 dont' have netmapper databases in the first place (yet) [15:39:33] and the code didn't handle the case od destructing something that never happened to exist... [15:40:01] (pretend I corrected all those typos above, it's first-coffee-30) [15:41:01] I should file a separate task about that so we can track down whomever is currently capable/responsible. [15:41:18] there's a whitelist somewhere on the zerowiki side that allows our caches to pull netmapper files, and eqsin isn't in it. [15:41:51] (I was hoping the recent update to wgSquidServersNoPurge might be what they were referencing, but I guess not!) [15:44:21] 10Traffic, 10Operations, 10Patch-For-Review: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089#3996469 (10BBlack) [15:44:23] 10Traffic, 10Operations: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3996468 (10BBlack) [15:46:13] I couldn't find it by staring at the code. It wasn't until I was tracing the updater threads that it hit me what was going on. [15:46:41] for future reference, when looking at "thread names" (comm field, which pthreaded-programs can override with calls like pthread_setname_np to give per-thread naming) [15:47:09] there's only one true "cache-main" thread named by varnish. All other "cache-main" threads in the same process are the netmapper updater threads, since they spawn from that one and don't set their own name. [15:47:27] maybe should add a name to netmapper while we have to rebuild anyways [15:47:34] yeah, that would be great [15:50:12] I guess that's more in general true for all vmods? all cache-main threads but one are vmod-spawned? [15:51:10] no [15:51:48] netmapper is "special", vmods in general just have code hooks in them that run in various existing varnishd-owned/managed threads (like main or workers) [15:52:03] but netmapper spawns a new pthread without varnishd's involvement :P [15:52:47] (so that it can async-reload its database files when they change on disk without taking a lock or perf hit in the worker-thread runtime lookups) [15:54:39] that's what all the "Crazy hack" section is about. managing the side-thread without varnish providing any sane hooks for vmods to do so in the general case [15:55:38] oh that's not accurate, the "Crazy hack" section is about managing our RCU stuff wrt cache-worker thread lifetimes. [15:55:47] vcl_fini takes care of the side-thread [15:56:04] either way... I don't think the vmod API really envisioned things like side-threads or RCU [15:56:56] vcl_fini, which so far we are not calling as it's only invoked when triggered by vcl.discard [15:58:31] right, which means our resource leaks include, in the common case, 2x threads per frontend VCL [15:58:54] which all sit in a loop sleeping for 89s then doing a stat on a JSON file (and reloading it if it changed) [15:59:32] if we stack up 20 cold VCLs, and zero updates the carriers.json file, it gets reloaded/processed 20 times in parallel pointlessly :) [16:03:02] anyways, added thread name thing [16:04:32] hmmm, maybe I haven't pulled in a while. I bumped the master version from 1.4 to 1.5, but I see 1.5 is the package already deployed... [16:04:50] yeah [16:06:16] fixed! [16:07:45] ah, I was just coming back here to say that [16:08:17] and it looks like the varnishapi-dev dependency on the debian branch is now >= 5.1.3 [16:08:48] may be easiest to build the new package for v5 only, and fix cold VCL after finishing the v5 conversion of text [16:09:21] +1 [16:10:06] supporting two different varnish versions in prod really is a nightmare [16:12:50] wait till we're parallel-patching backend-VCL + Lua :) [16:22:03] oh my god I'm so happy that setting WIKIMEDIA_EXPERIMENTAL works to pull in varnishapi-dev>=5.1.3 [16:26:01] testing on cp5004 [16:26:26] oh good, because I patched blindly, I'm not even 100% sure it compiles :) [16:26:39] it does! [16:28:04] so another way we could resolve the v5-vs-netmapper thing, is that since cp5 are the only ones missing database files in practice [16:28:35] oh wait, after upgrading vmod-netmapper we need to restart varnish-fe to use the updated vmod [16:28:43] we could do the cold VCL cleanup stuff now, and just exclude it from cp5 until either of (zero giving us files for eqsin | text v5 upgrade complete -> netmapper upgrade) happens [16:28:48] which means we lose access to all those nice cold vcls ready to be discarded :) [16:28:58] well you can spam some vcl-reload -n frontend [16:29:36] sure, they take a while to cool down though [16:36:23] 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996633 (10BBlack) p:05Triage>03High [16:36:48] 10Traffic, 10Operations: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#3996648 (10BBlack) [16:39:27] bblack: child survived discard, threads show up as netmap \o/ [16:39:41] merging, packaging, and all that [16:41:19] ema: \o/ [16:48:32] 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996678 (10BBlack) Assuming it is a whitelist of the private networks containing prod caches, the new additions to the list for ipv6+ipv4 would be: ``` 2001:df2:e500:10... [17:14:05] 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996786 (10Mholloway) I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that zerofetch.py is using the deprecated `acti... [17:39:30] 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996911 (10BBlack) >>! In T188111#3996786, @Mholloway wrote: > I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that z... [20:22:18] 10Traffic, 10Operations, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3997484 (10Mholloway) >>! In T188111#3996911, @BBlack wrote: > I looked into this a little bit, and while I do see there's a deprecation warning is... [20:41:51] 10Traffic, 10Operations, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3997503 (10BBlack) Merged your patch (thanks). New failure in eqsin is: `Exception: API login phase2 gave result Failed with reason "Incorrect us... [22:59:05] 10Traffic, 10Operations, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3998054 (10BBlack) I've tested setting the `HTTPS_PROXY` environment variable before a manual script run from eqsin, causing the request to be prox... [23:12:49] 10Traffic, 10Operations, 10Zero, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3998106 (10Mholloway) [23:12:57] 10Traffic, 10Operations, 10Zero, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996633 (10Mholloway)