[01:14:56] <wikibugs>	 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: Android unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3994992 (10Krinkle)
[01:15:35] <wikibugs>	 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: Unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3436479 (10Krinkle)
[03:23:58] <wikibugs>	 10Traffic, 10Commons, 10Operations, 10Thumbor, 10media-storage: Unable to render file from upload.wikimedia.org "Error 349 ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION" - https://phabricator.wikimedia.org/T170605#3436479 (10BBlack) https://stackoverflow.com/questions/13578428/duplicate-headers-recei...
[08:15:34] <vgutierrez>	 current pybal test coverage: https://phabricator.wikimedia.org/P6734
[09:45:11] <_joe_>	 vgutierrez: to put that into prespective, 2 years ago it was 0%
[09:45:13] <_joe_>	 :P
[09:45:38] <vgutierrez>	 best thing I found so far.. was flake8 running BUT configured to exclude everything
[09:45:39] <vgutierrez>	 *sigh*
[09:57:36] <_joe_>	 vgutierrez: eheh yeah, well, twisted is not pep8 compliant
[09:57:54] <_joe_>	 and mark's code isn't either :D
[09:58:12] <_joe_>	 so I'd rather work on the 2.0 version, tbh
[10:07:29] <vgutierrez>	 yeah... I need to use a paginator on flake8 bgp.py output
[10:56:09] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Some etcd connections not established at startup - https://phabricator.wikimedia.org/T188087#3995831 (10ema)
[10:56:35] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#3995835 (10ema)
[11:11:51] <wikibugs>	 10Traffic, 10Operations: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089#3995874 (10ema) p:05Triage>03High
[11:12:10] <_joe_>	 ema: wow it's the week of bugs for everyone
[11:12:28] <_joe_>	 in addition to pybal's issue, we found two infinite loops in mediawiki
[11:13:34] <ema>	 _joe_: yeah elukey told me about those. We're living interesting times :)
[11:14:07] <vgutierrez>	 wow
[11:14:29] <_joe_>	 vgutierrez: this is definitely *not* our average week :P
[11:14:35] <_joe_>	 don't get scared 
[11:14:41] * ema confirms
[11:14:44] <vgutierrez>	 it's not my fault!
[11:14:46] <vgutierrez>	 xDDD
[11:14:50] <_joe_>	 uhm right
[11:15:04] <_joe_>	 what changed this week? let's revert just in case :P
[11:38:15] * mark wants some style changes in joe's 2.0 code too
[11:38:19] <mark>	 but that we'll have to discuss later
[13:20:45] <vgutierrez>	 you gotta love pybal code: "sendNotificationWithoutOpen = True    # No bullshit" bgp.py:844
[13:21:08] <mark>	 what about it?
[13:21:38] <vgutierrez>	 funny comment :)
[13:22:22] <vgutierrez>	 I'm studying bgp.py to increase test coverage, hopefully it will help with T188085
[13:22:22] <stashbot>	 T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085
[13:23:37] <mark>	 good, that's the biggest remaining gap
[13:23:54] <mark>	 i have some unsubmitted patches for unit test coverage of server/coordinator.py too
[13:24:16] <mark>	 i wrote them like a month ago on the plane and haven't had a chance to get back to it since
[15:25:04] <ema>	 _joe_: what would you suggest as a way to tackle https://phabricator.wikimedia.org/T154801#3989098? Basically we want to ensure that a minimum percentage of backends is always pooled in a given DC. 
[15:25:31] <ema>	 backends=varnish backends in this context
[15:26:24] <_joe_>	 ema: annual planning, sorry
[15:26:55] <_joe_>	 just re-ask me on monday :P
[15:27:02] <ema>	 _joe_: no worries, this is not urgent. Just wanted to pick your brain but it can definitely wait :)
[15:27:21] <_joe_>	 I'd be very happy to work on this instead than annual plans, trust me
[15:38:46] <bblack>	 ema: https://gerrit.wikimedia.org/r/#/c/413740/
[15:39:15] <bblack>	 TL;DR - the fini bug you hit in netmapper when testing on cp5 was because cp5 dont' have netmapper databases in the first place (yet)
[15:39:33] <bblack>	 and the code didn't handle the case od destructing something that never happened to exist...
[15:40:01] <bblack>	 (pretend I corrected all those typos above, it's first-coffee-30)
[15:41:01] <bblack>	 I should file a separate task about that so we can track down whomever is currently capable/responsible.
[15:41:18] <bblack>	 there's a whitelist somewhere on the zerowiki side that allows our caches to pull netmapper files, and eqsin isn't in it.
[15:41:51] <bblack>	 (I was hoping the recent update to wgSquidServersNoPurge might be what they were referencing, but I guess not!)
[15:44:21] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: VCL discards crash varnish frontend child process - https://phabricator.wikimedia.org/T188089#3996469 (10BBlack)
[15:44:23] <wikibugs>	 10Traffic, 10Operations: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#3996468 (10BBlack)
[15:46:13] <bblack>	 I couldn't find it by staring at the code.  It wasn't until I was tracing the updater threads that it hit me what was going on.
[15:46:41] <bblack>	 for future reference, when looking at "thread names" (comm field, which pthreaded-programs can override with calls like pthread_setname_np to give per-thread naming)
[15:47:09] <bblack>	 there's only one true "cache-main" thread named by varnish.  All other "cache-main" threads in the same process are the netmapper updater threads, since they spawn from that one and don't set their own name.
[15:47:27] <bblack>	 maybe should add a name to netmapper while we have to rebuild anyways
[15:47:34] <ema>	 yeah, that would be great
[15:50:12] <ema>	 I guess that's more in general true for all vmods? all cache-main threads but one are vmod-spawned?
[15:51:10] <bblack>	 no
[15:51:48] <bblack>	 netmapper is "special", vmods in general just have code hooks in them that run in various existing varnishd-owned/managed threads (like main or workers)
[15:52:03] <bblack>	 but netmapper spawns a new pthread without varnishd's involvement :P
[15:52:47] <bblack>	 (so that it can async-reload its database files when they change on disk without taking a lock or perf hit in the worker-thread runtime lookups)
[15:54:39] <bblack>	 that's what all the "Crazy hack" section is about.  managing the side-thread without varnish providing any sane hooks for vmods to do so in the general case
[15:55:38] <bblack>	 oh that's not accurate, the "Crazy hack" section is about managing our RCU stuff wrt cache-worker thread lifetimes.
[15:55:47] <bblack>	 vcl_fini takes care of the side-thread
[15:56:04] <bblack>	 either way... I don't think the vmod API really envisioned things like side-threads or RCU
[15:56:56] <ema>	 vcl_fini, which so far we are not calling as it's only invoked when triggered by vcl.discard 
[15:58:31] <bblack>	 right, which means our resource leaks include, in the common case, 2x threads per frontend VCL
[15:58:54] <bblack>	 which all sit in a loop sleeping for 89s then doing a stat on a JSON file (and reloading it if it changed)
[15:59:32] <bblack>	 if we stack up 20 cold VCLs, and zero updates the carriers.json file, it gets reloaded/processed 20 times in parallel pointlessly :)
[16:03:02] <bblack>	 anyways, added thread name thing
[16:04:32] <bblack>	 hmmm, maybe I haven't pulled in a while.  I bumped the master version from 1.4 to 1.5, but I see 1.5 is the package already deployed...
[16:04:50] <bblack>	 yeah
[16:06:16] <bblack>	 fixed!
[16:07:45] <ema>	 ah, I was just coming back here to say that
[16:08:17] <bblack>	 and it looks like the varnishapi-dev dependency on the debian branch is now >= 5.1.3
[16:08:48] <bblack>	 may be easiest to build the new package for v5 only, and fix cold VCL after finishing the v5 conversion of text
[16:09:21] <ema>	 +1
[16:10:06] <ema>	 supporting two different varnish versions in prod really is a nightmare 
[16:12:50] <bblack>	 wait till we're parallel-patching backend-VCL + Lua :)
[16:22:03] <ema>	 oh my god I'm so happy that setting WIKIMEDIA_EXPERIMENTAL works to pull in varnishapi-dev>=5.1.3 
[16:26:01] <ema>	 testing on cp5004
[16:26:26] <bblack>	 oh good, because I patched blindly, I'm not even 100% sure it compiles :)
[16:26:39] <ema>	 it does!
[16:28:04] <bblack>	 so another way we could resolve the v5-vs-netmapper thing, is that since cp5 are the only ones missing database files in practice
[16:28:35] <ema>	 oh wait, after upgrading vmod-netmapper we need to restart varnish-fe to use the updated vmod
[16:28:43] <bblack>	 we could do the cold VCL cleanup stuff now, and just exclude it from cp5 until either of (zero giving us files for eqsin | text v5 upgrade complete -> netmapper upgrade) happens
[16:28:48] <ema>	 which means we lose access to all those nice cold vcls ready to be discarded :)
[16:28:58] <bblack>	 well you can spam some vcl-reload -n frontend
[16:29:36] <ema>	 sure, they take a while to cool down though
[16:36:23] <wikibugs>	 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996633 (10BBlack) p:05Triage>03High
[16:36:48] <wikibugs>	 10Traffic, 10Operations: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#3996648 (10BBlack)
[16:39:27] <ema>	 bblack: child survived discard, threads show up as netmap \o/
[16:39:41] <ema>	 merging, packaging, and all that
[16:41:19] <bblack>	 ema: \o/
[16:48:32] <wikibugs>	 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996678 (10BBlack) Assuming it is a whitelist of the private networks containing prod caches, the new additions to the list for ipv6+ipv4 would be:  ``` 2001:df2:e500:10...
[17:14:05] <wikibugs>	 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996786 (10Mholloway) I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that zerofetch.py is using the deprecated `acti...
[17:39:30] <wikibugs>	 10Traffic, 10Operations, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996911 (10BBlack) >>! In T188111#3996786, @Mholloway wrote: > I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that z...
[20:22:18] <wikibugs>	 10Traffic, 10Operations, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3997484 (10Mholloway) >>! In T188111#3996911, @BBlack wrote: > I looked into this a little bit, and while I do see there's a deprecation warning is...
[20:41:51] <wikibugs>	 10Traffic, 10Operations, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3997503 (10BBlack) Merged your patch (thanks).  New failure in eqsin is:  `Exception: API login phase2 gave result Failed with reason "Incorrect us...
[22:59:05] <wikibugs>	 10Traffic, 10Operations, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3998054 (10BBlack) I've tested setting the `HTTPS_PROXY` environment variable before a manual script run from eqsin, causing the request to be prox...
[23:12:49] <wikibugs>	 10Traffic, 10Operations, 10Zero, 10ZeroPortal, 10Patch-For-Review: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3998106 (10Mholloway)
[23:12:57] <wikibugs>	 10Traffic, 10Operations, 10Zero, 10ZeroPortal: Cannot fetch Zero carriers/proxies JSON files from eqsin - https://phabricator.wikimedia.org/T188111#3996633 (10Mholloway)