[00:30:42] <wikibugs>	 10Traffic, 06Commons, 06Operations, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2730435 (10matmarex)
[00:43:09] <wikibugs>	 10Traffic, 06Commons, 06Operations, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2724723 (10BBlack) If I had to venture a guess after-the-fact, I'd guess that some bad mime-type headers slipped into at least some of the caches for these fil...
[02:00:23] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2730602 (10BBlack) This will soon be the last cache_misc backend left that doesn't conform to the new normal (single service hostname handled by LVS), so it's becoming a blocker for furthe...
[02:00:54] <wikibugs>	 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2730603 (10BBlack)
[09:32:01] <ema>	 tlsproxy-related OCD -> https://gerrit.wikimedia.org/r/#/c/316947/
[09:34:51] <volans>	 lol +1 if the result is properly indented :-P
[09:35:15] <ema>	 volans: https://puppet-compiler.wmflabs.org/4447/
[09:52:06] <godog>	 any joy with the jenkins / ssl issues from yesterday? I won't be able to reproduce anymore since my desktop is now packed
[09:55:38] <volans>	 godog: TL;DR blame jenkins backend
[09:56:29] <volans>	 the less shorter version is that we were able to repro also connecting to other DC and none of the other sites on the same SSL terminators show that behaviour
[09:57:33] <volans>	 so the idea a closed/timeout connection from the backend chain (nginx-varnish[multi layer]-apache-jenkins) is shown as an SSL error on the client
[09:57:44] <volans>	 s/idea/idea is that/
[10:02:10] <godog>	 ah! thanks volans, odd indeed that this was only jenkins
[10:03:58] <volans>	 godog: also if connecting to eqiad you don't see the effect on the UI because the browser/varninsh retries are able to get all the resources, but can still see the failures in the network tab
[10:04:47] <volans>	 while connecting to esams for example with an additional varnish hop only some retries succeed hence you have a broken UI
[10:06:35] <godog>	 ah, none of jenkins is cached by varnish? ie. all reloads were hitting jenkins
[10:06:54] <volans>	 it's all pass
[10:08:23] <godog>	 heh that explains it alright :( anyways thanks for looking into it, unexpected behaviour heh
[10:18:59] <paravoid>	 what's with cp1047's exim errors?
[10:33:11] <elukey>	 +1, I was about to ask
[10:33:19] <elukey>	 (cronspam)
[10:44:02] <bblack>	 how dare you make the tlsproxy config files readable :P
[10:44:26] <volans>	 bblack: security by obscurity? :-P
[10:45:12] <bblack>	 paravoid: I donno, *but*, I think cp1047 might be one people have experimented on and left in a different package/repo state than others...
[10:46:35] <bblack>	 yeah other caches have exim4 4.84.2-2+deb8u1, cp1047 has 4.87-3~bpo8+1
[10:46:50] <bblack>	 I think it has to do with how the experimental repo is set up? it may be using experimental by default or something?
[10:46:58] <bblack>	 (set up on that host)
[10:48:06] <bblack>	 no not experimental, "backports"
[10:48:32] <bblack>	 apt-cache policy on cp1047 selects backports, on others it doesn't
[10:51:47] <bblack>	 even the priorities look the same on the exim4 policy, but it still selects backports
[10:51:50] <bblack>	 hmmmm
[10:54:09] <bblack>	 everything in /etc/apt/ seems identical to the other hosts though
[11:01:33] <moritzm>	 maybe some component from backports was pulled in manually and exim was updated indirectly via some dependency?
[11:15:12] <bblack>	 I have no idea
[11:15:58] <bblack>	 we have a general problem in the cp* world that people (myself included, but I think many people) occasionally experiment on one of the 101 servers and install arbitrary/crazy test packages, and then some host ends up in a different state than the rest, sometimes significantly as in this case
[11:16:26] <bblack>	 I find leftovers of experiments when checking package upgrade lists and whatnot more often than I'd like
[11:18:00] <moritzm>	 yeah, I often run into these why checking debdeploy failures (if the error shows up in an inconsistent dpkg state)
[11:18:32] <volans>	 should we just reimage a stateless host after experiments that left it in weird states?
[11:22:11] <bblack>	 volans: ideally, but reimaging has in the past been pretty non-trivial to manage
[11:22:55] <bblack>	 it would be nice to rope off experiments to just one host, but then there's often reasons experiments need to be in a given cluster and/or dc, so you really need 16 designated test hosts at that point
[11:23:17] <bblack>	 and it's not always just a test/experiment, sometimes it's installing packages to use them to investigate something about live traffic or whatever
[11:23:40] <volans>	 moritzm: strange, exim4 4.87 appears only in term.log, not in history.log (/var/log/apt) and together with many others but no timestamp in term.log
[11:24:14] <bblack>	 maybe once it flipped to backports it kept upgrading from there in normal upgrade or dist-upgrade
[11:24:45] <volans>	 there was a dist-upgrade yesterday but according to history.log upgraded just 2 packages
[11:25:14] <bblack>	 yeah they've all had dist-upgrade yesterday at some point
[11:25:36] <bblack>	 (and also not that long ago for the big jessie point release)
[11:26:14] <bblack>	 going to see what happens if I try to install non-backports exim4
[11:26:16] <volans>	 and apt-cache rdepends exim4 is quite long
[11:27:05] <volans>	 bblack: the dist-upgrade of the 2016-10-18 installed a lot of bpo packages
[11:27:44] <volans>	 and exim4 was upgraded in that run
[11:27:49] <volans>	 sorry I was misreading before
[11:27:53] <volans>	 exim4:amd64 (4.84.2-2+deb8u1, 4.87-3~bpo8+1)
[11:28:17] <bblack>	 yeah
[11:28:32] <bblack>	 the question is why, since the same /etc/apt/ contents and it didn't happen on other servers
[11:28:58] <bblack>	 after a manual:
[11:29:00] <bblack>	 apt-get install exim4=4.84.2-2+deb8u1 exim4-base=4.84.2-2+deb8u1 exim4-daemon-light=4.84.2-2+deb8u1 exim4-config=4.84.2-2+deb8u1
[11:29:10] <bblack>	 it now doesn't want to re-upgrade to backports, and policy selects the installed one
[11:29:31] <volans>	 I also checked timestamp of /etc/apt/preferences.d
[11:29:37] <bblack>	 and dist-upgrade doesn't want to mess with it either
[11:30:27] <volans>	 was it an explicit dist-upgrade -t jessie-backports?
[11:30:35] <volans>	 although history.log says: Commandline: apt-get -y dist-upgrade
[11:30:48] <volans>	 I dunno
[11:32:34] <bblack>	 yeah there was no -t
[11:37:29] <wikibugs>	 10Traffic, 06Commons, 06Operations, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2731008 (10Aklapper) p:05High>03Low Lowering priority as this cannot be reproduced anymore.
[11:43:47] <paravoid>	 weird
[11:44:51] <bblack>	 my last suggestion to fix all related things in the long term was to just start a policy of constant slow reimaging
[11:45:28] <bblack>	 as in: line up a standard schedule where we reinstall one cache node every day from some list as just noral process, taking ~4 months per cycle of reinstalling them all.
[11:46:20] <bblack>	 that could even be automated maybe, if we had some way to easily reboot-to-PXE and installation finished up without intervention for puppet keys/storedconfig, salt keys, etc
[11:46:32] <paravoid>	 ii  systemd                                       230-7~bpo8+2                amd64                       system and service manager
[11:46:37] <paravoid>	 this host needs a reimaging
[11:46:49] <paravoid>	 root@cp1047:~# dpkg -l |grep -c bpo8
[11:46:49] <paravoid>	 62
[11:47:00] <bblack>	 it also has a backports kernel installed too, but not booted to it
[11:47:03] <paravoid>	 including systemd..
[11:47:09] <paravoid>	 yeah
[11:47:18] <paravoid>	 I'm frankly surprised that everything still works
[11:47:25] <bblack>	 :)
[11:47:36] <paravoid>	 but I'd depool it regardless :)
[11:48:02] <paravoid>	 brb
[11:48:14] <volans>	 paravoid: yes the dist-upgrade of the 18th was with all bpo's
[11:59:21] <wikibugs>	 10Traffic, 06Operations, 10ops-esams: cp3009 hw issues - https://phabricator.wikimedia.org/T148722#2731071 (10BBlack)
[12:00:02] <wikibugs>	 10Traffic, 06Operations: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2731084 (10BBlack)
[12:00:42] <elukey>	 bblack: there is also https://phabricator.wikimedia.org/T148422 about cp3009
[12:01:01] <elukey>	 (saw it yesterday while checking tickets)
[12:03:04] <wikibugs>	 10Traffic, 06Operations, 10ops-esams: cp3009 hw issues - https://phabricator.wikimedia.org/T148722#2731105 (10BBlack)
[12:03:26] <wikibugs>	 10Traffic, 06Operations, 10ops-esams: cp3009: memory scrubbing error - https://phabricator.wikimedia.org/T148422#2722360 (10BBlack) It's depooled from service as of yesterday as well (didn't see this ticket!).
[12:03:33] <bblack>	 elukey: thanks :)
[12:34:29] <ema>	 porting varnishrls is being more interesting than expected :)
[12:34:46] <ema>	 1) the varnishapi.py issue reported here https://github.com/xcir/python-varnishapi/issues/65
[12:35:15] <ema>	 2) we should now use -q instead of relying on -I and friends
[12:35:40] <ema>	 3) for some funny reason I don't get any matching transactions as soon as I use -i
[12:37:33] <ema>	 using a tuple as simple as (('n', 'frontend'), ('i', 'ReqURL')) when it comes to 3) no transactions match
[12:46:27] <ema>	 uh ok 3) is due to VarnishLogProcessor.handle_log_record expecting Timestamp and Resp
[12:58:20] <elukey>	 buuuuu
[12:58:22] <elukey>	 :)
[12:59:10] <ema>	 elukey: finally managed to print tags related to /w/load.php :)
[12:59:50] <elukey>	 \o/
[13:29:03] <wikibugs>	 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Release-Engineering-Team, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2471564 (10elukey) ping :)
[13:33:01] <wikibugs>	 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2731387 (10elukey) p:05Triage>03Normal
[13:42:44] <elukey>	 if anybody has time for https://phabricator.wikimedia.org/T148412, I'd like to get some suggestions 
[13:43:17] <elukey>	 I am not sure what kind of issue this is
[13:43:54] <elukey>	 from a newbie perspective it seems that varnish frontend selects the backend, then something does not go as planned and the request ends up in a weird state (at least for logging)
[13:44:12] <elukey>	 it doesn't happen often but it triggers our dear data consistency checks
[13:45:43] <elukey>	 maybe something weird happening in the chain frontend -> (local DC) backend -> eqiad backend 
[14:29:05] <ema>	 elukey: does the issue ever arise in eqiad itself? Or is it something you've noticed in esams only
[14:31:38] <elukey>	 I noticed it in eqiad too, but I haven't created varnishlog instances in there
[14:31:44] <elukey>	 I could try to do it
[14:32:46] <ema>	 yeah if you think that hopping through multiple varnishes might make the problem worse it could be interesting to check in eqiad as well
[14:38:14] <elukey>	 the main problem is that even if I group for request I can't see a backend req logged
[14:38:21] <elukey>	 and this is super weird
[14:38:26] <elukey>	 it seems like it doesn't happen
[15:50:10] <bblack>	 on the recdns IP switch, statistically it looks like ~99.7% of reqs have moved to the new IP at this point on their own after the puppet change
[15:50:17] <bblack>	 so probably the final cleanup won't be too bad
[15:52:56] <bblack>	 looking at a quick check on the traffic, seems like one big case is frack must independently have config to hit our recursors
[15:53:06] <bblack>	 lots of the remaining reqs are from frack hosts
[16:00:18] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations, 06Performance-Team, 06Reading-Admin: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2731863 (10dr0ptp4kt)