[00:30:42] 10Traffic, 06Commons, 06Operations, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2730435 (10matmarex) [00:43:09] 10Traffic, 06Commons, 06Operations, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2724723 (10BBlack) If I had to venture a guess after-the-fact, I'd guess that some bad mime-type headers slipped into at least some of the caches for these fil... [02:00:23] 10Traffic, 06Operations, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2730602 (10BBlack) This will soon be the last cache_misc backend left that doesn't conform to the new normal (single service hostname handled by LVS), so it's becoming a blocker for furthe... [02:00:54] 10Traffic, 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2730603 (10BBlack) [09:32:01] tlsproxy-related OCD -> https://gerrit.wikimedia.org/r/#/c/316947/ [09:34:51] lol +1 if the result is properly indented :-P [09:35:15] volans: https://puppet-compiler.wmflabs.org/4447/ [09:52:06] any joy with the jenkins / ssl issues from yesterday? I won't be able to reproduce anymore since my desktop is now packed [09:55:38] godog: TL;DR blame jenkins backend [09:56:29] the less shorter version is that we were able to repro also connecting to other DC and none of the other sites on the same SSL terminators show that behaviour [09:57:33] so the idea a closed/timeout connection from the backend chain (nginx-varnish[multi layer]-apache-jenkins) is shown as an SSL error on the client [09:57:44] s/idea/idea is that/ [10:02:10] ah! thanks volans, odd indeed that this was only jenkins [10:03:58] godog: also if connecting to eqiad you don't see the effect on the UI because the browser/varninsh retries are able to get all the resources, but can still see the failures in the network tab [10:04:47] while connecting to esams for example with an additional varnish hop only some retries succeed hence you have a broken UI [10:06:35] ah, none of jenkins is cached by varnish? ie. all reloads were hitting jenkins [10:06:54] it's all pass [10:08:23] heh that explains it alright :( anyways thanks for looking into it, unexpected behaviour heh [10:18:59] what's with cp1047's exim errors? [10:33:11] +1, I was about to ask [10:33:19] (cronspam) [10:44:02] how dare you make the tlsproxy config files readable :P [10:44:26] bblack: security by obscurity? :-P [10:45:12] paravoid: I donno, *but*, I think cp1047 might be one people have experimented on and left in a different package/repo state than others... [10:46:35] yeah other caches have exim4 4.84.2-2+deb8u1, cp1047 has 4.87-3~bpo8+1 [10:46:50] I think it has to do with how the experimental repo is set up? it may be using experimental by default or something? [10:46:58] (set up on that host) [10:48:06] no not experimental, "backports" [10:48:32] apt-cache policy on cp1047 selects backports, on others it doesn't [10:51:47] even the priorities look the same on the exim4 policy, but it still selects backports [10:51:50] hmmmm [10:54:09] everything in /etc/apt/ seems identical to the other hosts though [11:01:33] maybe some component from backports was pulled in manually and exim was updated indirectly via some dependency? [11:15:12] I have no idea [11:15:58] we have a general problem in the cp* world that people (myself included, but I think many people) occasionally experiment on one of the 101 servers and install arbitrary/crazy test packages, and then some host ends up in a different state than the rest, sometimes significantly as in this case [11:16:26] I find leftovers of experiments when checking package upgrade lists and whatnot more often than I'd like [11:18:00] yeah, I often run into these why checking debdeploy failures (if the error shows up in an inconsistent dpkg state) [11:18:32] should we just reimage a stateless host after experiments that left it in weird states? [11:22:11] volans: ideally, but reimaging has in the past been pretty non-trivial to manage [11:22:55] it would be nice to rope off experiments to just one host, but then there's often reasons experiments need to be in a given cluster and/or dc, so you really need 16 designated test hosts at that point [11:23:17] and it's not always just a test/experiment, sometimes it's installing packages to use them to investigate something about live traffic or whatever [11:23:40] moritzm: strange, exim4 4.87 appears only in term.log, not in history.log (/var/log/apt) and together with many others but no timestamp in term.log [11:24:14] maybe once it flipped to backports it kept upgrading from there in normal upgrade or dist-upgrade [11:24:45] there was a dist-upgrade yesterday but according to history.log upgraded just 2 packages [11:25:14] yeah they've all had dist-upgrade yesterday at some point [11:25:36] (and also not that long ago for the big jessie point release) [11:26:14] going to see what happens if I try to install non-backports exim4 [11:26:16] and apt-cache rdepends exim4 is quite long [11:27:05] bblack: the dist-upgrade of the 2016-10-18 installed a lot of bpo packages [11:27:44] and exim4 was upgraded in that run [11:27:49] sorry I was misreading before [11:27:53] exim4:amd64 (4.84.2-2+deb8u1, 4.87-3~bpo8+1) [11:28:17] yeah [11:28:32] the question is why, since the same /etc/apt/ contents and it didn't happen on other servers [11:28:58] after a manual: [11:29:00] apt-get install exim4=4.84.2-2+deb8u1 exim4-base=4.84.2-2+deb8u1 exim4-daemon-light=4.84.2-2+deb8u1 exim4-config=4.84.2-2+deb8u1 [11:29:10] it now doesn't want to re-upgrade to backports, and policy selects the installed one [11:29:31] I also checked timestamp of /etc/apt/preferences.d [11:29:37] and dist-upgrade doesn't want to mess with it either [11:30:27] was it an explicit dist-upgrade -t jessie-backports? [11:30:35] although history.log says: Commandline: apt-get -y dist-upgrade [11:30:48] I dunno [11:32:34] yeah there was no -t [11:37:29] 10Traffic, 06Commons, 06Operations, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2731008 (10Aklapper) p:05High>03Low Lowering priority as this cannot be reproduced anymore. [11:43:47] weird [11:44:51] my last suggestion to fix all related things in the long term was to just start a policy of constant slow reimaging [11:45:28] as in: line up a standard schedule where we reinstall one cache node every day from some list as just noral process, taking ~4 months per cycle of reinstalling them all. [11:46:20] that could even be automated maybe, if we had some way to easily reboot-to-PXE and installation finished up without intervention for puppet keys/storedconfig, salt keys, etc [11:46:32] ii systemd 230-7~bpo8+2 amd64 system and service manager [11:46:37] this host needs a reimaging [11:46:49] root@cp1047:~# dpkg -l |grep -c bpo8 [11:46:49] 62 [11:47:00] it also has a backports kernel installed too, but not booted to it [11:47:03] including systemd.. [11:47:09] yeah [11:47:18] I'm frankly surprised that everything still works [11:47:25] :) [11:47:36] but I'd depool it regardless :) [11:48:02] brb [11:48:14] paravoid: yes the dist-upgrade of the 18th was with all bpo's [11:59:21] 10Traffic, 06Operations, 10ops-esams: cp3009 hw issues - https://phabricator.wikimedia.org/T148722#2731071 (10BBlack) [12:00:02] 10Traffic, 06Operations: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2731084 (10BBlack) [12:00:42] bblack: there is also https://phabricator.wikimedia.org/T148422 about cp3009 [12:01:01] (saw it yesterday while checking tickets) [12:03:04] 10Traffic, 06Operations, 10ops-esams: cp3009 hw issues - https://phabricator.wikimedia.org/T148722#2731105 (10BBlack) [12:03:26] 10Traffic, 06Operations, 10ops-esams: cp3009: memory scrubbing error - https://phabricator.wikimedia.org/T148422#2722360 (10BBlack) It's depooled from service as of yesterday as well (didn't see this ticket!). [12:03:33] elukey: thanks :) [12:34:29] porting varnishrls is being more interesting than expected :) [12:34:46] 1) the varnishapi.py issue reported here https://github.com/xcir/python-varnishapi/issues/65 [12:35:15] 2) we should now use -q instead of relying on -I and friends [12:35:40] 3) for some funny reason I don't get any matching transactions as soon as I use -i [12:37:33] using a tuple as simple as (('n', 'frontend'), ('i', 'ReqURL')) when it comes to 3) no transactions match [12:46:27] uh ok 3) is due to VarnishLogProcessor.handle_log_record expecting Timestamp and Resp [12:58:20] buuuuu [12:58:22] :) [12:59:10] elukey: finally managed to print tags related to /w/load.php :) [12:59:50] \o/ [13:29:03] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Release-Engineering-Team, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2471564 (10elukey) ping :) [13:33:01] 10Traffic, 10ArticlePlaceholder, 06Operations, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2731387 (10elukey) p:05Triage>03Normal [13:42:44] if anybody has time for https://phabricator.wikimedia.org/T148412, I'd like to get some suggestions [13:43:17] I am not sure what kind of issue this is [13:43:54] from a newbie perspective it seems that varnish frontend selects the backend, then something does not go as planned and the request ends up in a weird state (at least for logging) [13:44:12] it doesn't happen often but it triggers our dear data consistency checks [13:45:43] maybe something weird happening in the chain frontend -> (local DC) backend -> eqiad backend [14:29:05] elukey: does the issue ever arise in eqiad itself? Or is it something you've noticed in esams only [14:31:38] I noticed it in eqiad too, but I haven't created varnishlog instances in there [14:31:44] I could try to do it [14:32:46] yeah if you think that hopping through multiple varnishes might make the problem worse it could be interesting to check in eqiad as well [14:38:14] the main problem is that even if I group for request I can't see a backend req logged [14:38:21] and this is super weird [14:38:26] it seems like it doesn't happen [15:50:10] on the recdns IP switch, statistically it looks like ~99.7% of reqs have moved to the new IP at this point on their own after the puppet change [15:50:17] so probably the final cleanup won't be too bad [15:52:56] looking at a quick check on the traffic, seems like one big case is frack must independently have config to hit our recursors [15:53:06] lots of the remaining reqs are from frack hosts [16:00:18] 10Traffic, 06Analytics-Kanban, 06Operations, 06Performance-Team, 06Reading-Admin: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2731863 (10dr0ptp4kt)