[09:02:04] <_joe_> marostegui: https://phabricator.wikimedia.org/T223952#5199667 [09:02:18] <_joe_> TL;DR the problem was concentrated on enwiki, and enwiki alone [09:02:26] <_joe_> across all appservers and apis [09:02:43] <_joe_> at least in the two occurrences I checked from today and yesterday [09:03:01] <_joe_> so I guess we should take a look at the databases, if any sign of struggle is visible there [09:03:36] I will take a look at the aggregated graphs and the hosts individually [09:03:36] <_joe_> jijiki, elukey this also means that we can reopen traffic to php7 I think [09:03:48] <_joe_> marostegui: I'm not sure, it's just a clue [09:03:55] <_joe_> we had the same slowness on php7 and hhvm [09:04:12] <_joe_> and this time we didn't exhaust apache workers, so they must have been slow togethter [09:04:36] <_joe_> now, I could think of a few things shared across enwiki specifically and enwiki alone. The most important is the databases [09:04:48] <_joe_> so I'd look there as a next step [09:04:52] _joe_ lemme grab some traffic from mw1238 before so we can see why mcrouter's traffic is different, but yes +1 [09:05:12] <_joe_> elukey: cool so I'll wait for your green light [09:05:50] <_joe_> I'm still looking at numbers to confirm if all occurrences from the last couple days show the same behaviour [09:06:36] jijiki is also rebuilding the prometheus exporter so we'll have a baseline for gets too [09:07:54] ok one thing at a time [09:08:11] _joe_: no worries, I will take a look from the DB side and report back on the ticket [09:08:29] _joe_: we can do it in a couple of hours, I spoke with petr to migrate another job [09:09:04] <_joe_> jijiki: ok I think you and elukey can coordinate on reopening php7 [09:09:32] <_joe_> I'm off to other things; specifically APC bumping [09:09:58] yeah we can [09:50:31] _joe_: This is what I have seen so far https://phabricator.wikimedia.org/T223952#5199794 [09:51:31] <_joe_> yeah I was reading [09:51:42] <_joe_> basically a few things changed with wmf.3 I would say [09:52:23] <_joe_> can you concentrate on yesterday and today too? the issues seem to be larger in the last couple days [09:52:38] <_joe_> and very specifically on enwiki [09:53:30] <_joe_> but there's also that increase in mean query time on april 17th which is puzzling [09:53:43] yeah at 12:58 [09:54:06] from SAL 12:58 mobrovac: bootstrap restbase1021-c - T219404 [09:54:07] T219404: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 [09:55:28] and all the other new ones followed after that, not sure if could be related at this point [09:58:08] _joe_: there is nothing from the last 2-3 days really, just an increase on reads (but without changing any metric on the response time) [09:58:12] just more activity it seems [09:58:38] <_joe_> so, my next best guess is... nutcracker [10:06:26] <_joe_> but I find nothing whatsoever there [10:06:36] <_joe_> nothing in the logs either [10:07:05] <_joe_> our next best hope would be to go look at sampled perf data from around that time [11:15:35] for anyone intrested ripe78 is on this week https://ripe78.ripe.net/live/main/ [13:04:43] so.. do we have a strong opinion on using create_resources() in puppet? [13:05:12] I know sh.dubsh does ;) [13:06:25] it always seemed like a useful tool to me in the past, without which certain things are very hard to factor out [13:06:36] but, I think modern puppet syntax makes a lot of things doable without it? [13:07:23] I can't say I fully understand what I'm about to say but I believe that creating external resources the potential future use of Puppet environments very tricky [13:08:28] I don't think create_resources() is specific to external, it's just a macro-level to iterate the creation of normal resources from a data structure, basically. [13:08:39] oh, right. nevermind then [13:08:39] yeah [13:08:55] although people often do use it with "exported" resources [13:09:49] looks like we have a 42 uses of it currently [13:10:17] vgutierrez: what are you trying to do? in some cases makes total sense [13:11:20] but .each might be enough in most cases [13:11:56] indeed, .each is enough [13:36:52] <_joe_> vgutierrez: there are better ways to do things than create_resources [13:36:58] <_joe_> the simplest is [13:37:45] <_joe_> $data.each |$name, $params| { myresource { $name: *=> $params } } [13:44:32] marostegui: how dare you add a comment like https://phabricator.wikimedia.org/T223952#5199794 that is so short and without details? :D [13:44:44] *to add [13:44:51] (really nice btw) [13:50:06] elukey: was this you when you saw it? https://media.giphy.com/media/EeIzKI0uDz916/giphy.gif ? [14:31:09] marostegui: I was more like https://giphy.com/explore/not-bad [14:31:41] anybody up to a quick DNS change review ? https://gerrit.wikimedia.org/r/#/c/operations/dns/+/511713/ [14:31:47] hahaha [14:31:48] *for a [14:38:42] jbond42: thanks! [14:38:52] np [14:44:55] jbond42: godog do you know anywthing about the restbase-dev cron spam ? [14:45:21] jijiki: broken disk [14:45:50] oh poor disk [14:46:09] tx volans [14:46:11] I've also opened T223938 btw [14:46:12] T223938: facter 3: add timeout to custom facts external calls - https://phabricator.wikimedia.org/T223938 [14:47:31] volans: sorry:/ I didn't catch it [14:48:15] it could my attention because -services were doing some changing and I thought that maybe something didn't go well [14:48:22] changes* [14:49:00] bizzarre, I think that's the first time I see facter fail like that on a broken disk, perhaps that's facter 3 [14:49:05] but yeah what volans said jijiki [14:50:00] unfortunately we have a bunch of emails that start when there is a broken disk on hosts that use JBOD, like smart-data-dump, the mdadm email, etc. [14:50:09] *start spamming [14:51:32] let me see how facter2 reacts [15:39:15] i have updated T223938 with a some more information however tl;dr facter version2 didn't honour the timeout value and the command to resolve the checks takes just over 30 seconds [15:39:16] T223938: facter 3: add timeout to custom facts external calls - https://phabricator.wikimedia.org/T223938 [15:39:27] s/checks/facts/ [15:40:13]  [15:40:17] whoops [15:40:26] I was going to say, thanks jbond42 for taking a look [15:41:43] no probs, fyi if people want to test facter version 2 still you can just install the gem `gem install facter --user` and run it from your home dir `sudo ./.gem/ruby/2.1.0/bin/facter -p lvm_vgs` [15:42:24] `rm -rf ~/.gem` when done though :) [18:15:41] Can someone pls invite me to #_security. [18:15:57] * onimisionipe shamelessly asked [18:22:17] onimisionipe: you should be on the invite-not-needed list now, try joining again? [18:26:33] onimisionipe: (let me know if it doesn't work, ofc) [23:41:02] one more runbook. this for "unmerged changes on ..." https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes