[02:37:40] History [02:38:01] History & deleted by me [02:42:00] ? [02:43:39] 10Tool-Labs-tools-Other, 7I18n: [[Intuition:Raun-help p4/en]] i18n issue - https://phabricator.wikimedia.org/T123214#1923799 (10Macofe) 3NEW [04:15:58] 6Labs, 10Tool-Labs, 5Patch-For-Review: Install python-hunspell (and dictionaries?) - https://phabricator.wikimedia.org/T123192#1923813 (10Yamaha5) when I want to use hunspell it shows this error python >>> import hunspell Traceback (most recent call last): File "", line 1, in ImportError: d... [08:17:28] 6Labs, 10Tool-Labs, 5Patch-For-Review: Install python-hunspell (and dictionaries?) - https://phabricator.wikimedia.org/T123192#1923906 (10valhallasw) https://bugs.launchpad.net/ubuntu/+source/pyhunspell/+bug/1326518 So: I think we should deinstall the ubuntu packaged python-hunspell, and @Yamaha5 can then u... [13:11:26] (03CR) 10Hashar: "recheck" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [13:12:13] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [13:37:03] (03CR) 10Hashar: "I have enabled CI ( https://gerrit.wikimedia.org/r/#/c/263344/ ) to run npm/tox . See inline comment on tox.ini to have the test env defin" (031 comment) [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [13:58:29] (03PS51) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [13:59:15] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [13:59:44] (03CR) 10Ricordisamoa: "It works!" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [14:01:35] (03CR) 10Ricordisamoa: Initial commit (031 comment) [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [14:05:36] (03PS52) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [14:07:26] (03CR) 10Ricordisamoa: "PS52 adds __all__ to myjson to fix flake8" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [14:07:38] (03CR) 10Ricordisamoa: [C: 04-2] Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [14:11:55] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1924086 (10Beetstra) Valhallasw, it spawns many subprocesses to be able to keep up with wikipedia editing. It needs to parse in real time as anti-spam bots and work depend on it. [14:59:05] (03CR) 10Hashar: "Well done! Poke me as needed for the CI part. Removing myself from reviewers list." [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [16:09:50] (03PS53) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [16:09:53] 6Labs: Provide a simple way to backup arbitrary files from instances - https://phabricator.wikimedia.org/T104206#1924272 (10Halfak) Seems like that would be a reasonable strategy to me. [16:12:15] (03CR) 10Ricordisamoa: "PS53 marks some DragHelper methods with @private" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [16:20:26] (03PS54) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [16:26:48] (03CR) 10Ricordisamoa: "PS54 completes documentation for all DragHelper methods" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [17:11:19] YuviPanda: in case you’re looking for something to do… we were puzzling over the proper home of grrrit-wm: https://phabricator.wikimedia.org/T123167 [17:11:49] hey guys, I do not have enough bandwidth, but please keep an eye on https://tendril.wikimedia.org/report/slow_queries?host=^labsdb1003&user=&schema=&qmode=eq&query=&hours=1 [17:11:58] andrewbogott: I fixed that yesterday night, it's back on k8s now [17:12:06] https://tools.wmflabs.org/replag/ [17:12:46] jynus: ok! is that recovering? [17:12:46] YuviPanda: great! Will you update & close the ticket? [17:12:51] andrewbogott: yeah doing so now [17:12:56] no, it is getting worse [17:13:01] thanks, sorry for the post-hoc nag [17:13:06] kill queries if necessary [17:13:23] have to disconnect now, sorry [17:14:37] 6Labs, 10Tool-Labs, 10grrrit-wm: Rogue grrrit-wm running... somewhere - https://phabricator.wikimedia.org/T123167#1924427 (10yuvipanda) 5Open>3Resolved a:3yuvipanda It's back on k8s now, and I've killed it from SGE. The reason it didn't work for valhallasw is that my homedir is terribly organized, and... [17:14:47] ok [17:15:01] andrewbogott: ok, so looks like we'll have to go and kill some queries now! [17:16:08] YuviPanda: ‘increasing replag’ = ‘too many giant queries’? In this case, at least? [17:16:20] s51559 [17:16:25] seems to be running a fair bunch of them [17:17:23] can you walk me through your process? If you’re feeling patient? [17:17:36] andrewbogott: so I looked at the tendril page [17:17:49] andrewbogott: and I can see there that there's huge looking queries from that user [17:17:55] andrewbogott: now I Need to find out who that user is [17:18:09] ok, I already don’t know what the tendril page is. There’s a labs-specific one? [17:18:48] andrewbogott: jynus just pasted that ink [17:18:50] *link [17:18:56] https://tendril.wikimedia.org/report/slow_queries?host=^labsdb1003&user=&schema=&qmode=eq&query=&hours=1 [17:18:59] is the page [17:19:08] getent passwd | grep 51559 [17:19:12] shows me what tool it is [17:20:01] !log tools stopped webservice for quentinv57-tools [17:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:20:12] WOW there’s a lot of text on that page [17:20:35] yeah [17:20:37] huge queries [17:21:23] do you think we should kill the existing queries as well? [17:21:48] andrewbogott: am looking at existing queries now [17:21:51] you can do that by [17:21:56] ssh labsdb1003.eqiad.wmnet [17:21:58] sudo su [17:22:00] mysql [17:22:02] show processlist [17:23:22] andrewbogott: it doesn't look like there's too many of them from that user still running [17:23:30] andrewbogott: I've to run now to catch train (hopefully!) brb [17:23:48] yep, agreed they must be dying off on their own [17:23:58] andrewbogott: lag is still increasing tho [17:24:05] andrewbogott: this definitely requires more mysql knowledge than either of us have [17:25:29] whoops, I missed this train too lol ok [17:25:35] so I'm here for another 20mins [17:25:36] YuviPanda: is killing a query something you do within mysql or is it just a process kill? [17:25:43] andrewbogott: you can do that in mysql too [17:25:48] should do that even [17:26:50] andrewbogott: if you run [17:26:52] > SELECT ID, USER, TIME FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND = 'Query' AND TIME > 10 ORDER BY TIME DESC; [17:27:19] it'll list all running queries + their time + id + username [17:27:22] and we can kill them [17:28:04] I will start with everything from s51559 [17:28:45] andrewbogott: that's already dead [17:28:48] I think? [17:28:59] still has four running queries [17:29:37] ah [17:29:40] ok [17:31:05] ok, I have killed all the big ones. [17:31:28] shall we send an email to quentinv57-tools explaining why their service is stopped? [17:32:18] lag is still increasing [17:32:49] andrewbogott: yeah, can you do that? you can use http://tools.wmflabs.org/contact/ to find them [17:33:03] YuviPanda: yep [17:34:40] re: large queries... I have an old quarry -query which seems to be hanging.. if it's not already stopped, it may be stopped.. https://quarry.wmflabs.org/query/6797 [17:34:58] Stigmj: those are usually stopped at the mysql level and querry doesn't notice [17:35:03] ok [17:38:05] andrewbogott: am now looking at difference between 'show all slaves status' on labsdb1001 (which is fine) vs 1003 (which isn't) [17:41:55] YuviPanda: email sent, feel free to follow up if it’s off the mark. [17:51:01] !log tools kill all queries running on labsdb1003 [17:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:00:18] andrewbogott: was there a page or soemthign I didn't get? [18:00:22] on replag [18:00:47] I don’t think it was big enough to page. Jaime just popped in to tell us that it was increasing. [18:01:07] gotcha [18:11:31] andrewbogott: YuviPanda https://gerrit.wikimedia.org/r/#/c/263394/ [18:11:36] example http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1452285919.889&target=servers.labstore1001.nfsd.input_output.bytes-read&target=servers.labstore1001.nfsd.input_output.bytes-written&from=-72h [18:11:49] been running it to catch some data for a bit, seems fine [18:14:23] !ping [18:14:23] !pong [18:16:06] chasemp: that seems useful :) [18:57:03] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1924967 (10Legoktm) Can we set up a separate exec node for linkwatcher processes? [19:19:32] 6Labs, 10Tool-Labs: Make gridengine exec hosts also submit hosts - https://phabricator.wikimedia.org/T123270#1925004 (10yuvipanda) 3NEW [19:24:44] 6Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#1925019 (10valhallasw) Let me first stress that we value your work. Anti-vandalism is important, and it's clear that LiWa makes the jobs of editors much easier. At the same time, Tool Labs is a coop... [19:26:51] 6Labs, 10Tool-Labs, 7Shinken: Lots of hosts' services missing from Shinken - https://phabricator.wikimedia.org/T123271#1925026 (10scfc) 3NEW [19:45:34] bd808: When you have a moment, could you log in to puppet-andrew-cgroup.puppet.eqiad.wmflabs and confirm that it has the cgroup/whatever settings you want for vagrant? [19:45:51] And, if so, review the patch responsible: https://gerrit.wikimedia.org/r/#/c/262838/2 [19:46:41] andrewbogott: sure! I'll try to get to it between meetings. I glanced at the patch sometime last week [19:46:56] thanks [19:54:36] valhallasw`cloud: I'm about to set maxujobs to 128. objections? [19:54:43] YuviPanda: sounds good [19:55:02] YuviPanda: can we puppetize it? [19:55:19] 6Labs, 10Tool-Labs, 5Patch-For-Review: Install python-hunspell (and dictionaries?) - https://phabricator.wikimedia.org/T123192#1925103 (10scfc) I have filed https://bugs.launchpad.net/ubuntu/+source/pyhunspell/+bug/1532923 for fixing this in Trusty. Meanwhile, Puppet complains everywhere: ``` Error: Could... [19:56:04] valhallasw`cloud: don't think so unfortunately :( [19:56:11] oh well [19:56:16] all of the things there are unpuppetized [19:56:18] just document it somewhere? [19:56:23] yeah [19:56:30] on the tools help page perhaps? [19:56:47] mhm, I was thinking of the admin page [19:56:57] general unpuppetized grid engine settings list [19:57:13] oh yeah [19:57:16] that too [19:57:27] valhallasw`cloud: I think the two main things are the scheduler interval which is at 10s now and then this [19:58:04] ok -- I have no clue, to be honest [19:58:40] me neither :D [19:58:46] I'll document these on the admin page now [19:59:06] !log tools Set maxujobs (max concurrent jobs per user) on gridengine to 128 [19:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:59:52] 6Labs, 10Tool-Labs: Make gridengine exec hosts also submit hosts - https://phabricator.wikimedia.org/T123270#1925111 (10yuvipanda) I've setup a 128 concurrent job limit. [19:59:54] YuviPanda: shall I test it? >:) [20:00:03] valhallasw`cloud: :D yes [20:02:11] ok, 150x 'sleep 5' coming up [20:02:48] well, I couldn't submit that in 5 secs [20:03:12] and I have lots of them left in error state [20:03:50] so I'm not even sure how to test this sensibly :P [20:04:50] I'll make it a 5 minute sleep instead [20:05:03] and then figure out why they get sent to an error state [20:05:15] YuviPanda: assuming maxujobs is concurrent jobs per user or soemthing [20:05:30] chasemp: yeah, it's per user [20:05:59] what prompted the change? [20:06:02] but submitting that many jobs is actually not that easy [20:06:10] https://phabricator.wikimedia.org/T123121#1924967 [20:06:25] basically, job submission on grid hosts is disabled because of the risk of runaway job submission [20:07:57] got it, missed this thanks reading up now [20:10:28] YuviPanda: is it global or per-queue? [20:10:37] valhallasw@tools-bastion-01:~$ qstat | wc -l [20:10:37] 225 [20:10:46] but all in the cyberbot queue so they aren't running [20:11:56] YuviPanda: ah, I think we need max_u_jobs [20:13:00] max_u_jobs is a parameter of sge_conf (qconf -sconf) and controls the [20:13:00] total number of active jobs (running, qw, on hold, etc) for each user. [20:13:01] maxujobs is a parameter of sched_conf (qconf -ssconf) and controls the [20:13:01] total number of jobs a user may have running. [20:15:44] 6Labs, 10Tool-Labs: Make gridengine exec hosts also submit hosts - https://phabricator.wikimedia.org/T123270#1925177 (10valhallasw) ``` max_u_jobs is a parameter of sge_conf (qconf -sconf) and controls the total number of active jobs (running, qw, on hold, etc) for each user. maxujobs is a parameter of sched_... [20:41:41] 6Labs, 10Tool-Labs: Make gridengine exec hosts also submit hosts - https://phabricator.wikimedia.org/T123270#1925291 (10scfc) Why? Limiting `max_u_jobs` would effectively limit the number of jobs a user can submit, thus he would have to account for the possibility that `jsub` may fail. This would break many... [20:53:54] 6Labs, 10Tool-Labs: Make gridengine exec hosts also submit hosts - https://phabricator.wikimedia.org/T123270#1925318 (10scfc) >>! In T60949#616510, @scfc wrote: > (In reply to comment #1) >> Reading the IRC log, I don't quite understand why you need a *node* of your >> own. Apparently, you want to run 200 job... [21:07:08] 6Labs, 10Tool-Labs, 5Patch-For-Review: Replace all references to "tools" by references to $labsproject in operations/puppet and labs/toollabs - https://phabricator.wikimedia.org/T87387#1925360 (10scfc) 5Open>3Resolved There is one or two occurence of "tools." left, but those are in code paths unlikely to... [21:15:54] my jstart jobs is in a state of qw..(including cron) is this a labs error? [21:25:05] rohit-dua: in theory there seems to be capacity https://tools.wmflabs.org/?status [21:27:22] nothing obviously related in https://phabricator.wikimedia.org/project/feed/832/ although https://phabricator.wikimedia.org/T123270 and https://phabricator.wikimedia.org/T123121 might have affected something [21:29:46] rohit-dua: which job? [21:31:28] valhallasw: mworker0 and cron job in bub tool [21:32:44] hrm. [21:33:12] valhallasw`cloud: if you load horizon.wikimedia.org do you get graphics, menus, formatting etc? [21:33:22] hmm, can't get the webservice of the erwin85 tools restarted [21:33:51] andrewbogott: no css, just html it seems. Looking at SGE now, so can't check in detail [21:34:05] valhallasw`cloud: that’s fine, just making sure it’s not a local issue. thanks [21:35:01] ok, it seems I broke SGE :( [21:37:06] YuviPanda: can we get a little help over here? [21:37:19] so what happened is https://phabricator.wikimedia.org/T110994 as a result of the load testing earlier [21:37:53] hello [21:37:56] sorry, was away for lunch [21:37:57] plan of attach: reduce "eight number of jobs started in the last X minutes by setting load_adjustment_decay_time to something low and back to 7:30 [21:38:00] * YuviPanda reads backlog [21:38:03] YuviPanda: do you know how to do that? [21:38:24] YuviPanda: should be same command as setting maxujobs [21:38:43] qconf -mconf [21:38:55] ok, let me do that [21:39:01] ok [21:39:31] -msconf? [21:40:33] !log tools temporarily sudo qconf -msconf to 0:0:1 [21:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:40:48] !log tools that's load_adjustment_decay_time [21:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:41:12] it's not helping, though. [21:41:35] hmm [21:41:40] I'm reading throguh the bug still [21:41:45] !log tools currently 353 jobs in qw state [21:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:42:44] !log tools resetting to 0:7:30, as it's not having the intended effect [21:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:43:42] !log tools qstat -j shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting [21:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:43:52] valhallasw`cloud: should we restart master? [21:44:59] thta's an option [21:45:05] objections to doing it? [21:45:09] alternatively ,we can set job_load_adjustments to 0 [21:45:19] none, I think [21:45:26] shouldnt affect queued jobs [21:45:30] but may not solve the issue [21:45:35] done [21:45:40] ah, number is going down slowly now [21:45:41] !log tools restarted gridengine master [21:45:43] back to 342 [21:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:47:20] 309 [21:47:33] wooo [21:47:59] the numbers are still the same though (~14 started jobs in the last N minutes, times 0.5) [21:49:58] valhallasw`cloud: where are you getting the number from? [21:50:06] qstat -u "*" | grep qw | wc -l [21:50:09] now back to 320 [21:50:56] and 336 [21:50:59] so that didn't actually help [21:51:24] valhallasw`cloud: so job_load_adjustments to 0 now? [21:51:36] no [21:52:11] another issue is that people clearly assume they don't need -once [21:52:44] although [21:53:29] akoopal: webservice/lighttpd should be OK actually? [21:55:07] YuviPanda: let's first try job_load_adjustments_decay_time = 0:0:0? [21:55:16] ok [21:55:49] I'm watching the qw numbers [21:55:55] valhallasw`cloud: you're doing the mconf? [21:55:59] !log tools set job_load_adjustments_decay_time = 0:0:0 [21:56:00] ya [21:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:56:27] doesn't help -- it's clear now, but jobs are still not starting, it seems? [21:57:23] !log tools that cleared the measure, but jobs still not starting. Ugh! [21:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:57:40] !log tools reset to 7:30 [21:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:57:47] huh?! [21:57:55] and we're back at queue instance "continuous@tools-exec-1217.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.865000 (= 0.115000 + 0.50 * 14.000000 with nproc=4) >= 1.75 [21:58:06] some lighttpd ones are too [21:58:08] hmm [21:58:29] valhallasw`cloud: I wonder what happens if we just drop all those? [21:58:47] the queued jobs? [21:59:30] yeah [21:59:33] don't think that'll be of much help [22:00:44] there's also 1400 jobs running, which is 400 more than two hours ago [22:01:07] oh no, sorry [22:01:19] subtract queued, gives 1000, so that's the same, roughly [22:04:22] YuviPanda: I'm going to reset maxujobs [22:04:25] ok [22:04:29] I can't think of a reason why that would be the reason [22:04:35] it was 0, right? [22:04:37] yeah [22:04:41] and we set it to 128 [22:04:44] 0 was unlimited [22:04:46] well [22:04:53] unless maxujobs is for *everyone* for some reason? [22:05:09] yeah, that's the only thing I can think of [22:05:45] no luck [22:05:53] !log tools set maxujobs back to 0, but doesn't help [22:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:06:19] so for some reason the counter of jobs in the last X minutes doesn't seem to decrease [22:06:29] ok, well, let's set job_load_adjustments to none. [22:06:36] ok [22:07:41] !log tools set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0 [22:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:08:50] still nothing [22:09:26] hmm [22:11:10] back from 426 to 394 [22:11:26] I see it at 448 now [22:11:41] and 449 [22:11:53] sorry, was excluding merl jobs [22:12:09] ya, 449. [22:12:19] oh, was it higher? [22:12:26] restart master again? [22:12:30] it's climbing [22:12:32] yeah doing [22:12:40] !log tools restarted gridengine master again [22:12:43] maybe it needs a restart for the conf change to actually work? [22:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:12:47] it'd ropped [22:12:49] 406 [22:12:52] valhallasw`cloud: hmm that's a good point. maybe [22:13:14] it's holding at 406 now [22:13:20] and climbing again [22:13:25] 01/11/2016 22:13:09|worker|tools-grid-master|W|Skipping remaining 360 orders [22:13:25] 01/11/2016 22:13:09|schedu|tools-grid-master|E|scheduler tries to schedule job 2221233.1 twice [22:13:25] 01/11/2016 22:13:18|worker|tools-grid-master|E|scheduler tries to schedule job 2221233.1 twice [22:13:59] hmm [22:14:02] fractional job ids?! [22:14:04] log files, super useful. [22:14:13] that fixed it [22:14:31] yeah [22:14:34] 50 now [22:14:43] which is all the merl jobs [22:14:45] so does qconf need restarts? [22:15:25] I qdel'ed 2221233 [22:15:28] I think that fixed it [22:15:38] ah that's why I didn't find it [22:15:42] what is 221233? [22:15:44] *was? [22:15:48] was that one of your jobs? [22:15:55] no, phetools [22:16:04] but scheduled around the same time [22:16:04] did it have a timestamp? [22:16:06] ah [22:16:09] ok [22:16:11] submission_time: Mon Jan 11 21:00:22 2016 [22:16:16] lol, so does this mean bdb doesn't really have transactions? [22:16:27] should we put the user max jobs back? [22:17:21] I think this was the DB corruptio nagain [22:18:24] right [22:18:28] that's scary [22:18:45] lesson learned: *read the messages file* [22:19:09] ok, let me set back config [22:19:49] !log tools reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30 [22:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:20:41] ok, everything seems alive again. [22:21:13] woooo [22:21:15] valhallasw`cloud: <3 [22:23:34] "Somehow sending an e-mail to labs-l seems to resolve issues magically" -- valhallasw`cloud [22:23:56] labs-l is doomed next time this happens :p [22:26:26] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1925619 (10Andrew) this should be at least partially fixed now. The issue is that python-compressor was doing: from django.utils import simplejson which didn't work because modern versions of django don't inclu... [22:28:20] Krenair: hahahaha [22:28:41] or, y'know, anything goes wrong in labs [22:29:58] cetero censeo SGEm esse delendam [22:33:28] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1925645 (10hashar) Looks fine to me. JS/CSS is back and I can browse the UI. The last oddity which might or might not be related and for which maybe a task has been opened is : we can't switch between projects. [22:49:22] 6Labs, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 7WorkType-Maintenance: deployment-mediawiki03 : apt broken trying to reach out webproxy.eqiad.wmnet - https://phabricator.wikimedia.org/T122953#1925722 (10greg) [22:50:00] I killed wikibugs again with a bulk action in phab [22:50:24] heh [22:57:04] 6Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure, 6operations, and 2 others: Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins) - https://phabricator.wikimedia.org/T122731#1925848 (10gr... [22:57:06] gonna do it again :) [23:02:48] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1925951 (10Andrew) project-switching is unrelated, it has to do with incompatible changes in in horizon vs. the keystone api that we're using. I'm working on it. [23:02:52] 6Labs, 10Labs-Infrastructure: [Horizon] Design broken - https://phabricator.wikimedia.org/T120646#1925952 (10Andrew) 5Open>3Resolved a:3Andrew [23:03:55] 6Labs: Unable to change projects in horizon - https://phabricator.wikimedia.org/T123310#1925969 (10Andrew) 3NEW a:3Andrew [23:28:06] andrewbogott: I got as far as logging into puppet-andrew-cgroup.puppet.eqiad.wmflabs but haven't actually figured out how to test for the cgroup working as hoped. The only place I've hit it before has been related to firing up an LXC container. [23:29:09] bd808: you’re welcome to set up vagrant on that box, or whatever you need. Can I help somehow? [23:29:30] That instance is a one-off just for testing this. I can make another one in a different project if that helps [23:31:02] * bd808 checks to see if he has sudo there... [23:31:38] andrewbogott: I think I can test by manually setting up for LXC if it's fair game to mess with thing [23:31:45] yep, feel free [23:39:16] andrewbogott: hey! [23:39:26] * andrewbogott waves [23:39:28] andrewbogott: so I just created a project and it has the default security groups stuff missing [23:39:34] andrewbogott: should I leave it to debug or just fix it manually? [23:39:42] fix manually [23:39:47] ok [23:39:50] it seems to happen about 1/10 of the time :( [23:40:08] you could try deleting and recreating also [23:56:28] 6Labs: Access needed to mwui.wmflabs.org - https://phabricator.wikimedia.org/T123316#1926132 (10Volker_E) 3NEW