[00:37:59] 10Tool-Labs, 6Learning-and-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1188169 (10JAnstee_WMF) [00:39:34] 10Tool-Labs, 6Learning-and-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1181554 (10JAnstee_WMF) Thanks for the tag, changed it to our team rather than specific reporting project task group. [03:35:24] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Send metrics from service manifest monitor to graphite - https://phabricator.wikimedia.org/T95256#1188399 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [03:35:26] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1188401 (10yuvipanda) [03:35:37] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Create debian package for service manifest monitor - https://phabricator.wikimedia.org/T95255#1188402 (10yuvipanda) [03:36:01] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Create debian package for service manifest monitor - https://phabricator.wikimedia.org/T95255#1184555 (10yuvipanda) It's a python3 package, and I've packaged the dependency (statsd). Now needs an upstart script + packaging. [03:55:09] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [04:20:12] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:08:10] 6Labs: Investigate spikes in Labs NFS network usage - https://phabricator.wikimedia.org/T95392#1188473 (10yuvipanda) 3NEW [05:11:48] YuviPanda: Still awake? You need to sign up on Melange as a mentor. Not urgent, but do it within the week. :) [05:12:06] Niharika: hey! whoops,yes I’ll do it tomorrow [05:12:14] Thanks! [06:01:54] (03CR) 10Ricordisamoa: Initial commit (031 comment) [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/202610 (owner: 10Ricordisamoa) [06:36:27] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [06:38:53] RECOVERY - Free space - all mounts on tools-webgrid-04 is OK: OK: All targets OK [07:01:22] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [07:04:00] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [07:17:07] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [07:42:07] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:22:18] 10Tool-Labs: Remove unneeded tools - https://phabricator.wikimedia.org/T91740#1188911 (10scfc) It would have been so much more intelligent to remove the service groups //before// the file system stuff as `toolwatcher` recreated the directories in the mean time :-). I'll delete them again after checking. [13:56:33] 10Tool-Labs: Multiple webservices running for one tool - https://phabricator.wikimedia.org/T76578#1189479 (10scfc) [13:56:35] 10Tool-Labs: Tool Labs: jsub starts multiple instances of tasks declared as "once" - https://phabricator.wikimedia.org/T62862#1189480 (10scfc) [14:15:03] 10Tool-Labs, 10Continuous-Integration: labs-toollabs-debian-glue fails apparently with a timeout - https://phabricator.wikimedia.org/T91247#1189504 (10hashar) 5Open>3Resolved a:3hashar Maybe some transient issue ? We might had an issue on the slaves when the job ran. From the build history at https://in... [14:27:04] 10Wikimedia-Labs-General: Request to access redacted webproxy logfiles of (Tool) Labs - https://phabricator.wikimedia.org/T61222#1189544 (10scfc) 5Open>3Invalid a:3scfc With the task reporter away, I'm closing this for the moment. This is one of those tasks that require a high degree of interaction betwee... [15:50:28] hey, if anyone's available, could you assign [[User:L235]] on the beta cluster rename privileges so I can try [[Special:GlobalRenameQueue]] [15:57:06] L235: maybe. Any particular reason why or just testing? [15:59:02] JohnFLewis: mainly testing, yeah [16:00:17] L235: okay, I'll take a look :) [16:01:05] (and wow, Twinkle is not working correctly) [16:08:11] L235: couldn't see any groups with the right so [16:08:16] (Global rights log); 16:07 . . John F. Lewis (Talk | contribs | block) changed global group membership for User:L235 from (none) to globalrenamer ‎(testing) [16:08:16] (Global rights log); 16:07 . . John F. Lewis (Talk | contribs | block) changed group permissions for Special:GlobalUsers/globalrenamer. Added renameuser, centralauth-rename; Removed (none) ‎(testing) [16:08:33] JohnFLewis: Thanks! [16:09:16] L235: and of course if things break, poke here and/or me so they can be resolved :) [16:09:24] JohnFLewis: yep [16:09:51] (I'm going to be hit by the rate limit, but I'll complain about that later; I need to go for now_ [16:17:57] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:18] Is tools down? [16:24:19] L235: um, what are you doing? [16:24:21] > It's not just you! http://tools.wmflabs.org looks down from here. [16:24:21] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 762729 bytes in 6.036 second response time [16:24:21] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [16:24:21] back now? [16:24:22] o/ Coren [16:31:35] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:31:38] umm down again? [16:31:38] seems so, I also can't ssh there [16:32:56] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 762705 bytes in 6.544 second response time [16:33:24] it's going up and down [16:42:38] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:39] and it's down again. Seems to me like a network issue [16:51:01] [09:44:27] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [16:51:01] [09:48:58] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [35.0] [17:00:20] halfak: Heya [17:00:29] Hey! [17:00:56] I am just running to meeting, but I was going to ask about the work on making flask services easy to stand up in tool labs. [17:01:13] I talked to Yuvi about "flask in a bottle" do you know if anything came of that? [17:02:12] Coren, ^ [17:02:27] halfak: it's pretty easy, and there are docs too! [17:02:36] Cool. Where are those docs? [17:02:43] halfak: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_.28uwsgi.29 [17:03:52] Thanks legoktm. [17:03:54] :) [17:05:34] (03PS1) 10Southparkfan: Minor improvements for web pages [labs/tools/WMT] - 10https://gerrit.wikimedia.org/r/202773 [17:05:53] halfak: Listen to legoktm; I'm just a python neophyte and he's a pro. :-) [17:06:19] kk. [17:06:40] I appreciate your response anyway :) [17:09:21] (03CR) 10John F. Lewis: [C: 031] Minor improvements for web pages [labs/tools/WMT] - 10https://gerrit.wikimedia.org/r/202773 (owner: 10Southparkfan) [17:09:55] (03CR) 10Southparkfan: [C: 032 V: 032] Minor improvements for web pages [labs/tools/WMT] - 10https://gerrit.wikimedia.org/r/202773 (owner: 10Southparkfan) [17:17:22] Hello Coren! [17:17:23] https://gerrit.wikimedia.org/r/#/c/202664/ [17:17:29] Debian package!!! [17:17:41] With that done I can do thebpuppet role and be almost done [17:20:02] 10Tool-Labs: Clean up cruft in urlproxy - https://phabricator.wikimedia.org/T95442#1190543 (10scfc) 3NEW [17:22:30] 10Tool-Labs: Clean up cruft in urlproxy - https://phabricator.wikimedia.org/T95442#1190566 (10scfc) p:5Triage>3Low a:3scfc [17:23:32] YuviPanda: Holá. Lemme look at it. [17:23:45] Cool :D [17:24:00] I've testes it and spent some time even trying to understand it :) [17:24:16] Its a debian native package and I think that's ok [17:32:11] "Quibble" might even be too strong a term. [17:33:18] I've been using the equivalent yet slightly terser ISC license everywhere else. Either works, and I've no beef with MIT, but it might be simplest to just keep the same everywhere. Or not. [17:33:51] I just enjoy the "the whole license is a single statement" thing. :-) [17:37:58] ooh, ISC is nice. [17:38:20] I may start using it instead of MIT! I can just swap them out, can't I, since they're equivalent? [17:41:03] Coren: the repo already had an MIT license froma few commits earlier. [17:41:23] harej: I... I'm not sure. They both say to keep the permission notice, it's not immediately clear that substituting one for the other is okay unless you're the sole author. [17:41:42] I dont mind moving if you insist but I do generally like MIT better because it is better known - I haven't heard of isc at all before the Labs packages [17:42:01] YuviPanda: Hadn't noticed. But like I said, it's a really trivial thing and there is no problem with mix-n-match. [17:42:11] :) yeah [17:42:25] Outside of that everything good? [17:43:16] Well, the comments by Filippo are relevant but I see nothing else of issue. [17:44:16] Oh Filippo commented? [17:44:19] Looking [17:44:23] I'm still groggy [17:44:29] (On the license thing, I agree that MIT is most known but when the entire license is a single statement familiarity certainly isn't as much of a factor) :-) [17:44:33] My body isn't used to waking up in the mornings [17:44:53] I argue that with MIT for all intents and purposes the license is three chars long ;) [17:45:28] Oh I see [17:45:34] No need to do the init override [17:45:39] Stupid documentation [17:45:51] (Or stupid me, far likelier) [17:47:50] 10Tool-Labs, 3ToolLabs-Goals-Q4: Put toolserver.org redirect configuration in git - https://phabricator.wikimedia.org/T85165#1190669 (10Ricordisamoa) I guess http://toolserver.org/~dartar/cite-o-meter/ should be redirected to http://tools.wmflabs.org/cite-o-meter/ [18:01:36] Coren YuviPanda: a while back there was an issue when you curled http://tools.wmflabs.org while within the login server (or now tools-bastion-01) you'd get a 301 instead of a 200 [18:01:48] it seems to be happening again, but only for our tool xtools-articleinfo [18:02:05] for xtools and xtools-ec curl is sending back the correct response [18:02:33] something to do with the proxy settings? [18:03:59] ah, nevermind, happening on all of our tools [18:05:03] 301 when curling for http://tools.wmflabs.org and a 404 when curling any subpage thereof, such as http://tools.wmflabs.org/xtools-ec [18:11:30] Hm. [18:14:08] Hm.. looks like wikitech lost all login sessions again (I'm logged in but all openstack actions result in cryptic empty responses) [18:14:54] Krinkle: Works for me. Perhaps your keystone token expired? [18:15:02] Krinkle: (7 days, iirc) [18:15:15] No, i'm logged in just fine [18:15:19] I can interact with the wiki [18:15:29] but openstack's internal session or something keeps breaking every other day [18:15:37] so Special: nova instances gives me an empty list [18:15:48] and deleting instances tells me the instance doesn't exist etc. [18:15:55] Keystone token is distinct from just your wiki cookie. Every other day you say? So it's been just over a day since your last login? [18:16:22] I've logged in 3 days ago [18:16:39] I used to report this to Andrew and Ryan Lane whenever it happened (I think there's an open bug about it) [18:16:40] Yeah, that's way too short. I know andrewbogott_afk has been beating his head on the wall over this. [18:16:58] But yeah, it's a weird thing [18:17:13] Krinkle: Logging out and back on will fix it; but there seems to be no good reason why that happens. [18:17:24] Yeah, I've done that already. I know the drill :D [18:26:56] I'm sure andrewbogott will be so "pleased" to know the bug is still biting. [18:27:37] Coren: same NFS issues as last week? [18:28:02] Hm? No, NFS is behaving. Keystone token going barfy. [18:28:13] oh, dammit. [18:28:18] Do I need to do anything or is it already restarted? [18:28:37] I continue to suspect that virt1000 is oom’ing and that’s killing keystone [18:28:37] No, Krinkle had the same usual issue but a new login fixed it. [18:28:46] oh, /that/ bug :( [18:28:54] it will never die [18:29:54] can the software not negotiate a new token for the backend given the user has a valid session on the wiki? [18:30:27] It doesn’t store your password. But, it shouldn’t need to negotiate anyway, those tokens last forever. [18:30:51] "Vor various definitions of 'forever'" :-) [18:30:57] I don’t really understand the problem anymore — I fixed a bunch of potential causes. [18:31:17] ‘forever’ in this case is >= the length of the mediawiki session [18:31:30] ok, gotta go again, back soon [18:35:18] 10Wikimedia-Labs-wikitech-interface: Include role::analytics::hadoop roles in default list of labs puppet groups - https://phabricator.wikimedia.org/T70391#1191047 (10Ottomata) p:5Triage>3Normal [20:22:53] YuviPanda: btw, what is the status of backups for labs istances? specifically things like graphite.wmflabs? [20:23:29] greg-g: graphite.wmflabs.org is not on a labs instance [20:23:41] greg-g: it is on a prod box - labmon1001 [20:23:49] and thus backed up and such, yes? [20:23:55] Practically speaking there are no backups unless you make them yourself [20:23:58] Well there is raid [20:24:04] RINAB [20:24:05] I dont think we backup metrics [20:24:11] Not sure if we should either [20:24:23] why not? [20:25:25] Why? [20:25:31] I asked first. [20:25:33] :P [20:25:58] "because they're used" would be my response. and losing data is a non-zero issue [20:27:10] Hmm [20:27:18] Is this for the availability metric? [20:27:30] Prod graphite isn't backed up either [20:27:59] And I think putting availability metrics on graphite is the wrong thing to do, and I do recognize I'm not offering better options :) [20:28:12] greg-g: so. 'File a bug'? [20:28:19] right, so, this is what we got, and it's the only thing we can reasonably do [20:28:47] we can't piggy back on prod metrics because we aren't important enough to warrant the cost, but then we get requirements to show availability metrics... [20:29:37] where is the argument that beta should not be in prod graphite? curious on the thread of thought [20:29:56] I guess a lot of logistical challenges [20:33:52] chasemp: it could be but needs holes in firewalls and security things [20:34:15] could we monitor external beta presence from prod via normal external means? [20:34:22] i.e. what does being available mean [20:34:51] IMO that's the right thing to do [20:34:58] Use same external monitoring for prod and beta [20:35:14] legoktm: testing renaming and rename request interface [20:35:29] L235: how are you going to hit the rate limit? [20:35:45] legoktm: new accounts [20:36:16] L235: what *specifically* are you testing? [20:37:11] Accepting/declining rename requests, what the declined request blacklist covers [20:42:29] Hmm, I was under the impression there was a blacklist when rejecting requests, as the mock-ups show [20:43:11] Someone could waste a lot of steward/global renamer time by filing a lot of frivolous requests [20:54:29] greg-g: I’m going to run into the same problem very soon (since toollabs also needs availability metrics) [20:54:59] YuviPanda: solve it for us! [20:55:06] I kid, but... let's work on it [20:55:09] :) [20:55:27] greg-g: another person who has the same problem for performance metrics is ori, and of course he has patches :) [20:55:45] usually 5 different ones attacking the same problem from a few angles [20:55:57] yeah so he’s using rrdtool [20:56:34] greg-g: https://github.com/wikimedia/operations-puppet/blob/production/modules/webperf/files/rrd-navtiming [20:56:54] greg-g: I think that’s a valid way to attack the problem [20:56:57] allows backups, etc too [20:57:53] greg-g: I guess I should point that to, uh, twentyafterfour? :) [20:58:21] (we can’t use the exact same code, of course, but very similar solution would work [20:58:22] YuviPanda: what's up? [20:58:34] twentyafterfour: see scrollback + link I just pasted [21:00:02] L235: if you want to do that kind of stuff, you're probably better off setting up your own test wiki and playing with it [21:00:04] scrollback starting at 20:22 UTC [21:01:09] legoktm: yep, understood. I'm done, btw; you can remove me from the group and probably delete the global group entirely [21:03:34] why use a custom thing instead of graphite? seems like one more thing to maintain that duplicates the same functionality ... [21:04:13] admittedly graphite is annoying to work with [21:04:21] twentyafterfour: ‘graphite’ in our case means that graphite, statsd and the host you’re sending metrics from are all up and working [21:04:29] if you use a smaller thing, it’s a much smaller failure zone [21:05:59] but what's the frequency of graphite+statsd failures relative to the importance of these metrics? are they that critical? [21:06:34] I'm not against using rrd-navtiming or whatever, just curious [21:06:54] +1 to the idea that another shiny thing is just more overhead [21:06:58] not that I have a stake in it [21:07:37] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1191765 (10RobH) a:5Andrew>3RobH Understood, so we won't replace virt1000 but instead have a warm(ish) standby for use. I'll steal this task back. [21:08:52] well, if ori's already using it at least it's not much extra maintenance really...and I do like minimalist things [21:09:26] sounds like maybe you guys could buddy up idk [21:09:42] chasemp: twentyafterfour I think it’s the desire to not have to have availability metrics for the thing that does availability metrics to depend on 3 things [21:10:24] so you make a new thing with less visibility that fails in novel ways fewer people are watching :) [21:10:31] that’s also true [21:10:39] that was kinda my point [21:11:06] I hate txstatsd however, because if you don’t send it any metrics because idk you are down or something it keeps repeating the last metric you sent it [21:11:13] but I also am less than thrilled with how graphite works for the availability metrics that I've been working on [21:11:17] which is IMO shitty for availability metrics in particular, and generally shitty [21:11:26] no doubt txstatsd sucks but it's dying at least [21:11:26] and soon [21:11:30] yeah [21:12:03] I didn't experience that problem with my diamond collector, does it not use txstatsd? [21:12:08] it does :P [21:12:24] you just haven’t had a point where your collector is dead and not sending metrics [21:12:28] (like when the host it is on dies) [21:12:34] and then you get this flat line [21:12:48] and you have no idea if it’s flat because it was flat (at 100%?) or because the host sending the metrics died [21:12:52] yeah it's pretty much asinine for an avail metric [21:13:07] yeah, and IMO completely makes it untrustworthy [21:13:23] ‘does this 100% mean we were allways up or down for an indeterminable amount of time? not sure!' [21:13:43] but probably a better reason to tear out txstatd than add more layers [21:13:48] true [21:13:58] that’s happening tomorrow, isn’t it? [21:14:00] my metric is a composite of a bunch of hosts so I guess it's not really as vulnerable to that problem [21:14:07] true also [21:16:09] chasemp: https://phabricator.wikimedia.org/T95462 What...why do I have 50 emails from this guy? [21:16:25] you made a friend? [21:16:32] or more likely they are either spam or [21:16:37] not a spambot in phab [21:16:41] please no [21:16:43] completely clueless and now blocked [21:16:49] they happen sometimes [21:17:05] note the dot by their name [21:17:08] means disabled [21:17:15] ah [21:17:35] chasemp: FOCUS [21:17:35] I find the Maxxc0m one's real name kind of funny [21:17:45] YuviPanda: yep sorry [21:17:51] Negative24 oh no [21:17:59] Negative24: I was ribbing on [21:18:00] > Excus that ppl is not me. Use your eye and focus... [21:18:02] from that ticket [21:18:02] :D [21:18:23] I puzzled over the upload [21:18:24] https://phab.wmfusercontent.org/file/data/liexpti3jpm6hi4rqy3i/PHID-FILE-lbhnklyjuvcoxoop3k4t/vp2vw6wjmdm2352p/P_20150403_221506.jpg [21:18:26] :D [21:18:33] I don't...know what the hell that is for [21:18:34] yeah wtf that's all strange [21:18:36] that's the strangest spam ever [21:19:05] so I blocked them and brace for repercussions [21:19:09] attempted nerd-snipe gone horribly wrong [21:20:53] (or horribly right, of course) [21:20:59] lol [21:21:16] my instinct is clueless rather than malicious [21:21:17] but idk [21:22:03] hey Coren! Is https://phabricator.wikimedia.org/tag/labs-q4-sprint-2/ accurate for you? [21:22:07] "Never attribute to malice that which is adequately explained by stupidity." [21:22:28] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize service manifest monitor - https://phabricator.wikimedia.org/T95210#1191846 (10yuvipanda) [21:22:31] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Create debian package for service manifest monitor - https://phabricator.wikimedia.org/T95255#1191844 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [21:22:48] twentyafterfour: tyi I staged phab-01 with latest phab + sprint [21:22:51] if you can poke at it [21:23:00] awjr is gonig to validate sprint app functions [21:23:37] that was weird. Couldn't send to irc [21:26:58] 6Labs: Abolish use of ec2id - https://phabricator.wikimedia.org/T95480#1191859 (10Andrew) 3NEW a:3Andrew [21:28:06] 6Labs: Fix monitor_labs_salt_keys.py to handle the new labs naming scheme - https://phabricator.wikimedia.org/T95481#1191871 (10Andrew) 3NEW a:3ArielGlenn [21:30:20] 6Labs: Abolish use of ec2id - https://phabricator.wikimedia.org/T95480#1191887 (10Andrew) p:5Triage>3Normal [21:39:07] starting webservices takes so long [21:39:20] Coren: can we reduce the poll time for Grid Engine to be shorter? [21:39:24] it’s what, 10s now? [21:39:42] You mean, the scheduling interval? [21:40:00] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Ensure that all running webservices have a services.manifest file - https://phabricator.wikimedia.org/T95095#1191936 (10yuvipanda) Doing this now. [21:40:00] I think so, but there are caveats. Lemme check something. [21:40:06] Coren: ok [21:41:00] 10Tool-Labs, 3Labs-Q4-Sprint-2: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1191937 (10yuvipanda) 3NEW [21:42:15] YuviPanda: We might try on-demand scheduling. I've never used it before (it's relatively recent) but since we have a dedicated master that's probably our best bet. [21:42:29] chasemp: looks good on phab-01 [21:42:51] at least the security stuff is working correctly (once I set up herald and added teh security project) [21:43:10] Coren: what does ‘on demand scheduling’ do? [21:43:45] YuviPanda: Basically, it starts a scheduling run whenever a new requests comes into the queue as opposed to fixed intervals. [21:44:02] ah [21:44:08] Coren: ah, that’ll be ideal... [21:44:12] Coren: how disruptive is that? [21:44:53] YuviPanda: Well, since the master is on a dedicated VM the worse that can happen is that there is more overhead of lots of jobs get scheduled closed to each other (since they are not batched anymore) [21:45:10] that should be fine. when we move master to trusty we can make it xlarge, I guess [21:45:12] (if it isn’t already) [21:45:26] twentyafterfour: what's going on with phab-01? [21:45:35] I think it's a m1.small atm - and mostly idle anyways. [21:45:53] haha, right [21:45:54] * Coren reads up on how to do that. [21:48:30] Ah, actually, I can tune how many seconds after a job submission a scheduler run will take place - still allows some batching. How does 2secs sound to you? [21:49:20] I'll try 1 for now - we can always increase it if it causes an issue. [21:49:44] YuviPanda: I think that should do it. Try it now? [21:49:48] bbl - dinner [21:50:05] Coren: w00t, definitely much faster :D [21:50:09] Coren: we need to pupptize that as well :D [21:51:01] 10Tool-Labs, 3Labs-Q4-Sprint-2: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1191988 (10yuvipanda) Ah, actually, I can tune how many seconds after a job submission a scheduler run will take place - still allows some batching. How does 2secs sound... [22:00:08] !log tools.jackbot disable cron of webservice restart every minute [22:00:11] Logged the message, Master [22:13:11] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Ensure that all running webservices have a services.manifest file - https://phabricator.wikimedia.org/T95095#1192154 (10yuvipanda) Done for all webservices with a bigbrotherrc \o/ [22:24:07] For some reason i cant login to flow-tests.eqiad.wmflabs (ssh_exchange_identification: Connection closed by remote host) i can login to another labs instance we have on same project, ee-flow.eqiad.wmflabs. Any idea why? [22:24:24] havn't changed any of my local ssh config in some time [22:24:29] ebernhardson: it might be dead. [22:24:44] YuviPanda: hmm, i can probably force reboot from the wiki? [22:24:45] let me try [22:24:54] I saw that when we had the labstore1001 communication problem [22:24:55] ebernhardson: yup, root key doesn’t work either. do reboot [22:26:13] YuviPanda: on clicking reboot i get "The requested host does not exist." :) [22:26:22] ebernhardson: whelp. [22:26:27] ebernhardson: try logging out and back in? :) [22:26:35] ‘A dog ate my instance' [22:26:36] not sure if srs [22:26:40] ebernhardson: totally srs [22:27:24] YuviPanda: Not puppetizable per se; that's runtime config. [22:27:59] Coren: oh, ugh. [22:28:13] Coren: this is going to be super super fun when we move this all to trusty [22:28:28] Thankfully, it's relatively easy to export the config. [22:28:31] Coren: can you mention on phab ticket and close it? [22:28:32] insane, logging out and logging back in found the instance [22:28:42] ebernhardson: :D Totally srs :D [22:28:50] ebernhardson: solution is to kill wikitech with fire, which is in progress. [22:29:11] i wouldn't mind just having a cli client like i had for rackspace and aws [22:30:02] ebernhardson: there’s one, just nobody has access to it. [22:30:06] because auth, etc. [22:30:18] (or more like - it’s 3 of us doing all this, and there’s not enough time to fix all the things :( ) [22:30:26] * YuviPanda goes afk for tea [22:31:11] ebernhardson: On the up side, we have Horizon (the proper web interface) on the plate. [22:31:19] So an alternative is in view. [22:31:27] 10Tool-Labs, 3Labs-Q4-Sprint-2: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1192265 (10coren) 5Open>3Resolved a:3coren For the record, that is ```flush_submit_sec 1 flush_finish_sec 1``` in the scheduler config (qc... [22:37:24] Coren: i'm happy to hear it :) [22:37:40] Coren: Would that be VMware? [22:37:59] Negative24: Err no, we use openstack. :-) [22:38:37] that makes a bit more sense than vdi viewers