[03:21:38] :( [03:21:50] my labs-vagrant instance keeps falling over with ‘too many connections’ on the mysql server [03:23:52] …and i don’t know the root password for mysql :D heh [03:24:13] ah good it’s ‘vagrant' [03:24:15] clever [03:25:33] i’ll leave a shell connected in screen and check the processlist next time it dies [03:31:09] interesting. even in just a few minutes the number of connected threads has gone up from 22 to 47 [03:31:16] all idle [03:31:51] i wonder if hhvm’s not closing things out properly, or if it’s just spawning a huge number of idle threads [04:08:25] brion: Its hhvm, switch to zend and the problem goes away [04:08:35] whee [04:08:50] brion: https://bugzilla.wikimedia.org/71370 [04:27:31] 3Wikimedia Labs: DNS problem with instance on Wikimedia Labs (apparently two instances with same name) - 10https://bugzilla.wikimedia.org/71595#c5 (10Prateek Saxena) I've removed my instance with the same name. Sorry for the trouble. [04:37:16] [13nagf] 15Krinkle pushed 2 new commits to 06master: 02https://github.com/wikimedia/nagf/compare/7e89da5a6460...bed7ca2284f5 [04:37:16] 13nagf/06master 14297e90c 15Timo Tijhof: Make overview graph generator configurable [04:37:16] 13nagf/06master 14bed7ca2 15Timo Tijhof: graphs: Add puppetagent [04:52:17] wikimedia/nagf#17 (master - bed7ca2: Timo Tijhof) The build passed. - http://travis-ci.org/wikimedia/nagf/builds/37899435 [06:43:07] !log Pooled integration-slave1004 [06:43:08] Pooled is not a valid project. [07:33:34] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (30.00%) [07:43:53] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [08:00:16] hmm, missed Krinkle|detached again [08:34:04] I will let you know when I see andrewbogott around here [08:34:04] @notify andrewbogott [09:58:30] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c14 (10Kartik Mistry) 5NEW>3RESO/FIX https://cxserver-beta.wmflabs.org/ is built fine (deployment-cxserver03), so closing this now. [10:22:21] <`fox`> Hi, for the second time in a month or so the webservice for a tool of mine went down [10:22:33] <`fox`> can anybody help me figuring out what happens? [10:26:43] hey `fox`. [10:27:06] `fox`: what's the name of the tool? [10:27:12] `fox`: and does webservice start bring it back up? [10:37:00] <`fox`> YuviPanda, wikiwatchdog [10:37:08] <`fox`> yes it brings it back up [10:37:15] `fox`: then it probably just ran out of memory... [10:37:26] and it gets restarted a number of times per day, and then it just isn't.. [10:37:38] <`fox`> :| [10:37:56] let me find docs [10:37:58] <`fox`> i didn't get the last part... is there an automatic restarter? [10:38:24] `fox`: yup, let me find docs... [10:40:10] `fox`: https://lists.wikimedia.org/pipermail/labs-l/2014-July/002758.html [10:40:13] apparently it is 'opt in' [10:40:33] `fox`: but yeah, usually 'my web service just stopped working' is when your script uses more than 4G of RAM [10:42:25] `fox`: if you set it up, it'll restart your web services when they fail... [10:44:41] <`fox`> YuviPanda, how? [10:44:52] `fox`: It's described in that link... [10:45:20] <`fox`> oh ok cool [10:45:22] <`fox`> thanks :) [10:45:25] `fox`: yw! :) [10:53:01] 3Tool Labs tools / 3Database Queries: DBQ-196 one side categories on en.wiki - 10https://bugzilla.wikimedia.org/59479#c3 (10This, that and the other (TTO)) 5NEW>3RESO/WON Closing this old request as WONTFIX. Please reopen this bug, or leave a comment, if you would still like this task to be completed. [10:53:16] 3Tool Labs tools / 3Database Queries: DBQ-206 List of English Wikipedia articles along with creation date, length, and categories - 10https://bugzilla.wikimedia.org/59488#c2 (10This, that and the other (TTO)) 5NEW>3RESO/WON Closing this old request as WONTFIX. Please reopen this bug, or leave a comment,... [10:53:16] 3Tool Labs tools / 3Database Queries: DBQ-200 Edit history and protection log of semi-protected articles - 10https://bugzilla.wikimedia.org/59482#c2 (10This, that and the other (TTO)) 5NEW>3RESO/WON Closing this old request as WONTFIX. Please reopen this bug, or leave a comment, if you would still like t... [10:53:17] 3Tool Labs tools / 3Database Queries: DBQ-205 Number of power users using enhanced recentchanges - 10https://bugzilla.wikimedia.org/59487#c4 (10This, that and the other (TTO)) 5NEW>3RESO/WON Closing this old request as WONTFIX. Please reopen this bug, or leave a comment, if you would still like this task... [10:53:17] 3Tool Labs tools / 3Database Queries: DBQ-208 List of all composer or songwriter names, in sorted alphabetical order - 10https://bugzilla.wikimedia.org/59491#c2 (10This, that and the other (TTO)) 5NEW>3RESO/WON Closing this old request as WONTFIX. Please reopen this bug, or leave a comment, if you would... [10:53:18] 3Tool Labs tools / 3Database Queries: DBQ-204 SQL query for pages created on dates between October 2011 and March 2013 - 10https://bugzilla.wikimedia.org/59486#c3 (10This, that and the other (TTO)) 5NEW>3RESO/WON Closing this old request as WONTFIX. Please reopen this bug, or leave a comment, if you woul... [10:53:19] 3Tool Labs tools / 3Database Queries: DBQ-198 Create and update database reports on fr.wikipedia - 10https://bugzilla.wikimedia.org/59481#c2 (10This, that and the other (TTO)) 5NEW>3RESO/WON Closing this old request as WONTFIX. Please reopen this bug, or leave a comment, if you would still like this task... [10:58:00] 3Tool Labs tools / 3Database Queries: DBQ-205 Number of power users using enhanced recentchanges - 10https://bugzilla.wikimedia.org/59487 (10Nemo) 5RESO/WON>3REOP [10:58:15] 3Tool Labs tools / 3Database Queries: DBQ-205 Number of power users using enhanced recentchanges - 10https://bugzilla.wikimedia.org/59487 (10Nemo) 5REOP>3NEW [11:47:30] my .tclsh8.6_history file got truncated. is this my failt or something else? [12:38:43] Dunno. [12:38:55] Sometimes having multiple shells open at the same time can cause weirdness. [12:39:00] Plus most shells have a history limit. [14:02:56] (03PS1) 10Alexandros Kosiaris: Add passwords::phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/166569 [14:12:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add passwords::phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/166569 (owner: 10Alexandros Kosiaris) [14:29:24] andrewbogott: Ping me when you have a minute [14:29:48] Coren: now is ok -- what's up? [14:30:35] Can you take a look at https://gerrit.wikimedia.org/r/#/c/166351/ so that we can push the new images for reals? [14:33:43] It looks right to me. Looks to me like that patch shouldn't affect anyone who's already used role::labs::lvm::mnt with the new images -- you agree? [14:36:34] It shouldn't. The only ones that could possibly be affected, IMO, are people who have an old-style instance without a separate /var/log and who wants to use labs_lvm::biglogs [14:36:35] helloooo [14:36:56] whatever you fixed last week managed to let us create a Precise instance on labs with the role::labs::lvm class :-] [14:37:05] deployment-cxserver03.eqiad.wmflabs works for us now [14:38:18] hashar: yep, that's Coren's new image with fancy lvm magic. I'm about to merge a patch that should allow resizing of partitions (within the space allocated to the instance) [14:44:31] andrewbogott: It's still an horrible hack that we had to do this at firstboot, but I really saw no way to create LVM volumes from within vmbuilder since it doesn't actually run the kernel it's installing like a preseed does. [14:59:17] oh horrible hacks are fine to me :] [15:15:33] I prefer clean solutions; but horrible hacks will do in a pinch. :-) [15:21:00] 3Wikimedia Labs / 3tools: Tool Labs: Node.js and npm broken due to outdated certificate (install minor update to fix certificate) - 10https://bugzilla.wikimedia.org/70120#c10 (10Marc A. Pelletier) (In reply to Ed Summers from comment #9) > Can tools labs easily be upgraded to Trusty? Not all of it, but ther... [15:29:40] YuviPanda|brb: Is tools-proxy-test still relevant? [16:11:25] Oh, man, the initial puppet run is ridiculously long nowadays. [16:12:02] Well, for tools. Lots and lots of things to install. [16:12:50] andrewbogott: By the way, what have your experiments with the openstack web interface go? [16:12:56] s/what/how/ [16:13:16] Coren: I think it has potential but will require a fair bit of custom dev. [16:13:41] Sometime soon I hope to set up a side-by-side install of Horizon so we can use it as an alternative for certain tasks. [16:13:48] So we're stuck spending some serious dev time either way. On the extension or on Horizon. [16:13:53] Right. [16:14:02] Probably better to spend it on Horizon since that goes upstream [16:14:10] How receptive is upstream for patches? [16:14:14] Heh. GMTA [16:14:43] Hard to know how receptive they'll be until we start. But the Openstack folks are mostly good to work with. [16:15:26] Yeah, I feel that spending dev time on the extension is not a good investment - I very much doubt very many people would even consider managing their vm infrastructure from mediawiki except us. [16:15:54] I suppose it's all written in python? [16:16:34] yeah [16:16:53] Is there an extension/plugin framework we can use? [16:18:10] yeah, it's designed to be extensible, I think the intention is that each deployment will use horizon as a framework for a mostly-customized install. [16:19:05] It uses django, if that means anything... [16:20:09] That's good because it implies that upstream is then receptive to the scenario in the first place; even if most of what we end up doing isn't generally useful and doesn't make it upstream, we're not going to be fighting to make our stuff work with upstream changes. [16:20:38] I know /of/ Django. More to the point, Django is a known quantity. [16:26:16] I fixed a few bugs in Horizon a couple of years ago and found it pretty reasonable. I haven't dug into the code much recently but there are a lot of docs with titles like 'customize horizon' [16:27:37] What's our biggest chunks of coding? aaa integration, I'm guessing and...? [16:28:43] I haven't done a big enough audit to know for sure. I think it'll be a lot of little things -- horizon as it is isn't very feature rich, it's more like a roughed-in outline. [16:30:12] Not like the extension is that feature-rich to begin with. :-) [16:32:37] Tintle: What are you doing? [16:33:08] Maarten Dammers [16:34:29] Tintle: Either that was a really bad cross-linguistic accidental double entendre, or it was way TMI! [16:42:20] Coren: Please kickban Tintle or give me +o so I can kick geo's ass out of here [16:42:40] Please ignore multichill [16:43:28] Coren pls kick this annoying troll [16:43:45] andrewbogott: can we back up instances on a daily basis? example: to be able to restore beta cluster easily in case we loose 1..all instance(s) ? [16:44:48] hashar: instance snapshot is supported in openstack (I think) but we don't have active support for it and I've never seen it work very well. [16:44:59] If we wanted to allow it we would need a ton more disk space... [16:45:01] so technically yes [16:45:09] but in real world no :-D [16:45:18] Kind of :) Is puppet + selective file backups insufficient? [16:45:37] well when virt1005 died we have been luckily to only lost cxserver [16:45:43] wich was not that important [16:46:04] if we went to loose like half of the beta cluster, it would have been a much more serious outage :D [16:46:16] My knee-jerk reaction to this is "If you can't recreate every one of your instances in 30 minutes via puppet you're doing something wrong" [16:46:31] fully agree [16:46:49] that is why I made a point of having everything in puppet [16:46:53] but there might be glitches :D [16:47:20] I guess I will attempt to figure out what amount of work is needed in case we loose all the instances [16:47:27] probably not that long [16:47:43] andrewbogott: thanks :-] [16:47:56] You could institute a policy of daily random instance deletion :) [16:51:16] smart! :D [16:52:25] andrewbogott: Russian roulette with the instances? [16:52:44] exactly. If you have intentional outages all the time then the unexpected outages will be no big deal [16:53:13] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools-trusty.diskspace._var.byte_avail.value (50.00%) [16:54:16] Prepare for the worst. Relying on one system (Puppet) doesn't really sound like a good idea. You should have multiple layers of security preventing real problems [17:28:15] multichill: Nevertheless, instances should be viewed as cattle, not pets. If they are not fungible, then you have a problem. [17:47:41] andrewbogott: hey! There's another SWAT today, at 23:00 UTC, would you be around to deploy on wikitech if I get https://gerrit.wikimedia.org/r/#/c/166543/ and the follow up merged? [17:48:04] * andrewbogott calculates timezones slowly [17:48:28] So that is… five hours from now? [17:48:58] I will only barely still be here, I have dinner plans. How about the early swat tomorrow instead? [17:48:58] andrewbogott: yeah [17:49:07] andrewbogott: sure, works for me! [17:49:17] I am unable to create any new ssh session to toollabs but two of the connected terminals works fine [17:49:33] multiwiki: tools-login? [17:49:39] any known issues that anyone aware of? [17:49:55] multiwiki: is it a dns failure or is your key rejected? [17:49:57] andrewbogott: I'll get them merged today anyway. Provides anon access to instance info, and the other thing provides anon access to puppetvars/puppetclass info [17:49:58] (works for me, btw) [17:50:22] HostName tools-login-eqiad.wmflabs.org [17:50:31] yeah for the tools-login [17:50:51] multiwiki: try tools-login.wmflabs.org? [17:51:29] I get following: ssh: Could not resolve hostname tools-login-eqiad.wmflabs.org: nodename nor servname provided, or not known [17:52:12] multiwiki: I'm pretty sure that tools-login-eqiad.wmflabs.org has never been defined [17:52:47] andrewbogott: No- it was, for the migration. But it's been cleaned up some time ago seeing as how the migration has been done for quite some time. [17:52:55] ah ok [17:53:21] multiwiki: Just use tools-login.wmflabs.org or login.tools.wmflabs.org; both of those names are canonical for "log me into a bastion wherever" :-) [17:54:15] Yes I did the migration sometime ago [17:56:34] thanks [17:59:03] I tried to 'qdel' a job, and it's been in "deleting" status for over 24hours, what to do? [18:12:33] notconfusing_: it might need a more violent kick; have you looked on the actual exec node what happened to your job? [18:12:51] Coren, how do I do that? [18:13:33] notconfusing_: If you do a qstat, you can see what node its running on. What job number is it? [18:13:59] job-ID: 4769960 [18:14:07] Coren: hmm, so if a job doesn't exist on SIGINT, there's no SIGKILL sent? [18:14:26] i guess its on tools-exec-10 [18:14:34] YuviPanda: There should be, but the corruption of the job queue last week may have messed things up a bit. [18:15:55] ah, right [18:16:05] notconfusing_: Right. Once there you can look for it, and I see it running as pid 17291. You could try to figure out why it's not quitting, or just out and kill it [18:22:52] Coren, you mean to ssh into that machine? [18:23:14] notconfusing_: Yes. :-) sorry if I was unclear. [18:24:33] 3Wikimedia Labs: Puppet resets ownership of /srv/mediawiki in role::mediawiki-install::labs to root:root - 10https://bugzilla.wikimedia.org/72046 (10spage) 3NEW p:3Unprio s:3normal a:3None Flow has a shared labs instance ee-flow.wmflabs.org. We want devs to update it as themselves, so we set [core]... [18:24:46] 3Wikimedia Labs: Puppet resets ownership of /srv/mediawiki in role::mediawiki-install::labs to root:root - 10https://bugzilla.wikimedia.org/72046 (10spage) [18:24:46] 3Wikimedia Labs / 3Infrastructure: MediaWiki files set up by role::mediawiki-install::labs don't have proper permissions - 10https://bugzilla.wikimedia.org/62368 (10spage) [18:25:21] Coren: the SIGINT/SIGKILL change was after the job queue corruption, though, so older jobs sould have been SIGKILLed [18:25:39] uh [18:25:42] is bugzilla dead? [18:25:52] YuviPanda: {{worksforme}} [18:26:05] valhallasw`cloud: yeah, now for me too [18:26:09] valhallasw`cloud: I'm pretty sure that the kill methods aren't saved with the jobs. [18:26:14] Works for me. [18:26:52] Coren what did you use to find the pid of the job-id? [18:27:05] Coren: ohhh, I see. the qdel was after the change, of course. [18:27:29] I killed that pid number on the tools-exec-10, confirmed it was dead, but qstat still shows it as running [18:28:53] notconfusing_: 'ps' and look for processes owned by your tool's userid. :-) 'ps fuU tools.recitation-bot' [18:29:22] ok, great [18:30:11] notconfusing_: IIRC, the sge_sheperds report every 30s or so; it may take a bit before gridengine notices. [18:33:36] Coren: what is tools-trusty instance for? [18:33:39] bastion? [18:33:50] Yeah, it's a trusty bastion. [18:34:20] Necessary/useful if people are to develop things for the soon-to-be-made-available-for-real trusty node. [18:34:27] /var is full there.. [18:35:07] How, inane. The instance is all of 6 hours old or so. [18:35:31] Ah. /var/cache from installing that many packages I'm guessing. [18:35:33] wat, I've no idea what it's full with... [18:35:36] oh [18:35:37] yeah [18:35:39] it is [18:35:51] Coren: see, 2G var is pretty small even if you don't have a lot of logs :) [18:36:32] That's just effin broken - the precise instances certainly don't fill /var/cache/apt with nearly 2G of crap [18:37:16] it's cleared up now, I guess you cleaned it up... [18:37:29] I did. It's just a cache of recently downloaded debs. [18:37:37] right [18:37:45] And apparently apt-get lets it grow unbounded. [18:47:32] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [20:07:56] YuviPanda: you're here [20:07:57] I'm here [20:07:58] what's up [20:08:27] Krinkle: was wondering if you could help get https://gerrit.wikimedia.org/r/#/c/166543/3 merged into wikitech, once that's done I can have a 'archive dead instances' script running on cron... [20:11:12] Coren: also, puppet has been failing on tools-webproxy with: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to fetch instance ID at /etc/puppet/modules/base/manifests/init.pp:198 on node i-000000e6.eqiad.wmflabs [20:11:19] no idea why that would be a problem... [20:11:39] Coren: running ec2metadata succeeds just fine... [20:12:20] and the curl inside the output of that runs fine too [20:12:28] andrewbogott: ^ [20:17:01] Krinkle: I also responded on https://bugzilla.wikimedia.org/show_bug.cgi?id=71995, so you could either change the mime you're emitting to application/javascript, or if you think text/javascript is common enough I can make a patch [20:25:40] YuviPanda: the problem is with resolv -- hostname -d returns nothing [20:26:09] uh [20:26:41] hmm, is that because it has public DNS set? probably not, since tools-login resolves fine. [20:27:50] I've seen this problem on instances that were migrated from pmtpa, but this is different [20:29:20] andrewbogott: interestingly(?) it was also reporting itself as 'tools' instead of 'tools-webproxy' so far [20:32:18] o_O [20:32:21] That's just odd. [20:32:56] this might be a goodish time to switch the proxy over to another machine, and make that trusty... [20:33:01] so we can get rid of our homebrew package... [20:33:25] and also have two machines doing the proxying, with DNS round robin. [20:33:28] or maybe not :) [20:41:07] I'd wait until we have codfw up before making a second proxy. [20:47:01] And put it /there/ instead. [20:47:36] I still think we should have two proxies per DC, on different hardware [20:47:48] so things like the machine on which the proxy was the last time going down won't affect all of tools [20:57:00] That still works if they are in different DCs [20:57:25] Coren: well, latency if codfw proxy proxies to tool running on eqiad... [20:57:46] If we lose a virt host, some latency is not going to be a major issue. [21:35:16] * YuviPanda also pokes Krinkle with https://bugzilla.wikimedia.org/show_bug.cgi?id=71995 [21:45:08] YuviPanda: I found those gzip rules already [21:45:13] YuviPanda: But you removed text/html from it [21:45:16] so where is that getting compressed? [21:45:24] how is that happening? [21:45:33] tex/html from php that is [21:45:44] > Enables gzipping of responses for the specified MIME types in addition to “text/html”. The special value “*” matches any MIME type (0.8.29). Responses with the “text/html” type are always compressed. [21:45:48] from http://nginx.org/en/docs/http/ngx_http_gzip_module.html [21:45:58] Aha [21:46:00] That was not obvious [21:46:04] yup [21:46:10] YuviPanda: So, yeah, in that case text/javascript should be added [21:46:23] application/javascript is a standards hipster thing "nobody" usese that. [21:46:27] haha [21:46:30] Defacto standard is text/javascript. [21:46:57] so except for things nginx itself is in full control over, it'll be text/javascript. [21:47:10] in case of static files it works right now because it does its own mime type determiniation [21:47:17] but for anything php proxies, it'll be text/javascript. [21:48:50] Krinkle: yeah, https://gerrit.wikimedia.org/r/166676. [21:49:01] Krinkle: tools-webproxy puppet is still broken, though, so won't get applied immediately... [21:49:56] YuviPanda: It's a mystery how puppet being broken for days can become snafu. [21:49:56] Krinkle: let me put in a local hack that has the same effect though [21:50:09] I'm blocked on 2 different projects as well. [21:50:22] Krinkle: it is indeed, both Coren and andrewbogott are busy with other things atm :( [21:50:34] they didn't break puppet though [21:51:01] Krinkle: which other projects? [21:51:06] cvn and integration. [21:51:17] is puppet broken there? [21:51:35] can't do squat. I mean, I could. I could investigate million lines of puppet and learn everything ops knows, and do a local puppetmaster and fix it there, but I'm not going to do that. [21:51:41] puppet run is failing [21:52:05] Krinkle: ok, tools proxy should now compress text/javascript [21:52:09] Krinkle: What broken puppet? Point me at the bz? [21:52:22] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=72014 [21:52:23] looks like it got fixed this morning [21:52:23] yay [21:52:25] Coren: It's just ops being silly [21:52:34] Coren: no bz for tools-webproxy, it's hostname -d returning empty strings... [21:52:40] Coren: so can't apply the change that was just merged... [21:53:37] 3Wikimedia Labs / 3tools: WMFLabs: Gzip text/javascript responses from fcgi (just like text/html) - 10https://bugzilla.wikimedia.org/71995#c4 (10Yuvi Panda) 5NEW>3RESO/FIX Fixed by https://gerrit.wikimedia.org/r/166676 [21:54:07] 3Wikimedia Labs / 3tools: WMFLabs: Gzip text/javascript responses from fcgi (just like text/html) - 10https://bugzilla.wikimedia.org/71995#c5 (10Krinkle) > dynamicproxy: Compress text/javascript as well > Change-Id: I91c83b557b3af3977919412325c9bb950480e88f [21:56:52] Krinkle: although, I think running a project requires some amount of puppet skills [22:01:37] 3Wikimedia Labs / 3tools: Tool Labs: Node.js and npm broken due to outdated certificate (install minor update to fix certificate) - 10https://bugzilla.wikimedia.org/70120#c11 (10Marc A. Pelletier) 5ASSI>3RESO/FIX There is now a Trusty instance available to tool labs (per the email on labs-l) which can be... [22:03:18] YuviPanda: Eh? There's no custom puppetmaster by default [22:03:25] And I can't be responsible for operations/puppet [22:03:35] they introduced a change that would consistently break no matter what [22:03:53] and this happens about once a month. [22:04:26] usually by qa, labs or ops. [22:04:33] breaking one of the others [22:04:55] I don't break no puppet, except for Labs itself. [22:06:16] Krinkle: hmm, in this case I guess the problem is with whoever merged the original change... [22:06:18] Mostly because I generally don't touch manifests outside lab's own. :-) [22:06:23] * YuviPanda should've read the bug before responding [22:06:47] Coren: I strace'd hostname -d, and that just is plain reading from nscd and responding... [22:06:50] I've no idea what's going on here. [22:07:54] YuviPanda: lemme go take a look. [22:08:00] Coren: cool, thanks [22:08:22] I'm going to go sleep [22:13:38] YuviPanda|zzz: 10.68.16.4 tools-webproxy tools.wmflabs.org in hosts file. There's your problem. [22:14:05] oh, hmm. that's unnecessary here, yeah, but that's been around forever... [22:14:27] tools-webproxy shouldn't be using the hosts file, really. Ima running puppet now. [22:14:42] cool :) [22:14:50] and yay, icinga will be all green after that :) [22:15:27] Coren: also, any nginx conf changes require a manual restart... [22:23:36] So yeah, Trusty seems to work fine. [22:24:06] do websockets work with tools? I see that the websocket connection is established, but then websocket communication silently fails (using gevent-socketio and the portgrabber thing) [22:24:09] Coren: is this the same trusty instance we created way back? [22:24:40] sitic: it *should* just work... [22:24:48] although I don't think anyone else has tested it so far.. [22:24:51] yuvipanda: Yeah, I *finally* got around to working the gridengine settings to make it available (but also make a Trusty bastion) [22:24:54] the proxy supports it. [22:25:08] YuviPanda|zzz: ok thanks, I'll look more closely into it [22:25:18] Coren: nice. I want to slowly migrate all of everything to trusty over time, I guess. [22:25:36] sitic: I suggest making a small 'hello world' type websocket server, and seeing if that works. [22:27:52] YuviPanda|zzz: that's what I'm currently trying: http://tools.wmflabs.org/watchr , but there seems to be some issue with the backend. I'll look more into it. Thanks :-) [22:28:10] sitic: cool :) definitely want to support it on tools :) [22:28:53] Coren: btw, we need a trusty webgrid (or the equivalent of -tomcat) since a major use case for node is web apps... [22:30:59] Yeah, we'll just need to create one and set its release=trusty [22:31:06] yeah [22:31:23] Coren: why don't we just make new trusty nodes also act as -tomcat type nodes? [22:32:17] Because the resource management is not the same for web servcies and jobs. [22:32:41] There is a lot of overcommit for web servcies given that the number of actually different executables is low. [22:33:07] Coren: even on -tomcat? If it's something like lighty, that makes sense, but for random other services (go, node, tomcat?) [23:12:32] Even on -tomcat; especially because those services tend to be very large in vmem footprint; even if there were six different ones, you'd still have at most one of each. [23:12:47] Whereas the number of scripts that can be run on general nodes is essentially unbounded. [23:13:19] (More specifically, the webservices don't allocate on vmem at all as it's a poor metric) [23:14:01] Coren: ah, hmm. that makes sense. we should just name them something else other than tomcat tho :) [23:14:49] That's just an hysterical raisin. tools-webgrid-not-lighttpd-01 is quite a mouthful though. :-) [23:14:57] heh :D [23:15:07] tools-generic-webgrid or something like that maybe. [23:15:16] but we'll cross that bridge when we get to it :D [23:22:05] !log tools removed stale puppet lockfile and ran puppet manually on tools-exec-07 [23:22:08] Logged the message, Master [23:29:52] PROBLEM - ToolLabs: Puppet failure events on labmon1001 is CRITICAL: CRITICAL: tools.tools-trusty.puppetagent.failed_events.value (100.00%)