[00:06:33] andrewbogott: "Error: /Stage[main]/Role::Labs::Instance/Mount[/home]: Could not evaluate: Execution of '/bin/mount /home' returned 32: mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/mediawiki-core-team/home failed, reason given by server: No such file or directory" -- new instance mw-core-wbstatus in mediawiki-core-team project not setup of nfs apparently [00:06:56] bd808: believe it or not, that’s caused by instances building too quickly :) [00:07:06] I put a ‘sleep’ in the jessie boot just to avoid that. [00:07:17] Reboot your instance and it’ll be fine. It’s just a race with the NFS server. [00:07:29] k. I'll try that [00:08:33] andrewbogott: works :) thanks [00:14:00] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [00:14:10] nnoooooooo [00:28:55] andrewbogott: Nope, not me. [00:28:56] RECOVERY - Puppet staleness on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [3600.0] [00:29:29] Coren: any guess what it’s for and/or what it would break if I remove it? [00:30:06] I'm pretty sure nothing can break if you do - nothing delegates that reverse there anyways. [00:30:20] Anything that relied on it wouldn't work in the first place. [00:30:39] (And besides an A record in in-addr.arpa will never be looked up) [00:31:18] 10Tool-Labs: Document how to turn shadow into master - https://phabricator.wikimedia.org/T91133#1075068 (10coren) 5Open>3Resolved a:3coren service gridengine-master start :-) [00:33:52] Coren: great! That will fix my bug :) [00:49:18] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [00:49:29] Khaaaaaaaaaaaaan [00:50:09] 6Labs, 10Wikimedia-Labs-wikitech-interface: Wikitech doesn't allow to associate a hostname with a public ip address - https://phabricator.wikimedia.org/T90856#1075112 (10scfc) Don't know if it is related, but in T90787 you spoke of changing the LDAP entry of wmflabs.org a few days ago. [00:56:36] Coren: looks like tools-webgrid-06 is full up again [00:57:20] andrewbogott: Yeah, much to my annoyance. I'm going to allow overcommit for now, until I figure out what changed between the releases [00:59:30] !log tools set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down. [00:59:34] Logged the message, Master [01:00:09] !log tools Also That was -webgrid-05 [01:00:12] Logged the message, Master [01:00:33] !log tools Set vm.overcommit_memory=0 on -webgrid-05 (also trusty) [01:00:37] Logged the message, Master [01:01:20] This should prevent the puppet runs from failing. It's a bit more dangerous to run actually oom but since usage on the trusty nodes is <25% atm I'm not worried in the short term [01:04:58] ‘k [01:05:22] Hm, host entries must be in wikitech’s memcache… have to wait a while to see if I fixed anything [01:08:52] 6Labs, 10Wikimedia-Labs-wikitech-interface: Wikitech doesn't allow to associate a hostname with a public ip address - https://phabricator.wikimedia.org/T90856#1075170 (10Andrew) It's not related, unless I uncovered a latent issue. Anyway, I have now removed that entry -- we will see what breaks :( [01:09:19] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [01:19:27] "Error 400 on SERVER: Could not parse for environment production: invalid byte sequence in US-ASCII at /etc/puppet/manifests/role/planet.pp:1 on node bd808-vagrant.eqiad.wmflabs" [01:19:33] seems to be repeatable [01:19:56] 5 manual puppet agent attempts in a row [01:24:26] andrewbogott, Coren: could the locale be wrong on the labs puppet master? https://projects.puppetlabs.com/issues/22563 seems to match the error I'm seeing above [01:25:18] How did you get to that error? [01:25:59] running `puppet agent --test --verbose` on bd808-vagrant.eqiad.wmflabs [01:26:09] which is a new instance? [01:26:13] yes [01:26:21] what project? [01:26:37] mediawiki-core-team [01:27:01] It hasn't been rebooted since first provision [01:27:11] but the nfs mounts are there [01:27:20] just built it a few hours ago [01:34:54] bd808: I don’t know what gives. I created new instances in that project, they’re fine. [01:36:00] bd808: it looks to me like that instance was configured to be a puppet master, and then unconfigured...? [01:36:12] Certainly it’s /etc/puppet/puppet.conf is very different from that of a new instance [01:36:12] andrewbogott: *shrug* I'll try turning it off and on again and if that doesn't work nuke it and try again later [01:36:16] um… its [01:36:23] it is? [01:36:29] huh [01:36:47] I can’t guess how it got that way [01:37:23] I'll just nuke it then [01:37:28] no worries [01:37:59] bd808: there are two instances with that same name one in wikimania-support [01:38:11] ah. dumb me [01:38:17] I’m surprised that’s possible, but it’s most likely the problem. [01:38:21] it bet that makes things crazy [01:38:41] I don’t know… it /should/ be doing the puppet lookup by instance id. But… it’s not exactly a tested use case :) [01:40:20] It's a bit wonky that wikitech let me stomp on an existing host name like that [01:41:39] so when I ssh'd in I got the the existing instance which is probalby broken due to a messed up self hosted puppetmaster or something [01:42:03] * bd808 needs to be more creative with his vm names obviously [01:42:55] Yeah, it shouldn’t allow duplicates. There’s even a patch in nova to prevent it. [01:43:24] * bd808 spins up shaved-yak.eqiad.wmflabs to replace it [01:43:51] Personally, I'd have hostnames be $host.$project.$cluster.wmflabs [01:44:00] +1 [01:44:02] So hosts only need to be project-unique [01:44:19] Yep, would be better. I usually try to do projectname-host.eqiad.wmflabs at least [01:44:38] And if you want to be extra smart, you add $project.$cluster.wmflabs to the resolv.conf search [01:44:52] (and set ndots=1 globally) [01:45:16] * andrewbogott is out for now. Good weekend all! [01:45:24] * andrewbogott plans to spend his weekend rebooting virt1012 over and over [01:45:32] o/ [01:45:52] andrewbogott: I'd rather not. Let's hope it'll be gentle on us. [03:56:49] 10Tool-Labs: Document how to turn shadow into master - https://phabricator.wikimedia.org/T91133#1075326 (10yuvipanda) 5Resolved>3Open I think the admin documentation should be written in a way that anyone from ops who isn't familiar with tools should be able to do almost all the steps. The admin docs for fai... [04:27:40] MySQLdb is not available on uwsgi instances, now how I'm gonna continue? [04:31:00] 6Labs, 10Tool-Labs: Please install python's MySQLdb module in all uwsgi instances on labs tools. - https://phabricator.wikimedia.org/T91155#1075347 (10Mjbmr) 3NEW [04:32:27] 6Labs, 10Tool-Labs: Please enable multithreading for uwsgi on labs tools. - https://phabricator.wikimedia.org/T91156#1075356 (10Mjbmr) 3NEW [04:43:21] Mjbmr: build it yourself and use a virtualenv. [06:45:51] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [06:58:10] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [07:00:53] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [07:18:10] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [07:37:24] 6Labs, 10Tool-Labs, 5Patch-For-Review: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1075433 (10yuvipanda) @scfc I agree. Am going to kill tomcat and uwsgi nodes, and merge them all into 'generic'. [07:49:03] 6Labs, 10Tool-Labs: Move uwsgi jobs to be run on generic hosts, retire uwsgi hosts - https://phabricator.wikimedia.org/T91065#1075439 (10yuvipanda) [07:51:20] !log tools create tools-webgrid-07 [07:51:25] Logged the message, Master [07:51:48] 6Labs, 10Tool-Labs, 5Patch-For-Review: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1075445 (10yuvipanda) Hand moving the tomcat jobs onto the generic node now. [08:03:28] (03PS1) 10Yuvipanda: Point users to webservice2 for tomcat [labs/toollabs] - 10https://gerrit.wikimedia.org/r/193559 (https://phabricator.wikimedia.org/T91066) [08:07:52] 6Labs, 10Tool-Labs, 5Patch-For-Review: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1075459 (10yuvipanda) Moved them all off, and they all seem to work! yay! Just need to merge https://gerrit.wikimedia.org/r/#/c/193559/ and then I can kill the no... [08:09:48] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [08:10:02] (03PS2) 10Yuvipanda: Point users to webservice2 for tomcat [labs/toollabs] - 10https://gerrit.wikimedia.org/r/193559 (https://phabricator.wikimedia.org/T91066) [08:57:16] PROBLEM - Puppet failure on tools-webgrid-generic-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [11:38:23] how to submit a patch code on phabricator [12:04:06] !log tools.family Project created [12:04:09] Logged the message, Master [12:22:37] khushbu_: You submit it to Gerrit (gerrit.wikimedia.org), or Phabricator [12:22:56] s/or/not [12:24:52] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0] [12:44:22] no not yet [12:45:09] 6Labs, 10Wikimedia-Labs-wikitech-interface: Wikitech doesn't allow to associate a hostname with a public ip address - https://phabricator.wikimedia.org/T90856#1075608 (10Kelson) I have reset the hostname for mwoffliner2. Problem is still there. [16:17:52] YuviPanda|brb: Already back? :P [16:19:31] multichill: from vacation? Yeah :) [16:19:40] Sat night at the beach tho [16:19:53] We're on a hackathon, mind approving https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Metaodi ? [16:20:01] Hacking on the beach? Nice :-) [16:20:13] Aha [16:20:22] In 10min? [16:20:28] Check out https://tools.wmflabs.org/family/ancestors.php?q=Q7742&lang=en [16:20:39] That would be awesome. Don't get sand in your laptop ;-) [16:21:09] Woah. Sweeeeey [16:33:35] multichill: done [16:33:42] ty [16:36:08] yw [16:36:09] am off now tho [16:37:12] !log tools.family Added Metaodi? to the project [16:37:17] Logged the message, Master [16:56:01] GerardM-: https://tools.wmflabs.org/family/ancestors.php?q=Q154946 [16:56:14] (je kan gewoon op een naam klikken) [16:56:53] ok maar waarom toont het niet bij Amalia ? [16:58:00] Aantal levels zit een limiet op, probeer maar eens https://tools.wmflabs.org/family/ancestors.php?q=Q855749&lang=en&level=10 [16:58:15] @seen nakon [16:58:15] Cyberpower678: Last time I saw nakon they were quitting the network with reason: no reason was given N/A at 2/27/2015 4:31:03 AM (1d12h27m12s ago) [16:58:22] Zit nog een bugje in de code voor de laatste laag [16:58:54] GerardM-: ^ [17:00:57] :) [17:19:57] RECOVERY - Puppet staleness on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:41:58] hi, I seem to have problems restarting the webservice for my tool on trusty [17:42:30] when I run 'webservice status' it tells me [17:42:33] queue instance "webgrid-lighttpd@tools-webgrid-03.eqiad.wmflabs" dropped because it is temporarily not available [17:43:00] webservice stop or restart never terminates [17:43:03] any known issues? [17:50:47] 10Tool-Labs-tools-Other: Move Gerrit reports crontabs to Tools - https://phabricator.wikimedia.org/T88384#1075854 (10MZMcBride) >>! In T88384#1011393, @MZMcBride wrote: > Oh, sorry, I'm still waking up. The configuration portion is probably what's actually needed of me, at least to continue using the same bot ac... [17:56:15] legoktm: can you tell me how enable threading? [17:58:30] schaffert: what tool is this? [18:00:33] wikidata primary sources [18:00:46] it is just a fcgi binary [18:00:53] I still need its name [18:00:56] :-) [18:01:00] ok :) [18:01:25] wikidata-primary-sources to be precise [18:02:13] but the thing seems to be that the 'webservice' command cannot communicate with the grid server [18:02:32] No, that's not quite it - you have a stuck job. [18:02:39] ah ok [18:02:48] but the job is working [18:02:54] And webservice politely waits for it to die. [18:03:27] ok, how can I solve it? [18:03:54] I'm trying to find out /why/ it's stuck first. [18:04:08] ok :) [18:04:49] Hm. I'm not sure /why/ it got wedged like this - I presume it's because of yesterday's outage - but I've force-deleted it so you should be okay. [18:05:06] so it is shutdown now? [18:06:01] It's dead, Jim. [18:06:14] You can just start it afresh. [18:06:14] no, it's not [18:06:19] the webservice is still responding [18:06:27] curl https://tools.wmflabs.org/wikidata-primary-sources/entities/any [18:06:30] Oh? Hm. [18:06:52] Interesting. Apparently the grid lost track of it - I presume that's why it couldn't control it anymore. [18:07:03] * Coren tracks it down and murders it. [18:07:09] seems so :) [18:07:11] +1 [18:08:41] I can't use threading with default uwsgi, can I run my service separately and create a proxy for that? I can't create a new proxy. [18:09:32] Mjbmr: That is not recommended. Open a phab ticket with what you need to do the right thing with uwsgi and we'll see to it. [18:09:39] schaffert: It's been murderized. [18:10:01] I already did. it's urgent. [18:10:26] great, thx [18:10:36] Mjbmr: Define "urgent"? [18:10:49] It a wiki related. [18:11:19] besides, with the uwsgi you guys set, you can't find out when an instance is down or something is wrong with your code. [18:11:22] hi, I'm trying to run a simple continuous script which writes to a log file rc changes. when I run it directly I see the file created, but when it is run by "jsub -once -continuous -N testRC -mem 500m -quiet python rc_read.py" it doesn't (but it also doesn't fail on some error). So how to check a continous job work properly? [18:12:09] That applies to absolutely every tool, Mjbmr, pretty much by definition. Why, specifically, is it urgent? What depends on this? [18:12:29] It's difficult to prioritize backlog when everything needs to be done first. :-) [18:13:06] eranroz: If there are issues with running the tool, you'll get details in testRC.log and testRC.err normally. [18:13:50] I'm making a tool to two wikis which I'm admin in and I need this tool done for those projects, I started it already, I don't want to waste on time on nothing, so please tell me to create a proxy for that. [18:14:00] eranroz: Common things: not enough memory, or the directly your tool runs in isn't the one it expects to (by default, every job is run in the tool's home and not a subdirectory) [18:15:23] Mjbmr: What is the task number in phab? [18:16:09] T91156 [18:17:29] tools-webproxy:8282 ? [18:18:05] in /usr/local/bin/tool-uwsgi needs '--enable-threads' [18:22:03] Coren: thanks. found the error log :) [18:26:15] Mjbmr: If you have a ~/www/python/uwsgi.ini it will be given to uwsgi as an --ini option. You can add configuration there. [18:26:37] Thank you so much. :) [18:26:45] How can I tell continous job to proeprly import pywikibot? it have overriden teh PYTHONPATH in the .bash_profile and for non continous jobs it work [18:27:49] eranroz: .bash_profile is sourced only for interactive sessions. The most reliable way to pass environment variables to a job so that it works always regardless of how you have started it is with a shell script. [18:29:16] Coren: So I will wrap the script with a shell script. thanks! [18:29:34] eranroz: .bashrc is sourced by non-interactive shells, and would also work for most purposes. But starting a script that sets environment and execs the actual work script is the most reliable way - especially if you have more than one job. [18:29:57] (Also, you'll get serious headaches is .bashrc ever diverges from .bash_profile) [18:32:45] Coren: now some process is repeatedly killing my service [18:32:53] can I find out why? [18:34:06] schaffert: On the grid? You're almost certainly busting your memory limit. Check qacct -j it'll how how much vmem it ended up taking. [18:34:40] Or, alternately, start it with a more generous limit, and check it's actual usage after some time with 'qstat -j ' [18:35:05] You can then adjust down to something around that actual usage + a bit of elbow room. [18:35:34] how do I get the jobid of the webservice? [18:35:45] 'qstat' while it runs. [18:36:27] Wait, that's a webservice? The limit is 4G of vmem - that's a lot, and a bit hard to bust. Are you trying to handle huge database results in memory? [18:38:07] sorry, lost connection, how can I get the jobid for a webservice? [18:38:34] 'qstat' while it runs. [18:38:36] Wait, that's a webservice? The limit is 4G of vmem - that's a lot, and a bit hard to bust. Are you trying to handle huge database results in memory? [18:39:08] no, I am running a small service using sqlite3 implemented in C++ [18:39:24] implemented myself with no known memory leaks :) [18:40:02] Tell me the job number of one of your failed runs once you get it. [18:41:11] I just have lines like this in the error log [18:41:13] 2015-02-28 18:30:42: (server.c.1512) server stopped by UID = 0 PID = 2344 [18:42:44] Well, it's running right now with job number 8512696 [18:42:54] at the moment it seems to work, like it did for a week before [18:42:57] It also has a vmem usage of 2.387g at [18:42:57] so no worries [18:42:58] atm [18:43:24] Most of this should be shared libraries, etc. [18:43:29] yes [18:43:43] ... and it just restarted [18:43:47] yes [18:43:49] I did it [18:44:01] Ah. [18:44:18] That, by the way, will also give you "server stopped by UID = 0 PID = xxx" [18:44:32] I was experimenting with some redirection that I don't manage to get working :) [18:44:42] Because that's exactly what happens: you asked the grid to stop it. :-) [18:44:50] yes in this case I agree [18:45:06] but previously it died twice without my manual intervention :) [18:46:15] if you have some minutes, maybe you can help me with my redirection issue - problem is it works on my local machine but not with the configuration on the grid [18:46:46] since the service is a webservice, I was trying to redirect human users to the documentation when they request [18:46:52] Well, I'm on my weekend and need to go have lunch soon - but if you're still around when I come back I'll see what I can do to help. [18:46:55] http://tools.wmflabs.org/wikidata-primary-sources/ [18:47:03] ok, no worries then [18:47:08] we can try on Monday [18:47:17] the main webservice is working [18:47:20] so nop [19:33:12] 6Labs, 10Tool-Labs: Please enable multithreading for uwsgi on labs tools. - https://phabricator.wikimedia.org/T91156#1075894 (10coren) 5Open>3declined a:3coren uwsgi configuration can be made by plqcing directives in ~/www/python/uwsgi.ini, including threads. [20:12:32] 10Tool-Labs-tools-Other: Move Gerrit reports crontabs to Tools - https://phabricator.wikimedia.org/T88384#1075931 (10Nemo_bis) > I think the only remaining piece here is to set up the crontab. Thanks! have you already disabled yours? [22:38:48] wikibugs is down? [22:40:04] Dunno. [22:50:57] legoktm: wikibugs.log doesn't show anything, so maybe phabtrivcator is borking with the stream? [22:51:18] or maybe nothing is happening =p [22:52:07] !log tools.wikibugs restarted wb2-phab to see if we get stuff from phab again [22:52:10] Logged the message, Master [22:52:48] oh wait, it's on UTC not CET [22:52:54] `so it is getting messages [22:53:38] !log tools.wikibugs false alarm, messages were reported (2015-02-28 22:50:43,202 - irc3.wikibugs - DEBUG - > PRIVMSG #mediawiki-parsoid :10Parsoid, 10VisualEditor, 10VisualEditor-EditingTools, etc) which is a few minutes ago [22:53:40] Logged the message, Master [22:55:49] valhallasw`cloud: Something seems weird. [22:55:55] Fiona: ? [22:55:57] #mediawiki-feed has very little wikibugs activity. [22:56:29] yeah, because there is very little activity on phab :P [22:56:44] [07:54] Multimedia, operations, HHVM: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1075615 (Joe) @Bawolff first of all, thanks for the list of urls to test, this is a great starting point. Then, I apologize for lagging behind on this, but I have been busy with other priorities... [22:56:54] That's the last message I see in #mediawiki-feed. [22:56:55] hrm [22:56:59] 2015-02-28 22:53:34,305 - irc3.wikibugs - DEBUG - > PRIVMSG #mediawiki-feed :10Apex, 10MediaWiki-skins-MonoBook, 10MediaWiki-skins-Vector, 10OOjs-UI, and 2 others: OOjs UI uses font-size of 12.8px (0.8×16px) for windows and toolbars, but Vector skin uses 14px (0.875×16px) for body content, creating inconsistency between widgets in wi... - [22:56:59] https://phabricator.wikimedia.org/T91152#1075982 [22:57:01] that's odd [22:57:03] From wikibugs. grrrit-wm is a lot louder. [22:57:31] Oh, the bot is missing from the channel. [22:57:32] !log tools.wikibugs also restart wikibugs; it seems the PRIVMSGs in the log don't actually show up on irc [22:57:34] Logged the message, Master [22:57:47] Maybe netsplit? [22:58:07] no, I did see wikibugs but also didn't see something in -dev that the log file said was reported there [22:59:14] Seems better now.