[00:04:58] Hm. [00:05:34] Something odd is going on. [00:10:21] Coren: ? [00:11:00] Ryan_Lane: NFS problem. Wasn't last time a week to the hour? [00:11:08] no clue [00:11:30] Ah, no, two weeks to the day. [00:12:27] o_O NFS server is cringing under the load. [00:12:47] What in blazes? [00:13:11] look at dmesg [00:14:29] what in the world is causing nfsd to eat up so much cpu? [00:14:37] That's what I'm trying to figure out. [00:14:47] I see nothing odd going on. [00:14:55] Well, except the symptom. [00:16:16] what's with all the nslcd errors on the syslog? [00:20:15] Those are "usual". I've seen them on all instances [00:21:10] Ryan_Lane: The NFS processes aren't cpu bound, they're crusted under IO load. [00:21:16] crushed* [00:21:17] yeah [00:21:25] what's causing it? [00:21:45] there's no iowait [00:21:46] It's a bit hard to parse through a tcpdump. :-) [00:22:11] Not much iowait, the shelf is keeping up. [00:25:39] hm. we don't have mount stats? :( [00:26:09] * Coren installs iftop [00:27:27] tools-login is the biggest one by far, though nothing I'd call extraordinary. [00:27:57] Actually, net traffic is now rather low. WTF? [00:28:29] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Labs%2520NFS%2520cluster%2520pmtpa&tab=m&vn= [00:28:39] just in general things went pretty wonky almost immediately [00:30:43] Restarting the kernel nfs daemon brought the load back down instantly to normal levels. [00:31:03] yep [00:31:09] ... and it comes back up. Something tools-login is doing is evil. [00:32:19] On tools-login, I got umptillion wedges cron jobs. [00:33:04] the NFS mounts are working okay though. [00:34:01] ... sorta. [00:34:19] I can df and ls, but trying to actually read a file doesn't work. [00:35:19] wedges on D [00:36:13] nothing in the logs [00:36:18] What in *bleep* is going on? [00:37:15] Coren: I can't even login via ssh by the way [00:37:23] AzaToth: Symptom [00:38:03] sure it's nfs and not the blody clusterfuckfs? [00:39:25] clusterfuckfs? [00:39:30] you mean gluster? [00:39:34] it's not being used in tools [00:40:13] k [00:40:17] and funny enough, it's working without issue :) [00:40:21] hehe [00:41:03] I'm not getting it. [00:44:44] I've never seen this. I can browse the NFS mount no problem, but any attempt to actually /read/ a file deadlocks. [00:46:58] * Ryan_Lane grumbles [00:48:01] Oh-ho! [00:48:32] Something funky with lockd. [00:49:34] 81.69% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner [00:49:34] 13.29% nfsd [nfsd] [k] nfsd4_process_open2 [00:49:34] 1.37% nfsd [nfsd] [k] test_share.isra.48 [00:52:07] did you restart the nfs server again? [00:52:25] well crap, I wonder if it the load was gone when I ran perf. heh [00:53:48] *something* is holding locks [00:58:41] I think I found it. [00:58:55] I'm seeing zero traffic to statd. [01:05:09] wtf happened to lslk in precise? [01:17:21] * Coren sighs. [01:20:43] Uses the BIG hammer. [01:21:04] hm [01:21:17] what? reboot? [01:21:21] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1006446 [01:24:04] how is the filesystem mounted from the client? [01:24:15] what's the rsize and wsize? [01:25:05] https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/879334 [01:25:08] Ryan_Lane: 64k [01:26:59] And that unwedged. [01:27:01] I am displeased. [01:27:34] All of tools is back to operation as expected. [01:27:52] * Coren inspects those bugs in detail. [01:31:59] Aha. [01:32:15] we're using udp, and not tcp, right? [01:32:56] Hm. Looks like we're using TCP. I didn't specify, so nfs4 picked its default. [01:33:19] The rsize/wsize things might be the problem though. [01:33:23] ugh [01:33:39] hm. wait. I thought nfs4 only ever used udp? [01:35:14] ah. wait. that's right it changed the default to tcp [01:35:18] "All NFS version 4 servers are required to support TCP, so if this mount option is not specified, the NFS version 4 client uses the TCP protocol." [01:35:34] * Coren switches to UDP with 8k reads. [01:35:37] I was thinking backwards. it's been a while since I've really needed to deal with nfs :) [01:35:46] tcp should actually be the better protocol to use [01:36:05] It should, in theory. [01:36:42] we should also consider switching the scheduler to deadline [01:37:44] That part seems iffier to me. cfq should normally be superior for a bunch of unprioritized threads. [01:39:36] So, 8k... udp or tcp? [01:45:42] Coren: https://bugzilla.redhat.com/show_bug.cgi?id=448130 [01:45:49] I'd say leave tcp [01:46:30] I wonder if this is something that was fixed in ubuntu as well as redhat [01:48:25] «Did you know that you should never run your bot directly on login server, instead run it using jsub!» what is jsub? [01:55:06] gry: man jsub will give you the details but tl;dr: it sends your job to the gridengine for execution where there are resources. [01:55:21] Ryan_Lane: It's 2010, this has got to have made it upstream by then. [01:55:31] yeah. you'd hope :) [01:58:45] hm. seems ibm recommends deadline over cfq for kvm hosts and guests [01:58:48] http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaatbpdeadline.htm [03:26:22] qsub or jsub? [03:27:36] is tools still broken? [03:28:02] no idea. [03:39:36] shouldn't be [03:56:20] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Managing_Jobs "You can submit jobs with qsub," Should that be changed to "jsub"? Other parts of the page, such as examples, use jsub, as does the text warning onlogin. [03:56:36] Wrong section link; just search for the phrase on the page. [04:04:09] gry: qsub and jsub will both work, jsub is just easier [04:05:29] "You submit a job to a work queue from a submission server (-login) and the web servers.; " so it looks like I have to submit from tools-login, not login-toolname. Examples do the latter though. What do I do with that line? [04:06:24] erm so [04:06:34] ssh legoktm@tools-login.wmflabs.org [04:06:36] then [04:06:38] become toolname [04:06:45] then jsub blah.sh [04:07:00] why does it say "submit a job ... from ... -login" ? [04:07:39] jsub submits the job to the grid engine [04:07:51] yes, the question is where to submit it from [04:07:51] the grid engine then runs it on one of 6? execution hosts [04:07:57] from tools-login [04:08:20] if I ssh to tools-login and then become toolname, am I still at tools-login? [04:08:29] ah I see, yes [04:09:46] am I allowed to use things like perlbrew on local-toolname@tools-login? [04:12:36] i have no clue. there arent any real rules yet, besides dont destroy things, and it has to be open source. [06:11:51] gry: You can install perlbrew, but you very probably shouldn't. [06:13:40] Coren: BTW, during the hiccup I got "/bin/sh: execle: Cannot allocate memory" and "/bin/sh: 1: /home/scfc/bin/replagstats: Cannot allocate memory", so memory seems to have been tight. [06:14:29] scfc_de: Symptom, not cause: as the filesystem wedged the number of started but hung cron jobs grew to ridiculous levels. [06:22:19] I'm looking for few modules absent in distro packages, and prefer using cpanm, so if there is a place to request a system-wide install it could be useful [06:25:34] gry: Just file a bug in Bugzilla: https://bugzilla.wikimedia.org/enter_bug.cgi?product=Wikimedia%20Labs&component=tools [07:03:21] wow, if all software requests have to be made at bugzilla I'll become a bugzilla spammer [07:03:43] since perl isn't a particularly shiny language when it comes to being clear about missing modules [07:10:16] gry: Nothing wrong with that. [07:53:46] hi [07:54:13] gry: you don't need to create a ticket for every package ;) [07:54:23] you can just make 1 big ticket [07:55:05] !tools-bug is https://bugzilla.wikimedia.org/enter_bug.cgi?product=Wikimedia%20Labs&component=tools [07:55:05] Key was added [08:31:20] !log toolsbeta switching all instances to nfs [08:31:22] Logged the message, Master [08:44:42] @notify mutante [08:44:42] This user is now online in #wikimedia-tech. I'll let you know when they show some activity (talk, etc.) [08:44:46] @notify andrewbogott_afk [08:44:46] This user is now online in #wikimedia-labs. I'll let you know when they show some activity (talk, etc.) [09:46:37] I'm having trouble again ssh-ing tool-labs [09:46:50] (from bastion) [09:52:27] Oren_Bochman did you try directly to tools-login [09:52:39] yes that works [09:52:40] I can't help you with bastion :( [09:52:57] not trusted enough to have access there :P [10:08:46] Oren_Bochman: One of my cron jobs failed due to not enough memory around the time you reported, so it might have been a temporary bug. [10:09:13] Hmmm. ganglia.wmflabs.org is down again, so no easy confirmation. [10:10:18] * Oren_Bochman retries [10:10:28] looks like it is a different issues [10:10:38] I could ssh directly [10:10:49] last time andrew - fixed it [10:11:04] he said my /home process was hunging [10:11:18] not sure what to do about it myself [10:22:22] scfc_de: so memcached - master is apparently version 1.4, and actual dev release is 1.6, which is in branch engine-pu [10:22:44] scfc_de: that does seem to have better autotools, however it doesn't actually work [10:24:38] Coren: help - ssh issues once again [10:26:54] YuviPanda: "Source and Development" on memcached homepage links to code.google, which then in turn links to GitHub -- I love those setups :-). [10:27:05] scfc_de: :D [10:27:11] it seems fairly stagnant [10:52:16] petan: http://ganglia.wmflabs.org/ is completely down this time. [10:53:19] Also received a number of OOM errors on tools-login about 15 minutes ago. [10:54:23] liangent: what's project Category Sorting about? [11:02:25] I don't get this. For OOM on tools-login, something must suck 1 GByte of memory really fast. [11:16:09] !log deployment-prep created /data/project/apache/uncommon/master , owned by mwdeploy:mwdeploy and mode 0755. [11:16:12] Logged the message, Master [11:28:41] scfc_de checking [11:31:16] @labs-project-instances ganglia [11:31:16] Following instances are in this project: aggregator1, [11:31:33] !log ganglia rebooting [11:31:34] Logged the message, Master [11:39:33] scfc_de what kind of OOM message did you get on -login [11:39:40] I can't find any message from killer in syslog [11:42:39] scfc_de ganglia is back up [12:10:32] petan: Yes, ganglia is up for me as well. Re OOM: "Cannot allocate memory at $SCRIPT" (calling external program) at 10:34Z, 10:36Z and 10:39Z, "Out of memory!" at 10:35Z and 10:40Z. [12:10:47] scfc_de is that on -login or exec nodes? [12:22:11] CGI was randomly throwing 500s for about a minute? :/ [12:23:57] what is the "correct" way to query a backend script on the tools web servers, from the tools web servers? [12:24:59] querying tools.wmflabs.org/ourtool/stuff.py doesn't work - external IP [12:25:19] using an internal IP works but skips load balancing [12:25:26] petan, Coren? [12:25:33] hi [12:26:21] JohannesK_WMDE: That's a known limitation of OpenStack; it is unable to loop back onto its public IPs [12:26:50] hi Coren [12:27:03] for some reason on toolsbeta puppet is throwing errors [12:27:20] petan: I may be able to help in a few hours. [12:27:24] ok [12:27:35] Coren: ah. so we should use one of the webserver-01/02 ips? we found something called tools-webproxy, is that the load balancer which we should query? [12:27:36] it doesn't want to instal sql tool because of some getaddr issue [12:28:03] JohannesK_WMDE the IP address is retrieved from /data/project/.system/webservers [12:28:34] Strictly speaking, you can query any of the active webservers. [12:28:43] (they are all identical) [12:31:43] /data/project/.system/webservers -- so the "balancing" is per-tool then? (or could be... atm it's all the same server) [12:31:54] not really [12:31:58] cb is using -02 [12:32:05] all other tools are using 01 [12:33:02] so we just use -01 [12:33:49] petan: It was on tools-login. [12:45:28] It looks like something is broken with wikitech's logo. The SVG at https://wikitech.wikimedia.org/wiki/File:Wikimedia_labs_logo.svg has a transparent background, but all the rendered PNGs are showing a white background. [13:01:01] anomie: Is http://commons.wikimedia.org/wiki/Help:SVG, "How do I get rid of the transparent background?" relevant here? Looks like it's intended that way. [13:03:43] scfc_de: No, the problem is that the transparent background is missing. Also, I looked at the actual SVG code and there appears to be no white background in it. [13:16:30] anomie: "The SVG at https://wikitech.wikimedia.org/wiki/File:Wikimedia_labs_logo.svg has a transparent background" vs. "the transparent background is missing"?! [13:17:06] scfc_de: Yes. The SVG has a transparent background. But the rendered PNGs are showing a white background. [13:17:10] scfc_de, anomie someone explained this by svg preview / thumbnail generator being broken [13:17:19] this indeed is a bug [13:17:24] it's in bugzilla [13:17:35] petan: Ok, good. [14:09:47] JohannesK_WMDE: to host a demo installation of https://bugzilla.wikimedia.org/show_bug.cgi?id=44667 [14:17:06] liangent: chinese collations? sure that is the right link? i mean the project named Category Sorting [14:19:53] JohannesK_WMDE: it is [14:20:53] ah, i see [14:47:50] andrewbogott https://wikitech.wikimedia.org/wiki/Special:NovaSudoer doesn't work :/ [14:48:41] petan: ok, I'll look in a moment [14:51:50] petan, can you be more specific? [15:03:33] petan: when having the tools project in the filter? [15:03:38] I think I've seen this [15:04:00] we're probably doing something horribly inefficient in that special page [15:07:09] hm… I see the problem, will investigate. [15:19:32] Ryan_Lane: just an update, I started upstreaming patches to memcached. From talking to the maintainer on IRC he has very little time, and there is high reluctance to touch the codebase for anything. [15:19:42] I see [15:19:45] already have a small patch in there waiting for review [15:19:54] Ryan_Lane: but there are like patches waiting for review from other people for like, 8 months [15:20:12] fun times [15:20:19] Ryan_Lane: I did get told to keep pestering him, so will do that, and see if that can work [15:20:40] Ryan_Lane: we do have a redis install, though - that has a config feature to disable 'dangerous' commands, and we do have it disabled [15:21:06] i've a tool (the GitHub <-> Gerrit bot's Gerrit -> GitHub sync) using it right now, will write up docs once it runs for like a week or so without major issues [15:21:38] unsure if it is in puppet though, will make sure it is, eventually [15:42:51] YuviPanda: awesome. redis has a puppet module [15:42:57] awesome [15:43:04] it's in use in production [15:43:13] I'd imagine it'll work for you. it may need modification [15:43:46] Ryan_Lane: yeah, I'll look at it. I've never used Puppet before, so would be a good way to dive in [15:43:56] this is, of course, assuming that tools-mc is not already puppetized :) [16:22:43] !log deployment-prep varnish-t3 (mobile cache): cleaned up operations/puppet local repo and re ran puppet. Still blocked :/ {{bug|49700}} [16:22:47] Logged the message, Master [16:25:42] !log deployment-prep Apache was down on apache32. Restarted it as well as on apache33.. Solved {{bug|49700}} [16:25:45] Logged the message, Master [17:07:38] Coren, ? [17:31:10] YuviPanda of course it is [17:31:18] ? [17:31:21] puppetized? [17:31:24] including the config? [17:31:37] yes, but it doesnt have the redis thing at all [17:31:47] but the box itself is puppetized [17:31:55] it is in modules/toollabs [17:31:55] any admins on? Still having an issue on tools-login where with the tool account I get a Permission denied when trying to crontab -e [17:32:03] sdamashek hi [17:32:08] hi [17:32:19] any idea what the problem is? [17:32:22] petan: ah [17:32:24] sdamashek: can u tell me the name of a tol [17:32:28] voxelbot [17:32:29] tool [17:32:49] petan: so the redis is not puppetized but the box is [17:32:50] ok [17:33:07] Ryan_Lane: could you review https://gerrit.wikimedia.org/r/#/c/69126/1 ? I thought it would fix the sudoer page and it doesn't… but is still probably useful. [17:34:22] !log tools petrb: /var/spool/cron/crontabs/ has -rw------- 1 8006 crontab 1176 Apr 11 14:07 local-voxelbot fixing [17:34:26] Logged the message, Master [17:35:04] petan: many thanks, it works now [17:35:10] yw [17:35:31] YuviPanda that means, if you want to puppetize this, use the existing file [17:35:52] which, tbh I am not sure if was merged by Coren yet [17:48:28] btw YuviPanda I am setting up the toolsbeta, which is a great place for this kind of experiments :P [17:48:48] Coren, ? [17:50:07] petan, did you kill Coren? [17:50:09] :p [17:50:14] yes [17:50:42] ok. Just wondering. :p [17:50:50] I decided doing stuff alone is better [17:51:18] petan, can you then tell me what the status of S7 is and if legal has made any headway on archive? [17:51:26] no I cant [17:51:36] I have absolutely no access to db hardware whatsoever [17:51:56] But you're doing everything alone? You have doomed us all on labs. [17:51:59] :p [17:52:07] nah this is Ashers job [17:52:20] he is DB guy [17:52:23] He never seems to be on IRC. [17:53:50] Cyberpower678: binasher is usually on some of the ops channels [17:54:23] YuviPanda, channel? [17:55:02] Cyberpower678: #wikimedia-operations [17:55:50] petan: actually, coren is handling the redaction stuff [18:02:01] andrewbogott: yeah, this patch makes sense [18:02:22] It gets the page to load in finite time rather than infinite time, but I need to add more caching elsewhre. [18:04:27] heh [18:04:28] * Ryan_Lane nods [18:04:31] let me approve [18:05:15] merged [18:05:36] @search gerrit [18:05:37] Results (Found 6): gerrit, whitespace, git-puppet, gerritsearch, ryanland, gitweb, [18:05:45] !git-puppet [18:05:45] git clone ssh://gerrit.wikimedia.org:29418/operations/puppet.git [18:05:51] !coren [18:05:52] The toolmeister: http://www.mediawiki.org/wiki/User:MPelletier_(WMF) [18:06:01] !coren [18:06:02] The toolmeister: http://www.mediawiki.org/wiki/User:MPelletier_(WMF) [18:06:04] !Coren [18:06:39] Coren is Coren is dead. petan killed him. He now roams about as a zombie. [18:06:55] !Coren is Coren is dead. petan killed him. He now roams about as a zombie. [18:06:56] Key was added [18:20:13] Coren: how insane would I be thinking about allowing mounts of all projects in the bastion project? [18:20:56] assuming we mount them under directories that only allows access to people in the group? [18:26:32] Ryan_Lane what for [18:27:04] petan: so that people can directly scp/sftp to their project [18:27:15] hmm [18:27:28] also so that people can push to gerrit from repos without needing to forward their agent past the bastion [18:28:55] we don't allow root on bastions, so it *should* be safe [18:29:15] assuming we put a managed directory in front with proper permissions [18:31:06] depending on how reliable the shared storage is, that could also allow people to do most of their work on the bastion [18:31:41] we could also turn the bastion into a salt-trusted peer [18:31:49] I'd still want salt-api for that, though [18:38:02] marktraceur: so, we have a bunch of −1'd changes now :) [18:38:31] I'll start on the −1'd issues for the php parts today [18:38:40] and work towards the JS [18:39:02] Ah yeah [18:39:04] ori mentioned we should be using an MVC framework, but I think that's outside the scope of our current changes [18:39:15] It probably is [18:39:22] maybe something we should do in a refactor [18:39:40] Ryan_Lane: I may not be able to look at it for a while, or...need to look at it on a weekend maybe [18:39:45] no rush [18:39:52] *nod* [18:40:08] We wanted to merge a bunch of Nischayn22's code this week, so that'll be my Thursday and Friday [18:54:45] andrewbogott: heh. the ldap log on virt0 is going insane [18:54:51] I have a feeling you're testing things :) [18:55:07] Yeah, but I thought the log was off... [18:55:27] it looks like it's actually loading every single user [18:55:36] in the project [18:55:44] that would definitely make this incredibly slow [18:55:58] Yeah, it's always been doing that, that's what I'm working on. I need a quick way to convert a name to a uid and back. Probably I'll cache that whole list. [18:56:31] that may not be easy [18:56:39] it'll hit search limits [18:57:25] I think I can do it per project. [18:57:39] but maybe I can avoid doing that conversion entirely… gotta read more [18:58:32] I turned the log off, meanwhile :) [19:01:22] https://gerrit.wikimedia.org/r/69133 [19:02:29] Reedy: what did you mean? [19:02:31] :P [19:29:07] Is there a faster way to "SELECT COUNT(1) FROM abuse_filter_log WHERE afl_user_text=%s"? [19:30:54] * anomie replies to something an hour ago [19:30:54] Ryan_Lane: As far as scp and sftp go, I finally got around to setting up the proper ProxyCommand rules in my ssh config (as recommended at [[wikitech:Help:Access]] and [[wikitech:Server access responsibilities]]) and scp/sftp to instances behind the bastion seem to Just Work. [19:33:17] anomie: ah. cool [19:33:36] I'd still like that make that workflow easier for people :) [19:33:47] especially folks who use graphical clients [19:40:22] Ryan_Lane: dont swear [19:40:30] AzaToth: :) [19:40:44] ツ [19:41:49] how many public IP(v4) addresses do you have available? [19:44:13] not totally sure :) [19:44:58] 42 left [19:45:40] most of which are allocated to projects, but aren't associated with any instanced [19:45:43] *instances [19:45:57] Ryan_Lane: did you see my e-mail on the sf.net user admin panel? I don't need a response, just checking if you received it :-) [19:46:08] we have 13 completely free [19:46:15] valhallasw: I did. thanks for that! :) [19:46:22] ok, cool :-) [19:46:23] I need to work out a sane way of implementing it [19:46:29] we have some projects with like 50 users [19:47:08] some with more (tools has over 100) [19:48:53] Ryan_Lane: yeah, it shows some cracks for large amounts of users. It's not even sorted :| [19:49:06] heh. yeah. [19:49:22] Ryan_Lane: but the basic idea '[-] button next to users' and '[+] button that creates a new field to add a user' seems sane enough [19:49:24] the way mediawiki handles this is by just using a text box and only allowing single actions [19:49:27] yep [19:50:01] we could also use a chosen multi-select for the addition of users [19:50:33] Ryan_Lane: So I assume you have like a C network totally? [19:50:42] AzaToth: nope [19:51:08] AzaToth: we use /32s [19:51:19] and route larger segments [19:51:25] ok [19:51:25] we don't do subnetting at all [19:51:53] we're using static routes, which will eventually be BGP announced routes [19:52:22] (that saves a few IPs per block) [19:52:23] thought you have a 24 for disposal [19:52:29] had* [19:52:37] we have about 42 [19:52:44] but really about 13 [19:53:03] we can scrounge 42 if we pull allocated but unused IPs from projects [19:53:11] as you said 42 left, I assumed you meant you had more that was already in use [19:53:28] we have 230 [19:53:31] ok [19:53:58] some are being used for a project and will go away when its done [19:54:28] but those 230 are not linked together? [19:54:38] like as logically part of a sub network? [19:56:07] AzaToth: correct [19:56:15] static routes directly to the network node [19:56:22] ok [19:56:42] and eventually BGP [19:57:08] Coren: jstop is giving me: [19:57:10] Use of uninitialized value $mode in string eq at /usr/bin/job line 55, line 13. [19:57:10] Use of uninitialized value $mode in string eq at /usr/bin/job line 59, line 13. [20:11:37] anomie: https://gerrit.wikimedia.org/r/#/c/65421/29/api/ApiNovaAddress.php,unified [20:11:54] anomie: so. does $this->dieUsage escape messages? [20:12:04] do we not care about internationalization of api messages? [20:13:56] Ryan_Lane: The API doesn't internationalize error messages at all. There's a bug for that, the long-term plan is that clients would somehow-or-other specify a language code and whether they want basic messages from the i18n ("bot mode") or locally-customized messages ("ui javascript mode"). [20:15:08] For bots, it'd be annoying to get back some huge wad of presentational wikitext in the error message for your log file, while that's exactly what you probably want in a user script to present to the user. [20:15:28] (or worse, a bot getting a huge wad of HTML) [20:24:50] marktraceur: ^^ [20:24:57] marktraceur: so, that answers our questions :) [20:25:31] marktraceur: basically we should just hardcode error messages into the api die outputs [20:28:49] Ah hm. [20:28:54] That's...OK [20:30:05] not really what I was expecting. heh [20:37:05] anomie: when using dieUsage, I don't need to specify the message being used in getPossibleErrors, right? [20:37:20] I just spit out a message? [20:41:30] Ryan_Lane: To tell the truth, I have no idea what the use case is for getPossibleErrors. But technically, yes, you're supposed to return the code and something at least vaguely resembling the message from getPossibleErrors. [20:42:03] code? [20:42:09] I guess I need to look at that function again [20:42:22] docs for this are…. less than great :) [20:44:52] Ryan_Lane: Errors have a code and a description, the two args to dieUsage(). Clients will check the code to determine what the error was and how to handle it, while they'll use the description for logging (and maybe showing to the user, if they're too lazy to do something better). [20:45:09] * Ryan_Lane nods [20:45:40] dieUsage looks like the error code is a string [20:45:41] heh [20:46:37] Yeah, it's a string. Superficially it's usually very similar to a message key. [20:47:59] As in, returning the message key is probably going to be about right in most cases [20:48:51] New patchset: Tim Landscheidt; "job: Initialize $mode." [labs/toollabs] (master) - https://gerrit.wikimedia.org/r/69157 [20:49:11] * anomie digs into git history. Apparently getPossibleErrors was requested in bug 18771, which still doesn't have any actual use case beyond "so I can see what the possible errors are". [20:49:30] heh. I love getUsageMsg & parseMsg [20:49:42] getUsageMsg is basically only really usable by core [20:49:54] that's a pretty evil function [20:51:25] and $messageMap is public static [20:51:27] * Ryan_Lane shudders [20:54:23] Ryan_Lane: What's worse, some of the API hooks require its use. I had to write a hook function that handled errors with code like "ApiBase::$messageMap['key'] = array( 'code' => 'foo', 'info' => 'bar' ); $message = array( 'key' ); return false;". [20:54:26] With a comment "Bad design, API". [20:55:04] that is scary [21:01:32] wasn't there a bot command to search for instance names here? [21:02:22] !help [21:02:23] !documentation for labs !wm-bot for bot [21:04:20] @labs-project-instances testlabs [21:04:20] Following instances are in this project: webserver-lcarr, build-precise1, util-abogott, testlabs-abogott-dev, asher-m1, testlabs-buildtest, py-misc-projects, robh-sp, robh-sp1, [21:04:42] would like search for instance name and get project it's in [21:05:57] @labs-project-instances php [21:05:57] Following instances are in this project: [21:12:09] @labs-project-instances packaging [21:12:09] Following instances are in this project: udp-filter, php-packaging, build-lucid1, [21:12:16] aha:) there it is, nice [21:13:41] !log packaging adding myself to project to use wmf-build script from luasandbox repo [21:13:42] Logged the message, Master [21:21:06] @labs-resolve nova-precise2 [21:21:07] The nova-precise2 resolves to instance I-00000553 with a fancy name nova-precise2 and IP 10.4.1.57 [21:21:18] heh. doesn't list the project [21:21:29] @labs-info nova-precise2 [21:21:29] [Name nova-precise2 doesn't exist but resolves to I-00000553] I-00000553 is Nova Instance with name: nova-precise2, host: virt6, IP: 10.4.1.57 of type: m1.small, with number of CPUs: 4, RAM of this size: 2048M, member of project: openstack, size of storage: 30 and with image ID: ubuntu-12.04-precise (deprecated) [21:21:42] mutante: ^^ [21:22:04] @labs-resolve php-packaging [21:22:04] The php-packaging resolves to instance I-000003ae with a fancy name php-packaging and IP 10.4.0.177 [21:31:34] @labs-resolve init [21:31:34] I don't know this instance, sorry, try browsing the list by hand, but I can guarantee there is no such instance matching this name, host or Nova ID unless it was created less than 14 seconds ago [21:31:51] crazy bot [21:32:05] Ryan_Lane, petan: OK, this patch makes the sudoer page much faster (although still not actually 'fast'): https://gerrit.wikimedia.org/r/#/c/69155/2 [21:48:59] andrewbogott: is getMembers not used anywhere else? [21:49:16] andrewbogott: it's used in a bunch of places [21:49:26] lots and lots of places :) [21:50:11] Ryan_Lane: I've found a bug in labs [21:50:24] Oren_Bochman: there's lots of bugs. specifics? :) [21:50:46] I can't ssh anylonger from bastion to tools-login [21:51:51] or to other hosts [21:52:23] but I can ssh directly to tools-login [21:52:47] andrewbogott: helped me out with this last week [21:52:53] but I don't know how [21:53:03] he mentioned /home hanging [21:53:54] anyhow I don't know enough details to write up a bug report [21:54:02] or to fix it [21:55:53] Ryan_Lane, I didn't remove getMembers, it just relies on the member variable now (which, in turn, is backed by the cache.) [21:56:13] Oren_Bochman, what instance are you trying to ssh to? [21:56:34] tools-login [21:56:54] and he-moodle-24 [21:57:19] and he-moodle-25 etc [21:58:26] those instances look fine to me. Are you forwarding your key when you connect to bastion? [21:58:35] yes [21:59:00] the wierd thing is that I'm already using bastion to ssh to tools-login [21:59:15] but I can't make any new connections [22:00:34] Well… I'm pretty sure that something is broken on your end. What error do you get? [22:00:38] And, are you on Windows or linux, or? [22:00:43] window [22:00:48] using putty [22:01:09] I get permission denied [22:01:18] (public key) [22:03:00] that definitely sounds like your key isn't forwarded. I don't know how to debug in putty though. [22:03:00] I get "Host 'tools-login' is known and matches the RSA host key." [22:03:07] so key forwarding works [22:03:42] Not necessarily, I think that's about the host validation (not your private key) [22:03:48] Oren_Bochman: on bastion type: ssh-add -l [22:04:33] what's the output [22:04:34] ? [22:04:47] Could not open a connection to your authentication agent. [22:05:22] New patchset: Krinkle; "job: Initialize $mode" [labs/toollabs] (master) - https://gerrit.wikimedia.org/r/69157 [22:05:34] I can make a pastebin of the ssh -vv tools-login [22:05:35] andrewbogott: ah. sorry. misread the diff [22:05:46] Oren_Bochman: so. yeah. your key isn't forwarded [22:05:56] on your local computer: ssh-add -l [22:06:40] it windows - and it is set to forward [22:07:02] is the key in your pagent? [22:07:17] infact I'm running tmux with 3 ssh session from a couple of days ago [22:07:21] yes it is [22:07:40] to tools-login [22:07:51] if "ssh-add -l" on bastion isn't showing your key's fingerprint, then it isn't being forwarded to that instance [22:08:13] are you selecting a saved profile for bastion? [22:08:20] maybe it's not set up in that specific profile? [22:08:31] I am - I'll check [22:08:34] ok [22:09:25] but like I said - it's in a profile which was successfully doing key forwarding - untill it stopped. [22:09:55] the way to test this is always: ssh-add -l [22:11:17] Oren_Bochman: This is a silly question, but… are you ssh'd directly to bastion from putty, or did you hop from bastion -> somewhere -> bastion? [22:11:51] I've killed all my sessions [22:12:23] 'k [22:13:11] but I'm running: tmux a -t bastion|| tmux new -s bastion [22:13:19] as my connection command [22:14:37] andrewbogott: merged [22:14:42] how can I check for the above loop [22:14:45] could somebody grant my user 'gwicke' admin rights on en.wikipedia.beta.wmflabs.org so that I can import articles to test? [22:14:54] hashar: ^^ [22:14:56] petan: ^^ [22:15:05] no clue who else is the right person for that :) [22:15:10] Ryan_Lane: ^^ *** ERROR CYCLE DETECTED *** [22:15:20] Ryan_Lane: thanks for forwarding ;) [22:15:22] gwicke: will do [22:15:24] gwicke: yw :) [22:15:28] if I find out how to do that haha [22:18:13] gwicke: you might be a Global developer, Global sysops, Importers and Staff now [22:18:35] whatever that can mean, our interface is really horrible :( [22:18:56] Importers sounds promising [22:19:22] is there a limit on the number of ssh session from bastion to any given host [22:19:34] hashar: that seems to work, thanks! [22:20:37] \O/ [22:21:51] exit [22:22:04] isn't it /exit ? [22:22:06] * hashar tries [22:22:52] works for me [22:23:09] oren: /join 0 [22:23:55] I need to close a tmux session - without killing the irc client ;-) [22:24:15] and I've remapped the keys [22:24:35] Oren_Bochman: Screen is C^a d if that helps :) [22:24:54] * Oren_Bochman is getting tired of Chatzilla [22:26:13] hmmm I messed it [22:26:51] is there no escape ;-) [22:30:54] exit [22:34:01] Oren_Bochman: tmux/screen can break if your agent socket changes [22:34:08] since it keeps your old environment [22:34:24] so, if pagent restarts, it'll lose your agent [22:34:39] I't seems to work quite well [22:34:52] yeah, it will as long as your agent doesn't restart :) [22:35:17] I've restarted session a number of times - it always comes back [22:35:29] I keep my screen connected to my agent by writing the socket location into a dot file and having screen always source that [22:35:58] intersting [22:36:09] I've no idea how to do that [22:36:48] well, the easiest thing to do is to just check locally if your key is added to your agent [22:36:51] via ssh-add -l [22:37:12] if not, then your tmux has lost its agent [22:37:22] it works now [22:37:57] tmux [22:40:24] * Ryan_Lane nods [22:42:23] I'm writting a PHP class that queries the database. [22:43:13] does anyone have a code snippet of how to read the replica.my.conf instead of hard coding it into the php [22:44:50] Thanks, ryan_lane -- I've deployed both those patches. [22:44:54] cool [22:45:03] thanks for the fixes :)