[00:13:46] 6Labs: Labs homedirs owned by root for new instances - https://phabricator.wikimedia.org/T100478#1313531 (10Gage) 3NEW [00:14:31] 6Labs: Labs homedirs owned by root for new projects - https://phabricator.wikimedia.org/T100478#1313548 (10Gage) [00:31:20] 10Tool-Labs, 5Patch-For-Review: Reduce amount of Tools-local packages - https://phabricator.wikimedia.org/T91874#1313567 (10scfc) [00:37:18] 6Labs: Labs homedirs owned by root for new projects - https://phabricator.wikimedia.org/T100478#1313568 (10Dzahn) p:5Triage>3High [08:28:30] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:28:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1207 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:30:08] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:30:32] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 100.00% of data above the critical threshold [0.0] [08:31:18] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1202 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:31:34] PROBLEM - Puppet failure on tools-master is CRITICAL 100.00% of data above the critical threshold [0.0] [08:31:50] PROBLEM - Puppet failure on tools-mail is CRITICAL 100.00% of data above the critical threshold [0.0] [08:31:54] PROBLEM - Puppet failure on tools-mailrelay-01 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:32:42] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:33:23] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:33:43] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:34:49] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:39:19] PROBLEM - Puppet failure on tools-shadow is CRITICAL 100.00% of data above the critical threshold [0.0] [08:41:31] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 100.00% of data above the critical threshold [0.0] [08:44:20] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL 100.00% of data above the critical threshold [0.0] [08:44:38] PROBLEM - Puppet failure on tools-precise-dev is CRITICAL 100.00% of data above the critical threshold [0.0] [08:45:08] PROBLEM - Puppet failure on tools-submit is CRITICAL 100.00% of data above the critical threshold [0.0] [08:46:00] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 100.00% of data above the critical threshold [0.0] [09:11:36] 10Tool-Labs, 10Pywikibot-compat-to-core, 10pywikibot-core, 5Patch-For-Review: Install all pywikibot python dependencies on tool labs - https://phabricator.wikimedia.org/T86015#1314057 (10jayvdb) [09:15:41] 6Labs, 10LabsDB-Auditor, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, and 8 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1314070 (10faidon) >>! In T89270#1245165, @Aklapper wrote: > Alright. > > What I see left here (I won't have time the next two weeks for... [09:16:43] (03CR) 10John Vandenberg: [C: 031] Add python-ipaddress package [labs/toollabs] - 10https://gerrit.wikimedia.org/r/209978 (https://phabricator.wikimedia.org/T86015) (owner: 10Merlijn van Deen) [09:42:21] [13intuition] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/Krinkle/intuition/commit/07e75268a7d681cf2628b43edd3bdaa00df28735 [09:42:22] 13intuition/06master 1407e7526 15Timo Tijhof: cvnoverlay: Remove unused message "lastupdate-db"... [10:07:28] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1314129 (10faidon) @Coren, what's the status of this? [12:54:18] (03CR) 10Mattflaschen: [C: 04-1] "Some things that should be fixed before we continue troubleshooting." (0318 comments) [labs/tools/flow-oauth-demo] - 10https://gerrit.wikimedia.org/r/213590 (owner: 10Rjain) [12:57:12] (03CR) 10Mattflaschen: "Also, please add the LICENSE file from https://raw.githubusercontent.com/Stype/mwoauth-php/master/LICENSE like we discussed." [labs/tools/flow-oauth-demo] - 10https://gerrit.wikimedia.org/r/213590 (owner: 10Rjain) [13:35:07] 10Tool-Labs, 10Pywikibot-compat-to-core, 10pywikibot-core, 5Patch-For-Review: Install all pywikibot python optional dependencies on tool labs - https://phabricator.wikimedia.org/T86015#1314396 (10jayvdb) [15:12:04] 6Labs, 10LabsDB-Auditor, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, and 8 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1314661 (10Andrew) I don't object to anything that Faidon said, except that I'm confused about Labs-Vagrant. There /is/ a feature that all... [15:59:19] since when that joe-shit editor is default on labs? [15:59:54] petan: you mean nano? [16:00:02] no, nano is sort of fine [16:00:10] but on tool labs there is joe as default editor [16:00:16] oh, I don’t know then. I know that nano is now the default on Trusty which makes me miserable. [16:00:19] which seems like a total crap to me [16:00:25] I can’t imagine that’s on purpose. [16:00:38] well, nano is probably default editor on every ubuntu, and maybe even debian these days [16:00:45] but I would be much more happy for vim [16:00:49] seems like it was vim in precise [16:01:15] I doubt there was vim in past, more likely some puppet manifest did change it [16:02:05] If you can figure out why it’s defaulting to joe and can change it to vim I expect a patch will be welcome [16:02:10] you’ll have my +1 at least [16:02:18] I don't think it's joe by default [16:02:24] I wonder why joe is even installed [16:03:21] petan: I switched it to nano on tools-login by hand. am investigating why joe is around now [16:03:24] There was a request for joe to be an option, I added it. [16:03:34] aaah [16:03:40] But that was a year ago [16:03:42] well, but default? D: [16:03:48] I haven’t touched it since then. [16:04:02] so something else has changed. [16:04:28] um it was default at least when I did git commit, but now when I updated my .bash_profile [16:04:30] it's gone [16:04:37] fortunatelly... :o [16:04:40] I just switched it [16:04:44] to see if puppet switches it back [16:05:13] * petan votes for vim as default. everywhere. EVERYWHERE [16:05:21] me too [16:05:27] you should put it in your .bashrc then :P [16:05:30] * YuviPanda has a fairly souped up .vimrc [16:05:30] I did [16:05:31] but vim is pretty hostile to newbies. [16:05:36] but my tools have different .vimrc [16:05:37] yeah. [16:05:57] I mean .bashrc [16:06:22] so I would need to do for every account I use on labs for every project, which could be hundreds of files :3 [16:06:50] andrewbogott: did you install it on labs abse image or just for toollabs? [16:06:56] just toollabs [16:06:59] petan: ok [16:07:01] err [16:07:01] andrewbogott: ok [16:07:11] 10Tool-Labs: Investigate why Joe is default editor on toollabs - https://phabricator.wikimedia.org/T100526#1314825 (10yuvipanda) 3NEW [16:07:13] alright, so ^ [16:08:08] PROBLEM - Puppet failure on tools-exec-1212 is CRITICAL 44.44% of data above the critical threshold [0.0] [16:08:53] andrewbogott: is that you futzing with dnsmasq? [16:09:30] 6Labs: point puppet.eqiad.wmflabs to virt1000.wikimedia.org in labs dns - https://phabricator.wikimedia.org/T100317#1314832 (10Andrew) Done, but the trouble with this is that a given labs instance can't tell the difference between 'puppet' the production puppetmaster and any given labs instance named 'puppet'.... [16:09:44] YuviPanda: I restarted it, which apparently is enough to cause diamond to throw warnings. [16:09:49] heh [16:13:34] 6Labs: Don't hardcode virt1000 as the labs puppetmaster - https://phabricator.wikimedia.org/T100317#1314845 (10Andrew) [16:14:16] does that ~/error.log work for php errors? [16:14:43] mhm it does... it just takes some time for them to appear in it [16:33:05] RECOVERY - Puppet failure on tools-exec-1212 is OK Less than 1.00% above the threshold [0.0] [16:43:00] 6Labs: Document, explain, diagram labs vlans and network setup - https://phabricator.wikimedia.org/T100529#1314920 (10Andrew) 3NEW [17:32:07] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Abshir was created, changed by Abshir link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Abshir edit summary: Created page with "{{Tools Access Request |Justification=I want to help Somalian wiki by running a bot. also i will translate the bots from en.wiki other wikis to be run in over small wiki. |Com..." [17:49:47] Hi all [17:52:03] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Abshir was modified, changed by Abshir link https://wikitech.wikimedia.org/w/index.php?diff=160745 edit summary: [17:59:16] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Abshir was modified, changed by Merlijn van Deen link https://wikitech.wikimedia.org/w/index.php?diff=160751 edit summary: [18:24:05] 10Tool-Labs: Gridengine upgrade causes puppet failures - https://phabricator.wikimedia.org/T100073#1315158 (10valhallasw) The package with +wmf2 (instead of the ~wmf1 prerelease) version number is now on apt.wm.o. Running ``` sudo apt-get update sudo puppet agent -tv ``` on tools-exec-1202 updates both package... [18:24:37] 10Tool-Labs: Gridengine upgrade causes puppet failures - https://phabricator.wikimedia.org/T100073#1315162 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Yup, I tested and it seems ok. [18:25:09] YuviPanda: you are on vacation. stop doing wmf stuff :p [18:25:19] am I on vacation?! [18:25:26] hmm, I could be [18:29:20] RECOVERY - Puppet failure on tools-exec-cyberbot is OK Less than 1.00% above the threshold [0.0] [18:29:21] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [18:29:29] \o/ [18:30:13] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [18:30:42] !log revscoring created instances ores-web-01 and ores-web-02 [18:30:49] Logged the message, Master [18:31:03] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [18:31:05] * valhallasw hopes tools-master was just restarting juts now [18:31:08] https://www.irccloud.com/pastebin/Zr5hRCoi [18:31:16] Unable to run job: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": got send error. [18:31:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [18:31:33] but probably just gridengine-whatever being updated [18:31:41] YuviPanda: could I get a review of the extremely tedious https://gerrit.wikimedia.org/r/#/c/213543/? I’d like to schedule possible outages and apply it tomorrow (among other things). [18:31:50] YuviPanda: did you force an apt run on all hosts, or do they run apt just before a puppet run? [18:32:03] valhallasw: I didn't do anything, they run apt just before puppet run [18:32:06] ouch [18:32:06] YuviPanda, which, btw, since that patch may break puppet runs, it would be nice to have puppet runs actually working before that. Is that grid-engine noise still happening? [18:32:15] YuviPanda: perfect [18:32:20] andrewbogott: it's being resolved as we speak [18:32:25] awesome [18:32:47] oh, so I see! I guess I’d been tuning out those shinken alerts, didn’t notice that they were recoveries :) [18:32:49] valhallasw: am forcing a run on master now [18:33:43] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [18:34:16] valhallasw: meh, needed a manual gridengine-master start [18:34:21] :( [18:34:25] valhallasw: I guess there's no service { 'gridengine-master': } defined [18:34:34] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [18:34:45] valhallasw: check if job submision and stuff works now? [18:35:19] YuviPanda: nope [18:35:30] uh oh [18:35:32] what error? [18:35:34] * YuviPanda tries [18:35:37] same [18:35:48] yup the process is dead [18:35:50] looking [18:36:10] I thought we were just doing a rename of the package and coren had tested the actual package contents... [18:36:18] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1202 is OK Less than 1.00% above the threshold [0.0] [18:36:19] hmm, it's not in /var/log is it [18:36:43] maybe the master wasn't supposed to be updated somehow...? [18:38:10] valhallasw: hmm, it's dying without error messages [18:38:14] let me force it to the previous version [18:38:16] ... :X [18:38:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [18:38:28] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [18:38:32] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1207 is OK Less than 1.00% above the threshold [0.0] [18:39:09] YuviPanda: maybe try running /usr/sbin/sge_qmaster directly? [18:39:28] and/or strace it? [18:40:08] valhallasw: [pid 27542] write(2, "error: sge_gethostbyname failed\n", 32) = 32 [18:40:11] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1208 is OK Less than 1.00% above the threshold [0.0] [18:40:33] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [18:40:36] valhallasw: and that's with the unpatched gridengine [18:40:56] YuviPanda: and what host does it try to resolve? [18:41:26] also, try again? just added tools-master to /etc/hosts [18:41:33] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [18:41:47] YuviPanda: and as far as I can see, tools-master was never upgraded to ~wmf1 either [18:41:51] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [18:41:56] RECOVERY - Puppet failure on tools-mailrelay-01 is OK Less than 1.00% above the threshold [0.0] [18:42:01] valhallasw: it's back without the wmf patches now tho [18:42:11] oh [18:42:12] hmm [18:42:46] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [18:43:05] am stracing it as it runs [18:44:11] valhallasw: > error: unable to send message to qmaster using port 6444 on host "localhost": got send error [18:44:15] is it trying to hit *localhost*? [18:44:24] err [18:44:30] it is [18:44:38] connect(3, {sa_family=AF_INET, sin_port=htons(6444), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) [18:44:40] wtf [18:44:48] YuviPanda: *what* is? [18:44:48] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [18:44:53] valhallasw: qstat [18:44:56] on tools-bastion-01 [18:45:13] wat [18:46:45] tools.nlwikibots@tools-bastion-01:~$ cat /data/project/.system/gridengine/default/common/act_qmaster [18:46:46] localhost [18:46:52] valhallasw: yeah [18:46:56] valhallasw: just changed that to tools-master [18:47:00] error: commlib error: access denied (server host resolves destination host "tools-master.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") [18:47:50] adding tools-master.eqiad.wmflabs to /etc/hosts as well... [18:48:24] hm. [18:48:41] error: commlib error: access denied (server host resolves destination host "tools-master.eqiad.wmflabs" as "localhost") [18:48:41] ok, so maybe that's not it. [18:48:59] or it should be ordered differently? [18:49:06] no, hostname should be after localhost [18:49:32] YuviPanda: I'll email labs-l in the meanwhile [18:49:38] valhallasw: thanks [18:50:05] 23316 connect(3, {sa_family=AF_INET, sin_port=htons(6444), sin_addr=inet_addr("10.68.16.9")}, 16) = -1 EINPROGRESS (Operation now in progress) [18:50:06] so it does connect [18:51:43] YuviPanda: can we switch to -shadow as master? [18:52:01] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin looking [18:52:32] PROBLEM - Puppet failure on tools-master is CRITICAL 60.00% of data above the critical threshold [0.0] [18:52:43] valhallasw: sge_qmaster is running fine on tools-master so I'm not sure if that'll help but we can try [18:52:48] ok [18:53:07] valhallasw: I'm doing it now [18:54:19] ok [18:54:31] YuviPanda: note that you need to kill the lock file probably [18:54:46] oh no, I guess it should be ok [18:55:58] valhallasw: theoreitcally [18:56:06] yeah [18:56:16] valhallasw: hmm, so basically master can't resolve stuff [18:56:17] 05/27/2015 18:52:25|worker|localhost|E|no execd known on host tools-exec-1215.eqiad.wmflabs to send conf notification [18:56:21] and similar error messages [18:56:23] YuviPanda: restart master? [18:56:45] valhallasw: happened multiple times [18:56:55] valhallasw: also I killed it manually in an attempt to get shadow to start but that doesn't work [18:57:03] YuviPanda: as in the entire server [18:57:06] oh [18:57:06] hmm [18:57:18] I'm not sure why ping does work [18:57:53] valhallasw: switchover failed, I think - running gridengine-master start on tools-shadow does nothing [18:58:02] *nod* ok [18:58:11] let's restart tools-master [18:58:26] hmm, alright [18:58:40] !log tools rebooting tools-master after switchoer failed and it can not seem to do DNS [18:58:45] Logged the message, Master [18:58:50] valhallasw: done [18:59:40] now tools-shadow is trying to taking over but failing as well [18:59:54] error: commlib error: got read error (closing "tools-shadow.eqiad.wmflabs/qmaster/1") [18:59:54] error: commlib error: got select error (Connection refused) [18:59:54] error: unable to send message to qmaster using port 6444 on host "tools-shadow.eqiad.wmflabs": got send error [19:01:52] valhallasw: I'm going to try to force start on -master, but I don't think that'll have any effect [19:02:06] did that, back to the localhots issues [19:02:07] issue* [19:02:15] ugh [19:02:17] valhallasw: you're running it directly, as root [19:02:28] no, service X restart? [19:02:54] ah ok [19:02:59] hmm, I saw two processes there for a minute [19:03:35] did you edit act_qmaster manually the last time, or was that automagic? [19:03:49] I edited it manually last time [19:03:58] I guess master starts, tries to get ip, thinks its localhost and them boom [19:04:02] yeah [19:04:11] let me remove the /etc/hosts agian and restart [19:04:26] ok [19:06:09] valhallasw: did that work? [19:06:12] it's not listening to service stop O_o [19:06:12] I did echo -n 'tools-master.eqiad.wmflabs' > default/common/act_qmaster [19:06:13] last time [19:06:17] heh [19:07:42] valhallasw: I'm going to wait for you to before I try anything so we don't cross streams [19:07:51] kk [19:08:39] YuviPanda: are there two different versions of oge running at once now? [19:09:06] well there's always been two versions - trusty on the exec nodes adn precise on exe / master [19:09:32] ah, ok. Just wondering if your problem is maybe api version disagreements. [19:09:37] But, I’ll stay out of it :) [19:09:50] andrewbogott: that's possible, I'm reverting on shadow too [19:09:54] right, so on start it overwrites the sge_qmaster file with localhost [19:10:26] !log tools reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master [19:10:31] Logged the message, Master [19:11:35] no, nothing [19:11:37] ugh [19:12:14] oh, there's /tmp/sge_messages [19:12:21] wat [19:12:41] > 05/27/2015 18:59:13| main|localhost|C|admin_user "sgeadmin" does not exist [19:13:06] yeah. 'wat' indeed [19:13:23] valhallasw: we have sgeadmin [19:13:36] could refer to the SGE user instead of local user? [19:13:37] not sure [19:16:39] error: commlib error: access denied (server host resolves destination host "tools-master.eqiad.wmflabs" as "tools-master") [19:16:41] ... [19:16:43] valhallasw: not sure either. I'm attempting to page Coren now. [19:16:50] sounds good. [19:17:56] I think we're on our own for now. Not sure if he's working today.[ [19:18:12] so I wonder how it's doing its resolving [19:18:19] because I'm not seeing a gethostbyname [19:18:44] *nod* [19:18:52] valhallasw: are you the one running a dpkg-reconfigure? [19:19:01] YuviPanda: yeah, but I immediately got a scary prompt [19:19:07] yes I see that too :) [19:19:28] "Please choose whether you wish to configure SGE automatically (with debconf). If you do not configure it automatically, the daemons or client programs will not │ [19:19:28] │ work until a manual configuration is performed. " [19:19:40] I wanted to see what dpkg thought the config was :/ [19:20:13] valhallasw: we have /var/lib/gridengine on NFS as well so that means we can't just try it on one host [19:20:21] right [19:21:02] although entering through should not change anything [19:21:26] valhallasw: I'm quite scared of what things logically 'should' on gridengine [19:21:42] ok, killed it [19:21:46] valhallasw: moritz found that for example the gridengine build scripts check to see if it's on linux 2.6 or 3.0 and if not assume it's on solaris. [19:21:51] wat [19:21:54] yeaaah [19:21:58] and was a pain to build, apparently [19:23:07] valhallasw: I'm stracing it and attempting to see wtf it's trying to do [19:23:50] thanks [19:26:47] YuviPanda: ok I found something. apparently sge wants a fairly specific hosts file [19:26:53] https://scidom.wordpress.com/2012/01/18/sge-on-single-pc/ [19:27:55] valhallasw: ok, want me to try that or are you? [19:28:00] I'm trying., [19:28:04] heh I see you are [19:28:05] ok [19:28:38] act_master is still localhost tho [19:28:44] yeah [19:28:50] I wonder what happens if we just get rid of the localhost from /etc/hosts [19:28:54] let me try that [19:29:19] k [19:29:26] no dice [19:29:33] have you restarted gridengine? [19:29:39] I put it back in [19:29:40] yes I did [19:29:43] ok [19:30:21] valhallasw: http://star.mit.edu/cluster/mlarchives/2005.html seems similar [19:30:53] (gethostname is shipped with SGE in the util dir.) wat [19:31:21] valhallasw@tools-master:/usr/lib/gridengine$ ./gethostname [19:31:21] Hostname: tools-master [19:31:21] Aliases: [19:31:22] Host Address(es): 127.0.0.1 [19:31:24] ... [19:31:48] why does it not have the external ip there... [19:32:00] ... [19:32:59] strace log in /tmp/stracelog [19:33:25] why is it even reading /proc/cpuinfo [19:34:10] it connects to nscd and asks it stuff [19:34:35] either way, it gets back hostname as tools-master [19:35:41] yeah [19:36:36] http://talby.rcs.manchester.ac.uk/~ri/_notes_sge/name-and-address-resolution-and-troubleshooting.html ! [19:37:14] YuviPanda: could you run that checklist? [19:38:23] doing [19:38:24] what the heck [19:38:30] check http://informatics.malariagen.net/2011/06/01/gridengine-the-ubuntu-debian-way/ as well [19:38:34] there should be no localhost [19:38:47] but how did it work so far [19:38:52] just 127.0.0.1 tools-master and 10.x.y.z tools-master tools-master.eqiad.wmflabs [19:38:55] did someone hand hack /etc/hosts earlier? [19:38:56] hand-coded etc/hosts [19:39:03] ... [19:39:05] and no restarts [19:39:07] and then overwritten by your db hosts change [19:39:08] yeah [19:39:35] let me do the /etc/hosts hack [19:39:40] nope [19:39:43] I just did [19:39:51] even better [19:39:55] doesn't work [19:40:08] let me actually read them [19:41:51] I'm kind of lost now [19:42:10] valhallasw: I wonder if puppet logs might have a diff of the /etc/hosts change [19:42:22] LostPanda: I'll grep [19:42:26] valhallasw: ok [19:43:05] no, too long ago :/ [19:43:12] yeah [19:43:44] and coren is the only wmf person who knows any sge? [19:44:03] pretty much. [19:45:19] * polybuildr hugs LostPanda [19:45:44] YuviPanda and valhallasw are on the job, you'll be "found" soon enough. :P [19:46:27] !log tools echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster haaail someone? [19:46:31] Logged the message, Master [19:46:38] valhallasw: so with ^ I see [19:46:38] 2877 connect(3, {sa_family=AF_INET, sin_port=htons(6444), sin_addr=inet_addr("10.68.16.9")}, 16) = -1 EINPROGRESS (Operation now in progress) [19:46:41] so it's got the IP right [19:46:55] ok [19:47:13] but I guess the master doesn't agree [19:52:21] LostPanda: forced acc_qmaster again and we're back at error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") [19:52:24] dafuq [19:53:01] wat, tools-bastion-01? [19:53:23] yes, that's qstat running on bastion-01 [19:53:35] and master is like 'who on earth are you?' [19:53:46] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315443 (10valhallasw) 3NEW [19:58:20] hey! [19:58:27] qstat is working again [19:58:28] somehow [19:58:38] valhallasw: bblack offered to take a look so maybe it's him [19:58:49] valhallasw: not for me [19:58:55] or not. it was silent, at least, but crashes for nlwikibots again [19:58:56] still getting connection refused [19:58:58] heh [19:59:31] valhallasw: there's a strace log in /tmp/log that I'm looking at again [20:01:48] valhallasw: I maybe-fixed something, can you look at gethostname again? [20:02:10] should I try to start gridengine-master again? [20:02:13] and see what happens? [20:02:16] (it's not running atm) [20:02:20] lemmesee [20:02:41] valhallasw@tools-master:/usr/lib/gridengine$ ./gethostname [20:02:42] Hostname: tools-master [20:02:42] Aliases: tools-master tools-master.eqiad.wmflabs [20:02:46] Host Address(es): 127.0.0.1 10.68.16.9 [20:02:46] that looks sane! [20:02:57] I restarted nscd, I think it was caching Bad Things for some reason [20:03:03] just on tools-masters [20:03:08] err tools-master [20:03:10] I'll restart gridengine.. [20:03:23] bblack: hmm, we restarted nscd and the host itself at some point [20:03:36] act_qmaster is now tools-master [20:03:50] btw, why is the actual expected DNS IP + 127.0.0.1 together considered a sane result? [20:03:52] qstat is working!!! [20:04:03] I have no clue how any of this works [20:04:08] neither do we [20:04:32] ^ [20:04:36] bblack: did you also adapt /etc/hosts? [20:04:54] no, I didn't touch it. /etc/hosts is where the 127.0.0.1 is coming from of course [20:05:09] the DNS source has just the 10/8 IP, and nscd via libresolve looks at both sources [20:05:16] hosts also has the local ip [20:05:32] right [20:05:34] 127.0.0.1 tools-master [20:05:34] 10.68.16.9 tools-master.eqiad.wmflabs tools-master [20:05:49] actually I guess nscd->libresolve+nsswitch == it's getting all its answers from the hosts file in this case [20:05:49] we've been futzing around with /etc/hosts in an attempt to 'fix' this. It's had localhost and not had localhost and the actual hostname and not the actual hostname (multiple combinations) [20:06:13] but every time you muck with the hosts file, nscd can interfere as well probably [20:06:19] so past testing may be invalidated by that [20:06:30] so after changing /etc/hosts, restart nscd? [20:06:41] try just commenting out the 127.0.0.1 line as well and restarting nscd? [20:07:13] it's working now, so I'd rather not touch it until Coren is available again [20:07:23] and it's broken again [20:07:24] UGH [20:07:30] tools.nlwikibots@tools-bastion-01:~$ qstat [20:07:31] error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") [20:07:36] honestly, I don't know what the current state/bugs is wrt to nscd and /etc/hosts caching. You'd think it would watch mtime, but I don't expect much out of nscd. It's been a source of issue since like 20 years ago, even in the non-Linux world. [20:08:40] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315482 (10valhallasw) @bblack jumped in to help, and we seemed to have SGE working again, but it's down again. Okay, with /etc/hosts being ``` 127.0.0.1 tools-master 10.68.16.9 tools-master.eqiad.wmflabs tools-master ``` and ```... [20:09:00] I'm also not sure what SGE expects and does and does not [20:09:20] I /am/ giving up for tonight [20:09:47] LostPanda, can you send an update to labs-l? [20:10:01] tools.nlwikibots@tools-bastion-01:~$ qstat [20:10:01] job-ID prior name user state submit/start at queue slots ja-task-ID [20:10:04] ----------------------------------------------------------------------------------------------------------------- 970883 0.30064 lighttpd-n tools.nlwiki r 05/27/2015 06:32:00 webgrid-lighttpd@tools-webgrid 1 [20:10:08] ? [20:10:12] valhallasw: ok, will do [20:10:19] bblack: what on earth [20:10:36] I just jumped to that host as root and su'd to that user and ran the command to try to repro [20:10:43] but looks ok? [20:10:45] !log tools disabled puppet on tools-shadow too [20:10:48] yeah, it's suddenly working again [20:10:49] Logged the message, Master [20:11:00] yeah, works for me... [20:11:07] maybe it's flitting between 127.0.0.1 and the correct IP for some part of something [20:11:34] was there some config change that immediately preceded this breaking? [20:11:56] it was just a gridengine restart after a package upgrade [20:12:05] we downgraded the package but that didn't fix anything [20:12:08] I think it's a change that happened a few weeks ago. /etc/hosts was changed, but gridengine had not been restarted afterwards [20:12:25] ah [20:12:34] and of course /etc/hosts is a manual hack with no version history right? :) [20:12:45] it probably was, yes. [20:12:51] awesome! [20:12:59] but we don't know, because puppet overwrote it at some point [20:13:03] it's still intermittently failing [20:13:21] it was unpuppetized, we thought every host had just localhost + db hosts, and we replaced it with a puppetized version [20:14:33] so the immediate next-layer cause of the intermittent qstat failover over on the bastion host is: [20:14:36] root@tools-master:/usr/lib/gridengine# host tools-bastion-01.eqiad.wmflabs [20:14:39] tools-bastion-01.eqiad.wmflabs has address 10.68.17.228 [20:14:41] Host tools-bastion-01.eqiad.wmflabs not found: 3(NXDOMAIN) [20:14:44] Host tools-bastion-01.eqiad.wmflabs not found: 3(NXDOMAIN) [20:14:55] s/failover/failure/ [20:15:15] I'm running qstat in a loop [20:15:17] intermittent failures [20:16:09] bblack: wait, actual dns failures? [20:16:17] oh yeah [20:16:30] yeah, over on the master, for the bastion's hostname [20:16:34] also intermittent, heh [20:16:37] on the bastion itself too [20:16:49] also, when stracing the traffic for that lookup, the DNS responses are terribly odd [20:16:59] andrewbogott: around? you've been fiddling around with dnsmasq today as well? [20:17:09] the response packets are 64K in size for the NXDOMAIN, filled with 0xFF garbage or something.... [20:17:23] LostPanda: give me a minute [20:18:42] what's the nameserver implementation @ 10.68.16.1 ? it returns an address for an A-record query, and then SERVFAIL for AAAA, as if it's some software that predates IPv6 and doesn't know about it :/ [20:18:55] (on the same name, those two queries, I mean) [20:19:10] bblack: that's another piece of fail, dnsmasq [20:19:27] that could somehow, through a conflation of related DNS bugs, result in an NXDOMAIN from a cache for the whole name I imagine... [20:20:08] we have a non terrible DNS server at 208.80.154.12 [20:20:19] we can't fully switch to it yet but we could probably switch it to it and see? [20:20:27] * LostPanda is basically flying blind atm [20:20:28] I consistently get the right A-record answer from 10.68.16.1, which is in the master's resolv.conf [20:20:47] I'm not sure why the "host" command keeps getting something different + NXDOMAINs, yet... [20:22:31] oh the 64K response size thing is a red herring, I was interpreting strace of recvmsg iov's incorrectly [20:23:50] host -v says it gets back 64 bytes and then 2 48 byte fails [20:24:20] host -t A succeeds consistently [20:24:42] I guess that's just it trying to get AAAA records [20:25:13] it's AAAA and something else, just a sec... [20:25:21] LostPanda: ok, what’s up? [20:25:27] Sorry, there’s a lot of backscroll [20:25:38] oh it's checking A, AAAA, and MX [20:25:57] MX is SERVFAIL too lol [20:26:01] dnsmasq is awesome [20:26:55] if I had to guess, I'd say maybe tools-bastion-01 and other hosts of its class had entries in tools-master's /etc/hosts before all this, which paper over all of this. [20:27:29] now it's having to use DNS, and some combination of AAAA-fail from dnsmasq + nscd/libresolve/sge interpretation of that -> interrmittent loss of resolution for the tool [20:27:34] those submit hosts have been moved around a bit and renamed, so I dunno - either someone has been meticulously updating hostnames there without telling anyone or it's something else. [20:27:43] hmmmm [20:27:58] andrewbogott: basically ^^. gridengine failure because of DNS / etc/hosts / nscd issues. [20:28:11] LostPanda: a /change/ in DNS behavior? [20:28:23] andrewbogott: we don't really know - it was restarted today for a package update. [20:28:29] ok [20:28:42] LostPanda: can you try just stopping/disabling the nscd service completely and then doing your restart or whatever and see if things still are intermittently-broken? [20:28:50] bblack: looking [20:28:52] at least that would eliminate nscd as a catalyst here [20:29:14] (nscd is just a perf hack, shutting it off might cost latency but is supposed to be a functional no-op) [20:29:29] bblack: heh, it's 100% failing [20:29:38] you mean with nscd gone? [20:29:38] with a host not resolvable [20:29:52] I had stopped it and was running host in a loop before, so not sure if it was even running before [20:30:07] wat and it suddenly started working again [20:30:16] host still works for me [20:30:20] consistently working now [20:30:21] (qstat) [20:30:51] perhaps nscd is managing to interpret AAAA-SERVFAIL -> NXDOMAIN somehow [20:31:07] it's consistently working now. [20:31:11] * LostPanda is unsure what's happening or if it's all random [20:31:25] You've just described my entire life and worldview [20:31:31] :P [20:31:36] haha [20:32:33] so, for now, disable nscd -> temporarily fixed, investigate deeper later? [20:32:38] assuming it doesn't start randomly failing more [20:32:50] yeah, I've disabled puppet on both the masters as well [20:32:51] I'm ok with that [20:33:25] bblack: thanks a lot :D and you just restarted nscd in the beginning to trigger all this, right? [20:33:49] yeah 1x restart of nscd earlier on is all I've touched other than readonly commands [20:34:21] I only tried that because I noticed ./gethostname strace hitting the nscd socket, which was kinda odd as I don't normally expect to see it in use in general (nscd, I mean) [20:34:49] yeah [20:35:05] I didn't try that every time because it was the first thing I tried and it had no effect [20:36:36] bblack: I'll write up a summary and then go yell at a wall. [20:36:37] nscd as both a codebase and a concept has a long history of bugs heh. and in general, it kinda falls into that whole black hole of "bad caching of DNS at non-DNS layers" thing, plus adds a layer of results-interpretation. [20:37:00] bblack: I suspect nscd isn't the only thing to blame here. gridengine was indeed restarted and there were probably some hacks in place earlier that the restart invalidated? [20:37:01] I'd say as policy we should avoid ever turning it on, unless we have a clear and measurable performance reason and can't find a better way to get it. [20:37:19] isn't nscd on by default like everywhere? [20:37:24] I don't think so [20:37:25] it is on labs at least [20:37:46] not in prod [20:37:49] not on the prod hosts I regularly touch anyways [20:37:59] it's not in prod [20:38:05] ldap pulls it in [20:38:32] yeah I see puppet refs to nscd in ldap and labs -related modules only [20:40:28] bblack: I'll file a bug for that as well [20:40:58] ok cool, thanks [20:41:19] * bblack is terrible at filing bugs to back up his random irc proclamations [20:42:12] bblack: catchpoint also says it's ok [20:43:10] bblack: haha, a trail of failures again, and then it starts working again. [20:43:12] for about 2s [20:44:04] so maybe the failure is not in nscd, but the failure is intermittent for very short periods, and nscd caching extends those periods [20:45:07] yeah, that's probable [20:45:24] i can switch the DNS server to the newer one temporarily and see if it gets any better [20:45:48] but then I'll never be able to tell if it's really better of that's just intermittent [20:46:57] have we ever seen a failure directly in DNS? maybe do a quick looping test on dig @10.68.16.1 ? [20:47:44] if that's dnsmasq.... I know in debugging something related months ago, we had situations where that dnsmasq was ultimately/indirectly sourcing the results from LDAP, and there was an intermittent LDAP failure leading to interrmitent DNS failure type of scenario [20:48:56] LostPanda: unmerged change on palladium, proxy_pass oresweb [20:49:02] public DNS for labs instances (e.g. beta.wmflabs.org) comes from a pdns server that is backed by ldap. [20:49:05] mutante: bah, sorry [20:49:08] wow my typing sucks today. Errno: 42 (EINSUFFICIENTCAFFIENE) [20:49:11] private DNS shouldn’t have anything to do with ldap. [20:49:12] LostPanda: i can do it. np [20:49:17] thanks [20:49:29] eh, already done now somehow [20:49:30] LostPanda: where does dnsmasq get the result from then? [20:49:51] bblack: it's the dhcp server as well [20:49:58] so it gets it from... itself [20:50:32] I'm headed down that rabbithole now, I found labsnet1001 heh [20:51:27] so nova publishes a file to /var that dnsmasq consumes and serves [20:51:39] oh wow, so it's not even the dhcp part? [20:52:21] well, that is the dhcp part [20:52:26] oh right. [20:52:31] I think the DNS resolution comes from the same file as the DHCP stuff [20:53:58] I notice dnsmasq's uptime is longer than the last change to that DHCP file, so I guess it doesn't have to restart on change [20:54:09] I did mess with dnsmasq today, applying this patch: https://gerrit.wikimedia.org/r/#/c/213629/ [20:54:10] it doesn't [20:54:14] ah it re-reads on SIGHUP [20:54:18] I’ve now decided that it can die, and there’s a pending revert [20:54:33] https://gerrit.wikimedia.org/r/#/c/214077/ [20:54:47] andrewbogott: that change doesn't look very scary [20:54:50] I’m not going to revert while y’all are looking, but feel free to merge that if it makes your lives simpler. [20:54:55] Yes, almost certainly unrelated, but... [20:55:08] If something broke today then it’s worth suspicion [20:55:20] let's see if dnsmasq is even the problem first [20:55:29] we need to reproduce an interrmitent failure on a direct DNS query to it [20:55:44] *intermittent! I've typo'd that word like 20 times in the past hour [20:58:20] I have a loop running now trying it 10x/sec, it will break if it ever gets a bad result from dnsmasq [20:59:33] (I'm testing resolving tools-bastion01 from tools-master, which I think is the failing underlying case for the qstat thing) [20:59:40] yeah [20:59:43] tools-bastion-01 I mean [20:59:45] my qstat seems to succeed so far [20:59:59] 20:43 < LostPanda> bblack: haha, a trail of failures again, and then it starts working again. [21:00:05] yeah [21:00:06] ^ was that from some ongoing looped test? [21:00:09] yes [21:00:20] like, 10s of qstat failure and then back on [21:00:28] I'm making it slightly smarter now [21:07:04] bblack: yeah, I just got a fail [21:07:07] !log Deleted 4G of logs on jobrunner01 [21:07:07] Deleted is not a valid project. [21:07:08] and back again [21:07:21] LostPanda: I still haven't logged a DNS-fail [21:07:23] hmm [21:07:24] !log deployment-prep Deleted 4G of logs on jobrunner01 [21:07:30] Logged the message, Master [21:07:54] !log deployment-prep Deployed https://gerrit.wikimedia.org/r/#/c/208852/ [21:07:58] Logged the message, Master [21:08:12] bblack: lots of other stuff also fails if it's a general DNS error. inboxes flood with diamond errors, releng complains about scap failing, catchpoint complains, etc [21:08:35] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315702 (10jeremyb-phone) [21:09:19] yeah I was just hoping maybe this was some intermittent bug in dnsmasq responses for this one case. [21:09:23] so far, it doesn't seem so [21:09:34] I'm heading to bed. Thank you all for your help! [21:09:43] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315704 (10yuvipanda) puppet is disabled on -master and -shadow, and nscd is disabled on -master as well. /etc/hosts is in some hacked up state, and 'tools-master.eqiad.wmflabs' was hand entered into /var/lib/gridengine/default/act... [21:09:48] valhallasw: <3 thanks [21:10:02] so then we have to ask what layers lie between dnsmasq and sge, which is (with nscd gone) the eglibc libresolve/nsswitch code, which is also backending to the hosts file of course [21:10:16] plus whatever sge might be doing after it's done querying that [21:10:20] hey, my syslog is full of "May 27 21:07:06 kartotherian1 salt-minion[30582]: [ERROR ] The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate" [21:10:20] bblack: we also don't know if gridengine has this much of a 'normal' failure rate. [21:10:28] is it something important? [21:10:45] bblack: it could be happening all the time as well, we don't know [21:11:14] as in, this smaller fail-rate was happening before and nobody cared and just retried shit? [21:11:56] we don't know [21:12:02] nobody's reported that tho [21:14:01] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1315712 (10yuvipanda) 3NEW [21:15:11] !log deployment-prep populated jobqueue:aggregator:s-wikis:v2 with 1000 fake wiki keys for load testing [21:15:16] Logged the message, Master [21:17:29] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1315736 (10yuvipanda) Related to T100554 [21:21:37] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315763 (10yuvipanda) I've been running: ```while /bin/true; do qstat > /dev/null if [ $? -ne 0 ]; then exit fi echo -n . done ``` on a shell and it hasn't broken in a while (it used to bre... [21:21:49] bblack: ^ I've filed bugs and added info [21:39:40] (03CR) 10Tim Landscheidt: [C: 04-1] "Besides the general issues discussed in the other change, there was just recently a discussion about whether to include python-ipaddress a" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/209978 (https://phabricator.wikimedia.org/T86015) (owner: 10Merlijn van Deen) [21:58:35] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315867 (10Matthewrbowker) Experiencing errors myself, attempting to start a webservice ``` error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") error: u... [22:00:31] 6Labs, 6operations: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1315872 (10scfc) It was introduced with de059228933681b6b0f97a818a561cde22901e1e ("Initial commit of public puppet repo."). In the past IIRC we have increased the caching TTLs several times to reduce network l... [22:05:57] 6Labs, 10Tool-Labs: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1315890 (10Qgil) 5Open>3Resolved [22:06:21] 10Tool-Labs: Tool-labs meeting agenda for Lyon Hackathon - https://phabricator.wikimedia.org/T98912#1315892 (10Qgil) 5Open>3Resolved a:3Qgil [22:08:38] 6Labs: iPython for Labs: call for an interactive coding plattform - https://phabricator.wikimedia.org/T92506#1315894 (10Qgil) 5Open>3declined a:3Qgil This session doesn't appear in the program. I'm assuming that I didn't happen. Otherwise, please rectify, recycle it for #Wikimania-hackathon-2015, etc. [22:10:07] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1315907 (10yuvipanda) It failed twice and then started working right away afterwards :| I did a dig on tools-master and that was the only intervening thing I did between it failind and not failing, but I don't know / think that's r... [22:13:44] 10Tool-Labs: Conduct a Tool Labs workshop at Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1315917 (10Qgil) 5Open>3declined This session doesn't appear in the program. I'm assuming that I didn't happen. Otherwise, please rectify, recycle it for #Wikimania-hackathon-2015, etc. [22:15:43] 10Tool-Labs: Conduct a Tool Labs workshop at Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1315920 (10yuvipanda) Yes, was replaced by T100160 which was more appropriate for the crowd present. [22:17:28] chasemp: following up on shell access to phab-01 [22:17:58] I tried putting the suggested text in .ssh/config and something ended up spawning hundreds of ssh sessions and bringing my laptop to a crawl [22:18:14] here's the file I used as ~/.ssh/config [22:18:31] https://www.irccloud.com/pastebin/2eFo46h8 [22:18:59] whats going on with phab-01? [22:19:07] is there documentation I can follow to figure this out? [22:19:24] are you wmf staff? [22:19:35] Negative24: yes. it was proposed that I use phab-01 as a testing group for a script to export some data from Phabricator to do historical reporting. [22:20:37] s/group/ground/ [22:21:51] ok. please explain this s/word/english thing I see whenever someone types wrong. I'm lost :P [22:22:50] it's an idiom from sed; that's the syntax for search and replace. [22:22:51] Negative24: that generally means ‘here is a regular expression which should be applied to correct my previous comment' [22:22:53] https://en.wikipedia.org/wiki/Regular_expression [22:23:16] ah regexp. I don't speak regexp :) [22:23:32] Negative24: in vi/vim it would replace the words [22:23:48] I'm getting an access denied for labs [22:23:58] I'm trying to use jsub [22:25:27] jaufrecht: try adding the bottom part of http://pastebin.com/Ee7M8hHn to your config [22:26:15] What's going on with jsub? [22:27:06] New ssh security config? [22:27:48] No one is ever here anymore. :/ [22:29:33] jaufrecht: what did you type to connect? [22:30:31] ssh phab-01.eqiad.wmflabs [22:30:44] Negative24: I guess that breaks it out of the infinite loop? [22:30:54] yep [22:31:01] okay, here goes [22:31:04] Its trying to connect to the proxy with the same proxy [22:31:28] you can do this too: [22:31:52] Host phab-01 !bastion-restricted.wmflabs.org [22:32:01] nope, still an infinite loop [22:32:03] ! to exclude the bastion [22:32:03] hello :) [22:32:09] ! hi [22:32:09] hello :) [22:32:21] ! [22:32:21] hello :) [22:32:28] jaufrecht: let's start without using any wildcards then [22:32:53] mine uses *.eqiad.wmflabs I don't know if that has anything to do with it [22:33:29] jaufrecht: to keep it simple for testing. just this: https://phabricator.wikimedia.org/P695 and then "ssh phab-01" [22:33:30] first let me remember how I killed the infinite loop last time [22:34:05] oh, wrong bastion host, sorry [22:34:22] not "bastion-restricted.wmflabs.org" [22:34:35] just bastion.wmflabs.org [22:35:20] Negative24: You've never used sed? :-) [22:35:32] well [22:35:38] I have just not the same context [22:35:50] it didn't "click" [22:36:55] okay, attempt #3: ssh phab-01 using the config from P695 [22:37:01] :-) [22:37:48] much better - no infinite loop, just failure. Permission denied (publickey). [22:38:04] which is not surprising because I never gave anyone my public key for this [22:38:19] jaufrecht: so it's about being added to the project in the wikitech ui [22:38:52] let me check that [22:38:54] I did put my public key somewhere in wikitech [22:39:16] I am in the shell group, thank to Tim Landscheidt [22:39:52] jaufrecht: you are? well then it should already work [22:40:04] What is it? [22:40:08] checks the project anyways [22:40:21] Fiona: he needs shell on phab-01 [22:40:24] Right. [22:40:36] If he's part of the shell group, that doesn't give him access to phab-01, as I understand it. [22:40:43] Just access to bastion. [22:40:52] He has to be separately added to the group, as you say. [22:40:56] yea, you are in the project, even admin [22:40:58] chasemp was going to do something about that [22:41:11] he did apparently [22:41:26] Are you forwarding your key? [22:41:26] JAufrecht is in project phabricator [22:41:37] forwarding? [22:41:40] Fiona: no, he is using proxycommand as he should [22:41:48] Hah, all right. [22:42:17] I'm using id_dsa.pub instead of id_rsa.pub as suggested in the Help:Getting_Started [22:42:20] could that matter? [22:43:06] You can do like "ssh -vvv" to see which keys are being tried. [22:43:16] the answer is no; added rsa key as well [22:43:22] jaufrecht: do you have the "ssh-add" command? [22:43:23] and it still didn't work. permission denied. [22:43:40] joel@kali:~$ ssh-add [22:43:40] Could not open a connection to your authentication agent. [22:43:50] 10Tool-Labs-xTools: wikiviewstats rapidly fills error.log - https://phabricator.wikimedia.org/T100572#1316020 (10scfc) 3NEW [22:44:26] jaufrecht: try this: ssh -i /home/jaufrecht/.ssh/id_dsa.pub phab-01 [22:44:46] -i /path/to/your/key/whatever/it/is/called [22:44:51] d'oh. [22:44:56] https://www.irccloud.com/pastebin/4G8I79E8 [22:45:38] so for a second I thought it was joel vs jaufrecht [22:45:40] but: [22:45:44] https://www.irccloud.com/pastebin/cHQ5ZJN3 [22:45:52] 10Tool-Labs-xTools: wikiviewstats rapidly fills error.log - https://phabricator.wikimedia.org/T100572#1316030 (10scfc) [22:46:06] jaufrecht: i should not have added the .pub [22:46:20] jaufrecht: gotta find out where your key is locally in the filesystem [22:46:22] right. still, same results [22:46:37] my key is definitely ~/.ssh/id_dsa [22:46:41] it doesnt have to be in .ssh, it could be anywhere [22:46:45] but we have to tell it where then [22:46:57] a) that's the default. b) I'm looking at it. [22:46:58] what does "id" say? [22:47:13] https://www.irccloud.com/pastebin/pQAT34sX [22:47:55] is there another host within bastion that I should be able to connect to? [22:47:59] /home/joel/.ssh/id_dsa then [22:48:02] that doesn't require special permission like phab-01? [22:48:21] https://www.irccloud.com/pastebin/GV63bXqQ [22:48:21] they all require being a member of the project the instance belongs to [22:48:30] but let's try if you can connect directly to the bastion host [22:48:54] and then next step to jump via the bastion host [22:49:08] https://www.irccloud.com/pastebin/k4UWJzyz [22:49:15] let me connect to phab-01 myself [22:49:37] could I have screwed up adding the public key? I put it in https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-openstack [22:49:39] hmm.. so this key you are using [22:50:26] did you start that with "ssh-dsa" ? [22:50:33] maybe cut off something when pasting? [22:50:43] I created it with ssh-keygen -t dsa [22:50:53] and it starts ssh-dss AAAAB3NzaC1kc3MA... [22:51:25] I also get permission denied (publickey) when I use -i to force the rsa key [22:51:33] i can't connect to phab-01 either it seems :/ [22:51:41] the public rsa key starts ssh-rsa [22:51:42] Connection closed by UNKNOWN [22:51:48] so happy that it's not just me [22:51:56] but I can't connect to bastion.wmflabs.org either [22:52:21] 10Tool-Labs: Grid engine masters down - https://phabricator.wikimedia.org/T100554#1316042 (10scfc) Note sure if this is related, but the felt frequency of mails from sudo: ``` From: Subject: *** SECURITY information for tools-redis-slave *** To: root@tools.wmflabs.org Date: Mon, 25... [22:52:23] me neither, but that's because i'm ops [22:52:30] i can use bastion-restricted though [22:52:48] I can't log in to bastion-restricted either [22:53:02] bastion1 is working for me [22:53:06] i am getting this error now [22:53:09] https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000005d0.eqiad.wmflabs [22:53:13] [4b42ab1f] 2015-05-27 22:52:52: Fatal exception of type MWException [22:53:26] and phab-01 [22:53:26] that's when i clicked the phab-01 instance in wikitech ui [22:54:03] me and rush are on phab-01 [22:54:36] but I'm getting an internal error on wikitech [22:55:30] Negative24, the whole shebang is down right now :) [22:55:35] ugh, yes [22:55:40] the internal error is like on all appservers [22:56:00] Negative24: can you see if jaufrecht has a home dir? [22:56:01] ssh is still working [22:56:36] mutante: he doesn't [22:56:49] fwiw the error my bot got from the api is: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php [22:56:56] ls /home => "demon gmetric negative24 phd rush spage twentyafterfour vcs-user yuvipanda" [22:57:13] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 20.00% of data above the critical threshold [0.0] [22:58:01] Negative24: so he is a member and even admin in the project, but it gets only created on first login i think [22:58:19] I believe so [22:59:59] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 40.00% of data above the critical threshold [0.0] [23:00:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 37.50% of data above the critical threshold [0.0] [23:00:39] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL 30.00% of data above the critical threshold [0.0] [23:00:42] jaufrecht: i can connect to phab-01 as well now [23:00:56] jaufrecht: somebody had removed me from the project i created, added myself back and it worked [23:01:09] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 55.56% of data above the critical threshold [0.0] [23:01:25] I still can't connect. same thing, permission denied [23:01:47] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 60.00% of data above the critical threshold [0.0] [23:01:57] Big spike on nfs just when we were having problems [23:02:03] is the paste into the OpenStack tab in Preferences whitespace-sensitive? could that be a problem? [23:02:17] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 55.56% of data above the critical threshold [0.0] [23:02:20] yes it is [23:03:34] jaufrecht: so i think it's somehow about "rsa" vs. "dsa", because: [23:03:45] i see 2 keys for you [23:03:57] one ssh-rsa and one ssh-dss [23:04:01] both called "joel@kali" [23:04:15] I just deleted the dsa key [23:04:24] ack, confirmed [23:04:26] https://www.irccloud.com/pastebin/kgsZzd8d [23:04:31] i am checking that in LDAP on terbium [23:05:08] and you have that rsa key in .ssh/id_rsa ? [23:05:13] yes [23:05:40] ssh -i /home/joel/.ssh/id_rsa jaufrecht@bastion.wmflabs.org [23:05:45] just tried that [23:05:46] it worked [23:06:14] ok:) step 1 [23:06:16] what about: [23:06:18] https://www.irccloud.com/pastebin/digrxgvT [23:06:26] ssh-add /home/joel/.ssh/id_rsa [23:06:46] https://www.irccloud.com/pastebin/aK8VablW [23:06:58] so that part is a bit odd [23:07:08] would like the agent to be running [23:07:29] my ssh key doesn't have a password [23:07:34] so I don't think an agent is necessary [23:07:46] I mean, even if it did, the agent would just save me retyping the password [23:07:55] ok, back to the .ssh/config and the ProxyCommand then [23:08:27] https://www.irccloud.com/pastebin/6vsOumcN [23:09:00] could phab-01 put ssh keys in a different place than the web interface for wikitech? [23:09:32] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 50.00% of data above the critical threshold [0.0] [23:09:58] jaufrecht: do you have the line "User jaufrecht" in your ssh config file? [23:10:02] yes [23:10:08] hrmmm [23:10:34] that last try looked like bastion took it but the key wasn't forwarded [23:10:42] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL 50.00% of data above the critical threshold [0.0] [23:10:51] to phab-01 [23:10:53] could you manually put my id_rsa.pub file into phab-01:~/.ssh/authorized_keys ? [23:11:09] ssh-rsa [23:11:09] AAAAB3NzaC1yc2EAAAADAQABAAABAQC+IKG8mYtde3cwW3oiFpqYaP+XALx4KxqlnXPPTJoLIfWMOHZs7Sn6ep9siRBwUEqHVL7ZwY+Qn3yqRVFuuEtx8fEvKiTrnKo6QHJsugFods8vB5wHKCTjcRDw4+nq8nGSKzFqY3UPcISvh+VqqQODvKVW6WtWyzKfPvmG3fRYiF90bxfxYm/OeINNKTV6YGVjOij20etl5gFTCWZTqURSFD7lAHIUQmImdasZ3VVN+K0OsjynZF0hFq+hoP1ZbSgp0YKcYOpVp8wQR7uNrRgjsTVTt8z5G8VZmYHaWXct74XDO5D2QHWIujkzAoFa8fnu1E3+B9K/ [23:11:10] uLI8+hkLRCWf joel@kali [23:11:37] jaufrecht: no, your home doesn't exist yet :/ [23:11:58] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL 60.00% of data above the critical threshold [0.0] [23:12:19] and there's something on the other end that's taking incoming ssh and matching it against my public key stored in wikilabs? and that's working for other people but not me right now? [23:12:30] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 40.00% of data above the critical threshold [0.0] [23:13:21] out of time for today. thanks for helping. I got halfway .... [23:13:25] jaufrecht: i think it's the part that the agent isnt running [23:13:28] should I file a ticket or something for followup? [23:13:41] jaufrecht: if you could somehow start that and try again [23:14:00] the agent on my desktop should only be involved in providing the private ssh file; since my private ssh file isn't passworded, that shouldn't be an issue [23:14:08] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 66.67% of data above the critical threshold [0.0] [23:14:56] jaufrecht: yet if i do it like you do i get the exact problem you describe [23:14:57] https://www.irccloud.com/pastebin/mkfOqg17 [23:15:51] jaufrecht: then.. ticket sounds good, yes [23:15:55] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 50.00% of data above the critical threshold [0.0] [23:15:58] okay, thanks. to whom or what? [23:16:37] labs and LDAP project [23:19:11] thank you [23:19:32] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316097 (10JAufrecht) 3NEW a:3chasemp [23:23:37] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316121 (10chasemp) Hmm I believe this is in the phabricator project in which you are an admin. I am not at a computer now but maybe @yuvipanda or @coren could confirm your perms are on the right project? [23:23:57] jaufrecht: so you can ssh into bastion fine but not phab-01? [23:25:27] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316125 (10Dzahn) Yes, he is a member (and even admin) of project phabricator, which has instance phab-01 in it. He also has the RSA key in LDAP and he uses that to connect to the bastion just fine. Also, other users can connec... [23:26:10] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [23:26:46] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [23:26:56] jaufrecht: aah!! [23:27:13] jaufrecht: one more try, really quick? add the user name here: [23:27:14] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [23:27:22] before: ProxyCommand ssh -W %h:%p bastion.wmflabs.org [23:27:29] after: ProxyCommand ssh -W %h:%p jaufrecht@bastion.wmflabs.org [23:27:34] ^ do that [23:27:48] (also keep the "User jaufrecht" line below ) [23:28:57] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316140 (10Dzahn) In your .ssh/config instead of: ProxyCommand ssh -W %h:%p bastion.wmflabs.org use ProxyCommand ssh -W %h:%p jaufrecht@bastion.wmflabs.org and leave the "User jaufrecht" as it is. then try again [23:29:57] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [23:30:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK Less than 1.00% above the threshold [0.0] [23:30:38] RECOVERY - Puppet failure on tools-exec-1402 is OK Less than 1.00% above the threshold [0.0] [23:31:06] mutante: was just about to say that http://pastebin.com/i78dC1kt [23:31:09] :D [23:31:15] 6Labs, 7LDAP: error accessing phab-01 - https://phabricator.wikimedia.org/T100578#1316153 (10Dzahn) a:5chasemp>3Dzahn [23:32:16] RECOVERY - Puppet failure on tools-exec-1205 is OK Less than 1.00% above the threshold [0.0] [23:36:55] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK Less than 1.00% above the threshold [0.0] [23:39:08] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [23:39:32] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [23:40:44] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK Less than 1.00% above the threshold [0.0] [23:42:32] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [23:45:51] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0]