[03:55:28] is it possible for an admin to add a security group to an instance post-creation? [04:11:54] Hey, if an administrator is about, could you add [[User:Symmachus]] to the tools project? He's working with me [04:14:39] s/administrator/tools project admin/ [06:15:33] Coren: What's the rules about storing people's emails? I want to take the person's email when they submit a request, so it can email them when the results are ready. [06:15:57] Hmm, what're* [06:41:58] PROBLEM - Puppet failure on tools-webgrid-04 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [07:06:41] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Labs+NFS+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report [07:06:57] Loads for me now [07:07:05] network traffic is coming back as wel [07:07:19] SSH broke pipe, Mosh reconnected. [07:08:53] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 754621 bytes in 2.978 second response time [07:09:54] Oh well [07:10:35] Looks like it fixed itself before I got out of the restaurant [07:11:03] Krenair: what 'massive delete' [07:11:09] How massive? [07:11:45] like 1.8TB massive [07:13:58] a .out file which was being written to (with what I had thought were trivial irc logs) from roughly 2014-07-23 to 2015-02-26 [07:16:16] from a bot that would log into IRC daily, send a few messages and then quit (logging the whole raw socket data - the IRC protocol overhead of quit/join messages and MOTD was much larger than the messages sent/received) [07:16:57] I noticed it earlier, tried to get it in a state where I could work out what the hell was making it so large, but eventually gave up and just deleted it [07:17:32] shortly after, alerts started coming up, labs ssh down, beta down, tools down, etc. [07:20:18] I don't know if it was actually the cause, but ganglia showed labs NFS load shoot up 6-7x, network out fell to very little [07:21:59] Krenair: can you give me timestamps? [07:22:07] Of when this was deleted? [07:22:19] don't have that terminal window open anymore, sorry [07:22:20] I'll check in an hour [07:23:34] Krenair: alright :) [07:23:47] on the other hand, tools now has 52% use of /home rather than 54% [07:24:21] ah, history shows it [07:24:29] Surely a deletion is just marking the file as deleted? [07:24:47] It's not spinning all the disks to erase is physically, is it? [07:24:48] 2015-03-22 06:48:47 [07:24:51] it* [07:26:39] Not sure. It sounds like it should be a dimple operation ya [07:26:43] *simplr [07:27:43] YuviPanda|zzz, that .nfs file that appears after you do something like this... [07:27:49] that's not important, is it? >_> [07:28:13] There is a .NFS file?! [07:28:23] .nfs0000000016cab79100000303, specifically [07:28:34] I only have vague ideas of how our NFS setup works... [07:28:45] And I'm still on my phone. [07:28:48] Not sure what that is at all actually [07:29:20] http://serverfault.com/questions/201294/nfsxxxx-files-appearing-what-are-those - 'Nothing is going wrong. This is your NFS client trying to maintain proper "delete on later close" unix behavior within its own operational abilities. This NFS behavior is known as "silly rename":' [07:32:03] RECOVERY - Puppet failure on tools-webgrid-04 is OK: OK: Less than 1.00% above the threshold [0.0] [07:45:51] Should have labstore1001 alerts [07:46:13] YuviPanda|zzz: Is there something I can do so I don't need to run "webservice2 uwsgi-python restart" every time I change something? [07:57:10] a930913: yes but it kills performance... [07:59:34] a930913: you can configure uwsgi to autoreload [08:18:13] YuviPanda|zzz: Hi [08:18:34] Are you awake or still zzz ? [09:10:59] a930913 yes? :) [09:11:51] ToAruShiroiNeko: I thought you were joining for the excitement, from #wikipedia-en. [09:14:27] oh I am always excited :p [09:14:50] me being a regular here shouldn't imply otherwise :P [09:30:44] Vivek: hey [09:30:47] am still on phone [09:30:59] been a busy week, will continue to be a busy week I guess [09:31:05] * YuviPanda|zzz is moving continents on thursday [09:57:48] Ok [09:57:56] All the best. [09:58:33] I am going through wikitech, getting familiar with stuff. [10:35:00] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [10:37:30] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [10:37:42] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [10:47:43] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:50:10] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [10:52:22] RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [11:36:32] Crontab stopped working on tool labs at around 7:00 UTC [11:38:09] A job, which runs every 5 minutes normally ran at 06:35, 06:40, 06:45 - then 3 times at 07:06 and then stopped working. For different projects. And even "crontab -l" does not work. [11:56:39] apper: Yeah, labs went b0rk at that time. [12:38:20] How can I view a job status from a web node? 'denied: host "tools-webgrid-generic-01.eqiad.wmflabs" is neither submit nor admin host' [13:14:12] PROBLEM - SSH on tools-submit is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:07] crontab times out at the moment. Is tools-submit down? [14:10:18] YuviPanda|zzz: any idea what happened? [14:50:51] I just got here. [14:51:29] I know what happened: rsync. Apparently, near the end of a run, it tries to build a list of files that need deletion. It appears that this list requires holding the list of files in memory. [14:52:25] So rsync grew to eat pretty much all available memory, consuming all the buffer and cache space, until it ran out and died and freed it all suddenly. [14:54:11] andrewbogott_afk: YuviPanda|zzz; ^^ [14:54:35] So this was a one-off issue that self-resolved, but it also means that rsync-between-dc is not going to work. Eff. [14:54:58] What's rsync being used for? [14:56:01] Fiona: It was meant to back up a snapshot of the filesystem to the other DC at interval. But there are too many files. [14:56:16] I got 99 problems and they're all files. [14:56:20] * Coren tries to find other alternatives. [14:57:09] Actually, it might work iff directories that shouldn't/need not be backup up were marked somehow. Most of the bulk of the files are cache directories. [14:58:22] Also temp directories, thumbnails, etc. All the kind of things that are around for efficiency but can be rebuilt. [14:58:27] How many files are we talking about? [14:58:34] 160 million or so. [14:58:38] That's a lot of files. [14:58:48] What on earth. [14:59:23] A visual grep tells me all of that is reasonable actually, or at least the vast majority is. [14:59:54] I looked at the bigger users and they're all quite legitimate. [15:01:13] But much of this is actually quite disposable at need, it's just that there is no way to tell when doing the backup. [15:01:46] (For instance, it's kinda pointless to back up the beta site thumbnails) [15:17:14] Coren: https://rsync.samba.org/FAQ.html#4 suggests using --delete-delay or running it in chunks [16:10:24] Coren: I guess it’s an unusual case where ‘every file that needs deletion’ is super big. Still — seems like a pretty serious rsync bug. [16:10:37] Suppose there’s some reason it can’t just do them on the fly? [16:11:18] 6Labs, 7Monitoring, 5Patch-For-Review: Monitor nova services - https://phabricator.wikimedia.org/T90784#1139226 (10Andrew) Today, the nova-api process was running but api calls were timing out. So that's another thing to watch for. [17:08:49] andrewbogott_afk/Coren /YuviPanda|zzz xtools webserver is down [17:47:39] hi and help please, crontab isn't working any more [17:47:49] crontab -e does not work [17:48:07] Connection closed by 10.68.17.1 [17:48:31] Coren? What can I do? [17:49:25] project on labs: taxonbot [17:50:10] that looks like some tools service, rather than a labs project [17:51:01] Krenair: unless submit.tools is down [17:51:11] ? [17:51:12] meaning I am owner of taxonbot, but crontab is working [17:51:17] I can't edit it [17:51:39] I can't edit crontab on one of my tools either :/ [17:51:57] What's going on? [17:52:34] still the rsync issue? I thought that had solved itself, but maybe it is an issue that keeps repeating itself [17:52:55] "15:52 So rsync grew to eat pretty much all available memory, consuming all the buffer and cache space, until it ran out and died and freed it all suddenly." [17:53:37] has rsync to do with crontab? [18:05:49] it's overloading the same server, if I understand correctly [18:09:30] 10Tool-Labs: Require membership in the Tools project for mail forwards to work - https://phabricator.wikimedia.org/T93526#1139319 (10scfc) 3NEW [18:10:37] 6Labs, 10Wikimedia-Labs-wikitech-interface: Wikitech registration requires labs shell access - https://phabricator.wikimedia.org/T88092#1139330 (10Krenair) Why do we suggest that they ask for that check when it'll be performed automatically anyway? Can't we just remove that text, and open a separate task to ch... [18:11:10] wikibugs needs a restart. It's only in here, nowhere else (including -feed). <-- YuviPanda|zzz legoktm [18:11:43] valhallasw`cloud, ^ [18:11:44] quiddity: sec [18:11:48] :) [18:12:11] let me first check the log files... [18:12:44] * quiddity looks blearily around for coffee... [18:14:05] !log tools.wikibugs wikibugs is not in -feed and other channels again. Logging seems to be dead since 2015-03-02 as well!? Restarting. [18:14:34] !log tools.wikibugs valhallasw: Deployed 23240bd0dc5aebcc2a94b6f1ac268e2e3ad41114 Add more projects for devtools and mobile wb2-irc [18:16:36] !log tools.wikibugs still no log file. Killing job and restarting using fab start_jobs [18:18:26] !log tools.wikibugs log was in wb2-irc.log instead, apparently the job was started incorrectly last time. Is OK now (log is in redis2irc.log, which is auto-rotated) [18:18:52] meh, we should also log messages wikibugs gets from the server [18:19:03] or just join all channels even if we already joined? [18:40:43] 10Tool-Labs-tools-WMT-bots, 5Patch-For-Review: Redesign how setlists are generated - https://phabricator.wikimedia.org/T91941#1139353 (10JohnLewis) a:5JohnLewis>3Southparkfan As part of a code rewrite, this should be handled solely by SQL to be honest. For easy work, hack the code to work with an SQL (eith... [18:41:01] 10Tool-Labs-tools-WMT-bots: Redesign how setlists are generated - https://phabricator.wikimedia.org/T91941#1139355 (10JohnLewis) [18:43:44] 10Tool-Labs-tools-WMT-bots: Fix bot-operator so we can give commands to the bots - https://phabricator.wikimedia.org/T62858#1139359 (10JohnLewis) a:3Southparkfan [18:45:31] 10Tool-Labs-tools-WMT-bots: Implement whitelisting function - https://phabricator.wikimedia.org/T91940#1139360 (10JohnLewis) a:3Southparkfan [19:18:30] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1139406 (10Teslaton) After 4 days of continuous operation since the limit update advertised, some outages today, the longest one being ~90 min: http://tools.freeside.sk/monitor/http-kmlexport.html (time is in CET - UTC +0100) [19:24:00] (03CR) 10Merlijn van Deen: [C: 031] "Hm, I'm not sure if I'm a fan of having the colon in the project name if there is no next lower level, but I guess this is required in yam" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/196852 (owner: 10Awight) [19:37:56] something broken? the crontab thing is broken for me, corntab -e , -l etc. does not work [19:39:19] Coren? [19:39:20] yes, [19:39:27] the same problems [19:39:39] but crontab itself is working [19:46:50] for me complety broken since 07:05 [19:46:54] 10Gerrit-Patch-Uploader: Allow amended patch sets without rebase with gerrit patch uploader - https://phabricator.wikimedia.org/T92736#1139414 (10valhallasw) p:5Triage>3Normal I see why this would be useful, but it's actually somewhat hard to implement. The way the uploader works is as follows: 1. user subm... [19:48:36] 10Tool-Labs, 6operations: Crontabs broken on toolslabs - https://phabricator.wikimedia.org/T93530#1139416 (10Steinsplitter) 3NEW [19:49:17] thank you Steini for phab [19:50:12] 10Tool-Labs: Normal jobs sometimes run on tools-webgrid-tomcat.eqiad.wmflabs - https://phabricator.wikimedia.org/T64942#1139423 (10valhallasw) 5Open>3Invalid Doh! Yes, that makes sense. Added it now, so tonight it should run on a normal host. [19:51:46] @RC+ wikitechwiki Nova_Resource:Tools/Access_Request/* [19:51:46] Permission denied [19:51:49] :{ [19:51:52] @access list [19:51:58] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.6.2.0 [libirc v. 1.0.2] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [19:51:58] @help [19:52:10] I trust: .*@wikimedia/.* (2trusted), .*@mediawiki/.* (2trusted), .*@wikimedia/Ryan-lane (2admin), .*@wikipedia/.* (2trusted), .*@nightshade.toolserver.org (2trusted), .*@wikimedia/Krinkle (2admin), .*@[Ww]ikimedia/.* (2trusted), .*@wikipedia/Cyberpower678 (2admin), .*@wirenat2\.strw\.leidenuniv\.nl (2trusted), .*@unaffiliated/valhallasw (2trusted), .*@mediawiki/yuvipanda (2admin), .*@wikipedia/Coren (2admin), [19:52:10] @trusted [19:53:26] @RC+ wikitechwiki Nova_Resource:Tools/Access_Request/* [19:53:26] Unable to insert the string to the list because there is no such wiki site known by a bot, contact some developer with svn access in order to insert it [19:55:17] 10Tool-Labs, 6operations: Crontabs broken on toolslabs - https://phabricator.wikimedia.org/T93530#1139428 (10Steinsplitter) [19:55:44] $ ssh magog@tools-login.wmflabs.org [19:55:46] Permission denied (publickey,hostbased). [19:55:48] this is a new development [19:55:54] something is wrong [19:57:35] Magog_the_Ogre, not your fault, it's also rejecting me [19:58:31] maybe filesystem broken (AGAIN) [19:58:46] 6Labs: Puppetize & fix tools-db - https://phabricator.wikimedia.org/T88234#1139430 (10yuvipanda) @Coren: Any update on this? [19:58:54] 10Wikibugs: wikibugs is having issues joining channels on startup - https://phabricator.wikimedia.org/T92301#1139431 (10valhallasw) p:5Triage>3High The issue seems to be that wikibugs doesn't keep track of channels correctly, or something like that. I haven't found anything clear in the logs, though, as we d... [19:59:10] hi [19:59:31] Magog_the_Ogre: try trust.tools.wmflabs.org [19:59:47] I am all sunday’d out but am going to try debug this anyway [19:59:56] !log tools reboot tools-submit [20:00:10] somuchsunday [20:00:16] yuvipanda: *trusty :) [20:00:18] yeah, i should stop working, too. [20:00:22] right [20:00:24] trusty. [20:00:25] hi JohnFLewis [20:00:30] hi [20:00:41] $ ssh magog@trust.tools.wmflabs.org [20:00:41] ssh: Could not resolve hostname trust.tools.wmflabs.org: Name or service not known [20:00:56] Magog_the_Ogre: yeah, sorry, typo. trust.tools.wmflabs.org [20:00:57] Magog_the_Ogre: trusty.tools.wmflabs.org is correct :) [20:01:18] Magog_the_Ogre: trusty.tools.wmflabs.org [20:01:30] confirmed to not work on tools-login, however [20:01:33] am investigating [20:01:46] PROBLEM - Puppet staleness on tools-submit is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [43200.0] [20:02:14] trusty-tools seems to be working [20:02:17] looks like crontab is running again [20:02:56] Betacommand: yeah, I rebooted tools-submit [20:03:47] fwiw, trusty is slower than tools-login normally is. Not sure if this is relevant. [20:04:01] in terms of what, exactly? network or ? [20:04:04] RECOVERY - SSH on tools-submit is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [20:04:10] (it shouldn’t be) [20:04:14] Betacommand: ^ so tools-submit is back up [20:04:21] so cron should be [20:05:11] yuvipanda, probably networking. [20:05:29] I suppose if the processor is under stress, even the command line might run slowly. [20:05:38] Magog_the_Ogre: seems fine on my end. [20:06:12] crontabs back. [20:06:13] :-) [20:06:14] Magog_the_Ogre: yeah, seems fine on my end too [20:06:21] Steinsplitter: yup, the reboot fixed it [20:08:08] 10Tool-Labs, 6operations: Crontabs broken on toolslabs - https://phabricator.wikimedia.org/T93530#1139442 (10Steinsplitter) 5Open>3Resolved a:3Steinsplitter @yuvipanda has reeboted. Back now. Thanks. [20:08:26] 10Tool-Labs, 6operations: Crontabs broken on toolslabs - https://phabricator.wikimedia.org/T93530#1139446 (10Steinsplitter) a:5Steinsplitter>3yuvipanda [20:09:26] yeah my process is running plenty quickly, so must be my network [20:09:35] 10Wikibugs: Wikibugs should report the Priority if it is set during task creation - https://phabricator.wikimedia.org/T91202#1139448 (10valhallasw) p:5Triage>3Normal This should be possible, but I'm not immediately sure how. The information normally comes from the 'transaction' object, but the transaction do... [20:16:46] RECOVERY - Puppet staleness on tools-submit is OK: OK: Less than 1.00% above the threshold [3600.0] [20:37:06] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:38:13] Hi all, I haven't worked on my tool on labs for a while, and now I can't remember how to log in: I can log into bastion via public-key, but I get "Permission denied (publickey,hostbased)." when I try "ssh slashme@tools-login.wmflabs.org" [20:40:39] slashme: yup, seems to be a problem with tools-login. try ‘ssh trusty.tools.wmflabs.org' [20:40:58] Aah, thanks!! [20:42:00] Great, that works. I thought I was being dense :-D [20:42:05] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:21] (03PS1) 10Merlijn van Deen: Citoid and Design stuff [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/198649 [20:46:18] (03CR) 10Merlijn van Deen: [C: 032] "I think we should have a smarter way of doing this at some point, as this has one minor issue, but lgtm for now." [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/198649 (owner: 10Merlijn van Deen) [20:46:34] (03Merged) 10jenkins-bot: Citoid and Design stuff [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/198649 (owner: 10Merlijn van Deen) [20:52:05] !log tools.wikibugs Updated channels.yaml to: d4de555cf25b1bf7e4eb79e34b9947c84afa927a Citoid and Design stuff [21:22:04] Can someone log into tools-login or has an active session there? (If not, I'm planning to reboot it.) [21:22:37] scfc_de: yuvipanda is looking at it I believe [21:22:55] scfc_de: nah, feel free to. [21:23:04] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [21:23:16] yuvipanda: tools-bastion-01 = tools-login? [21:23:26] scfc_de: new one I’m setting up, yeah... [21:23:40] scfc_de: actually [21:23:47] scfc_de: I was able to login to tools-login as root [21:23:57] scfc_de: and everything was fucked up. root was uid something huge, and gid 500 [21:24:05] scfc_de: I wonder if the spammers also registered the ‘root’ account?! [21:24:23] No, *I* registered the root account, but haven't done anything with it since :-). [21:24:34] scfc_de: ah, I see. I wonder if just registering it fucked things up [21:24:44] I’m logging into tools-login again [21:24:45] wait [21:24:55] root@tools-login:/$ id [21:24:55] uid=11924(root) gid=500(wikidev) groups=500(wikidev) [21:24:56] root@tools-login:/$ [21:24:59] scfc_de: ^ [21:25:02] so that ain’t right [21:26:08] yuvipanda: Ha! Mea culpa maxima: "ldaplist -l passwd root". Could you delete the LDAP entry, please? Sorry. [21:26:15] scfc_de: :) [21:26:33] (But that doesn't explain why tools-login is fubar, but tools-dev & Co. work perfectly.) [21:27:18] yuvipanda: Is there load on tools-login or something similar? [21:28:06] > Cpu(s): 2.0%us, 1.3%sy, 0.2%ni, 95.8%id, 0.3%wa, 0.0%hi, 0.2%si, 0.2%st [21:28:34] And /public/keys? [21:28:49] scfc_de: hmm, I can’t actually find a user named ‘root’ on ldap [21:29:10] scfc_de: looks ok [21:29:14] yuvipanda: "ldaplist -l passwd root" [21:30:03] !log tools deleted root account created by scfc_de [21:30:20] scfc_de: because of the LDAP issue [21:30:21] root@tools-login:/$ tail -f /var/log/auth.log [21:30:24] tail: cannot open `/var/log/auth.log' for reading: Permission denied [21:31:20] So you as root couldn't access the ("former") root's files? [21:31:52] scfc_de: basically, yeah [21:32:00] I’m somewhat glad this is restricted to tools-login :P [21:32:06] scfc_de: did you log in to tools-login? [21:32:22] yuvipanda: +1 :-). No, I still can't log in. Can you look at puppet.log to see if that changed something? [21:32:33] scfc_de: can’t for obvious reasons :P [21:32:44] scfc_de: but did you attempt to log in as root? [21:33:01] I’m just tail: cannot open `/var/log/puppet.log' for reading: Permission denied [21:33:16] yuvipanda: You mean with my "Test for root" account? No, never got to it. [21:33:23] hmmmm [21:33:37] (I waited for /public/keys/root to show up, and before that happened, tools-login became inaccessible.) [21:33:40] how do we know it’s only restricted to tools-login? :) [21:34:18] scfc_de: so I’m going to assign tools-login’s IP to tools-bastion-01 [21:34:39] yuvipanda: No clue. But with the LDAP entry gone, you should get uid = 0 when you sudo now. [21:35:59] !log tools associated tools-login public IP with tools-bastion-01 instead [21:39:33] scfc_de: ^ try? [21:40:20] scfc_de: email sent [21:41:43] yuvipanda: please link to the protected page on wikitech [21:41:54] valhallasw`cloud: ya trying to find it [21:42:15] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [21:42:47] aha! cool [21:43:36] valhallasw`cloud: done [21:43:43] <3 [21:43:59] yuvipanda: Works; the missing motd of course sucks :-). [21:44:12] scfc_de: Coren was supposed to be working on that. [21:44:25] Coren: any news on the labstore stuff you were going to finish up over the weekend? [21:44:31] scfc_de: also welcome back to IRC :) [21:47:41] scfc_de, I wonder if some of those other SSH fingerprint pages should be protected [21:47:46] like https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast1001.wikimedia.org :/ [21:47:59] scfc_de: valhallasw`cloud can either of you open a phab ticket for tools-login? I have to sleep, pluse I don’t have my external keyboard and so every keystroke hurts... [21:48:22] yuvipanda: Will do, and sleep well. [21:48:29] thank you. <3 [21:48:33] Krenair: it is protected against me, at least? [21:48:50] yuvipanda: bah I was just going to ask you to something for me, oh well. Enjoy the sleep :) [21:52:18] JohnFLewis: Something related to Tools? Maybe I can help as well. [21:52:37] scfc_de: nah, production stuff [21:53:11] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:53:48] I just logged into tools-login and instead I hit tools-bastion [21:53:52] is that, um, expected? [21:53:55] scfc_de: re: bastion-01, we should make a bastion-02 and have that have the same ssh keys as -01 as well, so we can switch if needed. [21:54:00] Magog_the_Ogre: yup, see labs-l [21:54:05] ty :) [21:55:11] yuvipanda: Several hosts with the same ssh host key? That sounds like trouble ... [21:55:34] scfc_de: yeah, but csteipp and others in ops- couldn’t actually figure out why exactly, and I can’t either... [21:55:54] it makes me feel dirty, of course, but I can’t find an actual reason... [21:56:13] yuvipanda: Maybe with a bit distance to the incident just now I feel more comfortable with that :-). [21:56:25] scfc_de: haha :D yeah [21:57:00] scfc_de: ah fuck [21:57:01] > labstore1001 : Mar 22 21:47:13 : nagios : /etc/sudoers.d/nfsmanager is owned by uid 11924, should be 0 ; TTY=unknown ; PWD=/ ; [21:57:58] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [21:58:06] Uh. Uh. Uh. [21:58:18] how to change fingerprint to get a connection by putty again [21:59:44] is yuvipanda here? [21:59:52] doctaxon: heya! I am [21:59:54] 10Tool-Labs: Investigate lock-down of tools-login - https://phabricator.wikimedia.org/T93545#1139598 (10scfc) 3NEW [22:00:09] yuvipanda - how to change fingerprint to get a connection by putty again [22:00:32] what have I to do now [22:00:41] doctaxon: juts connect. You'll get a warning, with the new fingerprint [22:01:14] How can I view a job status from a web node? 'denied: host "tools-webgrid-generic-01.eqiad.wmflabs" is neither submit nor admin host' [22:02:20] yuvipanda - but now I am on tools-basation-01 [22:02:33] doctaxon: yup, that’s ok :) [22:02:41] is this tools-login? [22:02:44] doctaxon: it’s a new host, performs same functions as efore [22:02:52] scfc_de: gaaah, can you add tools-login as a submit host? [22:02:56] err, bastion-01 [22:02:56] New branding :p [22:03:50] Doesn't Puppet do that automatically? [22:04:22] thank you very much, yuvipanda [22:04:59] scfc_de: puppet doesn’t do jack, no……….. [22:05:14] Coren: ^ you really aught to fix that, it’s been going on for ages. [22:05:20] doctaxon: yw! [22:06:11] yuvipanda: One problem I see with two hosts sharing an ssh host key is that in host-based auth we can't differentiate between them. This is probably a non-issue, but one of those things ... [22:06:27] yuvipanda: Do you know how I can qstat from the web? [22:06:33] scfc_de: yeah, hmm. But only one will be active at any point [22:06:47] a930913: I think we just need to add those machines as ‘submit hosts’ on gridengine [22:06:51] a930913: I'll do that in a jiffy; tools-dev should work until then. [22:07:31] yuvipanda: But webgrids shouldn't be able to submit jobs, should they? [22:07:37] Can they be set to read only? [22:07:40] they should, afaik [22:07:58] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [22:09:27] scfc_de: Tools-dev doesn't serve web pages :p [22:09:31] yuvipanda: "ssh tools-login.eqiad.wmflabs" points me to -bastion-01. [22:09:47] scfc_de: That's right, no? [22:09:55] no that ain’t right [22:10:13] scfc_de: DNS magic? [22:10:15] I wonder [22:10:17] a930913: Ah, sorry, I only read yuvipanda's bit about tools-*login* needing to be set as a submit host. [22:10:21] Oh, I thought it was the new name. [22:10:31] scfc_de: :p [22:11:05] scfc_de: But accorting to yuvipanda, all the webgrids can be set as submit hosts. [22:11:08] according* [22:11:17] they are, afaik [22:11:29] the new tools-login needs to to :D [22:11:35] yuvipanda: So why did I get that error message? [22:11:37] !log qconf -ah tools-bastion-01.eqiad.wmflabs [22:11:45] !log tools qconf -ah tools-bastion-01.eqiad.wmflabs [22:11:52] valhallasw`cloud: can you respond on the list about putty / fingerprint? [22:12:32] a930913: Because the -generic-* hosts were not set; I'll do that now. [22:12:35] yuvipanda: err? [22:12:39] oh, I see [22:15:27] a930913: Try again, please. [22:15:33] yuvipanda: done [22:15:40] !log tools for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done [22:16:15] scfc_de: Er, need to make a job first :p [22:16:19] valhallasw`cloud: <3 thanks [22:16:33] scfc_de: <3 too. Now I shall sleep. [22:16:34] morebots for Labs is dead? [22:18:10] scfc_de: It works! It's alive! [22:18:36] Ohsh** it's become sentient! Quick, kill it! [22:18:58] scfc_de: yuvipanda: Thanks. [22:49:16] A first for a cub programmer: [22:49:37] I've just closed a user-submitted bug on my own tool with a git commit message. [22:49:43] Feelsgoodman [22:49:52] Baby steps... [22:56:17] slashme: congratulations [22:57:03] :-D [22:57:11] Night all! [23:05:12] !log tools.morebots Restarted Labs morebots [23:05:21] Ha, lied. [23:06:48] !log tools.morebots Restarted Labs morebots [23:06:53] Logged the message, Master [23:07:24] !log tools copied /etc/hosts into place on tools-bastion-01 [23:07:27] Logged the message, Master [23:07:43] scfc_de: so many manual steps... [23:07:49] !log tools for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done [23:07:51] Logged the message, Master [23:08:01] !log tools qconf -ah tools-bastion-01.eqiad.wmflabs [23:08:04] Logged the message, Master [23:12:27] 10Wikimedia-Labs-Infrastructure, 5Patch-For-Review: Move LabsDB aliases to DNS - https://phabricator.wikimedia.org/T63897#1139709 (10yuvipanda) @Coren: Any updates on this? I had to manually copy another /etc/hosts file again today... [23:23:51] 10Tool-Labs: Investigate lock-down of tools-login - https://phabricator.wikimedia.org/T93545#1139712 (10scfc) 5Open>3Resolved a:3scfc `tools-login` became accessible again, and looking at `/var/log/puppet.log` showed a lot of ownership changes (cf. T93543). So the lock-down is probably related to some pro...